Struct grep_printer::JSON

source ·

pub struct JSON<W> { /* private fields */ }

Expand description

The JSON printer, which emits results in a JSON lines format.

This type is generic over W, which represents any implementation of the standard library io::Write trait.

Format

This section describes the JSON format used by this printer.

To skip the rigamarole, take a look at the example at the end.

The format of this printer is the JSON Lines format. Specifically, this printer emits a sequence of messages, where each message is encoded as a single JSON value on a single line. There are four different types of messages (and this number may expand over time):

begin - A message that indicates a file is being searched.
end - A message the indicates a file is done being searched. This message also include summary statistics about the search.
match - A message that indicates a match was found. This includes the text and offsets of the match.
context - A message that indicates a contextual line was found. This includes the text of the line, along with any match information if the search was inverted.

Every message is encoded in the same envelope format, which includes a tag indicating the message type along with an object for the payload:

{
    "type": "{begin|end|match|context}",
    "data": { ... }
}

The message itself is encoded in the envelope’s data key.

Text encoding

Before describing each message format, we first must briefly discuss text encoding, since it factors into every type of message. In particular, JSON may only be encoded in UTF-8, UTF-16 or UTF-32. For the purposes of this printer, we need only worry about UTF-8. The problem here is that searching is not limited to UTF-8 exclusively, which in turn implies that matches may be reported that contain invalid UTF-8. Moreover, this printer may also print file paths, and the encoding of file paths is itself not guaranteed to be valid UTF-8. Therefore, this printer must deal with the presence of invalid UTF-8 somehow. The printer could silently ignore such things completely, or even lossily transcode invalid UTF-8 to valid UTF-8 by replacing all invalid sequences with the Unicode replacement character. However, this would prevent consumers of this format from accessing the original data in a non-lossy way.

Therefore, this printer will emit valid UTF-8 encoded bytes as normal JSON strings and otherwise base64 encode data that isn’t valid UTF-8. To communicate whether this process occurs or not, strings are keyed by the name text where as arbitrary bytes are keyed by bytes.

For example, when a path is included in a message, it is formatted like so, if and only if the path is valid UTF-8:

{
    "path": {
        "text": "/home/ubuntu/lib.rs"
    }
}

If instead our path was /home/ubuntu/lib\xFF.rs, where the \xFF byte makes it invalid UTF-8, the path would instead be encoded like so:

{
    "path": {
        "bytes": "L2hvbWUvdWJ1bnR1L2xpYv8ucnM="
    }
}

This same representation is used for reporting matches as well.

The printer guarantees that the text field is used whenever the underlying bytes are valid UTF-8.

Wire format

This section documents the wire format emitted by this printer, starting with the four types of messages.

Each message has its own format, and is contained inside an envelope that indicates the type of message. The envelope has these fields:

type - A string indicating the type of this message. It may be one of four possible strings: begin, end, match or context. This list may expand over time.
data - The actual message data. The format of this field depends on the value of type. The possible message formats are begin, end, match, context.

Message: begin

This message indicates that a search has begun. It has these fields:

path - An arbitrary data object representing the file path corresponding to the search, if one is present. If no file path is available, then this field is null.

Message: end

This message indicates that a search has finished. It has these fields:

path - An arbitrary data object representing the file path corresponding to the search, if one is present. If no file path is available, then this field is null.
binary_offset - The absolute offset in the data searched corresponding to the place at which binary data was detected. If no binary data was detected (or if binary detection was disabled), then this field is null.
stats - A stats object that contains summary statistics for the previous search.

Message: match

This message indicates that a match has been found. A match generally corresponds to a single line of text, although it may correspond to multiple lines if the search can emit matches over multiple lines. It has these fields:

path - An arbitrary data object representing the file path corresponding to the search, if one is present. If no file path is available, then this field is null.
lines - An arbitrary data object representing one or more lines contained in this match.
line_number - If the searcher has been configured to report line numbers, then this corresponds to the line number of the first line in lines. If no line numbers are available, then this is null.
absolute_offset - The absolute byte offset corresponding to the start of lines in the data being searched.
submatches - An array of submatch objects corresponding to matches in lines. The offsets included in each submatch correspond to byte offsets into lines. (If lines is base64 encoded, then the byte offsets correspond to the data after base64 decoding.) The submatch objects are guaranteed to be sorted by their starting offsets. Note that it is possible for this array to be empty, for example, when searching reports inverted matches.

Message: context

This message indicates that a contextual line has been found. A contextual line is a line that doesn’t contain a match, but is generally adjacent to a line that does contain a match. The precise way in which contextual lines are reported is determined by the searcher. It has these fields, which are exactly the same fields found in a match:

path - An arbitrary data object representing the file path corresponding to the search, if one is present. If no file path is available, then this field is null.
lines - An arbitrary data object representing one or more lines contained in this context. This includes line terminators, if they’re present.
line_number - If the searcher has been configured to report line numbers, then this corresponds to the line number of the first line in lines. If no line numbers are available, then this is null.
absolute_offset - The absolute byte offset corresponding to the start of lines in the data being searched.
submatches - An array of submatch objects corresponding to matches in lines. The offsets included in each submatch correspond to byte offsets into lines. (If lines is base64 encoded, then the byte offsets correspond to the data after base64 decoding.) The submatch objects are guaranteed to be sorted by their starting offsets. Note that it is possible for this array to be non-empty, for example, when searching reports inverted matches such that the original matcher could match things in the contextual lines.

Object: submatch

This object describes submatches found within match or context messages. The start and end fields indicate the half-open interval on which the match occurs (start is included, but end is not). It is guaranteed that start <= end. It has these fields:

match - An arbitrary data object corresponding to the text in this submatch.
start - A byte offset indicating the start of this match. This offset is generally reported in terms of the parent object’s data. For example, the lines field in the match or context messages.
end - A byte offset indicating the end of this match. This offset is generally reported in terms of the parent object’s data. For example, the lines field in the match or context messages.

Object: stats

This object is included in messages and contains summary statistics about a search. It has these fields:

elapsed - A duration object describing the length of time that elapsed while performing the search.
searches - The number of searches that have run. For this printer, this value is always 1. (Implementations may emit additional message types that use this same stats object that represents summary statistics over multiple searches.)
searches_with_match - The number of searches that have run that have found at least one match. This is never more than searches.
bytes_searched - The total number of bytes that have been searched.
bytes_printed - The total number of bytes that have been printed. This includes everything emitted by this printer.
matched_lines - The total number of lines that participated in a match. When matches may contain multiple lines, then this includes every line that is part of every match.
matches - The total number of matches. There may be multiple matches per line. When matches may contain multiple lines, each match is counted only once, regardless of how many lines it spans.

Object: duration

This object includes a few fields for describing a duration. Two of its fields, secs and nanos, can be combined to give nanosecond precision on systems that support it. It has these fields:

secs - A whole number of seconds indicating the length of this duration.
nanos - A fractional part of this duration represent by nanoseconds. If nanosecond precision isn’t supported, then this is typically rounded up to the nearest number of nanoseconds.
human - A human readable string describing the length of the duration. The format of the string is itself unspecified.

Object: arbitrary data

This object is used whenever arbitrary data needs to be represented as a JSON value. This object contains two fields, where generally only one of the fields is present:

text - A normal JSON string that is UTF-8 encoded. This field is populated if and only if the underlying data is valid UTF-8.
bytes - A normal JSON string that is a base64 encoding of the underlying bytes.

More information on the motivation for this representation can be seen in the section text encoding above.

Example

This section shows a small example that includes all message types.

Here’s the file we want to search, located at /home/andrew/sherlock:

For the Doctor Watsons of this world, as opposed to the Sherlock
Holmeses, success in the province of detective work must always
be, to a very large extent, the result of luck. Sherlock Holmes
can extract a clew from a wisp of straw or a flake of cigar ash;
but Doctor Watson has to have it taken out for him and dusted,
and exhibited clearly, with a label attached.

Searching for Watson with a before_context of 1 with line numbers enabled shows something like this using the standard printer:

sherlock:1:For the Doctor Watsons of this world, as opposed to the Sherlock
--
sherlock-4-can extract a clew from a wisp of straw or a flake of cigar ash;
sherlock:5:but Doctor Watson has to have it taken out for him and dusted,

Here’s what the same search looks like using the JSON wire format described above, where in we show semi-prettified JSON (instead of a strict JSON Lines format), for illustrative purposes:

{
  "type": "begin",
  "data": {
    "path": {"text": "/home/andrew/sherlock"}}
  }
}
{
  "type": "match",
  "data": {
    "path": {"text": "/home/andrew/sherlock"},
    "lines": {"text": "For the Doctor Watsons of this world, as opposed to the Sherlock\n"},
    "line_number": 1,
    "absolute_offset": 0,
    "submatches": [
      {"match": {"text": "Watson"}, "start": 15, "end": 21}
    ]
  }
}
{
  "type": "context",
  "data": {
    "path": {"text": "/home/andrew/sherlock"},
    "lines": {"text": "can extract a clew from a wisp of straw or a flake of cigar ash;\n"},
    "line_number": 4,
    "absolute_offset": 193,
    "submatches": []
  }
}
{
  "type": "match",
  "data": {
    "path": {"text": "/home/andrew/sherlock"},
    "lines": {"text": "but Doctor Watson has to have it taken out for him and dusted,\n"},
    "line_number": 5,
    "absolute_offset": 258,
    "submatches": [
      {"match": {"text": "Watson"}, "start": 11, "end": 17}
    ]
  }
}
{
  "type": "end",
  "data": {
    "path": {"text": "/home/andrew/sherlock"},
    "binary_offset": null,
    "stats": {
      "elapsed": {"secs": 0, "nanos": 36296, "human": "0.0000s"},
      "searches": 1,
      "searches_with_match": 1,
      "bytes_searched": 367,
      "bytes_printed": 1151,
      "matched_lines": 2,
      "matches": 2
    }
  }
}

Struct grep_printer::JSON

Implementations§

impl<W: Write> JSON<W>

pub fn new(wtr: W) -> JSON<W>

pub fn sink<'s, M: Matcher>( &'s mut self, matcher: M ) -> JSONSink<'static, 's, M, W>

pub fn sink_with_path<'p, 's, M, P>( &'s mut self, matcher: M, path: &'p P ) -> JSONSink<'p, 's, M, W>where M: Matcher, P: ?Sized + AsRef<Path>,

impl<W> JSON<W>

pub fn has_written(&self) -> bool

pub fn get_mut(&mut self) -> &mut W

pub fn into_inner(self) -> W

Trait Implementations§

impl<W: Debug> Debug for JSON<W>

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Auto Trait Implementations§

impl<W> RefUnwindSafe for JSON<W>where W: RefUnwindSafe,

impl<W> Send for JSON<W>where W: Send,

impl<W> Sync for JSON<W>where W: Sync,

impl<W> Unpin for JSON<W>where W: Unpin,

impl<W> UnwindSafe for JSON<W>where W: UnwindSafe,

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>