I'm gonna use a perhaps unusual text structure for this review. Consuming
text is a tiring activity that demands energy. Having that in mind, I
ignored the order in which the library is presented through the docs and
moved all trivial concepts discussion to the end. That's my attempt to make
sure the message gets through while the reader still has it -- the energy.
## A short introduction to pull parsers vs push parsers
The act of parsing an incoming stream will generate events from a
predefined set. Users match the generated events against rules to trigger
actions. For the following JSON sample:
```json
{ "foo" : 42 , "bar" : 33 }
```
One of such event streams could be:
```
begin_object -> string_key -> number -> string_key -> number -> end_object
```
The essential difference between pull parsers and push parsers is how this
event stream is delivered. A push parser (a.k.a. SAX parsers) uses the
callback handler model. A pull parser (a.k.a. StAX parsers) stores the
current parsing state and allows the user to query this state.
Essentially every push parser has an implicit pull parser that doesn't get
exposed to the user. This is to say that a push parser would be implemented
in the lines of:
```cpp
// parsing loop
while (it != end) {
match_token();
dispatch_token(user_handlers);
}
```
Whereas a pull parser doesn't have the loop construct above, just its
body. The loop construct would be under user-control. That's why pull
parsers are the more fundamental model. A push parser can be built on top
of a pull parser, but not the inverse.
Concrete examples will be shown later in this text.
## Performance: a common misconception
Some push parsing proponents believe pull parsers are slower. This belief
is unsubstantiated. As we have seen in the previous section, a push parser
is always built on top of an implicit pull parser. The only thing that
differs them both is that a small part of the parser logic under the
pulling model moves out of the interface to become user code. That's the
only difference.
One of the arguments brought by push parser proponents is the ability to
merge matching and decoding steps to perform less work. We can use the
previous sample to illustrate this idea:
```
{ "foo" : 42 , "bar" : 33 }
^^^^^
```
If we are to decode the `"foo"` token above, the parser first needs to
discover where the token begins and where the token ends. This is the
required matching step that every parser needs to perform. JSON strings can
be escaped and one such example would be:
```json
"na\u00EFve"
```
The decoding steps take care of converting escaped string slices to
unescaped ones. So user code would receive 3 string runs while this token
is being decoded:
- "na"
- "ï"
- "ve"
The parser does not allocate any memory to decode any string. The slices
are feeded to the user who in turns is responsible to put them together
(possibly allocating heap memory).
The idea here is... having separate match and decode steps forces the
parser to walk over the tokens twice. For Boost.JSON, the merged match and
decode steps materialize in the `on_string_part()` (and `on_string()`)
callback:
```cpp
bool on_string_part(string_view v, std::size_t, error_code& ec)
{
current_str.append(v.data(), v.size());
return true;
}
```
For a pull parser with _separate_ match and decode steps the interface
would be in the lines of:
```cpp
reader.next();
if (reader.kind() == token::string)
reader.string(current_str);
```
But as we've seen before, the only difference between pull and push parsers
is something else. Any features lateral to this classification originate
from different taxonomies. Just as a push parser can merge the matching and
decoding steps so can a pull parser:
```cpp
// A new overload.
//
// `my_match_options` here would provide
// the string collector upfront.
reader.next(my_match_options);
```
Another small note here is that just as merging matching and decoding steps
can be beneficial performance-wise so can skipping the whole decoding
step. I'll be touching on this very idea again in the section below.
Appendix A will present how another feature from Boost.JSON can be
implemented in pull parsers: strings split among different buffers.
## The case for pull parsers
So far we've only been exposed to the difference between push parsers and
pull parsers, but there wasn't a case for which choice is ideal (aside from
the naturally understood concept that pull parsers are more fundamental and
push parsers can be built on top when desired).
The defining trait present in pull parsers that is impossible to replicate
in push parsers is their ability to compose algorithms. Composition is
enabled by temporarily handing over the token stream to a different
consumer. Not every consumer needs to know how to process every part of the
stream. That's a trait impossible to find in push parsers.
I have created a [gawk](https://www.gnu.org/software/gawk/) plug-in to
illustrate this property. The same plug-in was implemented several times
using different libraries/approaches. I incrementally added more features
under different branches (there are 8 branches in total). You can find a
rationale for the AWK choice in Appendix B by the end of this review.
The plugin's code can be found at
https://gitlab.com/vinipsmaker/gawk-jsonstream-archive. And you may
forgive my beginner code, but I had no experience to write gawk plug-ins
beforehand. That's my first.
AWK is a C-like language (in syntax) where you define pattern-action
pairs. You can find a quick'n'dirty introduction to the language from
Wikipedia. This plug-in adds JSON capabilities to the tool. Given the
following data-set:
```
{"name": "Beth", "pay_rate": 4.00, "hours_worked": 0}
{"name": "Dan", "pay_rate": 3.75, "hours_worked": 0}
{"name": "Kathy", "pay_rate": 4.00, "hours_worked": 10}
{"name": "Mark", "pay_rate": 5.00, "hours_worked": 20}
{"name": "Mary", "pay_rate": 5.50, "hours_worked": 22}
{"name": "Susie", "pay_rate": 4.25, "hours_worked": 18}
```
And the following AWK program (with this plug-in loaded):
```awk
BEGIN {
JPAT[1] = "/name"
JPAT[2] = "/pay_rate"
JPAT[3] = "/hours_worked"
}
J[3] > 0 { print J[1], J[2] * J[3] }
```
The output will be:
```
Kathy 40
Mark 100
Mary 121
Susie 76.5
```
The paths for the JSON tree are given in [standard JSON
Pointer](https://tools.ietf.org/html/rfc6901) syntax. That's the plugin's
core and it's the code that doesn't change among branches, so it's the
appropriate place to begin the explanation.
In the file `jpat.ypp` you'll find a `read_jpat()` function that gets
called for each record. There's some boilerplate in the beginning to bail
out earlier if there were no changes to `JPAT`. Then there is this snippet
that gets processed by [re2c](https://re2c.org/) to generate valid C++
code:
```cpp
* {
if (*begin != '\0')
return bad_path();
jpat[i].components.emplace_back(std::move(current_component));
goto end;
}
{prefix} {
jpat[i].components.emplace_back(std::move(current_component));
current_component = std::string{};
goto loop;
}
{unescaped} {
current_component.append(begin, YYCURSOR - begin);
goto loop;
}
{escaped_tilde} {
current_component.push_back('~');
goto loop;
}
{escaped_slash} {
current_component.push_back('/');
goto loop;
}
```
The definitions for the data structures follow:
```cpp
struct Path
{
std::vectorstd::string components;
awk_value_cookie_t cookie = nullptr;
awk_string_t orig;
};
struct Tree: std::map
Wait a second! 312MB/s is almost 10x slower than Daniel Lemire's parser, the one that actually parses JSON, and we are only creating the objects that would result if we were parsing, without doing any actual parsing.
This is one of the many unintuitive aspects of parsing performance: the actual low-level, character-level parsing is generally the least important part for overall performance. Unless you do something crazy like use `NSScanner`. Don't use `NSScanner`. Please.
Let's dive in the code to see how they differ.
### `basic`
Every branch reads a record one document at a line. That's the
`get_record()` function. For the `boostjson-dom` branch, the code goes in
the lines of:
```cpp
try {
parsed_value.emplace_null();
json_parser.reset(parsed_value.storage());
std::error_code ec;
json_parser.write(line.data(), line.size(), ec);
if (ec)
throw std::system_error{ec};
json_parser.finish(ec);
if (ec)
throw std::system_error{ec};
parsed_value = json_parser.release(ec);
assert(!ec);
switch (parsed_value.kind()) {
case json::kind::array:
fill_j(parsed_value.get_array(), tree);
break;
case json::kind::object:
fill_j(parsed_value.get_object(), tree);
break;
// ...
}
} catch (const std::exception& e) {
warning(ext_id, "Error while processing record: %s", e.what());
}
```
The code parses the whole JSON at once and then follows to consume the
document by calling the `fill_j()` function:
```cpp
// Several overloads:
static void fill_j(const json::array& arr, Tree& node);
static void fill_j(const json::object& obj, Tree& node);
static void fill_j(const json::value& v, Tree& node);
// Here is the object overload. You can find the rest in the repository.
static void fill_j(const json::object& obj, Tree& node)
{
for (auto& [key, value]: node) {
auto it = obj.find(key);
if (it == obj.end())
continue;
std::visit(
hana::overload(
[&](Tree& tree) { fill_j(it->value(), tree); },
[&](size_t idx) { consume_value(it->value(), idx); }),
value);
}
}
```
The code is pretty straight-forward. We check if the fields the user asked
for are present in the document and recurse on them. When we reach a leaf
node, the `J` AWK variable is set (that'd be the `consume_value()`
function).
Do notice that there is no uncontrolled recursion here. Recursion is always
JPAT-delimited. And JPAT is part of the source code, not user-input. There
is no risk of stack overflow here.
The pull parser approach (branch master) somewhat follows the same trend,
but the code is bigger. Here's a selected highlight:
```cpp
void fill_j(json::reader& reader, Tree& node)
{
// ...
switch (reader.symbol()) {
case json::token::symbol::error:
throw json::error{reader.error()};
case json::token::symbol::begin_object: {
if (!reader.next()) throw json::error{reader.error()};
std::string current_key;
for (;;) {
if (reader.symbol() == json::token::symbol::end_object) {
reader.next();
break;
}
// Key
assert(reader.symbol() == json::token::symbol::string);
current_key.clear();
auto ec = reader.string(current_key);
assert(!ec); (void)ec;
if (!reader.next()) throw json::error{reader.error()};
// Value
auto it = node.find(current_key);
if (it == node.end()) {
json::partial::skip(reader);
continue;
}
std::visit(
hana::overload(
[&](Tree& tree) {
switch (reader.symbol()) {
case json::token::symbol::begin_array:
case json::token::symbol::begin_object:
fill_j(reader, tree);
break;
default:
json::partial::skip(reader);
}
},
consume_value),
it->second);
}
if (reader.symbol() == json::token::symbol::error)
throw json::error{reader.error()};
}
// ...
}
}
```
And here is the first sample on what I meant by composability. Our
"callback" only gets called to handle trees that are part of the decision
tree one way or another (and the root node always belongs here). When it
reaches a subtree that doesn't belong to the JPAT-defined decision tree,
`json::partial::skip()` is called. `json::partial::skip()` is defined
elsewhere and has no knowledge on the internals of our algorithm. This is a
different "callback" being called to handle a different part of the
tree. The two of them know how to handle different groups of tokens.
A nice side effect is that these "phantom" trees will be fully skipped
over. We don't allocate strings to collect values over them. This is an
optimization we get for free. [Designing JSON parsers to exploit this idea
is not unknown](https://github.com/zserge/jsmn).
The last one will be the code for the `boostjson` branch, but I feel like
it'll be easier to understand if I first introduce the tasks it has to
perform in the token stream.
```
{ "foo" : 42 , "bar" : { "foo" : 33 } }
^^
```
When the field's value is parsed, we must match the key against our
rule-set to possibly trigger an action. But it is not enough to store the
current key. The whole path must be stored to support the JSON nest-able
structure. We start with the structure to save current context information:
```cpp
struct State
{
bool is_object; //< false = ARRAY
std::string current_key;
size_t current_idx;
Tree* node;
};
std::vector<State> state;
```
So the beginning of the code for `on_object_begin()` must be:
```cpp
Tree* node = nullptr;
if (state.size() == 0) {
node = &tree;
} else if (state.back().node) {
auto it = state.back().node->find(current_key());
if (it != state.back().node->end())
node = std::get_if<Tree>(&it->second);
}
state.emplace_back();
state.back().is_object = true;
state.back().node = node;
```
And a matching action in the code for `on_object_end()`:
```cpp
state.pop_back();
```
But it's not really that simple. Paths such as
`/features/10000/properties/LOT_NUM` won't work because `current_key` won't
be provided by any `on_key()`-like callback when current node is
`10000`. `current_key` must be computed for arrays in the value handler
itself. That's why the following snippet must be present in the beginning
of every value-callback (and `on_{object,array}_begin()` too):
```cpp
if (state.size() > 0 && /*is_array=*/!state.back().is_object)
update_current_array_key();
```
And the matching helper functions:
```cpp
std::string_view current_key()
{
assert(state.size() > 0);
return state.back().current_key;
}
void update_current_array_key()
{
assert(state.size() > 0);
assert(!state.back().is_object);
state.back().current_key.resize(max_digits);
auto s_size = std::to_chars(
state.back().current_key.data(),
state.back().current_key.data() + state.back().current_key.size(),
state.back().current_idx++
).ptr - state.back().current_key.data();
state.back().current_key.resize(s_size);
}
```
The final touch is to realize that the event stream will be something like:
```
string_key -> value -> string_key -> value -> string_key -> value -> ...
```
And we need to clear the `string_key` by the end of every value consumed:
```cpp
if (state.size() > 0) state.back().current_key.clear();
```
So the callback implementation for a simple atom value such as a string is:
```cpp
bool on_string_part(std::string_view v, std::error_code&)
{
if (state.size() > 0) {
if (state.back().node == nullptr)
return true;
auto it = state.back().node->find(current_key());
if (it == state.back().node->end())
return true;
if (!std::holds_alternative
The defining trait present in pull parsers that is impossible to replicate in push parsers is their ability to compose algorithms. Composition is enabled by temporarily handing over the token stream to a different consumer. Not every consumer needs to know how to process every part of the stream. That's a trait impossible to find in push parsers.
Every callback depends on implementation knowledge from every other
callback. Spaghetti code as they say. I also implemented another change to
both branches -- `master` and `boostjson` -- to further demonstrate this
property. What if we want to store string representation for array indexes
in static arrays instead `std::string` objects?
There's this changeset for `master` branch:
```
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -184,20 +184,19 @@ void fill_j(json::reader& reader, Tree& node)
case json::token::symbol::begin_array: {
if (!reader.next()) throw json::error{reader.error()};
- std::string current_key;
+ std::array
But as we've seen before, the only difference between pull and push parsers is something else. Any features lateral to this classification originate from different taxonomies.
The enabling feature here is not the push interface. Rather it is the chosen set of generated events for the JSON token stream. If you change the notified events for the stream, you enable the same feature. For a pull parser, these are the events it'd have to provide (among other valid choices): - `string` (if non-split) - `string_run_begin` - `string_run` - `string_run_end` ## Appendix B: Why gawk? There are a few reasons why I've chosen a (GNU) AWK plug-in to illustrate the concepts presented in this review. First, I don't want to invent a new tool. Artificial examples could always be created to fabricate an argument. AWK is one of the classical UNIX tools. It's established and old. From [gawk manual](https://www.gnu.org/software/gawk/manual/html_node/History.html):
The original version of `awk` was written in 1977 at AT&T Bell Laboratories. In 1985, a new version made the programming language more powerful [...] This new version became widely available with Unix System V Release 3.1 (1987).
We're talking 40 years of history here. If the tool is established, that's one reason. Another reason why I've chosen AWK is that AWK has no JSON processing facilities, so it fills a gap in an existing tool. The third reason why I've chosen GNU AWK to create this plug-in is that gawk already has an extension mechanism. Therefore I don't have the freedom to modify the AWK processor to accommodate the JSON library idiosyncrasies. On the contrary, it's the JSON library the one to prove its flexibility. Now there's another question to answer here: why take MongoDB as inspiration to design the gap that would be filled in AWK (such as the `$unwind` operator)? And the answer will be the same. MongoDB is one of the JSON-oriented DBMS that got popular back in the NoSQL boom days. The end result tries to fit in the AWK's mindset and not the other way around. AWK is a tool designed for tabular data processing. JSON is in no way tabular, but its structure allows us to build a tabular view for the underlying data. There are two main variables in gawk to control how the fields are extracted: - `FS`: defines how to find the field _separators_. - `FPAT`: defines how to find the field _contents_. The plug-in just follows this trend and introduces a `JPAT` variable that also instructs how to find the field _contents_. `JPAT` is an array and each element has a [standard JSON Pointer](https://tools.ietf.org/html/rfc6901) expression that was designed exactly for this purpose. There's one very interesting fact about this outcome. The usage pattern here is removing from the library the step to process the stream (as the step that'd happen with the DOM approach) and moving much of the processing logic to be embedded in the parsing phase itself. That's a continuation from the trend we saw earlier about merging matching and decoding steps. It's also the same approach we'd use to build serious JSON processing tools such as [JMESPath](https://jmespath.org/) (some JMESPath expressions require a partial DOM, but it's doable to use a hybrid approach anyway as it has already been demonstrated in the `JUNWIND` code). ## Appendix C: Static decision trees with Boost.Hana The reader may be interested to know that the decision tree auxiliar to the parsing process developed for our (GNU) AWK plug-in is not only useful for AWK but to C++ programs as well. The decision tree strategy may even use Boost.Hana CT structures to fully unroll the code. It can be used to implement facilities such as: ```cpp using hana::literals::operator""_s; int foo; std::string bar; json::scanf(input, "{ foo: { bar: ? }, bar: ? }"_s, foo, bar); ``` That's also possible with Boost.JSON's current parser. But the following is not: ```cpp using hana::literals::operator""_s; Foo foo; std::string bar; json::scanf( input, "{ foo: { bar: ? }, bar: ? }"_s, [&foo](json::reader& reader) { // ... }, bar); ``` As it has been explained before, push parsers don't compose. And you aren't limited to root-level scanning. You should have `json::partial::scanf()` to act on subtrees too. A prototype for this idea can be found at https://github.com/breese/trial.protocol/pull/43. ## Review questions
Please be explicit about your decision (ACCEPT or REJECT).
REJECT.
How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
I've stalled a project of mine for a few weeks just to participate in this review process. I think I've spent close to 50 hours. I wrote a JSON tool in Boost.JSON and the same tool under a different parser model to highlight important traits that were being dismissed in previous debates. I've gone through the documentation and source code to better understand some Boost.JSON details. I've also been involved in previous Boost.JSON debates in this very mailing list.
Are you knowledgeable about the problem domain?
I've tested many JSON parser libraries in C++ to this point. In my last job I have leveraged a JSON (pull) parser to merge multiple validation layers into an one-pass operation and later leveraged Boost.Hana to automate the parser generation. Our API is JSON-based and we apply plenty of transformations (including DOM-based when unavoidable), some similar to the GNU AWK plug-in demonstrated here (e.g. synthesizing a new field on routed messages while DOM tree related costs are avoided). I've contributed a few algorithms and subtree parsing ideas back to the leveraged JSON (pull) parser. I offered the same composability ideas to Boost.JSON a few months back when the topic appeared here in this very mailing list. In the past, I've designed parsers in C++ -- pull and push parsers. I've played with some tools for parser generation such as re2c, gperf (gperf might not be really parsing, but it is related), and Boost.Xpressive. I've also experienced the pain of parsers that try to be too lightweight but ultimately backfire by moving not only performance costs back to the consumer but also greatly increasing complexity. -- Vinícius dos Santos Oliveira https://vinipsmaker.github.io/