Boost
Threads by month
- ----- 2024 -----
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- 7828 discussions
Hi Everyone,
I recommend to *ACCEPT* the library into Boost. It meets the standard for
BI am definitely going to use it.
I tried to use the library in two configurations in header-only mode:
g++ 9.2.0 with -std=c++2a on MinGW on Windows,
g++ 10.2.1 on Fedora with -DBOOST_JSON_STANDALONE and -std=c++20
All worked fine.
I studied library docs and played with examples for about 30 hours. I am
not particularly knowledgeable about the problem domain. But my programs
have to consume and produce JSON files.
DOCUMENTATION
It is very good, and meets the highest Boost standards. I mean both the
tutorial and reference sections. While doing this review I had to look at
the documentation of the competing libraries, and they are really bad. This
is why I can appreciate the big effort you put in making these docs. There
is a section "Comparison" that compares Boost.JSON to other JSON libraries.
I personally think it should also mention the documentation.
Some observations:
1. Add reference to RFC 7159 (https://www.rfc-editor.org/info/rfc7159) It
allows implementations to set limits on the range and precision of numbers
accepted.
2. The documentation for array object and index mention "Satisfies the
requirements of ContiguousContainer, ReversibleContainer, and
SequenceContainer" These requirements are never defined or referenced. The
user will not even know if you defined them or if they are "commonly
known". Maybe add link to cppreference.com:
https://en.cppreference.com/w/cpp/named_req/ReversibleContainer
https://en.cppreference.com/w/cpp/named_req/ContiguousContainer
https://en.cppreference.com/w/cpp/named_req/SequenceContainer
3. String has two sections "Member Functions":
https://master.json.cpp.al/json/ref/boost__json__string.html
4. Documentation for value::operator= seems corrupt:
https://master.json.cpp.al/json/ref/boost__json__value/operator_eq_/overloa…
* it has sections with no content
* It is an unconstrained template, it look like it can accept any type
5. Examples of input JSON-s contain a lot of escape characters in string
literals, like in
https://master.json.cpp.al/json/usage/conversion.html
This makes it difficult to read. Consider using raw string literals.
Instead of
assert( serialize( jv ) == "{\"x\":4,\"y\":1,\"z\":4}" );
We would have
assert( serialize( jv ) == R"({"x":4,"y":1,"z":4})" );
6. Some functions (like operator== or swap(x, y)) are naturally part of
class interface, even though they are not member functions. It is annoying
and confusing that rather than being listed along with the class they
belong to they are instead listed separately as "free functions"
The docs are not uniform about this issue. For `json::value` operator== is
listed as friends along with other member functions, but not 2-argument
swap().
7. In https://master.json.cpp.al/json/usage/conversion.html we read:
" Given the following definition of customer::customer( const value& )
[...] Objects of type customer can be converted *to* and from value".
This "to" is surprising here. How can a constructor converting from `value`
to `customer` could be used to convert `customer` to `value`?
8. As discussed in another thread, documentation would benefit from the
info on implementation-limits-related design choices.
a. Any json::value that you can build can be serialized and then
deserialized, and you are guaranteed that the resulting json::value will be
equal to the original.
b. JSON inputs where number values cannot be represented losslessly in
uint64_t, int64_t and double, may render different values when parsed and
then serialized back, and for extremely big number values can even fail to
parse.
c. Whatever JSON output you can produce with this library, we guarantee it
can be passed by any common JSON implementation (probably also based on
uint64_t+int64_t+double implementation.
DESIGN
Clear, sound and practical. The design decisions correspond with my
expectations of a library: fast, easy to use and learn, with value
semantics. The interface is very intuitive, I practically do not have to
consult the documentation: every functionality (serialization, value
alteration) has the interface that I would expect. I appreciate that it
targets embedded environments.
Some observations:
1. The handling of infinities and NaN's needs to be addressed. Currently, I
can do:
`json::serialize(std::numeric_limits<double>::infinity())`
and t produces an invalid JSON. You cannot easily make a precondition on
json::value assignment/construction because you expose the double as a
mutable member:
void fun(double x)
// precondition x is not infinity or nan
{
json::value v = x; // ok
v.as_double() *= 2.0; // this may produce infinity
}
So maybe you need a precondition in the serializer.
2. Boost.JSON provides its own string type. This begs a lot of design
questions for `json::string`. Is it a better replacement for std::string?
std::string provides the following capability:
```
void set(std::string& s, char v, char x, char z, char k)
{
if (cond)
s = {v, k};
else
s = {v, x, z, k};
}
```
json::string does not. How do I do a similar thing with json::string?
3. string::swap() has a strange precondition: you cannot swap an object
with itself
https://master.json.cpp.al/json/ref/boost__json__string/swap.html
This is a really unexpected and potentially dangerous, and at minimum
requires a very strong rationale. This would be the first swap function in
std or Boost that has a precondition. While discussing a somewhat related
issue (on the guarantees of the moved-from state) in the LEWG, someone
remarked that there was at least one STL implementation that did a
self-swap for convenience of implementation.
Is it really necessary?
IMPLEMENTATION
I didn't look at it much. Just some quick points I observed.
1. Values of type `double` representing Infinity and NaN can be stored in
json::value and when later serialized, they render a malformed JSON
output. json::value should either validate special values of `double` or
indicate as precondition that they should not be assigned. Alternatively,
the serializer should either validate this or indicate as precondition.
2. tag_invoke() takes tag arguments by reference, does it have to? Passing
by reference requires the constexpr variables to be instantiated and their
address assigned.
3. Customization through ADL-discovered overloads can cause ODR violations.
There is no superior alternative though. But maybe put a warning in the
docs that the user must make sure that all translation units must see the
same set of overloads.
4. I do not like the 2-state-switch mechanics of BOOST_JSON_STANDALONE. It
assumes that a user either wants a no-Boost version or that she wants a
package: all Boost components.
But what if I am using C++14 and I want to use std::error_code and
Boost-emulated memoty_resource and string_view? What If I use Boost version
and C++17 and want to use std::string_view?
Maybe you can additionally provide macros for turning Boost-emulation
libraries one-by-one? I mean, one for string_view, one for error_code, and
one for memory_resource.
5. The header only mode -- I really appreciate it. It is good for initial
experiments.
Finally, I would like to thank Vinnie and Krystian for writing and sharing
this library.
Regards,
&rzej;
3
4
The Boost formal review of the FPR (Flat Precise Reflection, ex Magic
Get, ex PODs Flat Reflection) will take place from September 28, 2020
to October 7, 2020.
The library is meant for accessing structure elements by index and
providing other std::tuple like methods for user defined types without
any macro or boilerplate code.
The library is authored by Antony Polukhin, author of Boost.DLL,
Stacktrace, Type Index libraries.
Documentation: http://apolukhin.github.io/magic_get/
Source: https://github.com/apolukhin/magic_get
If the description isn't immediately obvious for you, here's a
motivating example:
// requires: C++14
#include <iostream>
#include <string>
#include "boost/pfr/precise.hpp"
struct some_person {
std::string name;
unsigned birth_year;
};
int main() {
some_person val{"Edgar Allan Poe", 1809};
std::cout << boost::pfr::get<0>(val) // No macro!
<< " was born in " << boost::pfr::get<1>(val); // Works with
any aggregate initializables!
}
See more at: https://github.com/apolukhin/magic_get/blob/develop/README.md
Benedek
1
0
Howdy,
I vote to ACCEPT Boost.JSON into boost. I think it’s a fine implementation for what it’s intended for.
I don’t understand some of the other reviews that reject it for not being something else - they got an apple and are complaining it’s the wrong color for an orange.
It’s fine to prefer oranges over apples, but it’s also ok to offer apples for sale in your store even if you wouldn’t buy it yourself. A lot of people like apples - they’re simple and convenient, without having to peel away the layers of complexity involved in something more pulpy. (Ok, I’ve squeezed as much juice out of that metaphor as possible...)
- What is your evaluation of the design?
Good. It took me a bit to get used to it, because I was used to something else (Facebook’s folly::dynamic mostly); but that would be true of any new library.
- What is your evaluation of the implementation?
I didn’t study the internal details. I used it as a consumer of the library. I thought the user-facing API was reasonable. The one change I made to the code was to force it to use std::string_view instead of boost::string_view, even though I used it in non-standalone mode. It would be nice if there were a separate compile flag for that, but it’s not a big deal.
Having yet another custom/purpose-built string implementation was surprising. It’s not an issue, but just surprising that yet another string type is being created.
It was also a bit surprising there wasn’t a function to pretty-print the serialized output (e..g, via an option arg for `serialize()`). It’s fairly common to need such a thing, for example to write to user-viewable files (e.g., configuration or property files), and potentially to debug logs, etc. Obviously the user can write a pretty-printer of their own, but this is one of those things that I think should come out-of-the-box eventually.
- What is your evaluation of the documentation?
Very good. As a consumer, it had most everything I needed.
- What is your evaluation of the potential usefulness of the library?
I think it’s very useful.
Whether it will become _popular_ I don’t know. There’s no debate there is high demand for a JSON-based variant-structure library similar to Boost.JSON, but there are several popular choices already. Boost.JSON has better performance and memory usage than some of the ones I know of (nlohman and folly::dynamic), so that should give it a leg up. But those libraries also have a lot of bells and whistles, and it’s not clear how much that will matter to people.
While my company also has a custom implementation for some specific uses, we do still use folly::dynamic fairly heavily, and I would seriously consider migrating to Boost.JSON in our code-base instead.
- Did you try to use the library?
Yes, I used it in non-standalone mode, and ran it through some tests we have in my employer’s code base.
- With which compiler(s)?
Gcc 7.3, C++17 mode.
- Did you have any problems?
Just a couple bugs that were quickly fixed by the authors.
One test failure I got was with the usage of scientific notation in serialized output, as has been noted by other reviewers. It’s not _wrong_, of course, but it’s unusual - and unusual is frequently bad for interoperability.
Also, I first tried it in standalone and was surprised it didn’t work even though we use C++17 - sadly, gcc 7.3 doesn’t have the pmr::memory_resource implementation. Gcc doesn’t have pmr until version 9+, which is a shame because incorporating it as standalone would be ideal; we do use boost heavily, but we almost always have issues when upgrading boost versions, so we don’t upgrade often.
I think being usable as standalone would increase the uptake for Boost.JSON… but requiring gcc 9+ negates that somewhat, perhaps.
- How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?
I spent about a day testing it. I’m sorry I didn’t have more time this past week. Life happens. :(
- Are you knowledgeable about the problem domain?
I’ve been using `folly::dynamic` from Facebook’s folly library since 2014, and have made changes to it for my company’s code base. It provides a similar variant-style structure based on JSON-ish types, with a parser and serializer for JSON encoding.
I’ve also written a custom JSON-based variant-structure for some of my company’s particular needs, with built-in schema support, JsonPointer and XPath-style query support, etc.
-hadriel
2
1
On Mon, Sep 21, 2020 at 10:29 AM Mathias Gaunard
<mathias.gaunard(a)ens-lyon.org> wrote:
> > > In both cases, I'd like to read/write my data from/to JSON with the
> > > same framework.
> >
> > Why? What specifically, is the requirement here?
>
> What I'd like is a way to describe how my C++ types map to a key-value
> structure with normalized types so that I can easily convert my
> objects back and forth through a structured self-describing and
> human-readable interchange format.
Right, but what I'm asking you is: *specifically* in what way would a
framework that offers both JSON DOM and JSON Serialization be
"consistent?" Can you show in terms of declarations, what that would
look like? In other words I'm asking you to show using example code
how putting these two separate concerns in a single library would
offer benefits over having them in separate libraries.
Thanks
3
3
Hi,
This is my review of Boost.JSON by Vinnie Falco and Krystian Stasiowski.
I believe that the library should be REJECTED, though it would me a
much better candidate with just a few minor tweaks here and there.
The library provides the following components:
- a parser framework
- a DOM to represent/construct/edit trees of JSON-like values (json::value)
- a serializer for that type
Unfortunately, all three of these components are not quite
satisfactory when it comes to addressing the problem domain of JSON as
an interchange format.
JSON is ubiquitous technology, and I feel like this library is not
deserving of a name such as Boost.JSON, which would suggest "the right
way to do JSON as seen by the C++ community".
Parser:
--------
Parsing is arguably the most important feature of any such library.
However with Boost.JSON it really feels like it is an afterthought,
and only its coupling with json::value was considered.
These are the main issues:
- The lack of proper number support from the DOM apparently propagates
to the parser as well, even though there is no cost for the parser to
properly support them.
- The interface is closely tied to a text format and doesn't support
JSON-inspired binary formats. This kind of forward thinking is what
would really elevate those parsers.
- The fact that only a push parser is provided. Parsing into own types
usually requires a pull parser in order for nested member's parsers to
combine into the outer parser.
The way the parser interfaces handles numbers is not only inconsistent
with the rest of its interface but also limiting.
bool on_number_part( string_view, error_code& ) { return true; }
bool on_int64( std::int64_t, string_view, error_code& ) {
return true; }
bool on_uint64( std::uint64_t, string_view, error_code& )
{ return true; }
bool on_double( double, string_view, error_code& ) { return true; }
I would expect an interface that looks like
bool on_number_part( string_view, std::size_t n,
error_code& ) { return true; }
bool on_number( string_view, std::size_t n, error_code& )
{ return true; }
bool on_int64( std::int64_t, error_code& ) { return true; }
bool on_uint64( std::uint64_t, error_code& ) { return true; }
bool on_double( double, error_code& ) { return true; }
I see the push parser nicely integrates with beast/asio segmented
buffering and incremental parsing, which is nice.
DOM:
------
The DOM provides a canonical or vocabulary type to represent any JSON
document, which is useful in itself.
The DOM does not support numbers that are neither 64-bit integral
numbers nor double-precision floating-point. This is a common
limitation, but still quite disappointing, since the proposal of it
being a canonical type is quite tainted by the fact it can't represent
JSON data losslessly. I think it would be better if another type was
provided for numbers that don't fit in the previous two categories.
It uses a region allocator model based on polymorphic value, which is
what the new polymorphic allocator model from Lakos et al. was
specifically intended for. It puts some kind of reference-counting for
life-extension semantics on top, which I'm not sure I'm a big fan of,
but that seems optional.
Querying capabilities for that type are pretty limited and quite
un-C++ like. Examples use a C-style switch/extractor paradigm instead
of a much safer variant-like visitor approach.
There is no direct nested xpath-style querying like with boost
property tree or other similar libraries.
It feels like the interface is only a marginal improvement on that of rapidjson.
Serializer:
-----------
There is no serializer framework to speak of: just a serializer for
the DOM type json::value, i.e. the only way to generate a JSON
document is to generate a json::value and have it be serialized.
A serializer interface symmetric to that of the parser could be
provided instead, allowing arbitrary types to be serialized to JSON
(or any other similar format) without conversion to json::value.
When it comes to serializing floating-point numbers, it seems it
always uses scientific notation. Most libraries have an option so that
you get to choose a digit threshold at which to switch.
Integers always use a fixed representation, which makes this all quite
inconsistent.
Review questions:
----------------------
> - What is your evaluation of the design?
I can see the design is mostly concerned with receiving arbitrary JSON
documents over the network, incrementally building a DOM as the data
is received, and then doing the opposite on the sending path.
It is good at doing that I suppose, but however is otherwise quite
limited as it is unsuitable for any use of JSON as a serialization
format for otherwise known messages.
> - What is your evaluation of the implementation?
I only took a quick glance, but it seemed ok. Can't say I'm a fan of
this header-only idea, but some people like it.
> - What is your evaluation of the documentation?
It is quite lacking when it comes to usage of the parser beyond the DOM.
It also doesn't really say how numbers are formatted in the serializer.
> - What is your evaluation of the potential usefulness of the library?
Limited, given the scope of the library and the quality of its
interface, I don't see any reason to use it instead of rapidjson,
which is an established library for which I already have converters
and Asio stream adapters.
> - Did you try to use the library? With which compiler(s)? Did you have any problems?
I didn't.
> - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
Maybe half a day reading through the documentation and the mailing list.
> - Are you knowledgeable about the problem domain?
I have used a lot of different JSON libraries and I have written
several serialization/deserialization frameworks using JSON, some of
which are deeply tied into everything I do everyday.
5
14
My review is primarily from the perspective of a library consumer. I use boost
libraries and JSON on a daily basis. But neither did I ever write a boost
library nor a JSON library.
# Boost JSON Review
> Please provide in your review information you think is valuable to
> understand your choice to ACCEPT or REJECT including JSON as a
> Boost library. Please be explicit about your decision (ACCEPT or REJECT).
I recommend ACCEPTing the library into boost. It's well designed,
professionally implemented and fills an important use case. The library is
well maintained and I trust the authors to keep doing so. The documentation is
good, but I the reference documentation could use some improvements in a few
spots. Individual reference sections appear copy-pasted a bit too aggressively,
which is clearly good for the doc writer but not so much for the doc reader.
I consider none of the issues below show-stoppers that would block acceptance
into boost. This is also based in the experience that the authors are
responsive in maintaining the library (and other libraries) and I trust them
that they will use their best judgement and whether and how to address any issues.
Brief side note: I'm a user of boost libraries and I consider boost a very
important part of the C++ library ecosystem, in particular, for building
reliable and long lasting software components and systems. It sets a high bar
for acceptance. When using individual boost libraries, an important
consideration is whether the library has aged well with the evolution of the
C++ standard, whether it is currently well maintained and whether it will
remain well maintained in the future. Here boost sometimes appears to be a bit
of a mixed bag. So I consider the maintenance promise to be quite important
and in particular more important than any of the issues raise below.
I want to thank all the people involved in boost for keeping it alive,
useful and high quality.
> Some other questions you might want to consider answering:
>
> - What is your evaluation of the design?
I evaluated primarily the DOM interface, that is the value container, but
didn't look into the parser and serializer in detail.
The design is solid and clear. Smaller issues raised on slack during the
review have been addressed. I particularly like that it provides an easy to
use, automatic ownership allocator interface. It relieves me from the pain of
manually maintaining the life-time of non-global memory pools etc.
Minor comments:
- The `value::is_xxx` functions inconsistently return bools or pointers. This
can have unexpected consequences if they are use from a context that does not
implicitly convert to bool. Also I really can't remember which one returns
what. For example, `is_null`, `is_bool`, `is_number`, `is_string`,
`is_structured` returns a really nice mix. (Fixed on develop)
- The library provides `initializer_list` construction for `value`. I personally
would prefer it wouldn't do that. Too many pitfalls with a missing pair of
braces causing the meaning to change quite dramatically. Still I think many
expect them in modern JSON libs. So it's up to the authors.
- The library is inconsistent about `shrink_to_fit`. `object does not have it,
`array` has it. Also the documentation does not state what `shrink_to_fit`
does, basically repeating the errors of `std::vector`. I firmly believe that
the function should either be removed or clear and precise semantics should
be documented. In it's current state the following question comes to mind:
Why would I ever want to call it?
> - What is your evaluation of the implementation?
I looked at the implementation of the `value` container. I did not look into the
parser and serializer.
The implementation appears clean and professional. The authors responded quickly
to issues raised while studying the source code.
> - What is your evaluation of the documentation?
The documentation is generally good and well structured. However, the
reference is occasionally pretty terse. Some parts of it are written with the
assumption that any user going to the reference section has recently read the
entire documentation including examples upfront.
Some comments (mostly against master and develop from early Sep 20):
* `value::storage()` should say: Return a reference to the `storage_ptr` to the
`memory_resource` associated with the value. The same thing applies to a few
other classes.
* The reference for the `initializer_list` constructor could use some improvements:
e.g. "Construct an object or array from an `initializer_list`". I find this
hugely important since just the mere presence of an `initializer_list`
constructor changes the meaning of brace initialization.
* Additionally, I doesn't mention how nested `initializer_list` how are treated.
Most importantly it should show or link to some examples on how the
`initializer_list` are converted to objects, arrays, and non structured
values because some developers (if not most) might not immediately realize
from just looking at the reference documentation how nested
`initializer_list` are parsed. Maybe also reference to the
`initializer_list` constructors of `array` and `object`.
* `object::insert` is not well-documented. I did not understand what it does
when I insert an already existing key, in particular, whether assignment
takes place or not. The same is true for `object::emplace`.
* The `object` initializer constructor is listed under "Copy constructor".
* The `object::erase` return value documentation is wrong.
* `object::reserve` says "This inserts an element into the container."
* `object::swap(object)` and the friend version `swap(object,object)` have
rather inconsistent briefs.
* The initializer_list constructor for array is listed under move constructors.
* The pilfer constructors are not quite explained. There's a reference to the
paper, but I think the library should have a page that concisely explains what
it does and what a user of the library can expect from pilfered objects. In
particular, how the thing differs from move and why a user would ever want to
use it instead of move. Also the reference pages for pilfer and so on don't
explain what it does.
* `array::reserve` shows `BOOST_FORCEINLINE` in the docs.
* `value_to` is unclear about the precedence of the three listed options
(constructor, library provided conversion or `tag_invoke`) if multiple are
present provided this is actually possible. The same is true for `value_from`.
* `storage_ptr` thread safety is not documented. That's a real bummer for a
`shared_ptr`-like construct.
* I found it rather difficult to figure out the difference in the behavior of
`parser::write` and `parser::write_some` just from reading the documentation.
Either reference documentation should clearly state what the difference is.
I had to open both pages side by side to actually spot the wording
difference. For example, I think that the documentation of `write_some`
should clearly state that it's intended use case is for parsing multiple JSON
values from a byte stream presented through one or more buffers whereas
`write` is intended for parsing one JSON value from one or more buffers. At
least that's what I concluded from the documentation after some time.
* The `parser` documentation should either link to or inline provide one or
more examples how the parser is supposed to be used. In particular
`write_some` would benefit hugely from an example that shows how to properly
extract a stream of JSON values from a sequence of buffers.
* If I read the documentation correctly, `basic_parser::write` behave more like
`parser::write_some` than `parser::write`. I would prefer the call it
`basic_parser::write_some`.
* The `basic_parser` reference documentation would benefit from either a link
to an example or an inline example.
> - What is your evaluation of the potential usefulness of the library?
I think the library supports a very narrow yet widespread use-case. It offers a
modern interface and high performance. So I will definitely use and I believe it
will also be useful to a wide audience.
> - Did you try to use the library? With which compiler(s)? Did you have
> any problems?
I played with it in its STANDALONE mode with gcc 9.3.
> - How much effort did you put into your evaluation? A glance? A quick
> reading? In-depth study?
Roughly 5 hours playing with small examples, but mostly looking into the documentation
and into the code of `json::value` and related parts of the DOM implementation.
> - Are you knowledgeable about the problem domain?
I use JSON on a regular basis across different languages including c++
(mostly nlohmann's up to date). I've not implemented a JSON library myself.
Kind regards,
Max
3
4
The boost formal review for JSON will take place from Sept 14, 2020 to Sept
23, 2020.
JSON is authored by Vinnie Falco (author: boost beast and Static String),
Krystian Stasiowski(author: Static String) and submitted with of Peter
Dimov(author: Too many libraries to mention 😛)
The use of JSON format to share the data is getting bigger and wider
because it is lightweight and it's supported cross-platform and many more.
JSON has numbers of uses and to make our life easy to deal with it in C++
this library is introduced.
Documentation can be found here:
<http://vinniefalco.github.io/doc/json/>
It includes examples, usage and comparison. Documentation is still under
progress but still provides enough information.
Official repository:
<https://github.com/CPPAlliance/json>
Glad to address any questions or issues, here or in the repo:
<https://github.com/CPPAlliance/json/issues/>
The purpose of this announcement is to give people some time to start
looking at the documentation and source code for the library. Feel free
to begin asking questions. Most of all, get involved! Nowadays most of
us(assuming) would have dealt with the JSON in one way or another. Tell the
Boost community what you think of JSON.
--
Thank you,
Pranam Lashkari, https://lpranam.github.io/
5
8
I'm gonna use a perhaps unusual text structure for this review. Consuming
text is a tiring activity that demands energy. Having that in mind, I
ignored the order in which the library is presented through the docs and
moved all trivial concepts discussion to the end. That's my attempt to make
sure the message gets through while the reader still has it -- the energy.
## A short introduction to pull parsers vs push parsers
The act of parsing an incoming stream will generate events from a
predefined set. Users match the generated events against rules to trigger
actions. For the following JSON sample:
```json
{ "foo" : 42 , "bar" : 33 }
```
One of such event streams could be:
```
begin_object -> string_key -> number -> string_key -> number -> end_object
```
The essential difference between pull parsers and push parsers is how this
event stream is delivered. A push parser (a.k.a. SAX parsers) uses the
callback handler model. A pull parser (a.k.a. StAX parsers) stores the
current parsing state and allows the user to query this state.
Essentially every push parser has an implicit pull parser that doesn't get
exposed to the user. This is to say that a push parser would be implemented
in the lines of:
```cpp
// parsing loop
while (it != end) {
match_token();
dispatch_token(user_handlers);
}
```
Whereas a pull parser doesn't have the loop construct above, just its
body. The loop construct would be under user-control. That's why pull
parsers are the more fundamental model. A push parser can be built on top
of a pull parser, but not the inverse.
Concrete examples will be shown later in this text.
## Performance: a common misconception
Some push parsing proponents believe pull parsers are slower. This belief
is unsubstantiated. As we have seen in the previous section, a push parser
is always built on top of an implicit pull parser. The only thing that
differs them both is that a small part of the parser logic under the
pulling model moves out of the interface to become user code. That's the
only difference.
One of the arguments brought by push parser proponents is the ability to
merge matching and decoding steps to perform less work. We can use the
previous sample to illustrate this idea:
```
{ "foo" : 42 , "bar" : 33 }
^^^^^
```
If we are to decode the `"foo"` token above, the parser first needs to
discover where the token begins and where the token ends. This is the
required matching step that every parser needs to perform. JSON strings can
be escaped and one such example would be:
```json
"na\u00EFve"
```
The decoding steps take care of converting escaped string slices to
unescaped ones. So user code would receive 3 string runs while this token
is being decoded:
- "na"
- "ï"
- "ve"
The parser does not allocate any memory to decode any string. The slices
are feeded to the user who in turns is responsible to put them together
(possibly allocating heap memory).
The idea here is... having separate match and decode steps forces the
parser to walk over the tokens twice. For Boost.JSON, the merged match and
decode steps materialize in the `on_string_part()` (and `on_string()`)
callback:
```cpp
bool on_string_part(string_view v, std::size_t, error_code& ec)
{
current_str.append(v.data(), v.size());
return true;
}
```
For a pull parser with _separate_ match and decode steps the interface
would be in the lines of:
```cpp
reader.next();
if (reader.kind() == token::string)
reader.string(current_str);
```
But as we've seen before, the only difference between pull and push parsers
is something else. Any features lateral to this classification originate
from different taxonomies. Just as a push parser can merge the matching and
decoding steps so can a pull parser:
```cpp
// A new overload.
//
// `my_match_options` here would provide
// the string collector upfront.
reader.next(my_match_options);
```
Another small note here is that just as merging matching and decoding steps
can be beneficial performance-wise so can skipping the whole decoding
step. I'll be touching on this very idea again in the section below.
Appendix A will present how another feature from Boost.JSON can be
implemented in pull parsers: strings split among different buffers.
## The case for pull parsers
So far we've only been exposed to the difference between push parsers and
pull parsers, but there wasn't a case for which choice is ideal (aside from
the naturally understood concept that pull parsers are more fundamental and
push parsers can be built on top when desired).
The defining trait present in pull parsers that is impossible to replicate
in push parsers is their ability to compose algorithms. Composition is
enabled by temporarily handing over the token stream to a different
consumer. Not every consumer needs to know how to process every part of the
stream. That's a trait impossible to find in push parsers.
I have created a [gawk](https://www.gnu.org/software/gawk/) plug-in to
illustrate this property. The same plug-in was implemented several times
using different libraries/approaches. I incrementally added more features
under different branches (there are 8 branches in total). You can find a
rationale for the AWK choice in Appendix B by the end of this review.
The plugin's code can be found at
<https://gitlab.com/vinipsmaker/gawk-jsonstream-archive>. And you may
forgive my beginner code, but I had no experience to write gawk plug-ins
beforehand. That's my first.
AWK is a C-like language (in syntax) where you define pattern-action
pairs. You can find a quick'n'dirty introduction to the language from
Wikipedia. This plug-in adds JSON capabilities to the tool. Given the
following data-set:
```
{"name": "Beth", "pay_rate": 4.00, "hours_worked": 0}
{"name": "Dan", "pay_rate": 3.75, "hours_worked": 0}
{"name": "Kathy", "pay_rate": 4.00, "hours_worked": 10}
{"name": "Mark", "pay_rate": 5.00, "hours_worked": 20}
{"name": "Mary", "pay_rate": 5.50, "hours_worked": 22}
{"name": "Susie", "pay_rate": 4.25, "hours_worked": 18}
```
And the following AWK program (with this plug-in loaded):
```awk
BEGIN {
JPAT[1] = "/name"
JPAT[2] = "/pay_rate"
JPAT[3] = "/hours_worked"
}
J[3] > 0 { print J[1], J[2] * J[3] }
```
The output will be:
```
Kathy 40
Mark 100
Mary 121
Susie 76.5
```
The paths for the JSON tree are given in [standard JSON
Pointer](https://tools.ietf.org/html/rfc6901) syntax. That's the plugin's
core and it's the code that doesn't change among branches, so it's the
appropriate place to begin the explanation.
In the file `jpat.ypp` you'll find a `read_jpat()` function that gets
called for each record. There's some boilerplate in the beginning to bail
out earlier if there were no changes to `JPAT`. Then there is this snippet
that gets processed by [re2c](https://re2c.org/) to generate valid C++
code:
```cpp
* {
if (*begin != '\0')
return bad_path();
jpat[i].components.emplace_back(std::move(current_component));
goto end;
}
{prefix} {
jpat[i].components.emplace_back(std::move(current_component));
current_component = std::string{};
goto loop;
}
{unescaped} {
current_component.append(begin, YYCURSOR - begin);
goto loop;
}
{escaped_tilde} {
current_component.push_back('~');
goto loop;
}
{escaped_slash} {
current_component.push_back('/');
goto loop;
}
```
The definitions for the data structures follow:
```cpp
struct Path
{
std::vector<std::string> components;
awk_value_cookie_t cookie = nullptr;
awk_string_t orig;
};
struct Tree: std::map<std::string, std::variant<Tree, size_t>>
{};
static std::vector<Path> jpat;
Tree tree;
```
By the end of the re2c-generated loop, the variable `jpat` will be
filled. Then there is some validation/sanitization steps and we start to
fill the `tree` variable out of the extracted paths.
`tree` is a decision tree auxiliary to the process of parsing the JSON
documents. Leaf nodes represent which index in the `J` AWK variable will
receive that value. Given the following `JPAT`:
```awk
JPAT[1] = "/foo/bar"
JPAT[2] = "/foobar"
```
`tree` will have the following properties (pseudocode):
```cpp
// AWK arrays are 1-indexed, so the code
// translation has a 1-element offset
assert(tree["foo"]["bar"] == 0);
assert(tree["foobar"] == 1);
```
The user may find other uses for a JSON decision tree in Appendix C. And
here is where the code branches diverge. So let's present an overview and
then walk them one-by-one (or two-by-two):
```
+------------+---------------+--------------+
| | Boost.JSON | pull parser |
+------------+---------------+--------------+
| basic | boostjson | master |
| dom | boostjson-dom | |
| transform | boostjson2-b | master2 |
| transform2 | boostjson3 | master3 |
+------------+---------------+--------------+
```
The `master` branch is just 506 lines of C++ code total and `boostjson` is
613 lines. The description for the other branches follow:
- `dom`: built after `basic`. This branch is useful to highlight
performance differences between the DOM parser and the rest.
- `transform`: `boostjson2-b` is built on top of `boostjson-dom`
branch. `master2` is built on top of `master` branch. It adds the power
to update the JSON document. Here we start to notice how the ability to
mark certain points in the token stream are important for a JSON library.
- `transform2`: built on top of the previous branch, it adds the `JUNWIND`
operation in the likes of MongoDB's `$unwind` operator.
The first thing I'll do is to compare the DOM parsing layer cost. [There's
this 161MiB JSON file that'll be
used](https://drive.google.com/file/d/1-lVxGYWvvmBsjrDVgju_lOTBaK03MdQc/vie….
Just as `jq` can be used to extract a field from this file:
```bash
jq -cM '.features[10000].properties.LOT_NUM' citylots2.json
```
So can AWK:
```bash
gawk -l jsonstream -e '
BEGIN { JPAT[1] = "/features/10000/properties/LOT_NUM" }
{ print J[1] }' citylots2.json
```
The execution time for the boostjson branch will be:
```
real 0m2,112s
user 0m1,476s
sys 0m0,618s
```
And the execution time for the boostjson-dom branch will be:
```
real 0m5,639s
user 0m4,170s
sys 0m1,418s
```
We should always strive to do more than just throwing random
numbers. Anyone can do that. The timings show that the DOM layer is
obviously much slower, but there is no surprise here. It's kind of
ridiculous trying to compare both branches. They do different things
(although the plug-in feature-set is the same). The DOM approach will
always decode and save every matched token. And that's one of the
principles to deliver faster code: do less.
The reason why I've chosen a 161MiB JSON file was to make sure most of the
program's time would be spent not in the gawk VM but in the parsing
plug-in.
If we wanted to expose every value from the DOM tree, `json::value` would
also be useless because the consumer here expects `awk_value_t` objects
(and the DOM interface would guarantee the costs for unnecessarily saving
decoded results twice would be paid). We don't always control the
interface/protocol -- that's part of the C++ programmer's job. Would it be
reasonable to deliver code twice as slow to my client because "hey I have a
nice DOM container... gotta use it"?
As Marcel Weiher has noted in [his call for "less lethargic JSON
support"](https://blog.metaobject.com/2020/04/somewhat-less-lethargic-json-support.html)
[sic]:
> Wait a second! 312MB/s is almost 10x slower than Daniel Lemire's parser,
> the one that actually parses JSON, and we are only creating the objects
> that would result if we were parsing, without doing any actual parsing.
>
> This is one of the many unintuitive aspects of parsing performance: the
> actual low-level, character-level parsing is generally the least
> important part for overall performance. Unless you do something crazy
> like use `NSScanner`. Don't use `NSScanner`. Please.
Let's dive in the code to see how they differ.
### `basic`
Every branch reads a record one document at a line. That's the
`get_record()` function. For the `boostjson-dom` branch, the code goes in
the lines of:
```cpp
try {
parsed_value.emplace_null();
json_parser.reset(parsed_value.storage());
std::error_code ec;
json_parser.write(line.data(), line.size(), ec);
if (ec)
throw std::system_error{ec};
json_parser.finish(ec);
if (ec)
throw std::system_error{ec};
parsed_value = json_parser.release(ec);
assert(!ec);
switch (parsed_value.kind()) {
case json::kind::array:
fill_j(parsed_value.get_array(), tree);
break;
case json::kind::object:
fill_j(parsed_value.get_object(), tree);
break;
// ...
}
} catch (const std::exception& e) {
warning(ext_id, "Error while processing record: %s", e.what());
}
```
The code parses the whole JSON at once and then follows to consume the
document by calling the `fill_j()` function:
```cpp
// Several overloads:
static void fill_j(const json::array& arr, Tree& node);
static void fill_j(const json::object& obj, Tree& node);
static void fill_j(const json::value& v, Tree& node);
// Here is the object overload. You can find the rest in the repository.
static void fill_j(const json::object& obj, Tree& node)
{
for (auto& [key, value]: node) {
auto it = obj.find(key);
if (it == obj.end())
continue;
std::visit(
hana::overload(
[&](Tree& tree) { fill_j(it->value(), tree); },
[&](size_t idx) { consume_value(it->value(), idx); }),
value);
}
}
```
The code is pretty straight-forward. We check if the fields the user asked
for are present in the document and recurse on them. When we reach a leaf
node, the `J` AWK variable is set (that'd be the `consume_value()`
function).
Do notice that there is no uncontrolled recursion here. Recursion is always
JPAT-delimited. And JPAT is part of the source code, not user-input. There
is no risk of stack overflow here.
The pull parser approach (branch master) somewhat follows the same trend,
but the code is bigger. Here's a selected highlight:
```cpp
void fill_j(json::reader& reader, Tree& node)
{
// ...
switch (reader.symbol()) {
case json::token::symbol::error:
throw json::error{reader.error()};
case json::token::symbol::begin_object: {
if (!reader.next()) throw json::error{reader.error()};
std::string current_key;
for (;;) {
if (reader.symbol() == json::token::symbol::end_object) {
reader.next();
break;
}
// Key
assert(reader.symbol() == json::token::symbol::string);
current_key.clear();
auto ec = reader.string(current_key);
assert(!ec); (void)ec;
if (!reader.next()) throw json::error{reader.error()};
// Value
auto it = node.find(current_key);
if (it == node.end()) {
json::partial::skip(reader);
continue;
}
std::visit(
hana::overload(
[&](Tree& tree) {
switch (reader.symbol()) {
case json::token::symbol::begin_array:
case json::token::symbol::begin_object:
fill_j(reader, tree);
break;
default:
json::partial::skip(reader);
}
},
consume_value),
it->second);
}
if (reader.symbol() == json::token::symbol::error)
throw json::error{reader.error()};
}
// ...
}
}
```
And here is the first sample on what I meant by composability. Our
"callback" only gets called to handle trees that are part of the decision
tree one way or another (and the root node always belongs here). When it
reaches a subtree that doesn't belong to the JPAT-defined decision tree,
`json::partial::skip()` is called. `json::partial::skip()` is defined
elsewhere and has no knowledge on the internals of our algorithm. This is a
different "callback" being called to handle a different part of the
tree. The two of them know how to handle different groups of tokens.
A nice side effect is that these "phantom" trees will be fully skipped
over. We don't allocate strings to collect values over them. This is an
optimization we get for free. [Designing JSON parsers to exploit this idea
is not unknown](https://github.com/zserge/jsmn).
The last one will be the code for the `boostjson` branch, but I feel like
it'll be easier to understand if I first introduce the tasks it has to
perform in the token stream.
```
{ "foo" : 42 , "bar" : { "foo" : 33 } }
^^
```
When the field's value is parsed, we must match the key against our
rule-set to possibly trigger an action. But it is not enough to store the
current key. The whole path must be stored to support the JSON nest-able
structure. We start with the structure to save current context information:
```cpp
struct State
{
bool is_object; //< false = ARRAY
std::string current_key;
size_t current_idx;
Tree* node;
};
std::vector<State> state;
```
So the beginning of the code for `on_object_begin()` must be:
```cpp
Tree* node = nullptr;
if (state.size() == 0) {
node = &tree;
} else if (state.back().node) {
auto it = state.back().node->find(current_key());
if (it != state.back().node->end())
node = std::get_if<Tree>(&it->second);
}
state.emplace_back();
state.back().is_object = true;
state.back().node = node;
```
And a matching action in the code for `on_object_end()`:
```cpp
state.pop_back();
```
But it's not really that simple. Paths such as
`/features/10000/properties/LOT_NUM` won't work because `current_key` won't
be provided by any `on_key()`-like callback when current node is
`10000`. `current_key` must be computed for arrays in the value handler
itself. That's why the following snippet must be present in the beginning
of every value-callback (and `on_{object,array}_begin()` too):
```cpp
if (state.size() > 0 && /*is_array=*/!state.back().is_object)
update_current_array_key();
```
And the matching helper functions:
```cpp
std::string_view current_key()
{
assert(state.size() > 0);
return state.back().current_key;
}
void update_current_array_key()
{
assert(state.size() > 0);
assert(!state.back().is_object);
state.back().current_key.resize(max_digits);
auto s_size = std::to_chars(
state.back().current_key.data(),
state.back().current_key.data() + state.back().current_key.size(),
state.back().current_idx++
).ptr - state.back().current_key.data();
state.back().current_key.resize(s_size);
}
```
The final touch is to realize that the event stream will be something like:
```
string_key -> value -> string_key -> value -> string_key -> value -> ...
```
And we need to clear the `string_key` by the end of every value consumed:
```cpp
if (state.size() > 0) state.back().current_key.clear();
```
So the callback implementation for a simple atom value such as a string is:
```cpp
bool on_string_part(std::string_view v, std::error_code&)
{
if (state.size() > 0) {
if (state.back().node == nullptr)
return true;
auto it = state.back().node->find(current_key());
if (it == state.back().node->end())
return true;
if (!std::holds_alternative<size_t>(it->second))
return true;
}
current_str.append(v.data(), v.size());
return true;
}
bool on_string(std::string_view v, std::error_code&)
{
awk_value_t idx_as_val;
if (state.size() == 0) {
current_str.append(v.data(), v.size());
make_number(0, &idx_as_val);
} else {
if (/*is_array=*/!state.back().is_object)
update_current_array_key();
if (state.back().node == nullptr)
goto end;
auto it = state.back().node->find(current_key());
if (it == state.back().node->end())
goto end;
auto idx = std::get_if<size_t>(&it->second);
if (!idx)
goto end;
current_str.append(v.data(), v.size());
make_number(*idx + 1, &idx_as_val);
}
{
awk_value_t value_as_val;
char* mem = (char*)gawk_malloc(current_str.size() + 1);
memcpy(mem, current_str.c_str(), current_str.size() + 1);
set_array_element(
var_j,
&idx_as_val,
make_malloced_string(mem, current_str.size(), &value_as_val));
}
end:
if (state.size() > 0) state.back().current_key.clear();
current_str.clear();
return true;
}
```
But the code still performs differently. There's that optimization we got
for free in the pull parser choice. We must implement a mechanism to skip
over unwanted trees to make the comparison fair. And here's where a
`skipped_levels` variable is introduced. It'll change every callback again
but will finally let us have the following code for the `string_key`
callback:
```cpp
bool on_key_part(std::string_view v, std::error_code&)
{
if (skipped_levels)
return true;
// not called on phantom trees
state.back().current_key.append(v.data(), v.size());
return true;
}
bool on_key(std::string_view v, std::error_code&)
{
if (skipped_levels)
return true;
// not called on phantom trees
state.back().current_key.append(v.data(), v.size());
return true;
}
```
The full code can be found on the `include/jsonstream/parser.hpp` file.
This explanation should suffice to understand (to quote myself):
> The defining trait present in pull parsers that is impossible to
> replicate in push parsers is their ability to compose
> algorithms. Composition is enabled by temporarily handing over the token
> stream to a different consumer. Not every consumer needs to know how to
> process every part of the stream. That's a trait impossible to find in
> push parsers.
Every callback depends on implementation knowledge from every other
callback. Spaghetti code as they say. I also implemented another change to
both branches -- `master` and `boostjson` -- to further demonstrate this
property. What if we want to store string representation for array indexes
in static arrays instead `std::string` objects?
There's this changeset for `master` branch:
```
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -184,20 +184,19 @@ void fill_j(json::reader& reader, Tree& node)
case json::token::symbol::begin_array: {
if (!reader.next()) throw json::error{reader.error()};
- std::string current_key;
+ std::array<char, max_digits> current_key;
for (size_t i = 0 ;; ++i) {
if (reader.symbol() == json::token::symbol::end_array) {
reader.next();
break;
}
- current_key.resize(max_digits);
auto s_size = std::to_chars(current_key.data(),
current_key.data() +
current_key.size(),
i).ptr - current_key.data();
- current_key.resize(s_size);
- if (node.count(current_key) == 0) {
+ auto it = node.find(std::string_view(current_key.data(), s_size));
+ if (it == node.end()) {
json::partial::skip(reader);
continue;
}
@@ -214,7 +213,7 @@ void fill_j(json::reader& reader, Tree& node)
}
},
consume_value),
- node[current_key]);
+ it->second);
}
if (reader.symbol() == json::token::symbol::error)
```
The changes are local to the code that parses this structure (only the
`case json::token::symbol::begin_array` "callback" changed). It doesn't
affect the rest. As for the `boostjson` branch, the code for every callback
must now execute new code (if you're curious why we'd need a
`current_key()` helper function):
<https://gitlab.com/vinipsmaker/gawk-jsonstream-archive/-/commit/f4c81abf7e7…>.
As an exercise for the reader, try to imagine what changes would be
required in the push parser approach if we wanted to expose a DOM handle in
the `J` AWK variable. As for the pull parser approach we have a call to
`partial::skip()` within the `consume_value()` function. This call would be
replaced with `partial::parse()` and that's the change.
### `transform`
By now I already made my main point. Push parsers don't compose. I'll keep
the explanation for remaining branches brief and only give an outline over
them.
We've made a plug-in to read JSON values. Naturally the next step is to
have the ability to update/write them.
It's not possible to do this with Boost.JSON parser, so I had to give up on
branch boostjson2 and start a new one -- boostjson2-b. `boostjson2-b` is
based on `boostjson-dom` so you can expect similar prohibitive costs from
the DOM layer.
The strategy here is the same as before. We recursively traverse the stored
DOM using the decision tree to guide the whole process. The previous step
had a `consume_value()` function and the new step will have a
`produce_value()` one:
```cpp
static void produce_value(json::value& dom, size_t idx)
{
awk_value_t idx_as_val; make_number(idx + 1, &idx_as_val);
awk_value_t value;
if (!get_array_element(var_j, &idx_as_val, AWK_UNDEFINED, &value)) {
dom.emplace_null();
return;
}
switch (value.val_type) {
case AWK_UNDEFINED:
case AWK_ARRAY:
dom.emplace_null();
break;
case AWK_NUMBER:
if (value.num_value == 1.0 && dom.is_bool()) {
dom.emplace_bool() = true;
} else if (value.num_value == 0.0 && dom.is_bool()) {
dom.emplace_bool() = false;
} else {
// TODO: respect AWK's OFMT
dom.emplace_double() = value.num_value;
}
break;
case AWK_STRING:
case AWK_REGEX:
case AWK_STRNUM: {
dom.emplace_string() = std::string_view{
value.str_value.str, value.str_value.len};
break;
}
case AWK_SCALAR:
case AWK_VALUE_COOKIE:
assert(false);
}
}
```
Then we just serialize the whole DOM tree when the AWK function
`json::output()` is called:
```cpp
static awk_value_t* do_output(int /*num_actual_args*/, awk_value_t* result,
awk_ext_func_t* /*finfo*/)
{
query_j(parsed_value, tree);
auto serialized = json::to_string(parsed_value);
char* mem = (char*)gawk_malloc(serialized.size() + 1);
memcpy(mem, serialized.c_str(), serialized.size() + 1);
return make_malloced_string(mem, serialized.size(), result);
}
```
For the pull parser we modify the read loop to store the document offsets
(this small diff is the part impossible to achieve with Boost.JSON parser):
```
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -96,6 +96,11 @@ void fill_j(json::reader& reader, Tree& node)
// atom values at root level were moved out of this recursive function.
auto consume_value = [&](size_t idx) {
+ Slice slice;
+ slice.rel_off = reader.literal().data() - cursor;
+ slice.jidx = idx;
+ slice.bool_src = false;
+
switch (reader.symbol()) {
case json::token::symbol::end:
case json::token::symbol::error:
@@ -109,6 +114,8 @@ void fill_j(json::reader& reader, Tree& node)
var_j,
make_number(idx + 1, &idx_as_val),
make_number(reader.value<bool>() ? 1 : 0, &value));
+
+ slice.bool_src = true;
}
if (false)
case json::token::symbol::integer:
@@ -133,12 +140,15 @@ void fill_j(json::reader& reader, Tree& node)
make_malloced_string(mem, value.size(), &value_as_val));
}
case json::token::symbol::null:
+ slice.size = reader.literal().size();
if (!reader.next()) throw json::error{reader.error()};
break;
case json::token::symbol::begin_array:
case json::token::symbol::begin_object:
- json::partial::skip(reader);
+ slice.size = json::partial::skip(reader).size();
}
+ cursor += slice.rel_off + slice.size;
+ slices.push_back(slice);
};
switch (reader.symbol()) {
```
And implement AWK `json::output()` to perform simple string concatenation:
```cpp
static awk_value_t* do_output(int /*num_actual_args*/, awk_value_t* result,
awk_ext_func_t* /*finfo*/)
{
size_t output_size = 0;
for (auto& slice: slices) {
output_size += slice.rel_off;
slice.buffer.clear();
awk_value_t idx_as_val; make_number(slice.jidx + 1, &idx_as_val);
awk_value_t value;
if (!get_array_element(var_j, &idx_as_val, AWK_UNDEFINED, &value)) {
slice.buffer = "null";
output_size += slice.buffer.size();
continue;
}
switch (value.val_type) {
case AWK_UNDEFINED:
case AWK_ARRAY:
slice.buffer = "null";
break;
case AWK_NUMBER:
if (value.num_value == 1.0 && slice.bool_src) {
slice.buffer = "true";
} else if (value.num_value == 0.0 && slice.bool_src) {
slice.buffer = "false";
} else {
// TODO: respect AWK's OFMT
json::writer writer{slice.buffer};
writer.value(value.num_value);
}
break;
case AWK_STRING:
case AWK_REGEX:
case AWK_STRNUM: {
json::writer writer{slice.buffer};
writer.value(json::writer::view_type{
value.str_value.str, value.str_value.len});
break;
}
case AWK_SCALAR:
case AWK_VALUE_COOKIE:
assert(false);
}
output_size += slice.buffer.size();
}
auto buffer_end = buffer.data() + record_size - record_rt_size;
output_size += buffer_end - cursor;
char*const mem = (char*)gawk_malloc(output_size + 1);
auto out_it = mem;
auto in_it = buffer.data();
for (auto& slice: slices) {
memcpy(out_it, in_it, slice.rel_off);
out_it += slice.rel_off;
in_it += slice.rel_off + slice.size;
memcpy(out_it, slice.buffer.data(), slice.buffer.size());
out_it += slice.buffer.size();
}
memcpy(out_it, cursor, buffer_end - cursor);
out_it += buffer_end - cursor;
*out_it = '\0';
return make_malloced_string(mem, output_size, result);
}
```
It's clear which solution will perform better here (string concatenation of
a few string slices vs serializing the whole DOM tree again), but there is
another interesting difference to highlight. The difference having nothing
to do with pull parsers vs push parsers and everything to do with the
serialization framework. As we can see from the alternative library here,
we're able to serialize the document piece by piece. Boost.JSON has one
serialization facility -- `json::serializer` and its by-products such as
`json::to_string()` -- and that's it. If you have a domain object in the
likes of:
```
struct Reply
{
int req_id;
std::vector<std::string> result;
};
```
The only answer for Boost.JSON would be to first convert the whole object
into `json::value` and serialize the resulting DOM tree. That's an
unacceptable answer.
### transform2
For the last developed branches, I demonstrate how partial DOM trees fit in
the design of pull parsers. MongoDB has this `$unwind` operator that will
output a new document for each element in a certain array element of the
input document. So, for the following document:
```
{ "foo" 33 , "bar" : [ 3, 4, "abc" ] }
```
MongoDB's `$unwind` operator can be used to output:
```
{ "foo" 33 , "bar" : 3 }
{ "foo" 33 , "bar" : 4 }
{ "foo" 33 , "bar" : "abc" }
```
For our plug-in, you just fill the index you'd like unwind:
```awk
BEGIN {
JPAT[1] = "/bar"
JUNWIND = 1 #< index in JPAT array
}
```
For `boostjson3` branch, the idea is to swap the desired array to a
different place so we can later call `json::output()` cleanly:
```
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -74,13 +74,32 @@ static const awk_fieldwidth_info_t zero_fields = {
static json::parser json_parser;
static json::value parsed_value;
-static void consume_value(const json::value& v, size_t idx)
+int unwind_idx; //< 1-indexed ; 0 is root ; -1 is unwind unset
+json::array unwinded_values;
+static bool unwind_in_progress;
+static size_t nunwinded;
+
+static void consume_value(json::value& v, size_t idx)
{
switch (v.kind()) {
case json::kind::array:
+ if ((int)idx + 1 == unwind_idx) {
+ v.get_array().swap(unwinded_values);
+ {
+ awk_value_t val;
+ auto ok = sym_update_scalar(
+ var_julength,
+ make_number(unwinded_values.size(), &val));
+ assert(ok); (void)ok;
+ }
+ unwind_in_progress = true;
+ nunwinded = 0;
+ break;
+ }
case json::kind::object:
break;
case json::kind::string: {
```
For `master3` branch, only one change happened to the read loop (again
proving the composability of pull parsers). The idea is to store the unwind
target directly into a domain object (so almost a partial DOM, but even
cheaper). Do notice that only `json::token::symbol::begin_array` "callback"
changed. Every other "callback" stays the same:
```
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -146,6 +151,63 @@ void fill_j(json::reader& reader, Tree& node)
if (!reader.next()) throw json::error{reader.error()};
break;
case json::token::symbol::begin_array:
+ if ((int)idx + 1 == unwind_idx) {
+ const char* head = reader.literal().data();
+ if (!reader.next()) throw json::error{reader.error()};
+ for (;;) {
+ if (reader.symbol() == json::token::symbol::end_array)
+ break;
+
+ awk_value_t value;
+
+ switch (reader.symbol()) {
+ case json::token::symbol::end:
+ case json::token::symbol::error:
+ case json::token::symbol::end_object:
+ case json::token::symbol::end_array:
+ assert(false);
+ case json::token::symbol::boolean:
+ make_number(reader.value<bool>() ? 1 : 0, &value);
+ break;
+ case json::token::symbol::integer:
+ case json::token::symbol::real:
+ make_number(reader.value<double>(), &value);
+ break;
+ case json::token::symbol::string: {
+ auto data = reader.value<std::string>();
+ char* mem = (char*)gawk_malloc(data.size() + 1);
+ memcpy(mem, data.c_str(), data.size() + 1);
+ make_malloced_string(mem, data.size(), &value);
+ }
+ break;
+ case json::token::symbol::null:
+ value.val_type = AWK_UNDEFINED;
+ break;
+ case json::token::symbol::begin_array:
+ case json::token::symbol::begin_object:
+ value.val_type = AWK_UNDEFINED;
+ unwinded_values.push_back(value);
+ json::partial::skip(reader);
+ continue;
+ }
+ unwinded_values.push_back(value);
+ if (!reader.next()) throw json::error{reader.error()};
+ }
+ assert(reader.symbol() == json::token::symbol::end_array);
+ const char* tail = reader.literal().end();
+ if (!reader.next()) throw json::error{reader.error()};
+ slice.size = std::distance(head, tail);
+ {
+ awk_value_t val;
+ auto ok = sym_update_scalar(
+ var_julength,
+ make_number(unwinded_values.size(), &val));
+ assert(ok); (void)ok;
+ }
+ unwind_in_progress = true;
+ nunwinded = 0;
+ break;
+ }
```
I have explored a few concepts, but there are many more that were left off
(e.g. backtracking, wrapping parsers, chunking). I'm also familiar with
these topics and my assessment here would be the same: either there is no
loss in the pull parser approach and they both perform well, or the push
parser will perform significantly worse.
## A faster DOM tree
There is another class of parsers that I haven't talked about. If
parsing-to-DOM speed is the only relevant pursuit for this library, then
it's failing too. It should be using destructive parsing
techniques. Decoding strings can be performed in-place to put less pressure
on allocators (and to skip several memory copies). I can expand on this
topic if there is interest.
## Misc
Some assorted issues and observations:
- Boost.JSON serialization framework is non-existing. There's only one
feature. The functions don't help to generate any serialized JSON
value. Any value to be serialized must first be converted to the
allocating variant-like class `json::value`.
- `json::value::get_*()` functions break current convention for boundary
checking versions of accessor functions in a variant class.
- There is no `std::monostate` overload to initialize `json::value`. On a
personal note, I'd prefer to see an explicit `json::null_t`. I don't like
to conflate meaning of terms created for different purposes.
- Documentation for the parser interface is pretty much lacking. Parameters
are poorly explained and there are plenty of other questions that go
unanswered. Can the handler pause the parsing loop by returning `false`
w/o setting any error code (and how'd the stream be resumed)? Can the
handler callbacks access parser methods such as `depth()`? Can the
handler callbacks copy the parser object (this would be useful for
backtracking)?
- The code compiles with warnings.
- The library offers far too little. The soundness of its design can't be
proven. All it does is converting to/from a variant-like class and every
other JSON-related use-case is left off. In the future, another library
in the likes of QtXmlPatterns (but for JSON) will be built but this core
-- Boost.JSON -- will prove unfit to build anything higher-level and a
new core will have to be developed from the ground-up because current bad
design choices.
## Appendix A: Strings through multiple buffers in pull parsers
There are always compromises when it comes to designing programming
abstractions. As for parsers, one of such compromises is a parser that
never "fails" to consume input. Every input given is fully consumed.
The idea here is to avoid "reparsing" old elements from the stream. For
instance, one document may be received as the following two separate
buffers:
```
"na\u00
EFve"
```
The parser may fail to consume the escape sequence thanks to missing
characters and return `nread=3` for the first buffer. The performance
argument in favour of a parser that doesn't fail is irrelevant. There is no
performance gain by not reparsing 4 bytes. If your buffer is that small
that you find yourself constantly reparsing elements then you have a
problem way before the parsing level. The read syscall overhead would
dominate. That's a pathological case. You must never design abstractions
for a pathological case.
I'll refrain from initiating any debate here on what'd constitute a
pathological case for JSON parsers. I just want to make sure the reader has
this in mind and he should be careful to not ask for every feature under
the sun. He must carry a grain of doubt over whether a feature is really
desired. In fact, the parser may not buffer, but [now the consumer does
so](https://github.com/CPPAlliance/json/blob/fc7b1c6fd229e884df09fd45c64c61…
(so much for a chunking parser).
As for the topic on this section: can a pull parser support this case? The
answer will be the same as before (to quote myself):
> But as we've seen before, the only difference between pull and push
> parsers is something else. Any features lateral to this classification
> originate from different taxonomies.
The enabling feature here is not the push interface. Rather it is the
chosen set of generated events for the JSON token stream. If you change the
notified events for the stream, you enable the same feature. For a pull
parser, these are the events it'd have to provide (among other valid
choices):
- `string` (if non-split)
- `string_run_begin`
- `string_run`
- `string_run_end`
## Appendix B: Why gawk?
There are a few reasons why I've chosen a (GNU) AWK plug-in to illustrate
the concepts presented in this review.
First, I don't want to invent a new tool. Artificial examples could always
be created to fabricate an argument. AWK is one of the classical UNIX
tools. It's established and old. From [gawk
manual](https://www.gnu.org/software/gawk/manual/html_node/History.html):
> The original version of `awk` was written in 1977 at AT&T Bell
> Laboratories. In 1985, a new version made the programming language more
> powerful [...] This new version became widely available with Unix System
> V Release 3.1 (1987).
We're talking 40 years of history here.
If the tool is established, that's one reason. Another reason why I've
chosen AWK is that AWK has no JSON processing facilities, so it fills a gap
in an existing tool.
The third reason why I've chosen GNU AWK to create this plug-in is that
gawk already has an extension mechanism. Therefore I don't have the freedom
to modify the AWK processor to accommodate the JSON library
idiosyncrasies. On the contrary, it's the JSON library the one to prove its
flexibility.
Now there's another question to answer here: why take MongoDB as
inspiration to design the gap that would be filled in AWK (such as the
`$unwind` operator)? And the answer will be the same. MongoDB is one of the
JSON-oriented DBMS that got popular back in the NoSQL boom days.
The end result tries to fit in the AWK's mindset and not the other way
around. AWK is a tool designed for tabular data processing. JSON is in no
way tabular, but its structure allows us to build a tabular view for the
underlying data.
There are two main variables in gawk to control how the fields are
extracted:
- `FS`: defines how to find the field _separators_.
- `FPAT`: defines how to find the field _contents_.
The plug-in just follows this trend and introduces a `JPAT` variable that
also instructs how to find the field _contents_. `JPAT` is an array and
each element has a [standard JSON
Pointer](https://tools.ietf.org/html/rfc6901) expression that was designed
exactly for this purpose.
There's one very interesting fact about this outcome. The usage pattern
here is removing from the library the step to process the stream (as the
step that'd happen with the DOM approach) and moving much of the processing
logic to be embedded in the parsing phase itself. That's a continuation
from the trend we saw earlier about merging matching and decoding
steps. It's also the same approach we'd use to build serious JSON
processing tools such as [JMESPath](https://jmespath.org/) (some JMESPath
expressions require a partial DOM, but it's doable to use a hybrid approach
anyway as it has already been demonstrated in the `JUNWIND` code).
## Appendix C: Static decision trees with Boost.Hana
The reader may be interested to know that the decision tree auxiliar to the
parsing process developed for our (GNU) AWK plug-in is not only useful for
AWK but to C++ programs as well. The decision tree strategy may even use
Boost.Hana CT structures to fully unroll the code. It can be used to
implement facilities such as:
```cpp
using hana::literals::operator""_s;
int foo;
std::string bar;
json::scanf(input, "{ foo: { bar: ? }, bar: ? }"_s, foo, bar);
```
That's also possible with Boost.JSON's current parser. But the following is
not:
```cpp
using hana::literals::operator""_s;
Foo foo;
std::string bar;
json::scanf(
input,
"{ foo: { bar: ? }, bar: ? }"_s,
[&foo](json::reader& reader) {
// ...
},
bar);
```
As it has been explained before, push parsers don't compose. And you aren't
limited to root-level scanning. You should have `json::partial::scanf()` to
act on subtrees too. A prototype for this idea can be found at
<https://github.com/breese/trial.protocol/pull/43>.
## Review questions
> Please be explicit about your decision (ACCEPT or REJECT).
REJECT.
> How much effort did you put into your evaluation? A glance? A quick
> reading? In-depth study?
I've stalled a project of mine for a few weeks just to participate in this
review process. I think I've spent close to 50 hours.
I wrote a JSON tool in Boost.JSON and the same tool under a different
parser model to highlight important traits that were being dismissed in
previous debates. I've gone through the documentation and source code to
better understand some Boost.JSON details.
I've also been involved in previous Boost.JSON debates in this very mailing
list.
> Are you knowledgeable about the problem domain?
I've tested many JSON parser libraries in C++ to this point. In my last job
I have leveraged a JSON (pull) parser to merge multiple validation layers
into an one-pass operation and later leveraged Boost.Hana to automate the
parser generation. Our API is JSON-based and we apply plenty of
transformations (including DOM-based when unavoidable), some similar to the
GNU AWK plug-in demonstrated here (e.g. synthesizing a new field on routed
messages while DOM tree related costs are avoided).
I've contributed a few algorithms and subtree parsing ideas back to the
leveraged JSON (pull) parser. I offered the same composability ideas to
Boost.JSON a few months back when the topic appeared here in this very
mailing list.
In the past, I've designed parsers in C++ -- pull and push parsers. I've
played with some tools for parser generation such as re2c, gperf (gperf
might not be really parsing, but it is related), and Boost.Xpressive. I've
also experienced the pain of parsers that try to be too lightweight but
ultimately backfire by moving not only performance costs back to the
consumer but also greatly increasing complexity.
--
Vinícius dos Santos Oliveira
https://vinipsmaker.github.io/
4
6
This is my review of Boost.JSON by Vinnie Falco and Krystian Stasiowski.
First, a disclaimer: I have contributed to the library, mostly with design
advice, occasionally to the implementation.
Second, let's start with the end. I think that Boost.JSON should be
ACCEPTED. All of its aspects are of Boost quality, its maintainers are
qualified and responsive, the lack of a JSON library in Boost is frankly
embarrassing, and the json::value type serves as a standard variant for
scripting and transferring language-independent scalar and structured
values, which opens many possibilities for further integration with future
parts of the Boost ecosystem (I'll show an example of that later on.)
To evaluate the library (I was already broadly, but not intimately, familiar
with the structure and the interface), I started with the documentation,
specifically with the Quick Look and Usage sections.
It's very well written, and meets and exceeds the usual Boost level. I only
have a few remarks to make.
In Using Numbers, the documentation should explain more clearly and
prominently that v.as_double() does not succeed when v contains an integer,
but throws an exception. This is a common user expectation.
Instead of focusing on number_cast, It should advertise value_to<double>(v)
as the way to obtain a double from v, converting if necessary; and
similarly, value_to<int>(v) as the way to obtain an int, range-checking if
necessary.
number_cast should be considered legacy at this point, and not featured. I
understand that it offers an error_code& overload for those who don't want
exceptions, but the throwing version is entirely superseded by value_to.
In Parsing, the examples using parse_options look like this:
parse_options opt;
opt.allow_comments = true;
opt.allow_trailing_commas = true;
opt.allow_invalid_utf8 = true;
value jv = parse( "[1,2,3,] // comment, extra comma ", storage_ptr(),
opt );
That's fine and is indeed the C++11 standard-compatible way to do things.
However, when designated initializers are available, such as on g++ or
clang++, or on VS2019 with /std:c++latest, the above can be written as
value jv = parse( "[1,2,3,] // comment, extra comma ", {},
{ .allow_comments = true, .allow_invalid_commas = true,
.allow_invalid_utf8 = true } );
and it might be useful to show this style too.
In Value Conversion, Converting to Foreign Types, it's shown that defining a
constructor
customer::customer( value const& jv );
enables value_to<customer>(jv) to work. This is, in my opinion, a legacy
mechanism made obsolete by tag_invoke and should not be supported. This
example should be removed from the documentation, and only the tag_invoke
way should be featured.
And since we're on this page, it's odd that the StringLike requirement wants
construction from (char*, std::size_t) - pointer to non-const characters. I
don't think that any string type has such a constructor, they all take char
const*.
With the documentation carefully read (as in, quickly scanned), I proceeded
to try to use the library. A previous reviewer (Reiner Deyke) had brought up
the subject of CBOR parsing and serialization, and I had looked into CBOR as
a format. It seemed regular and easy to parse and serialize, so I decided to
write a CBOR serializer and parser for boost::json::value. How hard can that
be, really?
Not that hard. The serializer was quickly up and running, and it seemed
quite performant:
Serializing canada.json to CBOR: 1056200 bytes, 4840 us
Serializing canada.json to JSON: 2306988 bytes, 13036 us
(You can find the code at
https://github.com/pdimov/boost_json_review/blob/master/boost_json_cbor.cpp,
the serializer is the first 140 or so lines.)
As far as using the library went, it was completely painless and
straightforward.
Next up, the parser. I decided to write it using the json::value interface,
like a reference implementation. This, too, went well, and the parser was
soon parsing, and even producing the same value as the one serialized. The
performance, though...
Parsing canada.json from JSON: 10548 us
Parsing canada.json from CBOR: 33941 us, successful roundtrip
This wasn't good. Switching to a binary format such as CBOR is supposed to
be faster, not 3.4 times slower!
Before you say "well your parser is written like you were a college student
who was taught Java", be aware that I actually refactored the parser a
number of times; sometimes it was written in one style, sometimes in
another, functions went from being hand-inlined to extracted and back a few
times, but none of that made a dent on those 34ms.
To find out where the time was going, I looked into way more detail I needed
to. For instance, I saw that assigning f.ex. 1.0 to a default-constructed
json::value executes much more code than it needs to, and makes non-inlined
calls into libboost_json. One would expect this to take one check and branch
(do we need to deallocate anything? no), and two assignments (kind =
kind_double; dbl_ = 1.0).
Turns out that the efficient way to assign 1.0 is not `v = 1.0;`, but
`v.emplace_double() = 1.0;`. I of course replaced the assignments with this
supposedly efficient form, and it of course made no difference at all.
I then in desperation actually removed all the assignments. And it barely
made any difference.
At this point I decided that the json::value interface is not the right fit
for a performant parser, and switched to using boost::json::value_stack.
This is what Boost.JSON's parser itself uses, so if that's not fast, well, I
don't know what will be.
Updating the parser to use value_stack was basically trivial:
https://github.com/pdimov/boost_json_review/commit/7aa8d2dc8c706c3f4204d6fe…
and it worked. The performance was what I expected:
Parsing canada.json from JSON: 10044 us
Parsing canada.json from CBOR: 6332 us, successful roundtrip
(You can find the value_stack parser at
https://github.com/pdimov/boost_json_review/blob/master/boost_json_cbor_st.…
and the results at
https://github.com/pdimov/boost_json_review/blob/master/boost_json_cbor_st.….)
(Since now changes to the parser code actually affected performance, I added
two optimizations to the array parsing - a fast path for arrays containing
only doubles, and a fast path for arrays containing only integers.)
Is value hopeless for parsing, then? I went back to the other parser.
This is how the part that parses objects looks like, with the irrelevant
parts omitted:
boost::json::object & o = v.emplace_object();
o.reserve( n );
for( std::size_t i = 0; i < n; ++i )
{
// key string
boost::json::string_view sv( ... );
// value
boost::json::value w( v.storage() );
first = parse_cbor_value( first, last, w );
o.insert( boost::json::key_value_pair( sv, std::move( w ) ) );
}
The reason I use o.insert here is that I wanted the CBOR parser to match the
behavior of the JSON parser with respect to key order and duplicate keys.
The JSON parser preserves key order, which is both user-friendly and better
from a security perspective (observing key order may leak the seed of the
hash function.)
It also handles duplicates in an unspecified manner. Well, whatever that
unspecified manner was, I wanted to do the same, and using `insert` looked
like the way to do it.
There however was no `insert` taking a string_view and a value, so I had to
create a key_value_pair. There was `insert_or_assign` that did take a
string_view and a value, but it overwrites duplicates, and I didn't want
that.
I consulted Vinnie Falco about the lack of `insert(string_view, value)`, and
he pointed me at `emplace`. This sounded exactly like what my hypothetical
`insert` would do. However, hearing `emplace` gave me an idea. In the same
way `v.emplace_object()` gives me `object&`, `o.emplace(sv)` could have
given me `value&`, which I could then pass directly to `parse_cbor_value`:
first = parse_cbor_value( first, last, o.emplace(sv) );
I shared this brilliant observation with Vinnie, and he replied "obj[sv]".
obj[sv] indeed. Sometimes we forget that C++98 did have a useful feature or
two.
This doesn't exactly work as I wanted it to, because it neither guarantees
insertion order, nor has the "right" duplicate handling. Nevertheless, I did
make the change:
https://github.com/pdimov/boost_json_review/commit/e6baf8f1ce81966534ac93c5…
and guess what?
Parsing canada.json from JSON: 10261 us
Parsing canada.json from CBOR: 6761 us, successful roundtrip
So the problem the entire time was that one single line:
o.insert( boost::json::key_value_pair( sv, std::move( w ) ) );
I haven't looked into it, but I suspect that it for some reason makes a copy
of `w`, even though it's passed using `std::move` and uses the right
`storage_ptr`. Oh well. At least it works now.
So, there is nothing that much wrong with the `value` interface, and it's
possible to build values the straightforward way, instead of using a
value_stack, without significant loss in performance.
These CBOR parsers are entirely functional, and only have the following
limitations:
* Binary strings are not supported (as they're not representable in
json::value);
* Infinite sequences are not supported (because I was too lazy);
* 32 bit platforms need to check the string/array/object sizes for possible
size_t overflow (because I was way too lazy.)
I had also looked at, and tried, the value conversion interface, in the
course of developing the examples of an unreleased library of mine,
Describe. The value_from example is
https://pdimov.github.io/describe/doc/html/describe.html#example_to_json
value_to is at
https://pdimov.github.io/describe/doc/html/describe.html#example_from_json
and a simple JSON-RPC example is at
https://pdimov.github.io/describe/doc/html/describe.html#example_json_rpc
These all worked without any trouble, except I had to link with
Boost.Container for one of them, which didn't seem necessary. They are also
a good illustration of the possibilities that having a standard json::value
type presents, in terms of further library development. (One could f.ex.
imagine a Boost.Script interpreter that uses json::value as its data type,
and can be used to script C++ code.)
In conclusion, the library works, is well documented, and should be
ACCEPTED. I have no acceptance conditions. Obviously, the performance
problem concerning object::insert needs to be addressed, and more generally,
the value interface needs a bit more performance scrutiny because it's not
currently exercised by the benchmarks, but I trust that this will be done.
Solid work, thanks to the authors for their submission, and thank you for
reading this far.
2
1
I can not seem to get appveyor or travis CI to test the repositories of
which I am the maintainer, PP, VMD, and TTI, any more. Normally I would
not care much about the CI testing, but the fact that they seem to have
stopped wortking completely, or maybe just stopped sending me back
results, bothers me. All three have appropriate .yml files. Have these
CI services stopped working ? How do I integrate them again with these
Boost libraries ?
3
8