[gsoc 2008] JSON Archives for Boost.serialization

Hi all, I'm new to this list but have been tracking Boost development for a long time. Never thought I could contribute something new to Boost, but after seeing this year's GSoC Boost proposal, I changed my mind and decided to take a deeper look at the JSON proposal. I've used the Boost.serialization library extensively and always end up needing something more portable than the binary archive and less heavyweight than the XML archive. Also, I spend a lot of time developing with Boost.Python and a format already supported by Python would be great, so seeing there was interest in a JSON archive was the perfect opportunity to join. Anyway, I've some points regarding this project before I apply, don't know if I should send them to Jeff (who suggested the idea) or to Robert (who implemented Boost.serialization). Instead of just bothering them, I'll bother all the subscribers to boost@lists.boost.org :-) - the JSON spec is quite simple, but given that it's a subset of YAML (actually Syck parses JSON as well), should it support it in the future as well? - what about Unicode? I know that Boost.Regex supports Unicode if compiled against ICU and the JSON spec states that everything must be in Unicode (correct me if I'm wrong) - TinyJSON and JSON.Spirit both use a MIT-like license (JSON.Spirit is licensed under CPOL). The Boost license is compatible with them but, could it pose a problem? There's JSONcpp [1] as well, which is public domain. - wading through the Boost mailing list archives, I found a message [2] by Daryle Walker in which he expressed interest in a JSON serializer some while ago (2005), but I think it wasn't discussed any further. Eric Newhuis wrote again about having a JSON archive just a few weeks ago [3] - which platforms must be supported? I can only provide support to GCC under Linux, but I guess it has to support MSVC and some other platforms. Will it have to be available for all the platforms supported by Boost at the end of the summer or is it more of a process? That is, given that one of the purposes of GSoC is to involve more people into free software/open source projects, whoever (I hope it's me :-)) implements the JSON serializer, will become its maintainer too and will take care of all the tasks related to accept patches, track bugs, add support for future/incomplete platforms, etc. - I thought of adding JSON support to Boost.Spirit as well. Currently it has support for printing ASTs in XML (see tree_to_xml). A JSON dumper would be quite useful for manipulating ASTs in a high level language, such as Python through Boost.Python Cheers. 1 - http://jsoncpp.sourceforge.net/ 2 - http://lists.boost.org/Archives/boost/2005/09/94438.php 3 - http://lists.boost.org/boost-users/2008/03/34336.php

Hello Esteve! On Thu, Mar 20, 2008 at 10:10:07PM +0100, Esteve Fernandez wrote:
- the JSON spec is quite simple, but given that it's a subset of YAML (actually Syck parses JSON as well), should it support it in the future as well?
Actually JSON is not really a subset of YAML, see also the documentation of the Perl module of Marc Lehmann, who wrote a fully standards compliant JSON parser in (C, bound to Perl): http://search.cpan.org/~mlehmann/JSON-XS-2.1/XS.pm Look at the comparsion in the section 'YAML and JSON'.
- what about Unicode? I know that Boost.Regex supports Unicode if compiled against ICU and the JSON spec states that everything must be in Unicode (correct me if I'm wrong)
Yes, the JSON spec states that JSON is Unicode text, encoded in (any) Unicode encoding (usually UTF-8). However, there is one hard part when writing a JSON parser, you have to take care to handle the \uXXXX literals in strings correctly. The JSON spec (RFC 4627, http://www.ietf.org/rfc/rfc4627.txt ) states in section 2.5: To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". So care must be taken not to overlook this small detail.
- TinyJSON and JSON.Spirit both use a MIT-like license (JSON.Spirit is licensed under CPOL). The Boost license is compatible with them but, could it pose a problem? There's JSONcpp [1] as well, which is public domain.
I wrote a JSON parser in C++, which I didn't release yet, which should be almost 100% compliant with the JSON RFC. However, I defined my own bytebuffer class with special UTF-8 handling, as I only have to deal with UTF-8 in my problem domain. But I could release the code under the boost license anytime if someone is interested. But I guess for Boost inclusion there still has to be done some work. Greetings, Robin Redeker -- Robin Redeker | Deliantra, the free code+content MORPG elmex@ta-sa.org / r.redeker@gmail.com | http://www.deliantra.net http://www.ta-sa.org/ |

Just for the sake of chiming-in, I'd also be very interested in JSON support archive support. I haven't reviewed the available options in great detail yet.. I know there are several C & C++ libraries available, but I've been leaning towards using the YAJL one b/c it looks very clean, simple and is SAX-based: http://lloydforge.org/projects/yajl/ -Mike On Mar 20, 2008, at 5:35 PM, Robin Redeker wrote:
Hello Esteve!
On Thu, Mar 20, 2008 at 10:10:07PM +0100, Esteve Fernandez wrote:
- the JSON spec is quite simple, but given that it's a subset of YAML (actually Syck parses JSON as well), should it support it in the future as well?
Actually JSON is not really a subset of YAML, see also the documentation of the Perl module of Marc Lehmann, who wrote a fully standards compliant JSON parser in (C, bound to Perl):
http://search.cpan.org/~mlehmann/JSON-XS-2.1/XS.pm
Look at the comparsion in the section 'YAML and JSON'.
- what about Unicode? I know that Boost.Regex supports Unicode if compiled against ICU and the JSON spec states that everything must be in Unicode (correct me if I'm wrong)
Yes, the JSON spec states that JSON is Unicode text, encoded in (any) Unicode encoding (usually UTF-8). However, there is one hard part when writing a JSON parser, you have to take care to handle the \uXXXX literals in strings correctly. The JSON spec (RFC 4627, http://www.ietf.org/rfc/rfc4627.txt ) states in section 2.5:
To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
So care must be taken not to overlook this small detail.
- TinyJSON and JSON.Spirit both use a MIT-like license (JSON.Spirit is licensed under CPOL). The Boost license is compatible with them but, could it pose a problem? There's JSONcpp [1] as well, which is public domain.
I wrote a JSON parser in C++, which I didn't release yet, which should be almost 100% compliant with the JSON RFC. However, I defined my own bytebuffer class with special UTF-8 handling, as I only have to deal with UTF-8 in my problem domain.
But I could release the code under the boost license anytime if someone is interested. But I guess for Boost inclusion there still has to be done some work.
Greetings, Robin Redeker
-- Robin Redeker | Deliantra, the free code +content MORPG elmex@ta-sa.org / r.redeker@gmail.com | http://www.deliantra.net http://www.ta-sa.org/ | _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hi Robin El Viernes 21 Marzo 2008 01:35:15 Robin Redeker escribió:
Actually JSON is not really a subset of YAML, see also the documentation of the Perl module of Marc Lehmann, who wrote a fully standards compliant JSON parser in (C, bound to Perl):
http://search.cpan.org/~mlehmann/JSON-XS-2.1/XS.pm
Look at the comparsion in the section 'YAML and JSON'.
Yes, you're right. JSON is not a YAML subset, strictly speaking, but the most widely used YAML parser (Syck), accepts JSON as well. Well, the most widely used HTML parser (Trident, IE) accepts all sorts of things that are not HTML :-) so it's best not to try to parse YAML for the time being and stick to a pure JSON parser.
- what about Unicode? I know that Boost.Regex supports Unicode if compiled against ICU and the JSON spec states that everything must be in Unicode (correct me if I'm wrong)
Yes, the JSON spec states that JSON is Unicode text, encoded in (any) Unicode encoding (usually UTF-8). However, there is one hard part when writing a JSON parser, you have to take care to handle the \uXXXX literals in strings correctly. The JSON spec (RFC 4627, http://www.ietf.org/rfc/rfc4627.txt ) states in section 2.5:
To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
So care must be taken not to overlook this small detail.
Thanks for pointing this. This is one of the things that worries me about the JSON parser before I apply to GSoC, if it has to be fully compliant with the Unicode part of the JSON spec. TinyJSON advertises itself as Unicode compliant, but don't know if it takes into account this little bit about the JSON spec.
- TinyJSON and JSON.Spirit both use a MIT-like license (JSON.Spirit is licensed under CPOL). The Boost license is compatible with them but, could it pose a problem? There's JSONcpp [1] as well, which is public domain.
I wrote a JSON parser in C++, which I didn't release yet, which should be almost 100% compliant with the JSON RFC. However, I defined my own bytebuffer class with special UTF-8 handling, as I only have to deal with UTF-8 in my problem domain.
But I could release the code under the boost license anytime if someone is interested. But I guess for Boost inclusion there still has to be done some work.
Great! I'll surely ask you for your code :-) Cheers.

Robin Redeker wrote:
On Thu, Mar 20, 2008 at 10:10:07PM +0100, Esteve Fernandez wrote:
- what about Unicode? I know that Boost.Regex supports Unicode if compiled against ICU and the JSON spec states that everything must be in Unicode (correct me if I'm wrong)
Yes, the JSON spec states that JSON is Unicode text, encoded in (any) Unicode encoding (usually UTF-8). However, there is one hard part when writing a JSON parser, you have to take care to handle the \uXXXX literals in strings correctly. The JSON spec (RFC 4627, http://www.ietf.org/rfc/rfc4627.txt ) states in section 2.5:
To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
Sorry I'm a bit late to this - only just got around to reading this thread. I have a JSON string parser that handles this correctly by parsing into a UTF-16 buffer which can then be re-encoded to the required string type and encoding. I described it to Thomas Jensen so he could use it if he wanted to in TinyJSON, but anybody else should feel free to grab it if it's useful too. The parser is in the bottom half of this page: http://www.kirit.com/Blog:/2008-03-31/Thoughts%20on%20TinyJSON On the same page are some notes about how I store JSON objects. Probably not of interest for Boost.Serialization though. In any case I think ICU may be able to provide some suitable string encoding functions that the string parser could be parametrised on. By co-incidence I'd picked exactly the same test case as the standard uses :) Kirit -- http://www.kirit.com/

Esteve Fernandez wrote:
Hi all, I'm new to this list but have been tracking Boost development for a long time. Never thought I could contribute something new to Boost, but after seeing this year's GSoC Boost proposal, I changed my mind and decided to take a deeper look at the JSON proposal.
I've used the Boost.serialization library extensively and always end up needing something more portable than the binary archive and less heavyweight than the XML archive. Also, I spend a lot of time developing with Boost.Python and a format already supported by Python would be great, so seeing there was interest in a JSON archive was the perfect opportunity to join.
Anyway, I've some points regarding this project before I apply, don't know if I should send them to Jeff (who suggested the idea) or to Robert (who implemented Boost.serialization). Instead of just bothering them, I'll bother all the subscribers to boost@lists.boost.org :-)
Good choice...waiting for a particular person is sometimes slow.
- the JSON spec is quite simple, but given that it's a subset of YAML (actually Syck parses JSON as well), should it support it in the future as well?
I see YAML as another potentially interesting archive target, but I think others have established that it's different.
- what about Unicode? I know that Boost.Regex supports Unicode if compiled against ICU and the JSON spec states that everything must be in Unicode (correct me if I'm wrong)
Don't know, but this is certainly an issue. There's some utf8 facet stuff floating around in boost (boost/detail/utf8_codecvt_facet.hpp) I think -- I'm not sure if that solves your problem though.
- TinyJSON and JSON.Spirit both use a MIT-like license (JSON.Spirit is licensed under CPOL). The Boost license is compatible with them but, could it pose a problem? There's JSONcpp [1] as well, which is public domain.
MIT is compatible, but we're really trying get everything to be Boost license.
- wading through the Boost mailing list archives, I found a message [2] by Daryle Walker in which he expressed interest in a JSON serializer some while ago (2005), but I think it wasn't discussed any further. Eric Newhuis wrote again about having a JSON archive just a few weeks ago [3]
Others are chiming in...
- which platforms must be supported? I can only provide support to GCC under Linux, but I guess it has to support MSVC and some other platforms. Will it have to be available for all the platforms supported by Boost at the end of the summer or is it more of a process? That is, given that one of the
That's up to you to propose. Given that we here at Boost are interested in cross-platform code it's certainly better to have a broader support. In case you aren't aware, you can download an free (called express) version of vc8. But no matter, even if you don't test directly on any other platforms if you still to std c++ you should be ok. The mentors and others on the list will also help you with this issue.
purposes of GSoC is to involve more people into free software/open source projects, whoever (I hope it's me :-)) implements the JSON serializer, will become its maintainer too and will take care of all the tasks related to accept patches, track bugs, add support for future/incomplete platforms, etc.
That's the general idea :-) HTH, Jeff

El Viernes 21 Marzo 2008 14:01:30 Jeff Garland escribió:
- what about Unicode? I know that Boost.Regex supports Unicode if compiled against ICU and the JSON spec states that everything must be in Unicode (correct me if I'm wrong)
Don't know, but this is certainly an issue. There's some utf8 facet stuff floating around in boost (boost/detail/utf8_codecvt_facet.hpp) I think -- I'm not sure if that solves your problem though.
As Robin pointed out, there's some trickery for characters not in BMP, but I couldn't find any mention of it in the header file. Should I ask the original authors privately or are they in the Boost mailing list?
- TinyJSON and JSON.Spirit both use a MIT-like license (JSON.Spirit is licensed under CPOL). The Boost license is compatible with them but, could it pose a problem? There's JSONcpp [1] as well, which is public domain.
MIT is compatible, but we're really trying get everything to be Boost license.
IANAL, but I think the MIT allows relicensing to other free software licenses. However, if I were the author, I would find of bad etiquette to do it without asking me. So I'll mail the author of the chosen parser and see if he agrees to license it under the Boost license. If he doesn't agree... well, playing with Boost.Spirit is always fun and the JSON grammar is not exceptionally complicated.
- wading through the Boost mailing list archives, I found a message [2] by Daryle Walker in which he expressed interest in a JSON serializer some while ago (2005), but I think it wasn't discussed any further. Eric Newhuis wrote again about having a JSON archive just a few weeks ago [3]
Others are chiming in...
Yep, I just pointed that to know if there was another discussion that I was able to find in the archives. But I guess there hasn't been any.
- which platforms must be supported? I can only provide support to GCC under Linux, but I guess it has to support MSVC and some other platforms. Will it have to be available for all the platforms supported by Boost at the end of the summer or is it more of a process? That is, given that one of the
That's up to you to propose. Given that we here at Boost are interested in cross-platform code it's certainly better to have a broader support. In case you aren't aware, you can download an free (called express) version of vc8. But no matter, even if you don't test directly on any other platforms if you still to std c++ you should be ok. The mentors and others on the list will also help you with this issue.
I knew about the express editions that Microsoft provides about its products, but thought the Visual C++ one was not supported by Boost. If it's ok to use it, then I would be glad to add it to the list of supported platforms of the JSON archives.
purposes of GSoC is to involve more people into free software/open source projects, whoever (I hope it's me :-)) implements the JSON serializer, will become its maintainer too and will take care of all the tasks related to accept patches, track bugs, add support for future/incomplete platforms, etc.
That's the general idea :-)
Well, let's hope I have to take those tasks :-D Cheers.
participants (5)
-
Esteve Fernandez
-
Jeff Garland
-
Kirit Sælensminde
-
Michael Dickey
-
Robin Redeker