Idea Suggestion for GsOC'21
Hey! This is Sanyam Bhaskar. I read over the default Ideas provided to us and XML Parser really caught my eye. I would like to contribute to the same but don’t know where to start. Also, as per my understanding, XML is relatively outdated , when compared to data languages like JSON. So in addition to this being an XML Parser, I think adding a JSON parser alongside it would boost the library’s utility in the modern day industry. I Would appreciate it if someone could tell me how to get started and some feedback on the suggestion. Looking forward to contributing to the project. Yours, Sanyam Bhaskar
Em qua., 10 de mar. de 2021 às 14:01, Sanyam Bhaskar via Boost
Hey! This is Sanyam Bhaskar.
Hi Sanyam,
I read over the default Ideas provided to us and XML Parser really caught my eye.
glad to know. I'm the potential mentor for this project.
Also, as per my understanding, XML is relatively outdated , when compared to data languages like JSON. So in addition to this being an XML Parser, I think adding a JSON parser alongside it would boost the library’s utility in the modern day industry.
We held a review for a JSON library not long ago and the library got accepted, so we already have a JSON (push) parser. I still see room for a JSON pull parser, but I'd not be willing to spearhead this effort, so unless someone else shows up to mentor it I don't think we'd have such a project. XML is an old, overengineered and hated format (and rightfully so), but industry adoption basically forces us to use it for interoperability with a few services to this day. So that's the value for XML here, interoperability with legacy software. It's not a value to be neglected. I also think it'd be a good project for first-time students as the basics of the format are really well-known and I believe in my skills to gradually point the student to its quirks as the project advances.
I Would appreciate it if someone could tell me how to get started and some feedback on the suggestion. Looking forward to contributing to the project.
I wrote some of the ideas that you saw in the wiki page for Boost GSoC. I didn't know which projects would attract students, so I didn't invest a lot of time detailing each individual project (my bad). The programming competency test was to write a CSV parser. However you can negotiate to write a parser for a different format if you think it'd be more interesting to showcase your C++ skills (please choose a simple one-afternoon-to-implement format and negotiate the alternative target beforehand). Once you're done, send the code directly to me (don't post it publicly) and I'll be making requests to change one stuff here and there to see how well you manage to change the code as well as other comments. On top of that, you'll need to write a proposal to be submitted through Google platform during the student application period (March 29 - April 13). If you want, you can send your proposal here (this time you must not send it to me in private, but must post it publicly on the list) and ask for feedback if you want. If you don't need early feedback on your proposal, you can also decide to not post it here at all but otherwise only send it through Google platform (then your proposal will only be available to Boost GSoC team). I can't suggest specific strategies, but I advise you should strive to make a good impression. Early feedback obviously will give you "extra" time to improve your proposal. Do keep in mind that sending your proposal to this list is not an official submission. You always must send a final proposal through Google platform during the student application period. Google will eventually announce how many student slots Boost is given and the accepted students will be announced on May 17. -- Vinícius dos Santos Oliveira https://vinipsmaker.github.io/
Hi Vinicius, allow me to jump into this discussion with some thoughts. On 2021-03-10 2:16 p.m., Vinícius dos Santos Oliveira via Boost wrote:
XML is an old, overengineered and hated format (and rightfully so), but industry adoption basically forces us to use it for interoperability with a few services to this day. So that's the value for XML here, interoperability with legacy software. It's not a value to be neglected.
I also think it'd be a good project for first-time students as the basics of the format are really well-known and I believe in my skills to gradually point the student to its quirks as the project advances.
I'll give a very similar advice I shared with FFT proposals: Please consider not to re-implement a full XML library (which is quite a daunting task), but rather, focus on the C++ API as an *interface* that can be layered on top of existing XML libraries. The world already has way too many incomplete and buggy XML libraries. Please let's not make it worse. The approach I had taken (admittedly many years ago) consists in defining a C++ API around one of the more popular (and efficient) implementations at the time: libxml2 (http://www.xmlsoft.org/), with support for a DOM-like API as well as a SAX-like streaming API. Of particular importance is that a fully functional XML API needs to have some support for Unicode, which is sadly still quite difficult to do in C++. My choice was to parametrize the entire API around the character type, letting users pick their own Unicode bindings (a simple trait-like class would be enough to bind to alternative types there). Anyhow, my code is still online, if anyone wants to have a look: https://github.com/stefanseefeld/boost.xml While the libxml2 bindings work very nicely (including xpath support and some other nice features), I never felt comfortable proposing my work in its current form for adoption into Boost without having added at least one other XML library backend (Xerces comes to mind), to make sure the API itself is robust enough and doesn't accidentally leak libxml2 design choices. Best, Stefan -- ...ich hab' noch einen Koffer in Berlin...
On Wed, Mar 31, 2021 at 10:11 PM Stefan Seefeld via Boost < boost@lists.boost.org> wrote:
allow me to jump into this discussion with some thoughts.
On 2021-03-10 2:16 p.m., Vinícius dos Santos Oliveira via Boost wrote:
XML is an old, overengineered and hated format (and rightfully so), but industry adoption basically forces us to use it for interoperability with a few services to this day. So that's the value for XML here, interoperability with legacy software. It's not a value to be neglected.
I'll give a very similar advice I shared with FFT proposals: Please
consider not to re-implement a full XML library (which is quite a daunting task), but rather, focus on the C++ API as an *interface* that can be layered on top of existing XML libraries.
While normally I'd agree with you, by this train of thought, we wouldn't have Boost.JSON accepted in Boost right now.
The world already has way too many incomplete and buggy XML libraries.
True. But different people have different tradeofs. libxml2 and xerces and expat may be complete, and as close to bug free as it gets in C/C++ XML, but they are certainly not modern C++, often not incremental parsing, and certainly don't allow the kind of allocator support Boost.JSON introduced. Nor are they the fastest. So a non-wrapper Boost.JSON like Boost.XML would be very interesting. Perhaps even like Boost.JSON, and controversially, foregoing SAX and only do DOM. The main issue with XML are all the little things to get right, like character entities, entity includes inherited from DTDs, DTDs themselves, for validation and default values, whitespace normalization, namespace support, and related techs liks XSDs, XPath, XLink, XInclude, XQuery, etc... Proper PSVI (post schema validation infoset) is also often problematic, but that assumes a validating parser (via DTD or XSD) in the first place. There's definitely space to explore a Boost.JSON-like low-level modern parser building only a DOM with value semantic and allocator support, with a modern API. Much could be built on such a foundation, and that's an interesting GSOC project, even if it never "graduates". In any case, beside the 3 mentioned above, there's also rapidxml and pugixml, the latter still actively maintained. Perhaps they are not as complete, but they are definitely quite a bit faster than the "old" ones. --DD
Em qui., 1 de abr. de 2021 às 05:29, Dominique Devienne via Boost < boost@lists.boost.org> escreveu:
There's definitely space to explore a Boost.JSON-like low-level modern parser building
Boost.JSON is anything but low-level parsing. You definitely didn't explore its parser. I did. Boost.JSON would definitely not be an inspiration to Boost.XML. Can we not hijack this thread to propaganda machinery please? Thank you. -- Vinícius dos Santos Oliveira https://vinipsmaker.github.io/
On Thu, Apr 1, 2021 at 10:30 AM Dominique Devienne
On Wed, Mar 31, 2021 at 10:11 PM Stefan Seefeld via Boost < boost@lists.boost.org> wrote:
consider not to re-implement a full XML library
So a non-wrapper Boost.JSON like Boost.XML would be very interesting. Perhaps even like Boost.JSON, and controversially, foregoing SAX and only do DOM.
One thing I forgot to mention, is that an explicit goal of any Boost.XML API, wrapper or not, should be to replicate Peter's Boost.Describe "data-binding" examples to convert JSON "values" to described C++ structures, but in the XML space. Can your attempt at an XML API do that Stefan? That would be a very compelling Boost.XML IMHO. Even w/o any SAX support. --DD [1] https://github.com/pdimov/describe/blob/develop/example/to_json.cpp [2] https://github.com/pdimov/describe/blob/develop/example/from_json.cpp
On 2021-04-01 4:43 a.m., Dominique Devienne via Boost wrote:
One thing I forgot to mention, is that an explicit goal of any Boost.XML API, wrapper or not, should be to replicate Peter's Boost.Describe "data-binding" examples to convert JSON "values" to described C++ structures, but in the XML space. Can your attempt at an XML API do that Stefan?
That's an orthogonal piece of functionality, which can be implemented on top of Boost.XML and Boost.Describe. Stefan -- ...ich hab' noch einen Koffer in Berlin...
On 2021-04-01 4:30 a.m., Dominique Devienne via Boost wrote:
But different people have different tradeofs. libxml2 and xerces and expat may be complete, and as close to bug free as it gets in C/C++ XML, but they are certainly not modern C++,
Stylistic questions ("modern C++") are secondary to functional correctness.
often not incremental parsing, and certainly don't allow the kind of allocator support Boost.JSON introduced. Nor are they the fastest.
libxml2 offers streaming APIs ("incremental parsing") and is among the fastest implementations you can get. As I said in the FFT thread: thinking that you can match such a library (both in functionality and performance) with a GSoC project is foolish, so it seems wiser to focus on the interface, then bind that to existing implementations.
The main issue with XML are all the little things to get right, like character entities, entity includes inherited from DTDs, DTDs themselves, for validation and default values, whitespace normalization, namespace support, and related techs liks XSDs, XPath, XLink, XInclude, XQuery, etc... Proper PSVI (post schema validation infoset) is also often problematic, but that assumes a validating parser (via DTD or XSD) in the first place.
Exactly. How are you proposing to handle all these questions above ?
There's definitely space to explore a Boost.JSON-like low-level modern parser building only a DOM with value semantic and allocator support, with a modern API. Much could be built on such a foundation, and that's an interesting GSOC project, even if it never "graduates".
In any case, beside the 3 mentioned above, there's also rapidxml and pugixml, the latter still actively maintained. Perhaps they are not as complete, but they are definitely quite a bit faster than the "old" ones. --DD
This is not about which XML library is better. Quite the opposite, in fact: I want to make an argument for establishing a modern C++ API that can be bound to any such library. We don't need more half-baked partial XML implementations, we need a standard C++ API for XML. Stefan -- ...ich hab' noch einen Koffer in Berlin...
participants (4)
-
Dominique Devienne
-
Sanyam Bhaskar
-
Stefan Seefeld
-
Vinícius dos Santos Oliveira