[serialization] Add UTF-8 BOM support to xml_warchive

Hi, Recently I ran into the problem that the boost::serialization library could not handle XML files which contain the three UTF-8 BOM (Byte Order Mark) bytes. The serialization library creates XML files without the BOM bytes but when saving such files in an external Windows program, for example XML Notepad, these bytes are automatically added. Thereafter the XML file cannot be read anymore by the boost::serialization library. According to Wikipedia http://en.wikipedia.org/wiki/Byte_order_mark the UTF-8 BOM is optional and therefore creating XML files without the BOM bytes is all right. However, the reading should extended and be able to handle both types, files with and without the BOM. I propose to enhance the xml_warchive and text_warchive for reading with support of the BOM bytes. Example: namespace { const wchar_t g_cchUtf8Bom1 = 0xEF; const wchar_t g_cchUtf8Bom2 = 0xBB; const wchar_t g_cchUtf8Bom3 = 0xBF; } void CheckAndCorrectUtf8Bom(std::wifstream* pifs) { _ASSERT_POINTER(pifs); wchar_t chUtf8Bom1 = 0; wchar_t chUtf8Bom2 = 0; wchar_t chUtf8Bom3 = 0; chUtf8Bom1 = pifs->peek(); if (chUtf8Bom1 == g_cchUtf8Bom1) { *pifs >> chUtf8Bom1; _ASSERT(chUtf8Bom1 == g_cchUtf8Bom1); *pifs >> chUtf8Bom2; _ASSERT(chUtf8Bom2 == g_cchUtf8Bom2); *pifs >> chUtf8Bom3; _ASSERT(chUtf8Bom3 == g_cchUtf8Bom3); } else { // Reset to start of the stream pifs->seekg(0, std::ios_base::beg); } } Kind regards, Tijmen van Voorthuijsen ----------------------------------------------------------------------------------- T. van Voorthuijsen Senior System Engineer Noldus Information Technology bv Nieuwe Kanaal 5 P.O. Box 268 6700 AG Wageningen The Netherlands Phone: +31-(0)317-473300 Fax: +31-(0)317-424496 E-mail: T.van.Voorthuijsen@Noldus.nl<mailto:T.van.Voorthuijsen@Noldus.nl> Web: www.noldus.com<http://www.noldus.com>

On Friday, April 15, 2011, Tijmen van Voorthuijsen wrote:
Recently I ran into the problem that the boost::serialization library could not handle XML files which contain the three UTF-8 BOM (Byte Order Mark) bytes.
I propose to enhance the xml_warchive and text_warchive for reading with support of the BOM bytes. Example:
namespace { const wchar_t g_cchUtf8Bom1 = 0xEF; const wchar_t g_cchUtf8Bom2 = 0xBB; const wchar_t g_cchUtf8Bom3 = 0xBF; }
void CheckAndCorrectUtf8Bom(std::wifstream* pifs) { _ASSERT_POINTER(pifs);
wchar_t chUtf8Bom1 = 0; wchar_t chUtf8Bom2 = 0; wchar_t chUtf8Bom3 = 0;
chUtf8Bom1 = pifs->peek(); if (chUtf8Bom1 == g_cchUtf8Bom1) { *pifs >> chUtf8Bom1; _ASSERT(chUtf8Bom1 == g_cchUtf8Bom1); *pifs >> chUtf8Bom2; _ASSERT(chUtf8Bom2 == g_cchUtf8Bom2); *pifs >> chUtf8Bom3; _ASSERT(chUtf8Bom3 == g_cchUtf8Bom3);
This logic seems wrong. Just because the first byte is 0xef doesn't mean it's necessarily a BOM.

On Fri, Apr 15, 2011 at 05:14:03PM -0400, Frank Mori Hess wrote:
On Friday, April 15, 2011, Tijmen van Voorthuijsen wrote:
Recently I ran into the problem that the boost::serialization library could not handle XML files which contain the three UTF-8 BOM (Byte Order Mark) bytes.
I propose to enhance the xml_warchive and text_warchive for reading with support of the BOM bytes. Example: This logic seems wrong. Just because the first byte is 0xef doesn't mean it's necessarily a BOM.
If it's supposed to be well-formed XML, there's nothing in the mandatory 'prolog' production that can have the value 0xef as the first octet. An XML document in UTF-16 MUST have a BOM, and MAY have a BOM in UTF-8. Unless indicated externally (MIME, other framing), an XML processor MUST be able to handle the precense of BOMs, and MUST be able to process the UTF-8 and UTF-16 families of encodings. Of course, I may have misread the specification (XML 1.0 5e), feel free to show a well-formed counter-example. -- Lars Viklund | zao@acc.umu.se
participants (3)
-
Frank Mori Hess
-
Lars Viklund
-
Tijmen van Voorthuijsen