duplicate source files in boost- utf8_codecvt_facet

We now have a at least one set of files more or less duplicated in boost. Ron Garcia's utf8_codecvt_facet is included in: boost/utf8_codecvt_facet.hpp boost/detail/utf8_codecvt_facet.hpp boost/libs/serialization/src/utf8_codecvt_facet.cpp boost/libs/program_options/detail/utf8_codecvt_facet.cpp boost/libs/doc/serialization/utf8_codecvt_facet.html boost/libs/serialization/test/test_utf8_codecvt_facet.cpp Leaving aside how this came to be, I would like to do the following: boost/detail/utf8_codecvt_facet.hpp boost/libs/detail // new directory boost/libs/detai/src/utf8_codecvt_facet.cpp boost/libs/detail/test/test_utf8_codecvt_facet.cpp boost/libs/detail/doc/utf8_codecvt_facet.html This would make both the serialization and program options libraries that much smaller and server our needs until a reviewed version of codecvt facets in added to boost. Robert Ramey

Robert Ramey wrote:
We now have a at least one set of files more or less duplicated in boost.
Ron Garcia's utf8_codecvt_facet is included in:
boost/utf8_codecvt_facet.hpp boost/detail/utf8_codecvt_facet.hpp boost/libs/serialization/src/utf8_codecvt_facet.cpp boost/libs/program_options/detail/utf8_codecvt_facet.cpp boost/libs/doc/serialization/utf8_codecvt_facet.html boost/libs/serialization/test/test_utf8_codecvt_facet.cpp
Leaving aside how this came to be, I would like to do the following:
boost/detail/utf8_codecvt_facet.hpp boost/libs/detail // new directory boost/libs/detai/src/utf8_codecvt_facet.cpp boost/libs/detail/test/test_utf8_codecvt_facet.cpp boost/libs/detail/doc/utf8_codecvt_facet.html
This would make both the serialization and program options libraries that much smaller and server our needs until a reviewed version of codecvt facets in added to boost.
Dave and I suggested such a move some time ago.. When I noticed the duplication when fixing some problems with the facet. And It was agreed by Volodya and others on principle. Don't remember what the details of some of the problems raised. But I believe that making a new "boost/libs/detail" directory would solve the problems I vaguely remember. -- I think something about where to put tests and docs for it was a sited problem. Sorry to be vague... But I can't seen to find the original thread, stupid search engines :-( -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com - 102708583/icq

As I remember, there was an objection to that because this component wasn't subjected to any kind of revieew. Robert Ramey "Peter Dimov" <pdimov@mmltd.net> wrote in message news:001001c4d4b8$8e944850$6401a8c0@pdimov2...
Rene Rivera wrote:
But I believe that making a new "boost/libs/detail" directory would solve the problems I vaguely remember. -- I think something about where to put tests and docs for it was a sited problem.
We usually put these fast-tracked components in utility/. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Nov 27, 2004, at 9:25 PM, Robert Ramey wrote:
As I remember, there was an objection to that because this component wasn't subjected to any kind of revieew.
Robert Ramey
"Peter Dimov" <pdimov@mmltd.net> wrote in message news:001001c4d4b8$8e944850$6401a8c0@pdimov2...
Rene Rivera wrote:
But I believe that making a new "boost/libs/detail" directory would solve the problems I vaguely remember. -- I think something about where to put tests and docs for it was a sited problem.
We usually put these fast-tracked components in utility/.
As I remember there was an issue with the build system and problems if one compiled library depended on another compiled library. Matthias

Matthias Troyer wrote:
As I remember there was an issue with the build system and problems if one compiled library depended on another compiled library.
You mean a problem with trying to make a LIB out of two other LIBs? That would not happen in this case. Each of, serialization and program_options, would compile the source for the facet independently. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com - 102708583/icq

On Nov 29, 2004, at 12:02 AM, Rene Rivera wrote:
Matthias Troyer wrote:
As I remember there was an issue with the build system and problems if one compiled library depended on another compiled library.
You mean a problem with trying to make a LIB out of two other LIBs? That would not happen in this case. Each of, serialization and program_options, would compile the source for the facet independently.
If each one compiles the sources into the same namespace there will be a problem with multiply defined symbols when both libraries are linked together. Matthias

Matthias Troyer wrote:
On Nov 29, 2004, at 12:02 AM, Rene Rivera wrote:
Matthias Troyer wrote:
As I remember there was an issue with the build system and problems if one compiled library depended on another compiled library.
You mean a problem with trying to make a LIB out of two other LIBs? That would not happen in this case. Each of, serialization and program_options, would compile the source for the facet independently.
If each one compiles the sources into the same namespace there will be a problem with multiply defined symbols when both libraries are linked together.
My original proposal were that program_options would contain utf8_codecvt.hpp with this content: namespace boost { namespace program_options { #include <boost/detail/unicode/utf8_codecvt.hpp> }} Same for .cpp. And yes, there were some problems with dependencies between two compiled libraries. Forgot the details :-( - Volodya

At 12:52 PM 11/27/2004, Robert Ramey wrote:
We now have a at least one set of files more or less duplicated in boost.
Ron Garcia's utf8_codecvt_facet is included in:
boost/utf8_codecvt_facet.hpp boost/detail/utf8_codecvt_facet.hpp boost/libs/serialization/src/utf8_codecvt_facet.cpp boost/libs/program_options/detail/utf8_codecvt_facet.cpp boost/libs/doc/serialization/utf8_codecvt_facet.html boost/libs/serialization/test/test_utf8_codecvt_facet.cpp
Leaving aside how this came to be, I would like to do the following:
boost/detail/utf8_codecvt_facet.hpp boost/libs/detail // new directory boost/libs/detai/src/utf8_codecvt_facet.cpp boost/libs/detail/test/test_utf8_codecvt_facet.cpp boost/libs/detail/doc/utf8_codecvt_facet.html
This would make both the serialization and program options libraries that much smaller and server our needs until a reviewed version of codecvt facets in added to boost.
I may also need UTF-8 conversion as an implementation detail for boost::filesystem::wpath on Linux and/or POSIX. So I'd really appreciate it if you move ahead with your plan above. --Beman

I may also need UTF-8 conversion as an implementation detail for boost::filesystem::wpath on Linux and/or POSIX. So I'd really appreciate it if you move ahead with your plan above.
I'm also starting to do UTF-interconversion inside regex, but with iterators (http://cvs.sourceforge.net/viewcvs.py/boost/boost/boost/regex/pending/Attic/...), if that's any easier for you. John.

I used Ron Garcias utf8 codecvt facet that I down loaded from yahoo files section. I just wrote some tests and make some tweaks as test results for various platforms came it. In includes a manual page in standard boost format. In the same section (http://groups.yahoo.com/group/boost/files/unicode/) there is another file (utf8_transform_iterator) which wraps the same functionality in an standard iterator. Given my experience with the codecvt facet, (I just worked - except for tweaks required for older libraries/compilers) I would strongly recommend that this this other package be examined as well. In fact, I'm sure that it would be easy to transfer the portability tweaks from the current package to the iterator based one. Finally, now that I'm more familiar with codecvt facets, iterators, iterator adaptors, stream buffers, etc. I would much like to see something like the following developed. composable codecvt facets =================== a) codecvt facet which takes an iterator as a type parameter. b) a few basic iterator adaptors c) the ability to compose iterator adaptors to be used as a codecvt facet This would leverage on iterarator adaptors and children, ie. dataflow iterators, ranges, views, fusion? or? to permit one to compose codecvt facets for compression, encryption, and code conversion Just an idea. Robert Ramey "John Maddock" <john@johnmaddock.co.uk> wrote in message news:04a701c4d551$6b584230$45340252@fuji...
I may also need UTF-8 conversion as an implementation detail for boost::filesystem::wpath on Linux and/or POSIX. So I'd really appreciate it if you move ahead with your plan above.
I'm also starting to do UTF-interconversion inside regex, but with
iterators
(http://cvs.sourceforge.net/viewcvs.py/boost/boost/boost/regex/pending/Attic /unicode_iterator.hpp),
if that's any easier for you.
John.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

I used Ron Garcias utf8 codecvt facet that I down loaded from yahoo files section. I just wrote some tests and make some tweaks as test results for various platforms came it. In includes a manual page in standard boost format.
In the same section (http://groups.yahoo.com/group/boost/files/unicode/) there is another file (utf8_transform_iterator) which wraps the same functionality in an standard iterator.
I know, he sent me the HTML page a while back ;-) The main differences between the two are: A) Ron has written an iterator-generator (via boost::iterator_adapter), rather than an iterator (via boost::iterator_facade) as I have. Personally I find the latter easier to use, but that may be personal preference. B) I think my iterators have more checks for invalid code sequences than Ron's (if you don't check everything that you can then some really bad things can happen, more on this later). Even so there may well be more checks that can be added. C) Ron has a single adapter (makes a UTF-8 sequence look like a UTF-32 one), I have a whole family of them, here's the synopsis: 1) Read Only, Input Adapters: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ template <class BaseIterator, class U8Type = ::boost::uint8_t> class u32_to_u8_iterator; Adapts sequence of UTF-32 code points to "look like" a sequence of UTF-8. template <class BaseIterator, class U32Type = ::boost::uint32_t> class u8_to_u32_iterator; Adapts sequence of UTF-8 code points to "look like" a sequence of UTF-32. template <class BaseIterator, class U16Type = ::boost::uint16_t> class u32_to_u16_iterator; Adapts sequence of UTF-32 code points to "look like" a sequence of UTF-16. template <class BaseIterator, class U32Type = ::boost::uint32_t> class u16_to_u32_iterator; Adapts sequence of UTF-16 code points to "look like" a sequence of UTF-32. 2) Single pass output iterator adapters: ~~~~~~~~~~~~~~~~~~~~~~~~~~~ template <class BaseIterator> class utf8_output_iterator; Accepts UTF-32 code points and forwards them on as UTF-8 code points. template <class BaseIterator> class utf16_output_iterator; Accepts UTF-32 code points and forwards them on as UTF-16 code points. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ D) You will note that there are still two adapters missing - to convert between UTF-16 and UTF-32 - but I haven't needed these, and they could I suppose be composed from the other 4. E) Ron's code accepts up to 6 octets in a UTF-8 sequence mine only accepts 4: I think this is the difference between Unicode and ISO-10646-1, and may be a feature or a bug depending upon your point of view. Personally I needed to ensure that only valid UTF-32 sequences were generated. F) On error checking, I have come to the conclusion that both Ron and I have fallen into the same trap: If the conversion iterator is constructed from a single base-iterator, and that base-iterator does not point to the start of a valid utf-8 sequence then it becomes possible for the adapter to increment past the end of a sequence, or decrement past the start of one (Note my code does trap invalid UTF-8 sequences provided they are not at the end of a range). I believe this problem can be solved by constructing a pair of adapters from a pair of base iterators: the end points can then be checked to ensure that nothing bad can happen. We wouldn't want a corrupt UTF-8 sequence to crash your program after all :-) The alternative is to leave it to the user to call a "check_range" or similar function, but this still looks error prone to me. Oh, and the same problem arises when iterating UTF-16 sequences as well (the sequence must not start with a low-surrogate, or end with a high surrogate). Regards, John.
participants (7)
-
Beman Dawes
-
John Maddock
-
Matthias Troyer
-
Peter Dimov
-
Rene Rivera
-
Robert Ramey
-
Vladimir Prus