Re: [boost] [rfc] I/O Library Design

23 Jun 2007


      Jeremy Maitin-Shepard wrote:
...
Andrey Semashev <andysem@mail.ru> writes:
...
...
...
...
...
That will just require duplicating the tables and algorithms required to 
process the text correctly.
What algorithms do you mean and why would they need duplication?
Examples of such algorithms are string collation, comparison, line
breaking, word wrapping, and hyphenation.
...
Why would these algorithms need duplication? If we have all 
locale-specific traits and tools, such as collation tables, character 
checking functions like isspace, isalnum, etc. along with new ones that 
might be needed for Unicode, encapsulated into locale classes, the 
essence of algorithms should be independent form the text encoding.
Using standard data tables, and a single algorithm that merely accesses
the locale-specific data tables, you can provide these algorithms for
UTF-16 (and other Unicode encodings) for essentially all locales.  This
is done by libraries like IBM ICU.  Providing them in addition for other
encodings, however, would require separate data tables and separate
implementations.
I still can't see why one would need to reimplement algorithms. Their 
logic is the same regardless of the encoding.
...
...
What I was saying, if we have a UTF-8 encoded string that contains both 
latin and national characters that encode to several octets, it becomes 
a non-trivial task to extract i-th character (not octet) from the 
string. Same problem with iteration - iterator has to analyze the 
character it points to to adjust its internal pointer to the beginning 
of the next character. The same thing will happen with true UTF-16 and 
UTF-32 support.
As an example of the need in such functionality, it is widely used in 
various text parsers.
I'm still not sure I quite see it.  I would think that the most common
case in parsing text is to read it in order from the beginning.  I
suppose in some cases you might be parsing something where you know a
field is aligned to e.g. the 20th character,
There may be different parsing techniques, depending on the text format. 
Sometimes only character iteration is sufficient, in case of forward 
sequential parsing. There is no restriction, though, to perform 
non-sequential parsing (in case if there is some table of contents with 
offsets or each field to be parsed is prepended with its length).
...
but such formats tend to
assume very simple encodings anyway, because they don't make much sense
if you are to support complicated accents and such.
If all standard algorithms and classes assume that the text being parsed 
is in Unicode, it cannot perform optimizations in a more efficient 
manner. The std::string or regex or stream classes will always have to 
treat the text as Unicode.