
Jeremy Maitin-Shepard wrote:
Andrey Semashev <andysem@mail.ru> writes:
Jeremy Maitin-Shepard wrote:
Andrey Semashev <andysem@mail.ru> writes:
There may be different parsing techniques, depending on the text format. Sometimes only character iteration is sufficient, in case of forward sequential parsing. There is no restriction, though, to perform non-sequential parsing (in case if there is some table of contents with offsets or each field to be parsed is prepended with its length).
Such a format would likely then not really be text, since it would contain embedded offsets (which might likely not be text).
Why not? See GCC symbols mangling for example.
If all standard algorithms and classes assume that the text being parsed is in Unicode, it cannot perform optimizations in a more efficient manner. The std::string or regex or stream classes will always have to treat the text as Unicode.
Well, since std::string and boost::regex already exist and do not assume Unicode (or even necessarily support it very well; I've seen some references to boost::regex providing Unicode support, but I haven't looked into it), that is not likely to occur.
Actually, std::string (or basic_string) does not support Unicode since it operates on per-value_type basis. IOW, it won't recognize code sequences. Same thing with streams. As for Boost.Regex, it has such support, but it is optional (i.e. it allows 1-octet fixed width strings for processing). And I believe, it is the way to do in other components we're discussing.