Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

20 Jan 2011

      On 20/01/2011 05:38, Patrick Horgan wrote:
...
On 01/19/2011 08:51 AM, Mathias Gaunard wrote:
...
My Unicode library works with arbitrary ranges, and you can adapt a
range in an encoding into a range in another encoding.
This can be used to lazily perform encoding conversion as the range is
iterated; such conversions may even be pipelined.
Sounds interesting. Of courses ranges could be used with strings of
whatever sort. Is the intelligence about the encoding in the ranges?
I've chosen not to attach encoding information to ranges, as this could 
make my Unicode library quite intrusive.

It's a design by contract; your input ranges must satisfy certain 
criteria, such as encoding, depending on the function you choose to 
call. If the criteria are not satisfied, you either get undefined 
behaviour or an exception, depending on the version of the function you 
choose to call.
...
As
you iterate a range does it move byte by byte character by character,
You can adapt a range of code units into a range of code points or into 
a range of ranges of code points (combining character sequences, 
graphemes, words, sentences, etc.)
...
does it deal with compositions?
It can.

My library doesn't really have string algorithms, it's up to you to make 
sure you call those algorithms using the correct adapters.

For example, to search for a substring in a string, both of which being 
in UTF-8, and taking into account combining characters, there are 
different strategies:
- Decode both to UTF-32, normalize them, segment them as combining 
character sequences, and perform a substring search on that.
- Decode both to UTF-32, normalize them, re-encode them both in UTF-8, 
perform a substring search at the byte level, and ignore matches that do 
not lie on the utf8_combining_boundary (checks whether we're at a UTF-8 
code point boundary, decodes to UTF-32, checks whether we're at a 
combining character boundary).

You could want to avoid the normalization step because you know your 
data is already normalized.
The second one has chances to be quite faster than the former, because 
you spend most of the time working on chars in actual memory, which can 
be optimized quite aggressively.

Both approaches are doable directly in a couple of lines by combining 
Boost.StringAlgo and my Unicode library in various ways ; and all 
conversions can happen lazily or not as one wishes.
Boost.StringAlgo isn't however that good (it only provides naive O(n*m) 
algorithms, doesn't support right-to-left search well, and certainly is 
unable to vectorize the cases where the range is made of built-in types 
contiguous in memory), so eventually it might have to be replaced.
...
Is it available to read?
Somewhat dated docs are at <http://mathias.gaunard.com/unicode/doc/html/>
A presentation is planned for Boostcon 2011, and a submission for review 
before that.