
On 20/01/2011 05:38, Patrick Horgan wrote:
On 01/19/2011 08:51 AM, Mathias Gaunard wrote:
My Unicode library works with arbitrary ranges, and you can adapt a range in an encoding into a range in another encoding. This can be used to lazily perform encoding conversion as the range is iterated; such conversions may even be pipelined. Sounds interesting. Of courses ranges could be used with strings of whatever sort. Is the intelligence about the encoding in the ranges?
I've chosen not to attach encoding information to ranges, as this could make my Unicode library quite intrusive. It's a design by contract; your input ranges must satisfy certain criteria, such as encoding, depending on the function you choose to call. If the criteria are not satisfied, you either get undefined behaviour or an exception, depending on the version of the function you choose to call.
As you iterate a range does it move byte by byte character by character,
You can adapt a range of code units into a range of code points or into a range of ranges of code points (combining character sequences, graphemes, words, sentences, etc.)
does it deal with compositions?
It can. My library doesn't really have string algorithms, it's up to you to make sure you call those algorithms using the correct adapters. For example, to search for a substring in a string, both of which being in UTF-8, and taking into account combining characters, there are different strategies: - Decode both to UTF-32, normalize them, segment them as combining character sequences, and perform a substring search on that. - Decode both to UTF-32, normalize them, re-encode them both in UTF-8, perform a substring search at the byte level, and ignore matches that do not lie on the utf8_combining_boundary (checks whether we're at a UTF-8 code point boundary, decodes to UTF-32, checks whether we're at a combining character boundary). You could want to avoid the normalization step because you know your data is already normalized. The second one has chances to be quite faster than the former, because you spend most of the time working on chars in actual memory, which can be optimized quite aggressively. Both approaches are doable directly in a couple of lines by combining Boost.StringAlgo and my Unicode library in various ways ; and all conversions can happen lazily or not as one wishes. Boost.StringAlgo isn't however that good (it only provides naive O(n*m) algorithms, doesn't support right-to-left search well, and certainly is unable to vectorize the cases where the range is made of built-in types contiguous in memory), so eventually it might have to be replaced.
Is it available to read?
Somewhat dated docs are at <http://mathias.gaunard.com/unicode/doc/html/> A presentation is planned for Boostcon 2011, and a submission for review before that.