Re: [Boost-users] regex with multi-byte characters

21 Jul 2005

      On Thu, July 21, 2005 14:54, John Maddock said:
...
...
What do you think? Could boost regex make usage of such traits_class or
you would not like to
include it into the distribution?
I don't know, it depends what it does: how do you plan to handle character
classification in a portable manner for unsigned short?
I plan to do it the same way Xerces-C does it. As I understand it they put 2 byte code into the
short and do various operations with it. I have to investigate how exactly it is done.
...
...
There are too many developers involved in the process, that we force all
to recompile Xerces-C
with specific settings. I don't think this would be an option for us. In
our case it can also lead
to unpredictable results, if one replaces xerces-c with freshly compiled
xerces-c without icu
support. I am a little bit sceptical about this.
OK let me try one more time: if you compile regex *only* with ICU support,
and use the iterator based u32regex_match/u32regex_search algorithms (or
their equivalent regex iterators) then it doesn't matter what character type
Xerces or anything else uses as long as:
It's an 8-bit type: then it'll be treated as an [unsigned] UTF-8 encoded
string.
Or: It's a 16-bit type, then it'll be treated as an [unsigned] UTF-16
encoded string.
Or: It's a 32-bit type, then it'll be treated as an [unsigned] UTF-32
encoded string.
Is that generic enough for you? :-)
Yes, I will do some tests. If they will be ok, I will compile regex with ICU support. Otherwise I
will write my own traits class for unsigned short characters.

Thanks a lot for your help.
...
John.
_______________________________________________
Boost-users mailing list
Boost-users@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users
With Kind Regards,

Ovanes

Re: [Boost-users] regex with multi-byte characters

Ovanes Markarian