
Hello, I was looking at the string algo library hoping that I can use it with my own string class and I have some questions: 1. Is the library designed to work with variable length encoding characters (like UTF8, UTF16). The answer seems to be no, but wanted to find for sure. Are there any plans to make these algorithms compatible with this type of sequences? 2. Some algorithms seem to be strangely named and seem to overlap existing functionality in boost. For example split() and find_token() - why not using a boost::tokenizer? I think the library needs to be re-organized. 3. Internationalization support. I am not sure if these algorithms will work properly in all languages. For example to_upper/to_lower. I recall that in some languages, going from uppercase to lowercase (or viceversa) you go from one character to two (and viceversa). These two algorithms make the assumption that the correspondence is one to one. Thanks, Florin.

Florin Trofin wrote:
Hello,
I was looking at the string algo library hoping that I can use it with my own string class and I have some questions:
1. Is the library designed to work with variable length encoding characters (like UTF8, UTF16). The answer seems to be no, but wanted to find for sure. Are there any plans to make these algorithms compatible with this type of sequences?
Generaly no. The preamble of the library specifies that "a string" is an arbitrary sequence of characters. If you store variable length character string into a char* array, it will definitely not work. Since your container will store byte-codes, not characters. If you design a utf8 encoded string class with iterators that will iterate over real characters then there is a good chance that the string library will be functional. There is an issue with c++ locales that are not designed to support this kind of encoding, therefore some algorithms will not work, unless you extend the locales as well.
2. Some algorithms seem to be strangely named and seem to overlap existing functionality in boost. For example split() and find_token() - why not using a boost::tokenizer? I think the library needs to be re-organized.
Boost tokenizer is a distinct library, not releated to string_algo library. split and underlying find_iterator functionality provided in the string_algo library uses different design approach that is build on the facilities in the library. Boost is a collection of libraries, not just one library. And there is nothing in the spirit of Boost that prevents this kind of "concurency". It is up to you to use the one that suits you better.
3. Internationalization support. I am not sure if these algorithms will work properly in all languages. For example to_upper/to_lower. I recall that in some languages, going from uppercase to lowercase (or viceversa) you go from one character to two (and viceversa). These two algorithms make the assumption that the correspondence is one to one.
String_algo library depends solely on the facilities provided by standart c++ library. Namely locales facility. As far as I know, this kind of conversion is not supported there. Best regards, Pavol

Pavol Droba wrote:
Florin Trofin wrote:
Hello,
I was looking at the string algo library hoping that I can use it with my own string class and I have some questions:
1. Is the library designed to work with variable length encoding characters (like UTF8, UTF16). The answer seems to be no, but wanted to find for sure. Are there any plans to make these algorithms compatible with this type of sequences?
Generaly no. The preamble of the library specifies that "a string" is an arbitrary sequence of characters. If you store variable length character string into a char* array, it will definitely not work. Since your container will store byte-codes, not characters. If you design a utf8 encoded string class with iterators that will iterate over real characters then there is a good chance that the string library will be functional.
Though I don't know the developing status, <boost/regex/pending/unicode_iterator.hpp> seems to nicely work with Boost.StringAlgorithm. I tend to prefer the iterators to locale. Regards, -- Shunsuke Sogame
participants (3)
-
Florin Trofin
-
Pavol Droba
-
shunsuke