
Let us start with my decision: I vote against inclusion of Boost.Locale into Boost in its current state, but strongly encourage the author to reconsider resubmitting it after some design issues have been reconsidered. In which case, I would be in favour of putting it on the fast-track for a mini-review. Since I am the author of another Unicode library that I'd like to see included in Boost, I hope that my review will not be considered biased. I believe Boost.Locale is a valuable and useful library, and I believe both libraries can coexist. The main reason for my decision is the interface that the library provides: albeit it aims at abstracting the backend used to implement the facilities, it exposes too much of how those facilities are implemented, which prevents benefiting from future backends that may not have the same limitations as the current backends do. Indeed, the Boost.Locale rationale clearly says that it only takes std::basic_string and const char* as input because that's what the backend libraries expect anyway. Yet there are various scenarios where this is useful: - I might want to apply some operation to data in a file without putting it all to memory first; Boost.Locale could copy that memory by chunks to apply that operation for backends that have no native support for this. Copying the whole file to memory is a good enough approximation for a first version. - The input I have may not be directly available in the form of a contiguous buffer, but I wouldn't mind if Boost.Locale copied it. If we consider the functions from_utf, to_utf and between, they can either take a const char*, two const char*, or a std::basic_string. Clearly this would be better if those functions took a range, which covers all three of these cases. It's not just a matter of being generic, it makes the library easier to use since the range is defined by a single argument which can be anything I want, I do not need to convert to what the library needs. If the library needs to do conversion behind the scenes, it can do it just like that, behind the scenes. Taking a range causes the problem however of what the return type should be: I think there should be two variants, one which returns a string, and one which appends to a container (like Boost.StringAlgo) or write to an output iterator (like STL algorithms). Obviously, the former would be defined in terms of the latter. This causes however a big problem: all backends would not be interchangeable with a stable unique ABI. Maybe some template specializations could be injected at link-time, but they would have to be full concrete specializations. Maybe the idea of backends being interchangeable at the ABI level needs to be rethought. After all, isn't being interchangeable at the API level enough, since you can easily mangle the configuration in a symbol to cause an error if linking with the non-matching backend? Still talking about conversion between encodings, I would like it if it was possible to statically choose the encodings we're converting between, as some backends could directly expose a specialized and optimized some_encoding_to_some_other_encoding function. Note the remarks about conversion between encodings also applies to case folding, collation etc. Being able to specify statically the normalization form you're normalizing to and things like that could be useful just as for the character encodings. Perhaps not made clear from the docs, but Boost.Locale embeds another mechanism than the functions in the conv namespace to convert between character encodings: using the classes derived from boost::locale::util::base_converter. This is the only interface that can be passed incomplete data, as obtained through buffering. That interface was designed to implement codecvt facets, and unfortunately, it's not much better. The library implements its own UTF-8/UTF-16/UTF-32 conversion logic there regardless of which backend is used; it is a shame it cannot reuse the to_utf or from_utf functions in all cases. This implementation is also quite tricky since the state management prevents the code from being simple and self-contained. That interface does not allow optimized conversion functions to be used. Compatibility with codecvt facets is a nice feature, but it shouldn't impair the interface of the component that could have other usages. Furthermore, I do not find the boundary analysis interface satisfying at all. It is not made clear in the docs that doing boundary analysis actually goes through the whole string and returns a vector of positions corresponding to the boundaries. You can use that vector to adapt any range into a range of strings. Each dereferencing copies the subrange defined by the vector into a new temporary string. This is pretty terrible. I want to be able to find the boundaries as I advance through a string, or be able to find the closest boundary from an arbitrary position without having to process the whole string. I do not want to have to copy my whole input range into little strings for no reason. Usage of this feature is also unnecessarily verbose. It should not be more than a function call to turn a range of characters into a range of tagged ranges of characters. This part of the library is the one that needs the more work. Ironically enough, it's also the only one that deals a bit with iterators and ranges, and I just asked for more iterators and ranges at the beginning of this review ;). Finally, I will not comment on the implementation too much, since I think other Boosters already did that (though I did not read all of it). There is some unnecessary or non-optimal code, some code more complicated than it should be, some compile-time values tested at runtime, some things that could have reused existing utilities from Boost, and some inappropriate use of reinterpret_cast. While most of the backends are separated, it is not the case for encoding/codepage.cpp, which is shared by all backends and contains ifdefs. This is not a very nice extension point. I'm also a bit concerned about the generalized usage of virtual functions anywhere; it seems unnecessary in quite a few places, please prefer using templates when dynamic binding is not required. A lot of new and vectors too, I'd prefer if ideally the library never allocated anything. I hope my review will be well received and that you will consider all of my "feature requests". Your library is practical and can get things done for sure, but maybe it's not quite as good as it could be in terms of interface yet. Also, if I want Boost.Unicode to ever be a possible backend, Boost.Locale needs to be able to make the most of it ;).

From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
Let us start with my decision: I vote against inclusion of Boost.Locale into Boost in its current state, but strongly encourage the author to reconsider resubmitting it after some design issues have been reconsidered. In which case, I would be in favour of putting it on the fast-track for a mini-review.
Can you please state explicitly what things are conditional for the library to get to mini-review. There are many points down but not everything clear about if you consider them critical or not. I assume not all things would be or may be solved but I'd rather like to see more specific notes like (a) missing XYZ interface for function ABC. It would help me and review manager.
Since I am the author of another Unicode library that I'd like to see included in Boost, I hope that my review will not be considered biased. I believe Boost.Locale is a valuable and useful library, and I believe both libraries can coexist.
I can mention only that I'm looking forward for Boost.Unicode and I know that the way you look on interfaces is very different. Boost.Locale is build from very practical point of view - primary usability. I don't try to solve all possible corner cases, especially when they are not central part of localization task. I'd rather prefer spending more time on adding specific missing features useful for localization over providing stuff like iterator based interface for some operations that usually executed on small chunks of text. The localization topic is almost endless and I have different priorities that are linguistic and localization centric.
The main reason for my decision is the interface that the library provides: albeit it aims at abstracting the backend used to implement the facilities, it exposes too much of how those facilities are implemented, which prevents benefiting from future backends that may not have the same limitations as the current backends do.
Indeed, the Boost.Locale rationale clearly says that it only takes std::basic_string and const char* as input because that's what the backend libraries expect anyway.
Not only, there are more reasons for this: http://cppcms.sourceforge.net/boost_locale/html/appendix.html#why_linear_chu...
Yet there are various scenarios where this is useful: - I might want to apply some operation to data in a file without putting it all to memory first; Boost.Locale could copy that memory by chunks to apply that operation for backends that have no native support for this. Copying the whole file to memory is a good enough approximation for a first version. - The input I have may not be directly available in the form of a contiguous buffer, but I wouldn't mind if Boost.Locale copied it.
The point, that apart of Boost.Unicode I don't see any option for specific backend now or in near future that would require such things. Apart for charset conversion (that already supports streaming via codecvt see an example in tutorial) all known to me Unicode/Localization API work on single chunk of memory. I don't see myself implementing it other way. When Boost.Unicode would be ready I'm sure its users would be glad to use its own native interface where it is needed. So do we need more generic API that would never be used? I really doubt.
If we consider the functions from_utf, to_utf and between, they can either take a const char*, two const char*, or a std::basic_string. Clearly this would be better if those functions took a range, which covers all three of these cases. It's not just a matter of being generic, it makes the library easier to use since the range is defined by a single argument which can be anything I want, I do not need to convert to what the library needs. If the library needs to do conversion behind the scenes, it can do it just like that, behind the scenes.
Actually it is convenience API.
Taking a range causes the problem however of what the return type should be: I think there should be two variants, one which returns a string, and one which appends to a container (like Boost.StringAlgo) or write to an output iterator (like STL algorithms). Obviously, the former would be defined in terms of the latter.
This causes however a big problem: all backends would not be interchangeable with a stable unique ABI. Maybe some template specializations could be injected at link-time, but they would have to be full concrete specializations.
To be honest I don't think that iterators especially over non-UTF charsets are good idea. I'd rather accept a concept of stream that you read from and write to: 1. It is buffered 2. It is fast and can work with big chunks. 3. It can be passed transparently and ABI stable to any backends. So I would rather see implementation of generic algorithms over I/O streams rather then iterators.
Maybe the idea of backends being interchangeable at the ABI level needs to be rethought.
After all, isn't being interchangeable at the API level enough, since you can easily mangle the configuration in a symbol to cause an error if linking with the non-matching backend?
It is actually a center part of the library design that I think had very good effect on it. I'm working with localization and actually use it for localization purposes and I know how valuable it is to have generic stable and independent API and different backends especially when ICU maybe heavy dependency. Without it I wouldn't develop the library in first place.
Still talking about conversion between encodings, I would like it if it was possible to statically choose the encodings we're converting between, as some backends could directly expose a specialized and optimized some_encoding_to_some_other_encoding function.
Actually almost all backends are optimized for UTF-8 where it is possible. So without brining charset tables into Boost.Locale I can see only very few character specializations can be used: 1. UTF-XYZ to/from UTF-ABC 2. Latin1/ISO-8859-1 to/from UTF-XYZ But I think Boost.Unicode would do it better as it specializes in handling UTF-XYZ. Note, you will almost never see in Boost.Locale API any kind relations to code-points as code point is meaning-less in context of localization. It is too low level abstraction that I actually prefer developers not even relate to. I don't try to implement tasks that should UTF-8/16/32 library do.
Note the remarks about conversion between encodings also applies to case folding, collation etc. Being able to specify statically the normalization form you're normalizing to and things like that could be useful just as for the character encodings.
I know this (we talked about it before). However I don't relate to it as almost all Unicode API around. Such low level optimizations are good but IMHO corner case. It is good for lower level library.
Perhaps not made clear from the docs, but Boost.Locale embeds another mechanism than the functions in the conv namespace to convert between character encodings: using the classes derived from boost::locale::util::base_converter. This is the only interface that can be passed incomplete data, as obtained through buffering. That interface was designed to implement codecvt facets,
Exactly and this is its purpose.
and unfortunately, it's not much better. The library implements its own UTF-8/UTF-16/UTF-32 conversion logic there regardless of which backend is used; it is a shame it cannot reuse the to_utf or from_utf functions in all cases. This implementation is also quite tricky since the state management prevents the code from being simple and self-contained.
Actually the code reuses it in many places, but there is a small point. Unlike the case of normal to_utf/from_utf calls the facets know their encoding and may do things faster as for example in ICU it uses this knowledge to perform some tasks better.
That interface does not allow optimized conversion functions to be used. Compatibility with codecvt facets is a nice feature, but it shouldn't impair the interface of the component that could have other usages.
It is not nice feature, it is a feature that allows to use Boost.Filesysem and Boost.ProgramOptions correctly with UTF-8 or other encodings on Windows platform with total encoding mess. Using Boost.Locale and UTF-8 codecvt facet opens a door to all other libraries to adopt it. What is important that std::codecvt is part of C++ standard and thus any library that uses this standard would enjor the full power of Boost.Locale without even depending on it.
Furthermore, I do not find the boundary analysis interface satisfying at all. It is not made clear in the docs that doing boundary analysis actually goes through the whole string and returns a vector of positions corresponding to the boundaries. You can use that vector to adapt any range into a range of strings. Each dereferencing copies the subrange defined by the vector into a new temporary string. This is pretty terrible. I want to be able to find the boundaries as I advance through a string, or be able to find the closest boundary from an arbitrary position without having to process the whole string. I do not want to have to copy my whole input range into little strings for no reason.
I may agree that it is not perfect, but giving limitations of current state-of-the-art implementations in ICU it is yet reasonable one. AFAIK Boost.Unicode implemets its own interface for text segmentation but without ICU's custom locale rules and dictionaries that are important for text segmentation of some languages.
Usage of this feature is also unnecessarily verbose. It should not be more than a function call to turn a range of characters into a range of tagged ranges of characters. This part of the library is the one that needs the more work. Ironically enough, it's also the only one that deals a bit with iterators and ranges, and I just asked for more iterators and ranges at the beginning of this review ;).
Finally, I will not comment on the implementation too much, since I think other Boosters already did that (though I did not read all of it). There is some unnecessary or non-optimal code
Can you point? There were some duplications but non-coptimal?
some code more complicated than it should be
Can you be more specific?
some compile-time values tested at runtime, some things that could have reused existing utilities from Boost,
More specific points?
and some inappropriate use of reinterpret_cast.
In two places where it is going to be fixed.
While most of the backends are separated, it is not the case for encoding/codepage.cpp, which is shared by all backends and contains ifdefs. This is not a very nice extension point.
Actually not codepage only. - Formatting (locale::format) - Locale information - Message Formatting (gettext) - codecvt facets (not details) And more... So there are lots of shared parts. Now unlike most of operations that are locale related codepage handing is basics that is used all over the code. So I don't really understand how do you see it might actually be?
I'm also a bit concerned about the generalized usage of virtual functions anywhere; it seems unnecessary in quite a few places, please prefer using templates when dynamic binding is not required.
More specific points?
A lot of new and vectors too, I'd prefer if ideally the library never allocated anything.
I'm sorry but it is just something that can't and would never happen. This request has no reasonable base especially for such complex topic.
I hope my review will be well received and that you will consider all of my "feature requests". Your library is practical and can get things done for sure, but maybe it's not quite as good as it could be in terms of interface yet.
The interface is a part of the design and have designed to be simple and covinent rather then fully generic. But I understand that we have very different points of view.
Also, if I want Boost.Unicode to ever be a possible backend, Boost.Locale needs to be able to make the most of it ;).
Actually I was thinking about it. Boost.Locale and Boost.Unicode very different: - Boost.Locale is locale centric - Boost.Unicode is locale agnostic. Few points from Boost.Unicode that can actually serve Boost.Locale: - Normalization But I assume that it would be better to use Boost.Unicode directly as it is really locale independent procedure. It is part of Boost.Locale for convinience - Collation if it is parametrized so it can be changed for each locale separatly - Boundary analysis for a subset of languages not Japenese not Thai and some others at leaset for how Boost.Unicode looks like at this point. I rather perfer to look at Boost.Unicode as orthogonal library to Boost.Locale as it does very different things. It can't be Boost.Locale's backend as at least in current state it is locale agnostic and misses most important feature localization is actually required - CLDR - Common Locale Database Repositry Thank you for the review. Artyom
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Artyom wrote:
The point, that apart of Boost.Unicode I don't see any option for specific backend now or in near future that would require such things.
Apart for charset conversion (that already supports streaming via codecvt see an example in tutorial) all known to me Unicode/Localization API work on single chunk of memory.
I've offered you a set of fast charset conversion specialisations; these will work on ranges of arbitrary iterator types. Regards, Phil

On 18/04/2011 14:24, Artyom wrote:
Can you please state explicitly what things are conditional for the library to get to mini-review.
I don't know, I just read some more of the other reviews, and no one seems to have the same concerns as me, therefore I do not know if my demands are justified.
Boost.Locale is build from very practical point of view - primary usability. I don't try to solve all possible corner cases, especially when they are not central part of localization task.
I'd rather prefer spending more time on adding specific missing features useful for localization over providing stuff like iterator based interface for some operations that usually executed on small chunks of text.
I think that for a Boost library, the API can be more important than the features themselves.
The main reason for my decision is the interface that the library provides: albeit it aims at abstracting the backend used to implement the facilities, it exposes too much of how those facilities are implemented, which prevents benefiting from future backends that may not have the same limitations as the current backends do.
Indeed, the Boost.Locale rationale clearly says that it only takes std::basic_string and const char* as input because that's what the backend libraries expect anyway.
Not only, there are more reasons for this:
http://cppcms.sourceforge.net/boost_locale/html/appendix.html#why_linear_chu...
I can't find any more reasons than what I said. Am I missing something?
When Boost.Unicode would be ready I'm sure its users would be glad to use its own native interface where it is needed.
So do we need more generic API that would never be used?
Maybe they could want another backend for various reasons. - ICU is more well-tested - ICU is more feature-rich - Boost.Unicode's behaviour is not tailored for particular locales - Boost.Unicode is written by a crazy frenchman that they do not trust - ICU may already be installed while Boost.Unicode not - etc.
To be honest I don't think that iterators especially over non-UTF charsets are good idea.
I'd rather accept a concept of stream that you read from and write to:
1. It is buffered 2. It is fast and can work with big chunks. 3. It can be passed transparently and ABI stable to any backends.
So I would rather see implementation of generic algorithms over I/O streams rather then iterators.
Ok. I had the feeling that iterators were more popular than streams, but streams are fine too; they're very close concepts.
Without it I wouldn't develop the library in first place.
I never said to get rid of the stable API, I was just asking if you were willing to consider giving up the idea of the binaries of the library backends being swappable at (a potentially runtime) link-time. i.e. backends would be statically instead of dynamically selected. But if not, then it's fine, but that does give quite a few limitations to what the library can do in terms of template stuff.
What is important that std::codecvt is part of C++ standard and thus any library that uses this standard would enjor the full power of Boost.Locale without even depending on it.
While I agree that support for std::codecvt is important, why not make a new interface similar to it that doesn't have the same state problems, and tell users to use that instead when possible?
I may agree that it is not perfect, but giving limitations of current state-of-the-art implementations in ICU it is yet reasonable one.
ICU does not have such limitations. - BreakIterator allows to find boundaries as you iterate the text - Even if ICU couldn't do that, there is no reason to return in the operator* of your token_iterator basic_string<ValueType>(begin, end) and copy the data, you could just return iterator_range<IteratorType>(begin, end). Also, while I'm at it, break_iterator really looks like a special case of boost::filter_iterator.
some compile-time values tested at runtime, some things that could have reused existing utilities from Boost,
More specific points?
I saw stuff like if(is_linear_iterator<SomeType>::is_linear) .... Additionally, this particular trait is not canonical (should be ::value, not ::is_linear), is not righly named (it's contiguous, not linear), and could have other uses outside of Boost.Locale, so it should be in its own file.
While most of the backends are separated, it is not the case for encoding/codepage.cpp, which is shared by all backends and contains ifdefs. This is not a very nice extension point.
Actually not codepage only.
- Formatting (locale::format) - Locale information - Message Formatting (gettext) - codecvt facets (not details)
And more...
So there are lots of shared parts.
I meant shared parts which are not really shared, i.e. a file that is shared but that contains completely different code chosen using #ifdefs.
I'm also a bit concerned about the generalized usage of virtual functions anywhere; it seems unnecessary in quite a few places, please prefer using templates when dynamic binding is not required.
More specific points?
Couldn't boost::locale::util::base_converter be a concept rather than an abstract base class?
A lot of new and vectors too, I'd prefer if ideally the library never allocated anything.
I'm sorry but it is just something that can't and would never happen.
This request has no reasonable base especially for such complex topic.
Usage of templates instead of inclusion polymorphism would allow to avoid newing the object and using a smart pointer, for example. Plus the instances of allocation in the boundary stuff (when you build an index vector and when you copy into a new basic_string) appears to be unnecessary. Functions that allocate behind your back are frowned upon, because you have no mean of controlling how that memory is allocated or managed. It limits the application range of the library. But then that could be considered reasonable if the benefits outweigh the drawbacks; I do not think this is necessarily the case here.
I hope my review will be well received and that you will consider all of my "feature requests". Your library is practical and can get things done for sure, but maybe it's not quite as good as it could be in terms of interface yet.
The interface is a part of the design and have designed to be simple and covinent rather then fully generic.
Passing an arbitrary range to the functions is convenient, forcing the user to copy it to a vector and pass the begin and end pointers before it can call the functions isn't. See, genericity is convenient.
It can't be Boost.Locale's backend as at least in current state it is locale agnostic and misses most important feature localization is actually required - CLDR - Common Locale Database Repositry
It would have to be a partial backend (used just for conversion between UTF encodings, normalization and boundary analysis), and you could fill the blanks using another backend.

AMDG On 04/19/2011 08:13 AM, Mathias Gaunard wrote:
On 18/04/2011 14:24, Artyom wrote:
some compile-time values tested at runtime, some things that could have reused existing utilities from Boost,
More specific points?
I saw stuff like
if(is_linear_iterator<SomeType>::is_linear) ....
And what exactly is wrong with this? Most compilers are perfectly capable of optimizing away the branch, and it makes the code more straightforward than any compile time selection mechanism. In Christ, Steven Watanabe

On 19/04/2011 17:27, Steven Watanabe wrote:
AMDG
On 04/19/2011 08:13 AM, Mathias Gaunard wrote:
On 18/04/2011 14:24, Artyom wrote:
some compile-time values tested at runtime, some things that could have reused existing utilities from Boost,
More specific points?
I saw stuff like
if(is_linear_iterator<SomeType>::is_linear) ....
And what exactly is wrong with this? Most compilers are perfectly capable of optimizing away the branch, and it makes the code more straightforward than any compile time selection mechanism.
True, it's a style thing, but the fact that there were several issues with this single line caught my eye.

On Apr 19, 2011, at 11:27 AM, Steven Watanabe wrote:
On 04/19/2011 08:13 AM, Mathias Gaunard wrote:
I saw stuff like
if(is_linear_iterator<SomeType>::is_linear) ....
And what exactly is wrong with this? Most compilers are perfectly capable of optimizing away the branch, and it makes the code more straightforward than any compile time selection mechanism.
In the future, maybe that will be spelled if<...> { ... } :-D G

From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
Can you please state explicitly what things are conditional for the library to get to mini-review.
I don't know, I just read some more of the other reviews, and no one seems to have the same concerns as me, therefore I do not know if my demands are justified.
Boost.Locale is build from very practical point of view - primary usability. I don't try to solve all possible corner cases, especially when they are not central part of localization task.
I'd rather prefer spending more time on adding specific missing features useful for localization over providing stuff like iterator based interface for some operations that usually executed on small chunks of text.
I think that for a Boost library, the API can be more important than the features themselves.
As you know, the API is a subject to change in Boost, and I would say that it is changed toooo freely and toooo often. However in case of Boost.Locale I think more changes will come but they would extend the functionality. However everything really depends on needs.
Indeed, the Boost.Locale rationale clearly says that it only takes std::basic_string and const char* as input because that's what the backend libraries expect anyway.
Not only, there are more reasons for this:
http://cppcms.sourceforge.net/boost_locale/html/appendix.html#why_linear_chu...
I can't find any more reasons than what I said. Am I missing something?
Yes: Most of supported operations on text like collation, case handling usually work on small chunks of text I can accept that some operations may be better to work on arbitrary streams but most of them just don't need it. For example collation... When exactly do you need to compare arbitrary text streams?
To be honest I don't think that iterators especially over non-UTF charsets are good idea.
I'd rather accept a concept of stream that you read from and write to:
1. It is buffered 2. It is fast and can work with big chunks. 3. It can be passed transparently and ABI stable to any backends.
So I would rather see implementation of generic algorithms over I/O streams rather then iterators.
Ok. I had the feeling that iterators were more popular than streams, but streams are fine too; they're very close concepts.
I'll reveal you the plan that I'm going to provide stream like interface for search algorithm. Unfortunately it is really non-trivial, not because it is not possible but rather because there is no enough time to implement it. Basically what I need is to implement icu::CharacterIterator over seekable std::istream The biggest problem is that CharacterIterator requires UTF-16 indexing and this is not-so simple problem. So I left this out of the scope of the current version of Boost.Locale and I'll add it in future. I think it is doable but it requires me to look deep into ICU sources because it is not really documented how to do it. Once it will be done, it would be possible to implement same strategy for boundary analysis and probably some others as well (normalization I think) So basically user would provide I/O streams for this purpose (of course with some nice wrappers for common cases) But it would not happen now, as it requires quite a lot work and testing. This is basically what I had said in the lines in the documentation. When new API is introduced into Boost.Locale in future, such that it likely works on large chunks of text, will provide an interface for non-linear text handling. However the only API that really needs this is search api, for some others like segmentation it is nice to have. For others like collation it is not needed at all. So this is a quick glance to the future that currently only on planning stage and is not a part of the review. I thought to provide stream API for charset conversion but put it on hold as it is not really a central part, especially when codecvt it there.
More specific points?
I saw stuff like
if(is_linear_iterator<SomeType>::is_linear) ....
Additionally, this particular trait is not canonical (should be ::value, not ::is_linear), is not righly named (it's contiguous, not linear), and could have other uses outside of Boost.Locale, so it should be in its own file.
Such things should appear in code review. I'm really welcome to learn how to do things better. The question is whether such changes are critical or not.
While most of the backends are separated, it is not the case for encoding/codepage.cpp, which is shared by all backends and contains ifdefs. This is not a very nice extension point.
Actually not codepage only.
- Formatting (locale::format) - Locale information - Message Formatting (gettext) - codecvt facets (not details)
And more...
So there are lots of shared parts.
I meant shared parts which are not really shared, i.e. a file that is shared but that contains completely different code chosen using #ifdefs.
Not exactly, it consists of fallbacks, if for example iconv does not support windows-xyx uconv would be used. It allows to merge several libraries in way they backup each other. Take a deeper look to the section. It is different from backend selection.
I'm also a bit concerned about the generalized usage of virtual functions anywhere; it seems unnecessary in quite a few places, please prefer using templates when dynamic binding is not required.
More specific points?
Couldn't boost::locale::util::base_converter be a concept rather than an abstract base class?
Because there is no need to duplicate a complex code via template metaprogamming if a simple function call can be made. Don't overrate template metaprogramming and don't underestimate virtual functions.
A lot of new and vectors too, I'd prefer if ideally the library never allocated anything.
I'm sorry but it is just something that can't and would never happen.
This request has no reasonable base especially for such complex topic.
Usage of templates instead of inclusion polymorphism would allow to avoid newing the object and using a smart pointer, for example.
I'm not sure what exact location bothers use but anywhere (unless I miss something) there are minimal data copying, and I relate heavily on RVO. If you see some not-required copying tell me.
Plus the instances of allocation in the boundary stuff (when you build an index vector and when you copy into a new basic_string) appears to be unnecessary.
More specific location? I don't remember such thing, I just need better pointers to answer.
I hope my review will be well received and that you will consider all of my "feature requests". Your library is practical and can get things done for sure, but maybe it's not quite as good as it could be in terms of interface yet.
The interface is a part of the design and have designed to be simple and covinent rather then fully generic.
Passing an arbitrary range to the functions is convenient, forcing the user to copy it to a vector and pass the begin and end pointers before it can call the functions isn't.
Where exactly, because I need better context as above. Artyom

On 19/04/2011 20:30, Artyom wrote:
As you know, the API is a subject to change in Boost, and I would say that it is changed toooo freely and toooo often.
Right, that's because the API is what makes Boost libraries good. If it's not good enough, it needs to change. And to avoid APIs from changing too much in released libraries, the best solution is to make sure the library has the best API *when it is accepted into Boost*. And I think there is still some room for improvement with the Boost.Locale API.
I can accept that some operations may be better to work on arbitrary streams but most of them just don't need it.
For example collation... When exactly do you need to compare arbitrary text streams?
Because the data does not exist in memory, may be computed on the fly, or whatever really. A possible application is simply to chain operations in a pipeline, i.e. without having to apply one operation completely, then the other on that result, etc (and do the intermediate temporary buffer allocations).
So this is a quick glance to the future that currently only on planning stage and is not a part of the review.
I thought to provide stream API for charset conversion but put it on hold as it is not really a central part, especially when codecvt it there.
I believe it *is* the very central part of any text processing system.
Such things should appear in code review.
I'm really welcome to learn how to do things better. The question is whether such changes are critical or not.
All those comments at the end of my original message only affect the internals, so I don't care so much. It's mostly a style thing.
Take a deeper look to the section.
It is different from backend selection.
If I want to add a backend, I only want to add a new repository with the implementation for that backend. I do not want to have to hack all shared files by adding some additional ifdefs.
I'm also a bit concerned about the generalized usage of virtual functions anywhere; it seems unnecessary in quite a few places, please prefer using templates when dynamic binding is not required.
More specific points?
Couldn't boost::locale::util::base_converter be a concept rather than an abstract base class?
Because there is no need to duplicate a complex code via template metaprogamming if a simple function call can be made.
This sentence doesn't make any sense to me. Template meta-programming is not a mean to duplicate code. Nor is normal template usage, which is what I suggested instead of virtual functions, template meta-programming.
Don't overrate template metaprogramming and don't underestimate virtual functions.
You could remove the reference to the base class by a template parameter and get rid of the virtualness of the member functions. The virtualness is already in codecvt, why would you want to put it twice in a row?
A lot of new and vectors too, I'd prefer if ideally the library never allocated anything.
I'm sorry but it is just something that can't and would never happen.
This request has no reasonable base especially for such complex topic.
Usage of templates instead of inclusion polymorphism would allow to avoid newing the object and using a smart pointer, for example.
I'm not sure what exact location bothers use but anywhere (unless I miss something) there are minimal data copying, and I relate heavily on RVO.
I didn't say copying, I said allocation and usage of new. grep -r -F "new " * should give you the exact locations.
If you see some not-required copying tell me.
Plus the instances of allocation in the boundary stuff (when you build an index vector and when you copy into a new basic_string) appears to be unnecessary.
More specific location? I don't remember such thing, I just need better pointers to answer.
I've been very precise. You unnecessarily allocate a new string and copy the contents in the operator* of token_iterator. I've also given you a trivial fix for that particular problem, though it doesn't solve the underlying problem of generating (and allocating) the index vector first. I don't see how I can be more precise than this.
Passing an arbitrary range to the functions is convenient, forcing the user to copy it to a vector and pass the begin and end pointers before it can call the functions isn't.
Where exactly, because I need better context as above.
The interface of your functions is something like std::string foo(char* begin, char* end), Let's say I've got a filter_range<std::vector<char> > as input. To use your function, I need to do std::vector<char> v; std::copy(boost::begin(input), boost::end(input), std::back_inserter(v)); std::string s = foo(v.data(), v.data() + v.size()); A more convenient interface would directly allow std::string s = foo(input);

From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
I can accept that some operations may be better to work on arbitrary streams but most of them just don't need it.
For example collation... When exactly do you need to compare arbitrary text streams?
Because the data does not exist in memory, may be computed on the fly, or whatever really. A possible application is simply to chain operations in a pipeline, i.e. without having to apply one operation completely, then the other on that result, etc (and do the intermediate temporary buffer allocations).
Pipeline and collation? Either I don't get you or we have too different points of view. Not every programming concept is about stream processing, especially collation where you sort two Units of data, where each unit is a whole part. But lets live it behind because I don't see that we would get anywhere
I thought to provide stream API for charset conversion but put it on hold as it is not really a central part, especially when codecvt it there.
I believe it *is* the very central part of any text processing system.
Text processing, not localization, apart there is a stream charset conversion...
Take a deeper look to the section.
It is different from backend selection.
If I want to add a backend, I only want to add a new repository with the implementation for that backend. I do not want to have to hack all shared files by adding some additional ifdefs.
It is different from localization backend and utility that converts one encoding to other. But I see your point.
Because there is no need to duplicate a complex code via template metaprogamming if a simple function call can be made.
This sentence doesn't make any sense to me.
Template meta-programming is not a mean to duplicate code. Nor is normal template usage, which is what I suggested instead of virtual functions, template meta-programming.
I mean binary code. When you have template<typename Type> class foo { void bar() { something type independent } } And then use: foo<char> and foo<wchar_t> bar would be eventually duplicated in binary code as void foo<char>::bar(); void foo<wchar_t>::bar(); Regardless the fact it does the same job. And finally you get huge executables that basically copy same things around.
A lot of new and vectors too, I'd prefer if ideally the library never allocated anything.
I'm sorry but it is just something that can't and would never happen.
This request has no reasonable base especially for such complex topic.
Usage of templates instead of inclusion polymorphism would allow to avoid newing the object and using a smart pointer, for example.
I'm not sure what exact location bothers use but anywhere (unless I miss something) there are minimal data copying, and I relate heavily on RVO.
I didn't say copying, I said allocation and usage of new. grep -r -F "new " * should give you the exact locations.
This would not happen. It is not fancy header only library that does some small functions character by character. This library uses a dozen of various APIs... Do you really think it is possible to do it without a single new? And BTW most of them are called for locale's facets generation, basically once locale initialized.... If you would really had run this grep and seen each use case of them you wouldn't even write this "grep" sentence
If you see some not-required copying tell me.
Plus the instances of allocation in the boundary stuff (when you build an index vector and when you copy into a new basic_string) appears to be unnecessary.
More specific location? I don't remember such thing, I just need better pointers to answer.
I've been very precise. You unnecessarily allocate a new string and copy the contents in the operator* of token_iterator.
Yes? So how would you return a string? I don't see there any unexpected allocations. ------------------------------------------------------ I want to say few words to summarize because I don't see it is going anywhere Boost.Locale is not Boost.Unicode, it behaves differently, it thinks differently and does many things in a way normal localization APIs all over the world do it. Yes, ranges in nice and important concept for template metaprogramming, but it is not template library and would never be. You can't expect from the library to provide techniques suitable for template system. Yes, it is simple to write template<typename Input,typename Output> Output bad_to_upper(Input begin,Input end,Output out,std::locale const &l) { typedef std::ctype<typename Input::value_type> facet_type; while(begin!=end) *out++ = std::use_facet<facet_type>(l).to_upper(*begin)++; } But it does not work this way because to_upper needs entire chunk and not arbitrary character at every point. You need to call some virtual function on some range it does not even know what Iterator is... So you are tring to apply techniques that does not belog here. Why because you need either to: template<typename Input,typename Output> Output a_to_upper(Input begin,Input end,Output out,std::locale const &l) { typedef typename Input::value_type char_type; typedef boost::locale::convert<char_type> facet_type; std::vector<char_type> input_buf; std::copy(begin,end,back_insterer(temporary_buf)); std::basic_string<char_type> output_buf = std::use_facet<facet_type>(l).to_upper(&input_buf[0],input_buf.size()); std::copy(output_buf.begin(),output_buf.end(),out); } But it does two allocations!$@R$%#! Not good. So lets create a some virtual iterator: template<typename CharType> class base_iterator<CharType> { virtual CharType value() { return value_; } virtual bool next() = 0; protected: CharType value_; } template<IteratorType> class wrapper : public base_iterator<typename IteratorType::value_type> { wrapper(IteratorType begin,IteratorType end): begin_(begin),end_(end) {} virtual bool next() { if(begin==end) return false; value_ == *begin++; } private: IteratorType begin_,end_; } Same for template<typename CharType> class base_output_iterator<CharType> { ... } template<IteratorType> class output_wrapper : And now we rewrite our function as: template<typename Input,typename Output> Output b_to_upper(Input begin,Input end,Output out,std::locale const &l) { typedef typename Input::value_type char_type; input_wrapper<char_type> input(begin,end); output_wrapper<char_type> output(out); std::use_facet<facet_type>(l).to_upper(input,output); return output.value(); } But, hey!#%#$%#4 For each character I call virtual function WOW the cost is too big! $^$%^%@#$^@#%@#$%@#$% Attempt nuber three, make virtual functions more efficient template<IteratorType> class input_wrapper : public std::istream<typename IteratorType::value_type> { ... } template<IteratorType> class output_wrapper : public std::ostream<typename IteratorType::value_type> { ... } Now they are buffered and no virtual functions call and even under the hood it may work on single memory chunk... template<typename Input,typename Output> Output c_to_upper(Input begin,Input end,Output out,std::locale const &l) { typedef typename Input::value_type char_type; input_wrapper<char_type> input(begin,end); output_wrapper<char_type> output(out); std::use_facet<facet_type>(l).to_upper(input,output); return output.value(); } But hey... We created to iostream object because user wanted to do convert a string to upper... Something really-really-really wrong here. ------------------------------------------ Template metaprograming techniuqes just to fit there. You may want to enforce them as much as you can but they are and will be ugly. Don't try to make things more fancy then they should be especially when it comes to text and every string I've ever seen has something like c_str().... Artyom

On 20.04.2011 0:40, Artyom wrote:
From: Mathias Gaunard<mathias.gaunard@ens-lyon.org>
Template meta-programming is not a mean to duplicate code. [...]
I mean binary code.
When you have
template<typename Type> class foo { void bar() { something type independent } }
And then use:
foo<char>
and
foo<wchar_t>
bar would be eventually duplicated in binary code as
void foo<char>::bar(); void foo<wchar_t>::bar();
Regardless the fact it does the same job. And finally you get huge executables that basically copy same things around.
There are techniques to prevent code bloat: in this case you can inherit from type independent base class (with type independent code) -- - Do you speak English? Мужик с глубоким вздохом: - Yes I do. А хули толку?

From: Max Sobolev <macsmr@ya.ru>
Template meta-programming is not a mean to duplicate code. [...]
I mean binary code.
[snip]
There are techniques to prevent code bloat: in this case you can inherit from type independent base class (with type independent code)
That is exactly what is done in my code. Artyom

Artyom wrote:
From: Max Sobolev <macsmr@ya.ru>
Template meta-programming is not a mean to duplicate code.
You keep mentioning template meta-programming in contexts without any meta-programming. Don't confuse the terminology.
I mean binary code.
There are techniques to prevent code bloat: in this case you can inherit from type independent base class (with type independent code)
That is exactly what is done in my code.
I thought Mathius was referring to virtual functions in your code, which are not necessary to eliminate template code bloat. _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On 19/04/2011 22:40, Artyom wrote:
Text processing, not localization, apart there is a stream charset conversion...
What is localization if not text processing?
I mean binary code.
When you have
template<typename Type> class foo { void bar() { something type independent } }
And then use:
foo<char>
and
foo<wchar_t>
bar would be eventually duplicated in binary code as
void foo<char>::bar(); void foo<wchar_t>::bar();
Except that foo<T>::bar() would mostly just call T::baz() and T::whatamacallit(), so it will be completely different for T or U. Generating very small functions that keep jumping and doing indirections between each other is even worse than duplicating a bit of binary code. Inlining gives very good performance benefits. It appears that you are reluctant to template usage, because you're not comfortable with templates techniques and you still believe age-old myths. Some people here have deployed template code in very constrained environments, such as microcontrollers with only a few KBs of memory, and it works just fine. I don't think that kind of template fear is positive for a Boost library. But then why not, maybe Boost libraries really overuse templates as some claim.
This would not happen. It is not fancy header only library that does some small functions character by character.
This library uses a dozen of various APIs... Do you really think it is possible to do it without a single new?
Yes, no problem at all. As I said, just take an output iterator or a container that you append to. This way the allocation is handled by whatever the user gives for the output.
And BTW most of them are called for locale's facets generation, basically once locale initialized....
Yes, those are required due to your design decisions of using the C++ locale facet system; but that may not be the case for all usages. Note those are just general remarks, this is not a major problem. But it would definitely be best if you could reduce places where allocations happen to a minimum.
Yes? So how would you return a string? I don't see there any unexpected allocations.
I shall repeat the fix then. Return an iterator_range (similar to a pair of pointers) instead of a basic_string.
Yes, it is simple to write
template<typename Input,typename Output> Output bad_to_upper(Input begin,Input end,Output out,std::locale const&l) { typedef std::ctype<typename Input::value_type> facet_type; while(begin!=end) *out++ = std::use_facet<facet_type>(l).to_upper(*begin)++; }
But it does not work this way because to_upper needs entire chunk and not arbitrary character at every point.
You need to call some virtual function on some range it does not even know what Iterator is...
That's a backend limitation, that you may be able to overcome or not. If the backend really needs a contiguous memory buffer, then just copy the range to a contiguous memory buffer. I don't see where the problem is. I'm just trying to make your library easier to use. I may not have a pointer to it handy, and generating one could cost a lot of lines of code, which would make your library a bit annoying to use.
So you are tring to apply techniques that does not belog here.
Why because you need either to:
template<typename Input,typename Output> Output a_to_upper(Input begin,Input end,Output out,std::locale const&l) { typedef typename Input::value_type char_type; typedef boost::locale::convert<char_type> facet_type; std::vector<char_type> input_buf; std::copy(begin,end,back_insterer(temporary_buf)); std::basic_string<char_type> output_buf = std::use_facet<facet_type>(l).to_upper(&input_buf[0],input_buf.size()); std::copy(output_buf.begin(),output_buf.end(),out); }
But it does two allocations!$@R$%#! Not good.
That could be the general case, yes (though it would be best if to_upper could directly take a buffer to write to rather than return a basic_string). Then you would specialize the cases where either input or output are contiguous iterators to avoid those copies. I suppose you could also do something like. char_type output_buffer[buffer_size]; while(input_begin != input_end) { char_type* output = output_buffer; status = use_facet<whatever>(l).to_upper(input_begin, input_end, input_begin, output_buffer, output_buffer + buffer_size, output); if(status == ...) // do something smart in case of error or incomplete std::copy(output_buffer, output, out); } Zero memory allocation needs to happen from the library itself with that kind of design.
So lets create a some virtual iterator:
If you want that kind of thing, use type erasure. This is the same technique used by any or function. It allows a single type to contain any other type and dispatch each of it member functions dynamically as long as it models a particular concept. For iterators, google for any_iterator.
But, hey!#%#$%#4
For each character I call virtual function WOW the cost is too big!
Indeed, this is probably why any_iterator never became popular. But note codecvt facets can have exactly that same problem. At least Dikumware STL calls codecvt facets functions character per character.

From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
Generating very small functions that keep jumping and doing indirections between each other is even worse than duplicating a bit of binary code. Inlining gives very good performance benefits.
Some people (performance freaks would disagree with you) see Linux kernel coding style. But this is other story.
It appears that you are reluctant to template usage, because you're not comfortable with templates techniques and you still believe age-old myths.
Unfortunately it is not a myth. I see real libraries, like Boost.Asio and I see their endless compilation times and enormous executable sizes it creates and I'm not happy with this. That is a reality. Not a myth. But I don't think it is the central part of the discussion.
Some people here have deployed template code in very constrained environments, such as microcontrollers with only a few KBs of memory, and it works just fine.
I agree, template metaprogramming is very important part of C++ but everything has its own place. Sometimes dynamic polymorphism is better then static and sometimes other way around.
I don't think that kind of template fear is positive for a Boost library. But then why not, maybe Boost libraries really overuse templates as some claim.
It does, sometimes, sometimes it is great and fully justified.
any unexpected allocations.
I shall repeat the fix then. Return an iterator_range (similar to a pair of pointers) instead of a basic_string.
In order to have ranges just use break_iterator that returns boundary points. That it was designed for. However token_iterator designed to create tokens. Note, if token iterator was returning a range you wouldn't be able to call std::string s = *it;
I'm just trying to make your library easier to use. I may not have a pointer to it handy, and generating one could cost a lot of lines of code, which would make your library a bit annoying to use.
I see, but sometimes I don't really agree that the interface you suggest simplifies things, sometimes it may but not always. But this is my point of view.
I suppose you could also do something like.
char_type output_buffer[buffer_size]; while(input_begin != input_end) { char_type* output = output_buffer; status = use_facet<whatever>(l).to_upper(input_begin, input_end, input_begin, output_buffer, output_buffer + buffer_size, output);
if(status == ...) // do something smart in case of error or incomplete
std::copy(output_buffer, output, out); }
Note: You assume that "whatever" facet is stateful, and it is not it is const and stateless, unless I provide some specific state as additional parameter. So this way or other I'll have to put something to buffer, or state or whatever, on stack if it is small and on heap if it is not so basically it should be. allocate on stack small input buffer try to copy the range to input buffer? succeed call to_upper if(output buffer is too small allocate one and call to_upper again else copy output buffer to output stream else allocate input buffer on heap. copy data from temporary buffer to the buffer on heap update it with remaining part of data allocate output buffer run to_upper if output buffer too small, reallocate it run to_upper again copy output buffer to output iterator... Hey but what if user wants (in 95% of real cases) std::string name = to_upper(name); Because of this, lets create a convenience wrapper that uses two interators for name.begin(),name.end() create some std::string back_inserter (that by the way string does not have so we need to write one) and now call the function above and then return created string. And now: a) How much time this code will compile at user's side? b) How more bugs would be there? c) How many times range would actually be used? So far, I'm failing to get the clear advantage.
So lets create a some virtual iterator:
For each character I call virtual function WOW the cost is too big!
Indeed, this is probably why any_iterator never became popular.
But note codecvt facets can have exactly that same problem. At least Dikumware STL calls codecvt facets functions character per character.
But this is really "Dikumware STL" problem because other normal libraries do it in big chunks as they should. Best, Artyom

Artyom wrote: [snip]
create some std::string back_inserter (that by the way string does not have so we need to write one)
Pardon me, what does std::string not have? std::string has push_back so it is perfectly fine to have a back_inserter on std::string, without having to write any support/adapt code for it. [snip] Thanks, Gevorg

From: Gevorg Voskanyan <v_gevorg@yahoo.com> Artyom wrote:
create some std::string back_inserter (that by the way string does not have so we need to write one)
Pardon me, what does std::string not have? std::string has push_back so it is
perfectly fine to have a back_inserter on std::string, without having to write
any support/adapt code for it.
Yes, you are right, I probably had mistaken with some other container... I don't know why I thought that std::string does not have push_back. Artyom

On 20/04/2011 16:30, Artyom wrote:
From: Mathias Gaunard<mathias.gaunard@ens-lyon.org>
Generating very small functions that keep jumping and doing indirections between each other is even worse than duplicating a bit of binary code. Inlining gives very good performance benefits.
Some people (performance freaks would disagree with you) see Linux kernel coding style.
But this is other story.
High-performance computing is my job, but whatever.
I see real libraries, like Boost.Asio and I see their endless compilation times and enormous executable sizes it creates and I'm not happy with this.
That is a reality. Not a myth.
The problem with Asio is not templates, it's that the author likes it to be header-only to avoid having to set up a proper build environment. Some code should definitely be in its own TU. Compilation time is usually much longer when using templates, but that's not really a problem.
I agree, template metaprogramming is very important part of C++ but everything has its own place.
Why do you insist on confusing templates with template meta-programming? Template meta-programming is some very specific usage of templates.
Sometimes dynamic polymorphism is better then static and sometimes other way around.
Its only advantage is being dynamic. But maybe you don't need that dynamism?
In order to have ranges just use break_iterator that returns boundary points. That it was designed for.
break_iterator is an iterator of iterators, not ranges. And a poor one, but that's another issue.
However token_iterator designed to create tokens.
Note, if token iterator was returning a range you wouldn't be able to call
std::string s = *it;
Is that a problem? Why would you want to do such a thing in the first place? You already have the data, why copy it? In any case, it's still possible by adding a conversion operator to the range type in question. Maybe one of the Boost.Range maintainers could comment on why iterator_range doesn't have such a conversion operator? Anyway, the whole boundary part of the library has a horrible design in my opinion. The more I look at it the less I like it. It changing completely is the main condition for a future acceptance from me.
I see, but sometimes I don't really agree that the interface you suggest simplifies things, sometimes it may but not always.
But this is my point of view.
I don't see how that kind of simplification can be subject to points or view. Not requiring users to convert to your format is simpler than requiring them to do so, it's a fact. If I wanted to use your library, it would be ironic if the first thing I had to do is write a wrapper around it. I thought it was a library that aimed at being a wrapper of ICU with a nice C++ interface.
You assume that "whatever" facet is stateful, and it is not it is const and stateless, unless I provide some specific state as additional parameter.
No, I assumed it was stateless.
So this way or other I'll have to put something to buffer, or state or whatever, on stack if it is small and on heap if it is not so basically it should be.
Don't allocate anything but small automatic variables. The algorithm itself should be separate from the buffer management.
allocate on stack small input buffer
try to copy the range to input buffer? succeed call to_upper if(output buffer is too small allocate one and call to_upper again else copy output buffer to output stream else allocate input buffer on heap. copy data from temporary buffer to the buffer on heap update it with remaining part of data allocate output buffer run to_upper if output buffer too small, reallocate it run to_upper again copy output buffer to output iterator...
Your algorithm seems wrong. If the output buffer is too small, simply don't process the data that would require a larger buffer. Or you could design the code so that the output buffer is always statically guaranteed to be big enough. And don't copy the whole input range to a contiguous buffer (that would need to be heap-allocated), copy it in bits to an automatic buffer.
But this is really "Dikumware STL" problem because other normal libraries do it in big chunks as they should.
If your library requires on codecvt facets to be efficient and the most popular implementation isn't, then it becomes your library's problem.

Inlining gives very good performance benefits.
Some people (performance freaks would disagree with you) see Linux kernel coding style.
But this is other story.
High-performance computing is my job, but whatever.
Mine too :-)
However token_iterator designed to create tokens.
Note, if token iterator was returning a range you wouldn't be able to call
std::string s = *it;
Is that a problem? Why would you want to do such a thing in the first place? You already have the data, why copy it?
Why? Because that what 99% of programmers would actually do if they want specific token.
I see, but sometimes I don't really agree that the interface you suggest simplifies things, sometimes it may but not always.
But this is my point of view.
I don't see how that kind of simplification can be subject to points or view.
Not requiring users to convert to your format is simpler than requiring them to do so, it's a fact.
If I wanted to use your library, it would be ironic if the first thing I had to do is write a wrapper around it. I thought it was a library that aimed at being a wrapper of ICU with a nice C++ interface.
It really depends on your use pattern. See a small sample when I use collation in my real application to display index of topics: // // note std::locale has // bool operator()(std::string const &,std::string const &) const // typedef std::multimap<std::string,std::string,std::locale> mapping_type; mapping_type mapping(context().locale()); cppdb::result r; cppdb::session sql(conn); r=sql<< "SELECT slug,title FROM pages WHERE lang=?" << locale_name; while(r.next()) { std::string slug,t; r >> slug >> t; mapping.insert(std::pair<std::string,std::string>(t,slug)); } How easy and simple, isn't it? How would it look like with ranges API?
You assume that "whatever" facet is stateful, and it is not it is const and stateless, unless I provide some specific state as additional parameter.
No, I assumed it was stateless.
Your sample just took some two generic template pointers that can't be passed to virtual functions that know anything about them unless you use any_iterator or something like that. So at lease how it was written it was wrong.
Your algorithm seems wrong. If the output buffer is too small, simply don't process the data that would require a larger buffer.
If you know that would it size be?
Or you could design the code so that the output buffer is always statically guaranteed to be big enough.
What is the guarantee? Do you have specific suggestions for Case handling where the charset is changed in runtime, locale is changed in runtime and so on?
And don't copy the whole input range to a contiguous buffer (that would need to be heap-allocated), copy it in bits to an automatic buffer.
Some operations require to have entire text and not small part of it, you had written some Unicode algorithms, you know it. Now add a charset know in the run time to make it even more fun.
But this is really "Dikumware STL" problem because other normal libraries do it in big chunks as they should.
If your library requires on codecvt facets to be efficient and the most popular implementation isn't, then it becomes your library's problem.
No, it does not requires, but if current implementation is already bad, how can I do it better? Artyom
participants (8)
-
Artyom
-
Gevorg Voskanyan
-
Gordon Woodhull
-
Mathias Gaunard
-
Max Sobolev
-
Phil Endecott
-
Steven Watanabe
-
Stewart, Robert