
Dear All, Here is a brief review of the proposed Boost.Locale library. I know little about the internationalisation issues that this library addresses, so I will focus just on the few areas where I do have some experience. The documentation is generally of good quality but could be expanded in places as noted below. I appreciate the introduction of std::locale at the beginning; having avoided std::locale for precisely the "common critical problems" mentioned, this introduction was useful. As others have said, a similar introduction to icu, perhaps even deep links to appropriate parts of its documentation, and perhaps also to Unicode docs would be useful. I found that it was not sufficiently easy to find which header each function is in, e.g. on the charset_handling.html page the name of the header is not mentioned; if I click on e.g. from_utf I get to a reference page that also doesn't name the header. Use of strings rather than symbols: there are a few places where the library uses strings such as "UTF-8" or "de_DE.UTF8" rather than symbols like utf_8 or de_DE_UTF8. This use of strings can make things run-time errors that could otherwise be compile-time errors. Mis-typing e.g. "UTF_8" or "de-DE" is too easy. Perhasp in some cases the domain of the parameter is not known at compile time, but even in those cases, it should be possible to provide symbols for the most common choices. Collation: I would be interested to know how fast these comparisons are compared to regular std::string::operator<. If I correctly understand what transform does (more docs needed), it would be interesting to then compare the performance of std::map<...,comparator> with a std::map<> storing transformed strings. Or similarly, to compare std::sort using the comparator with std::sort of a pair<transformed string, original string>. Ideally these collation functions would take character ranges of arbitrary types, however I guess that the icu API takes char* making this unreasonable. Apparently there is no option for a NULL-terminated char*, unlike the case conversion functions, right? Case conversions: I don't know what "title case" is or what "fold case" means; the docs should define this. At least linking to the definitions of the Unicode normalisation forms would also be useful; preferably, some explanation of how they should be used would be included. I can think of two recent case-conversion problems that I have had, both related to geographic names: 1. I had a list of North American names, in "UPPER CASE". I decided to convert them to "Initial Capitals" for better appearance, but later realised that this was broken for accented names in Quebec. I could imagine some sort of "case conversion is lossy" property of a locale that would detect this; does anything like that exist? (I guess not.) 2. In a similar case I found that " 's " was being wrongly capitalised, e.g. "St. John'S Street". (This was with PostgreSQL's case conversion functions.) Again it would be useful to know whether (a) does the library try to handle this sort of thing correctly, and (b) can it tell me whether the operation that I want to do is lossy or not? For example, say I have a set of place names that are a mixture of en_US and en_CA and fr_CA but I don't know which; can I perform a conversion that works because those locales all have the same rules for e.g. ' and -? (I guess not.) Character set conversion: I think it's great that Boost might finally get character set conversion - it's overdue! However, I'm not enthusiastic about this: std::string utf8_string = to_utf<char>(latin1_string,"Latin1"); 1. It should be possible for the source character set to be a compile-time constant symbol, and a template parameter e.g. conv<latin1,utf8>(str). As it is, if I want to convert just a few bytes, it will surely spend more time decoding the charset name string than actually converting. 2. As above, ideally it should take a range of arbitrary type and output to a back_insert_iterator; presumably this is an icu limitation. 3. The error handling seems to be basic - it's "all or nothing". Considering from_utf, since this is the direction where there are more things that can be unrepresentable, I might want to: - Skip the character; - Replace with '?'; - Replace with an unaccented version, or e.g. curly-quotes to plain-quotes; - Throw an exception that aborts the whole operation; - Do the conversion up to the point of difficulty and then indicate the problem. 4. There's also the case where the input is chunked, and the chunk-breaks might fall within multi-byte characters; so individual chunks are potentially malformed while the overall stream is OK. I think the best way to handle this is with a stateful functor to do the comparison; I wrote one using iconv, like this: Iconver<latin1,utf8> latin1_to_utf8; while (!end_of_input()) { output.append( latin1_to_utf8(input.get_chunk()) ); } assert(latin1_to_utf8.empty()); // detect unconverted partial chars at end of input Also: how is the performance of this conversion? If it is being done by icu, we should have a hook (i.e. specialisation, hence my suggestion of template parameters for the character sets) so that we can provide optimised versions of important conversions. Here are my answers to the "normal questions": - What is your evaluation of the design? Generally it seems like a reasonable design for an icu wrapper. If it were not an icu wrapper, no doubt some things could be done better. - What is your evaluation of the implementation? I've looked at very little of the source. I decided to look at iconv_codepage.hpp since I have some experience with iconv. I note that a union is being used in what I believe is an undefined way: union { char first; uint16_t u16; uint32_t u32; } v; v.u16 = 1; if(v.first == 1) { return "UTF-16LE"; } else { return "UTF-16BE"; } In my own iconv wrapper code I need a const_cast because the POSIX spec misses a const on the second argument to iconv(); some platforms correct this, but not all: #if defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__NetBSD__) || defined(__sun__) // Previously __APPLE__ was included in this list; presumably they have // changed their headers. If you have an older system you may need to put // it back. #define ICONV_ARG2_IS_CONST #endif This doesn't seem to be done here. There seems to be some use of reinterpret_cast which I think should be a static_cast via void*: virtual std::string convert(char_type const *ubegin,char_type const *uend) { .... char const *begin = reinterpret_cast<char const *>(ubegin); "int err" should probably be "errno_t err". There seems to be a "magic constant" of 10 to allow for the difference in size between the input and the output. Some comments would help. For example, to describe the meaning of the return values and error codes from iconv(). Based on that limited look, my view is that the code probably works, but it could be improved in many ways. - What is your evaluation of the documentation? Good enough. - What is your evaluation of the potential usefulness of the library? Useful and overdue. - Did you try to use the library? With what compiler? Did you have any problems? No, I've not compiled anything. - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? 3 hours reading the docs and looking at a tiny bit of the implementation. - Are you knowledgeable about the problem domain? Not beyond the areas that I have discussed above.
Please explicitly state in your review whether the library should be accepted.
It's borderline. I was going to say "yes, but please consider updating at least the charset conversion functions in the way that I proposed". But then I looked at the source and noted what I think are well-known issues like abuse of union and reinterpret_cast, which have lowered my opinion a bit. And then I've seen some other reviews that have looked at the areas I've not considered (i.e. message translation) and they have been quite critical. Based on all of that, my view has shifted to "no, but this is worthwhile functionality, so please come back when you've resolved some of these concerns". Regards, Phil.

AMDG On 04/16/2011 03:02 PM, Phil Endecott wrote:
I note that a union is being used in what I believe is an undefined way:
union { char first; uint16_t u16; uint32_t u32; } v;
v.u16 = 1; if(v.first == 1) { return "UTF-16LE"; } else { return "UTF-16BE"; }
char is a special case (3.10): If a program attempts to access the stored value of an object through an lvalue of other than one of the following types the behavior is undefined: -- the dynamic type of an object ... -- a char or unsigned char type
There seems to be some use of reinterpret_cast which I think should be a static_cast via void*:
virtual std::string convert(char_type const *ubegin,char_type const *uend) { .... char const *begin = reinterpret_cast<char const *>(ubegin);
Under the new standard reinterpret_cast is correct. Under the current standard it's implementation defined, and correct for all existing implementations. In Christ, Steven Watanabe

AMDG On 04/16/2011 03:02 PM, Phil Endecott wrote:
In my own iconv wrapper code I need a const_cast because the POSIX spec misses a const on the second argument to iconv(); some platforms correct this, but not all:
#if defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__NetBSD__) || defined(__sun__) // Previously __APPLE__ was included in this list; presumably they have // changed their headers. If you have an older system you may need to put // it back. #define ICONV_ARG2_IS_CONST #endif
This doesn't seem to be done here.
Actually, the library handles it in a smarter way: it uses overload resolution to detect the type of ::iconv and adds the cast if needed. In Christ, Steven Watanabe

From: Phil Endecott <spam_from_boost_dev@chezphil.org>
As others have said, a similar introduction to icu, perhaps even deep links to appropriate parts of its documentation, and perhaps also to Unicode docs would be useful.
The point is that Boost.Locale uses ICU as one of its back-ends (even thou it is primary and the bast one) so I even prefer to decouple Boost.Locale from ICU at some level so users would see that Boost.Locale is not just convince wrapper of ICU but more then that. Also Unicode is very-very large topic and basically the search in google for ICU and Unicode would give you the best links...
I found that it was not sufficiently easy to find which header each function is in, e.g. on the charset_handling.html page the name of the header is not mentioned; if I click on e.g. from_utf I get to a reference page that also doesn't name the header.
Good point. I'll add it. I'm little bit surprised about missing header name int the reference docs because it what Doxygen usually does.
Use of strings rather than symbols: there are a few places where the library uses strings such as "UTF-8" or "de_DE.UTF8" rather than symbols like utf_8 or de_DE_UTF8. This use of strings can make things run-time errors that could otherwise be compile-time errors. Mis-typing e.g. "UTF_8" or "de-DE" is too easy. Perhasp in some cases the domain of the parameter is not known at compile time, but even in those cases, it should be possible to provide symbols for the most common choices.
What most common choices? There are few dozen of different character encodings, there are even more locales, and what considered common? Also not all encodings are supported by all backends. For example iconv would not handle some windows codepages and MultiByteTo.. would not handle some other encodings. Locales, Is de_DE.UTF-8 common? Is he_IL.UTF-8 common? Is zh_CN.UTF-8 common? Also the level of support by different backends may depend on actually OS configuration - if some locale is not configured on Windows or Linux that non-ICU backends would fallback to the C/POSIX locale. So should there be hard coded constants for locales and encodings? I doubt. However, I think there may be useful conversion like: std::string utf8_str=utf_to_utf<char>(L"Hello"); // UTF-16/UTF-32 to UTF-8 And it was rised by other reviewers and I actually consider adding something like that to the code Also small note, about character encoding mistype. The encodings compared as characters and numbers only, the rest of the symbols are ignored, so from Boost.Locale and ICU point of view: UTF-8, UTF8, Utf_8 and UTF#$%^#$8 are the same.
Collation: I would be interested to know how fast these comparisons are compared to regular std::string::operator<. If I correctly understand what transform does (more docs needed), it would be interesting to then compare the performance of std::map<...,comparator> with a std::map<> storing transformed strings. Or similarly, to compare std::sort using the comparator with std::sort of a pair<transformed string, original string>.
Transform is something very common for collation purposes. Every collation API provides it (ICU, POSIX, C and Win32 API). I hadn't do any specific benchmarks but it is much faster as collation normally involves Unicode normalization and then search over character database for specific locale. I doubt it std::sort of transofrmed messages would be much faster as transform costs as well. But if you search frequently for some string in database then it as significant performance impacts. As you transform it only once and then you just run strcmp which is much faster
Ideally these collation functions would take character ranges of arbitrary types, however I guess that the icu API takes char* making this unreasonable. Apparently there is no option for a NULL-terminated char*, unlike the case conversion functions, right?
Good point. I'll take a look on this.
Case conversions: I don't know what "title case" is or what "fold case" means; the docs should define this. At least linking to the definitions of the Unicode normalisation forms would also be useful; preferably, some explanation of how they should be used would be included.
Noted. Good point.
I can think of two recent case-conversion problems that I have had, both related to geographic names:
1. I had a list of North American names, in "UPPER CASE". I decided to convert them to "Initial Capitals" for better appearance, but later realised that this was broken for accented names in Quebec. I could imagine some sort of "case conversion is lossy" property of a locale that would detect this; does anything like that exist? (I guess not.)
Actually any case conversion is "lossy" because it does not know the original case, however the good news is that most of code-points in Unicode database does not have case at all. Roughly Only European languages actually have case, the Semic, East-Asian and many others don't have case at all. And BTW "Inital Capitals" is actually Title Case.
2. In a similar case I found that " 's " was being wrongly capitalised, e.g. "St. John'S Street". (This was with PostgreSQL's case conversion functions.) Again it would be useful to know whether (a) does the library try to handle this sort of thing correctly, and (b) can it tell me whether the operation that I want to do is lossy or not? For example, say I have a set of place names that are a mixture of en_US and en_CA and fr_CA but I don't know which; can I perform a conversion that works because those locales all have the same rules for e.g. ' and -? (I guess not.)
Case handling is actually Unicode algorithms, and it behaves same for almost all locales but few: Azeri and Turkish where "i" converted to "I with dot" and "I converted to i without dot" and AFAIR there is one or two other coner cases locale dependent. In any case case handling is always lossy but depends on how you define it. To answer the (a) yes it handles it right using ICU backend. If you ask if it does everything right then not. Because different languages may have different capitalization rules but as far as I see Unicode algorithm and current ICU library ignores them. But it is still the best we get.
Character set conversion: I think it's great that Boost might finally get character set conversion - it's overdue! However, I'm not enthusiastic about this:
std::string utf8_string = to_utf<char>(latin1_string,"Latin1");
1. It should be possible for the source character set to be a compile-time constant symbol, and a template parameter e.g. conv<latin1,utf8>(str). As it is, if I want to convert just a few bytes, it will surely spend more time decoding the charset name string than actually converting.
See notes above about constants. Also Boost.Locale does not use its own conversion tables, it relays on 3rd part API like ICU, iconv or Win32 API. So this way or other you will spend some time calling uconv_open or iconv_open or something else. If you want to make stream conversions or convert few characters a time you can use codecvt API. There is actually an example in docs about using codecvt.
2. As above, ideally it should take a range of arbitrary type and output to a back_insert_iterator; presumably this is an icu limitation.
Actually iterator is bad ideome for charset conversion it is more an operation on a stream. And this is what codecvt is useable for. And note that not all APIs support stream operations most noticeable is Win32 API MultiByte... and iconv. I had very hard times to support this stream conversions as it wasn't too friendly API in reallity. As generally codepage conversion is much more stateful then most of us think. For example iconv may composite several character to a single code-point.
3. The error handling seems to be basic - it's "all or nothing". Considering from_utf, since this is the direction where there are more things that can be unrepresentable, I might want to: - Skip the character; - Replace with '?'; - Replace with an unaccented version, or e.g. curly-quotes to plain-quotes; - Throw an exception that aborts the whole operation; - Do the conversion up to the point of difficulty and then indicate the problem.
Few points: - Skip and Throw is supported.
- Replace with an unaccented version, or e.g. curly-quotes to plain-quotes;
I'm not aware of any charset API that actually supports such things.
- Replace with '?';
I think I noted before, it is problematic with non-ICU backends as not all support such operations and more then that some do not even provide a way to know where operation had failed (Win32 API)
- Do the conversion up to the point of difficulty and then indicate the problem.
This is problem as well as not all API provide this.
4. There's also the case where the input is chunked, and the chunk-breaks might fall within multi-byte characters; so individual chunks are potentially malformed while the overall stream is OK. I think the best way to handle this is with a stateful functor to do the comparison; I wrote one using iconv, like this:
Iconver<latin1,utf8> latin1_to_utf8;
while (!end_of_input()) { output.append( latin1_to_utf8(input.get_chunk()) ); } assert(latin1_to_utf8.empty()); // detect unconverted partial chars at end of input
There is a codecvt for this as I mentioned above, also codepage conversion is not the primary goal of Boost.Locale but rather utility task.
Also: how is the performance of this conversion? If it is being done by icu, we should have a hook (i.e. specialisation, hence my suggestion of template parameters for the character sets) so that we can provide optimised versions of important conversions.
I had found that iconv and ICU perform similarry in terms of performance. The performance of codecvt for UTF-8 and sinle byte charact sets is very good for multi-byte like Shift-JIS not so good but it is rather limitation of std::codecvt design.
I note that a union is being used in what I believe is an undefined way:
union { char first; uint16_t u16; uint32_t u32; } v;
v.u16 = 1; if(v.first == 1) { return "UTF-16LE"; } else { return "UTF-16BE"; }
No problems there (see response of Steven)
In my own iconv wrapper code I need a const_cast because the POSIX spec misses a const on the second argument to iconv(); some platforms correct this, but not all:
Actually POSIX specifies restrict which is little bit stronger then const.
#if defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__NetBSD__) || defined(__sun__) // Previously __APPLE__ was included in this list; presumably they have // changed their headers. If you have an older system you may need to put // it back. #define ICONV_ARG2_IS_CONST #endif
It is wrong as it may depend on version of iconv... And Boost.Locale does it smarted see response of Steven.
This doesn't seem to be done here.
There seems to be some use of reinterpret_cast which I think should be a static_cast via void*:
virtual std::string convert(char_type const *ubegin,char_type const *uend) { .... char const *begin = reinterpret_cast<char const *>(ubegin);
No problems there as wellm , see Steven's response
"int err" should probably be "errno_t err".
No, see: http://pubs.opengroup.org/onlinepubs/009695399/functions/errno.html errno defined as int by POSIX specifications.
There seems to be a "magic constant" of 10 to allow for the difference in size between the input and the output.
So? It is just guess for extend it does not has other meanings.
Some comments would help. For example, to describe the meaning of the return values and error codes from iconv().
man iconv? Also EINvalidVALue, E2ooBIG and EILlligalSEQence are quite self explaining at least for thous who familiar with Unix programming.
- What is your evaluation of the documentation?
Good enough.
- What is your evaluation of the potential usefulness of the library?
Useful and overdue.
- Did you try to use the library? With what compiler? Did you have any problems?
No, I've not compiled anything.
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
3 hours reading the docs and looking at a tiny bit of the implementation.
- Are you knowledgeable about the problem domain?
Not beyond the areas that I have discussed above.
Please explicitly state in your review whether the library should be accepted.
It's borderline. I was going to say "yes, but please consider updating at least the charset conversion functions in the way that I proposed".
Some things like add utf-to-utf conversion would be added, however I doubt about the substitution for the reasons I explained above.
But then I looked at the source and noted what I think are well-known issues like abuse of union and > reinterpret_cast, which have lowered my opinion a bit.
There was nothing wrong and Steven had answered for most points.
Regards, Phil.
Thanks for the review. Artyom

Hi Artyom, I don't have time to answer in much detail, but I'll just re-emphasise that I would very much like to see a simple and fast character set conversion like this: string l1_name = charset_conv<utf8,latin1>(utf8_name); In my own charset conversion code, I generated a large enum of all the character set names and aliases based on the IANA registry (http://www.iana.org/assignments/character-sets). I then have a fallback converter using iconv, and fast specialisations for cases like utf-to/from-iso8859. I would happily contribute those specialisations if you had a way to plug them in. Currently you offer either conversion functions that take character set names as strings, which I dislike because of the overhead of the run-time lookup that is required (especially for short strings), and stream processing, which I dislike because I don't want the overhead (runtime and lines-of-code) of copying to and from stringstreams. Regards, Phil.

AMDG On 04/17/2011 12:25 PM, Phil Endecott wrote:
I don't have time to answer in much detail, but I'll just re-emphasise that I would very much like to see a simple and fast character set conversion like this:
string l1_name = charset_conv<utf8,latin1>(utf8_name);
In my own charset conversion code, I generated a large enum of all the character set names and aliases based on the IANA registry (http://www.iana.org/assignments/character-sets). I then have a fallback converter using iconv, and fast specialisations for cases like utf-to/from-iso8859. I would happily contribute those specialisations if you had a way to plug them in. Currently you offer either conversion functions that take character set names as strings, which I dislike because of the overhead of the run-time lookup that is required (especially for short strings), and stream processing, which I dislike because I don't want the overhead (runtime and lines-of-code) of copying to and from stringstreams.
Of course this only works if the character set is known at compile time. Boost.Locale is all about runtime configuration. Thus, specifying the character set as a string rather than a template parameter is perfectly reasonable. In Christ, Steven Watanabe

Artyom wrote:
From: Phil Endecott <spam_from_boost_dev@chezphil.org>
Use of strings rather than symbols: there are a few places where the library uses strings such as "UTF-8" or "de_DE.UTF8" rather than symbols like utf_8 or de_DE_UTF8. This use of strings can make things run-time errors that could otherwise be compile-time errors. Mis-typing e.g. "UTF_8" or "de-DE" is too easy. Perhasp in some cases the domain of the parameter is not known at compile time, but even in those cases, it should be possible to provide symbols for the most common choices.
What most common choices? There are few dozen of different character encodings, there are even more locales, and what considered common?
Also not all encodings are supported by all backends. For example iconv would not handle some windows codepages and MultiByteTo.. would not handle some other encodings.
Locales, Is de_DE.UTF-8 common? Is he_IL.UTF-8 common? Is zh_CN.UTF-8 common?
Also the level of support by different backends may depend on actually OS configuration - if some locale is not configured on Windows or Linux that non-ICU backends would fallback to the C/POSIX locale.
So should there be hard coded constants for locales and encodings?
If there is any runtime cost associated with the string representation, could you use a type to represent the encoding? The idea being that one could instantiate the encoding object from a string and the constructor could throw an exception to indicate an unsupported encoding. Then, one can reuse the encoding object thereafter with no further runtime cost. Thus, APIs would expect an encoding object, not a string, but if the encoding constructor is not explicit, the effect would be the same. _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

From: "Stewart, Robert" <Robert.Stewart@sig.com>
Use of strings rather than symbols: there are a few places where the library uses strings such as "UTF-8" or "de_DE.UTF8" rather than symbols like utf_8 or de_DE_UTF8. This use of strings can make things run-time errors that could otherwise be compile-time errors. Mis-typing e.g. "UTF_8" or "de-DE" is too easy. Perhasp in some cases the domain of the parameter is not known at compile time, but even in those cases, it should be possible to provide symbols for the most common choices.
What most common choices? There are few dozen of different character encodings, there are even more locales, and what considered common?
Also not all encodings are supported by all backends. For example iconv would not handle some windows codepages and MultiByteTo.. would not handle some other encodings.
Locales, Is de_DE.UTF-8 common? Is he_IL.UTF-8 common? Is zh_CN.UTF-8 common?
Also the level of support by different backends may depend on actually OS configuration - if some locale is not configured on Windows or Linux that non-ICU backends would fallback to the C/POSIX locale.
So should there be hard coded constants for locales and encodings?
If there is any runtime cost associated with the string representation, could you use a type to represent the encoding? The idea being that one could instantiate the encoding object from a string and the constructor could throw an exception to indicate an unsupported encoding. Then, one can reuse the encoding object thereafter with no further runtime cost. Thus, APIs would expect an encoding object, not a string, but if the encoding constructor is not explicit, the effect would be the same.
I'm not sure I fully understand you but... Actually the encoding is constant for each locale object and it is usually knows to to handle it efficiently. For example in ICU backend there is a special class that handles conversions from locale's encoding to UTF-16 - internal ICU's encoding. In any case the best practice is to use one encoding over all your code base (UTF-8) and the library is optimized especially for it. Artyom

Artyomj wrote:
From: "Stewart, Robert" <Robert.Stewart@sig.com>
Use of strings rather than symbols: there are a few places where the library uses strings such as "UTF-8" or "de_DE.UTF8" rather than symbols like utf_8 or de_DE_UTF8. This use of strings can make things run-time errors that could otherwise be compile-time errors. Mis-typing e.g. "UTF_8" or "de-DE" is too easy. Perhasp in some cases the domain of the parameter is not known at compile time, but even in those cases, it should be possible to provide symbols for the most common choices.
What most common choices? There are few dozen of different character encodings, there are even more locales, and what considered common?
Also not all encodings are supported by all backends. For example iconv would not handle some windows codepages and MultiByteTo.. would not handle some other encodings.
Locales, Is de_DE.UTF-8 common? Is he_IL.UTF-8 common? Is zh_CN.UTF-8 common?
Also the level of support by different backends may depend on actually OS configuration - if some locale is not configured on Windows or Linux that non-ICU backends would fallback to the C/POSIX locale.
So should there be hard coded constants for locales and encodings?
If there is any runtime cost associated with the string representation, could you use a type to represent the encoding? The idea being that one could instantiate the encoding object from a string and the constructor could throw an exception to indicate an unsupported encoding. Then, one can reuse the encoding object thereafter with no further runtime cost.
Thus, APIs would expect an encoding object, not a string, but if the encoding constructor is not explicit, the effect would be the same.
I'm not sure I fully understand you but...
There was reference to encoding constants and to strings to describe encodings. There was a desire for more constants to represent the "common" encodings. You noted the difficulty in determining which were the common encodings to know which constants to provide. My suggestion was, if it fits the library, to create a class representing the encoding. The constructor would accept a string and could validate that the string represents a valid encoding. Your APIs that need an encoding can, then, expect an encoding object rather than a string to describe the encoding. Rereading the section at the top, above, I see that the OP wanted to get compile time errors. I presume that just a request that your APIs not accept strings but encoding objects as arguments, but that doesn't address validation of strings when that flexibility is needed, since such validation must be done at runtime. With my suggestion, encoding validation is done in just one place: the encoding class' constructor. Conversely, the rest of your API need do no validation (presuming it does so, given that strings are accepted in at least some places). An encoding class also means that library users can create their own encoding constants by simply creating const instances of the encoding class to use throughout an application, relieving you of the burden to create the "right" set of "common" encoding constants. The last point I was making is, if the encoding class' constructor is not explicit, then users can call your APIs with strings and the compiler will implicitly instantiate an encoding object to pass to the called functions.
Actually the encoding is constant for each locale object and it is usually knows to to handle it efficiently.
I'm not sure how your locale objects factor in this discussion about encoding constants, but I'll leave that for you to determine. (I have never done i18n or l10n, so I haven't even read the documentation to understand how these may play together. I just thought I might be able to suggest something to address the OP's concern. However, my ignorance may be causing me to speak out of turn. If so, I'm sorry for the noise.)
For example in ICU backend there is a special class that handles conversions from locale's encoding to UTF-16 - internal ICU's encoding.
In any case the best practice is to use one encoding over all your code base (UTF-8) and the library is optimized especially for it.
Perhaps this is a terminology thing. Here you revert to "encoding" after your earlier mention of "locale." Perhaps one simply installs a global locale and that establishes a global encoding and no other encoding is referenced in the normal case. However, that seems at odds with the existence of any encoding constants, so I'm left confused. (Don't worry about easing my confusion; I just mean that it still seems like encoding constants are warranted and that my suggestion might be useful as a consequence. If not, just ignore me and get on with more important things!) _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Rob Stewart wrote:
So should there be hard coded constants for locales and encodings?
If there is any runtime cost associated with the string representation, could you use a type to represent the encoding? [snip]
Thus, APIs would expect an encoding object, not a string, but if the encoding constructor is not explicit, the effect would be the same.
I'm not sure I fully understand you but...
[snip] My suggestion was, if it fits the library, to create a class representing the encoding. With my suggestion, encoding validation is done in just one place: the encoding class' constructor.
In Boost.Locale scope it works this way: 1. You create a locale: boost::locale::generator gen; std::locale myloc = gen("de_DE.ISO-8859-15") // latin1 with Euro sign; gen creates numerous object that handle operations, dictionaries, collators and so on; 2. You use this locale and it already has its own encoding std::string a,b; // compare if a < b when a,b are in ISO-8859-15 bool less = myloc(a,b); if(less) { // std::cout uses ISO-8859-15 std::cout.imbue(myloc); // messages formatted in ISO-8859-15 std::cout << format(translate("{1} sorted before {2}")) % a % b; // gettext creates a message in ISO-8859-15 according to myloc's encoding MessageBox(gettext("All sorted out",myloc).c_str()); } So basically you almost never use encoding name directly unless you do charset conversion from some forgain character set. Because encoding information is kept together with the std::locale object and validated during generation of the locale. That is why there is no need for specific encoding object in Boost.Locale's scope. Artyom
participants (4)
-
Artyom
-
Phil Endecott
-
Steven Watanabe
-
Stewart, Robert