Re: [boost] [string] Realistic API proposal

29 Jan 2011

      On 2011-01-28 20:12, Joe Mucchiello<jmucchiello@yahoo.com> wrote:
...
// conversion for Windows API
    std::vector<wchar_t>  vec;
    vec.resize(count_codepoints<utf8>(mystring.begin(), mystring.end()));
    convert<utf8,utf16>(mystring.begin(), mystring.end(), vec.begin());
I spy with my little eye a potential crash waiting to happen.
Code-points != Code-units.
vec has room for N code-units, but 2*N code-units may be written to it 
if mystring contains non-BMP characters.

"Corrected" code:

    std::vector<wchar_t> vec;
    vec.resize(count_codeunits<wchar_encoding>(mystring.begin(), 
mystring.end()));
    convert<wchar_encoding>(mystring.begin(), mystring.end(), vec.begin());

I think a lot of these potential crashes could be prevented if the 
iterator of the new string-type (chain,text,tier,yarn) would only expose 
(const) code-points. Actual code-units would be hidden, and only 
accessed using a facade/adapter view/iterator.

    auto u8v = make_view<utf8_encoding>(mystring);
    auto u16v = make_view<utf16_encoding>(mystring);

    for (auto codepoint : mystring) {...}
    for (auto u8codeunit : u8v) {...}
    for (auto u16codeunit : u16v) {...}

I also think there isn't a reason that the new string-type *has* to be 
UTF-8 internally. It could be UTF-16, UTF-32, SCSU, or CESU-8 internally 
for that matter. Making a view from the internal encoding to an external 
encoding when both encodings are the same should be a no-op.

Regards,
Anders Dalvander

-- 
WWFSMD?

Anders Dalvander

tags

participants (1)