
On Fri, 22 Oct 2004 12:46:00 -0400 (EDT), Rob Stewart <stewart@sig.com> wrote:
From: Rogier van Dalen <rogiervd@gmail.com>
unicode::string should take a unicode::character for appending. A unicode::character object may be constructed with a single codepoint, which will be its base character. If this codepoint is invalid, it should throw. If the codepoint is a combining mark, it should also throw. unicode::correct() should convert an invalid codepoint into U+FFFD, and if it is input a combining mark, it should use U+0020 SPACE as a base character.
Why not have unicode::character's ctor invoke unicode::correct()?
unicode::correct() replaces every encoding error in the input by a replacement character. This loses information and it is not recoverable. The combining character bit is only slightly better. When I proposed a policy I called it workaround_encoding_error; maybe we need a better name than "correct". I agree with Peter Dimov, however, that the default should be to throw rather than to throw away information and pretend nothing happened. Regards, Rogier