
On Mon, Jan 24, 2011 at 9:48 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Dean Michael Berris wrote:
Consider the following:
template <class String> void needs_utf8(String const & s) { view<utf8_encoded> utf8_string(s); if (!valid(utf8_string)) throw invalid_string("I need a UTF-8 string."); }
template <class String> void needs_utf16(String const & s) { view<utf16_encoded> utf16_string(s); if (!valid(utf16_string)) throw invalid_string("I need a UTF-16 string."); }
I would say you have four choices when implementing `view` and `valid`:
1. view converts, and valid is a no-op. 2. view doesn't convert, and valid does the validation on the underlying string. 3. view converts, and valid does the validation on the underlying string. 4. view doesn't convert, but valid checks the validation on the view.
I'm leaning towards #2.
#1 and #3 would be wasteful for cases when the string is already known to have the desired encoding, so they are non-starters.
I'm not sure I understand the distinction or reason for the distinction you imply by #2 versus #4. #2's wording suggests that you mean valid() accesses the underlying string through the view, but why is that better or worse than just using the view as in #4?
In #2, you can have valid be implemented like this: template <template class <class> View, class Encoding> bool valid(View<Encoding> const & encoded_view) { if (!valid_length(encoded_view.raw(), Encoding())) // use static tag-dispatch return false; // ... do other validity checking based on just the raw data // like BOM checking, character-by-character check on whether // there are invalid characters not within range, consider Base64 // and/or hex-encodings aside from just Unicode, etc. } Which you really would want to have for performance reasons -- case in point, if the underlying string doesn't have a valid length for UTF-16 or UTF-32 strings, you get a win by just doing some math on the length check for validity. Some libraries even make these parts compile to vectorized code, use OpenMP, or might do some things like even do GPU-assisted validation. For #4 though this would be unnecessarily limited by the interface provided by the view, which may mean that the only way you would write a validator would be to try to get an iterator from the view where you essentially wait for a dereference of an iterator to fail through some mechanism -- maybe throw on dereference, or something like that. By doing it through the #2 approach you can write a general validation routine that can even be specialized on through the specific encoding. You get the tag-dispatch goodness you can whenever for example you have a specialized routine for validation in a given encoding, have some room for partial/full specialization, etc. -- Dean Michael Berris about.me/deanberris