[locale] Normalization and transformation

Hi, following is a sample program playing with text conversion features of Boost.Locale (Boost version 1.48.0 on Linux Fedora 17), as seen in the documentation (http://unicode.org/reports/tr15/#Norm_Forms and http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/index.html): ___________________________________________________________ #include <boost/locale.hpp> int main() { // Get the global localisation backend boost::locale::localization_backend_manager locBEMgr = boost::locale::localization_backend_manager::global(); // Select ICU backend as default locBEMgr.select ("icu"); // Set this backend globally boost::locale::localization_backend_manager::global (locBEMgr); // Create a generator that uses this backend. boost::locale::generator locGen (locBEMgr); // Create locale generator with the system default locale std::locale::global (locGen ("")); // Test string with accents (french word for "side") std::string sideFR ("Côté"); // Test the Boost Locale string conversions std::cout << "Original: " << sideFR << std::endl <<"Upper " << boost::locale::to_upper (sideFR) << std::endl <<"Lower " << boost::locale::to_lower (sideFR) << std::endl <<"Title " << boost::locale::to_title (sideFR) << std::endl <<"Fold " << boost::locale::fold_case (sideFR) << std::endl << "Normalised - [NFD]: " << boost::locale::normalize (sideFR, boost::locale::norm_nfd) << "; [NFC]: " << boost::locale::normalize (sideFR, boost::locale::norm_nfc) << "; [NFKD]: " << boost::locale::normalize (sideFR, boost::locale::norm_nfkd) << "; [NFKC]: " << boost::locale::normalize (sideFR, boost::locale::norm_nfkc) << std::endl; return 0; } ___________________________________________________________ The output is: _____________________________________________________ Original: Côté Upper CÔTÉ Lower côté Title Côté Fold côté Normalised - [NFD]: Côté; [NFC]: Côté; [NFKD]: Côté; [NFKC]: Côté ____________________________________________________ Apparently, there is no difference in the normalization forms, whatever method is used (NFC, NFD, NFKC, NFKD). I would expect that NFD and NFKD would produce a different result. But maybe my (UTF8-based) Linux terminal recomposes automatically the letters and accents, so that we do not see the difference? Or is it a feature of Boost.Locale, which I overlooked? Let me state my goal. I use Xapian (http://xapian.org/docs/) as a full-text matching engine, and I feed it with texts of various languages and scripts, all in UTF8. Xapian will typically match keywords with the same forms and cases; in other words, I cannot choose or influence the collation algorithm (http://www.unicode.org/reports/tr10/), which corresponds to the third or fourth in Xapian (AFAIU). To give a sample, if the "Côté" word has been indexed by Xapian, "cote" will not match with it. So, I would like Xapian to index both forms ("Côté" and "cote"), so that both forms match. When a user gives me a string to match against the index, I will first try to match the string itself, then transform it (http://www.icu-project.org/icu-bin/translit) and try to match the transformed version. So, my question is: Is Boost.Locale capable of transforming Unicode strings, for instance to remove accents? A good example is provided by the first answer to the StackOverflow question: http://stackoverflow.com/questions/144761/how-to-remove-accents-and-tilde-in... In other words, I would like to apply the "NFD; [:M:] remove; NFC" transformation, as is possible with the ICU library. The following is a code sample showing how to do that: ____________________________________________________ // ICU #include <unicode/translit.h> #include <unicode/unistr.h> #include <unicode/ucnv.h> int main() { // Create a Normalizer UErrorCode status = U_ZERO_ERROR; const char* lNormaliserID = "NFD; [:M:] Remove; NFC;"; lNormaliser = Transliterator::createInstance (lNormaliserID, UTRANS_FORWARD, status); // Register the Transliterator Transliterator::registerInstance (lNormaliser); UnicodeString myString ("Côté"); lNormaliser->transliterate (myString); std::cout << "Normalized version without accents" << toUTF8String (lQueryString) << std::endl; return 0; } ____________________________________________________ Do not hesitate if you have suggestion, feedback, work around... Kind regards -denis

The source code for the sample program is available on GitHub: https://github.com/denisarnaud/playground/tree/master/i18n
participants (1)
-
Denis Arnaud