Re: [boost] [gsoc]built-in support for dictionary words

28 Mar 2009

      kannan venkat wrote:
...
I plan to provide
*support for list of all meaningful words .
*efficient methods to search if a string is a valid english word.
*advanced searched options for checking if a given string is a substring is
of any valid english word in an efficient manner.
*methods to check if any anagram of a given string is a valid english word.
could you tell me if this will be a useful contribution & give suggestions
if there can be any other useful feature related to this..
I do not think it's very interesting if you limit that to English. This 
should work with all languages.

Now, with any language, representation and iteration of characters, 
words, sentences as well as comparison and collations are all 
not-so-trivial operations.
You would however need almost all of them to do what you want.

Handling natural language is really quite more complicated than handling 
bytes.

Thankfully, the Unicode standard defines representation and a lot of 
operations. The funny thing is that some languages, such as Thai, 
actually require a dictionary to tell words apart from each other, since 
there are no explicit word boundaries (alternatively, it can be done 
using machine learning algorithms to percept word-like constructs, there 
are quite a few research papers on that topic).

There has been a lot of demand for some Unicode library within Boost, 
but those demands were never met.
ICU from IBM is a popular Unicode library, though, and several libraries 
within Boost use it.

I would suggest you either base your work on ICU, or you re-implement 
the parts from Unicode that you need.

As for features, I suggest you define a dictionary format that allows 
concise definitions of words. Since in English, for example, most words 
are actually some base word with prefixes and suffixes, you could simply 
tell the dictionary to allow various combinations.
It may be a bad idea or not, I don't know, but I've found several times 
that spell-checkers were able to recognize some word but not if I add a 
valid suffix to it.

Your project actually gave me an idea: being myself a student, I will 
propose an unicode string project.
Thank you ;)

Re: [boost] [gsoc]built-in support for dictionary words

Mathias Gaunard