[gsoc]built-in support for dictionary words

Hi ,I am a student applying for GSoC 2009. I had an idea of providing inbuilt support for English words found on the dictionary. As far as i have searched i found that there isn't support for dictionary words in boost (please correct me if i am wrong & if i am trying to reinvent the wheel). In many situations i had wanted to have a list af all meaningful english words(inclusive of places & common names of people) , specially in cryptographic applications which include deciphering ciphers & other applications related to words(like finding meaningful words in a random grid of characters) . And I could not find any in-built support for this requirement.I had to write my own algorithm to work on an external list. Thus i think it might be useful. I plan to provide *support for list of all meaningful words . *efficient methods to search if a string is a valid english word. *advanced searched options for checking if a given string is a substring is of any valid english word in an efficient manner. *methods to check if any anagram of a given string is a valid english word. could you tell me if this will be a useful contribution & give suggestions if there can be any other useful feature related to this.. bye kannan

Why restrict this to english words ;) This can work from a proper serialized dictionnary file in any language. Beside that, hwo will it be implemented so i can be interesting ? -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On Sat, Mar 28, 2009 at 4:47 AM, kannan venkat <kvenkan@gmail.com> wrote:
Thus i think it might be useful. I plan to provide *support for list of all meaningful words . *efficient methods to search if a string is a valid english word. *advanced searched options for checking if a given string is a substring is of any valid english word in an efficient manner. *methods to check if any anagram of a given string is a valid english word.
could you tell me if this will be a useful contribution & give suggestions if there can be any other useful feature related to this.
One of the most likely applications would be a spelling checker. Thus you might want to consider the needs of spelling checkers in the requirements. Another concern is internationalization. Thus a need to work on string types other than char. HTH, --Beman

kannan venkat wrote:
Hi ,I am a student applying for GSoC 2009. I had an idea of providing inbuilt support for English words found on the dictionary.
There was a thread a few days ago about Bloom filters, did you see that? Several people expressed an interest. I think that a Bloom filter would work well for the problem you describe i.e. checking password strength. Is this the sort of thing that you have in mind? For substring searching things get more complicated e.g. suffix trees. I guess there would be interest in that too, if it could be done in the available time. Phil.

----- Original Message ----- From: "Phil Endecott" <spam_from_boost_dev@chezphil.org> To: <boost@lists.boost.org> Sent: Saturday, March 28, 2009 1:30 PM Subject: Re: [boost] [gsoc]built-in support for dictionary words
kannan venkat wrote:
Hi ,I am a student applying for GSoC 2009. I had an idea of providing inbuilt support for English words found on the dictionary.
There was a thread a few days ago about Bloom filters, did you see that? Several people expressed an interest. I think that a Bloom filter would work well for the problem you describe i.e. checking password strength. Is this the sort of thing that you have in mind? For substring searching things get more complicated e.g. suffix trees. I guess there would be interest in that too, if it could be done in the available time.
As you know I would like to see Bloom Filters in Boost. It would be great if this support for dictionaries come with them. Vicente

kannan venkat wrote:
I plan to provide *support for list of all meaningful words . *efficient methods to search if a string is a valid english word. *advanced searched options for checking if a given string is a substring is of any valid english word in an efficient manner. *methods to check if any anagram of a given string is a valid english word.
could you tell me if this will be a useful contribution & give suggestions if there can be any other useful feature related to this..
I do not think it's very interesting if you limit that to English. This should work with all languages. Now, with any language, representation and iteration of characters, words, sentences as well as comparison and collations are all not-so-trivial operations. You would however need almost all of them to do what you want. Handling natural language is really quite more complicated than handling bytes. Thankfully, the Unicode standard defines representation and a lot of operations. The funny thing is that some languages, such as Thai, actually require a dictionary to tell words apart from each other, since there are no explicit word boundaries (alternatively, it can be done using machine learning algorithms to percept word-like constructs, there are quite a few research papers on that topic). There has been a lot of demand for some Unicode library within Boost, but those demands were never met. ICU from IBM is a popular Unicode library, though, and several libraries within Boost use it. I would suggest you either base your work on ICU, or you re-implement the parts from Unicode that you need. As for features, I suggest you define a dictionary format that allows concise definitions of words. Since in English, for example, most words are actually some base word with prefixes and suffixes, you could simply tell the dictionary to allow various combinations. It may be a bad idea or not, I don't know, but I've found several times that spell-checkers were able to recognize some word but not if I add a valid suffix to it. Your project actually gave me an idea: being myself a student, I will propose an unicode string project. Thank you ;)

Mathias Gaunard wrote:
Thankfully, the Unicode standard defines representation and a lot of operations. The funny thing is that some languages, such as Thai, actually require a dictionary to tell words apart from each other, since there are no explicit word boundaries (alternatively, it can be done using machine learning algorithms to percept word-like constructs, there are quite a few research papers on that topic).
I would further that by not only allowing spoken languages but generalize the concept to take any form of bit groupings, then it could be useful in other areas of comp-sci. Furthermore, Bloom filters would only be one aspect of such a library, the underlying data structures would require tries, wide-column stores etc. Seems more like a few GSOCs. Arash Partow ________________________________________________________ Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net
participants (7)
-
Arash Partow
-
Beman Dawes
-
Joel Falcou
-
kannan venkat
-
Mathias Gaunard
-
Phil Endecott
-
Vicente Botet