
kannan venkat wrote:
I plan to provide *support for list of all meaningful words . *efficient methods to search if a string is a valid english word. *advanced searched options for checking if a given string is a substring is of any valid english word in an efficient manner. *methods to check if any anagram of a given string is a valid english word.
could you tell me if this will be a useful contribution & give suggestions if there can be any other useful feature related to this..
I do not think it's very interesting if you limit that to English. This should work with all languages. Now, with any language, representation and iteration of characters, words, sentences as well as comparison and collations are all not-so-trivial operations. You would however need almost all of them to do what you want. Handling natural language is really quite more complicated than handling bytes. Thankfully, the Unicode standard defines representation and a lot of operations. The funny thing is that some languages, such as Thai, actually require a dictionary to tell words apart from each other, since there are no explicit word boundaries (alternatively, it can be done using machine learning algorithms to percept word-like constructs, there are quite a few research papers on that topic). There has been a lot of demand for some Unicode library within Boost, but those demands were never met. ICU from IBM is a popular Unicode library, though, and several libraries within Boost use it. I would suggest you either base your work on ICU, or you re-implement the parts from Unicode that you need. As for features, I suggest you define a dictionary format that allows concise definitions of words. Since in English, for example, most words are actually some base word with prefixes and suffixes, you could simply tell the dictionary to allow various combinations. It may be a bad idea or not, I don't know, but I've found several times that spell-checkers were able to recognize some word but not if I add a valid suffix to it. Your project actually gave me an idea: being myself a student, I will propose an unicode string project. Thank you ;)