On Fri, Oct 23, 2009 at 10:18 AM, Roland Bock
Stjepan Rajko wrote:
On Thu, Oct 22, 2009 at 2:04 AM, Roland Bock
rbock@eudoxos.de>> wrote:
Stjepan Rajko wrote:
Hello,
For the past few years I've been working on the AME Patterns library - a generic library for modeling, recognition and synthesis of sequential patterns. ...
Support for Hidden Markov models would be of interest to me. I hope to start some part-of-speech-tagging and similar analysis in about half a year.
That's a neat problem that I haven't tried yet. I downloaded the Brown corpus and will try to get some results on part-of-speech-tagging.
Thanks for your interest,
Stjepan
Wow! Keep me posted, please :-)
OK, I just completed a small experiment on the 9 texts of the Brown Corpus categorized as "humor". I used 6 of the texts for training, and 3 for testing. I created one submodel per tag ( http://kh.aksis.uib.no/icame/manuals/brown/INDEX.HTM#bc6), trained each from the training data, and then connected the submodels into a larger model with transitions also trained by the training data. Here are the results: Out of 7159 tagged parts of speech (words, symbols, etc.) present in the 3 test texts: 5190 were tagged correctly 300 were tagged incorrectly 1669 were not tagged, because the word or symbol was not present (at least not in a verbatim form) in the training data. So, if you only consider the 7159-1669=5490 parts that could possibly be tagged based on what the training data covers, you get a 94.5% success rate. By using a larger training set, the number of non-tagged parts should go down. Also, I'm sure there are domain-specific tricks to improving the results. BTW., 95% of work to get this done was putting together the code that reads the corpus, since I already have generic code that does this kind of experiment.
I hope to join in a few months...
Great! I hope to have things cleaned up and better documented by then. Best, Stjepan