
Hello list, I have worked with Boost on my projects but haven't really thought of using C++ as a language for NLP. The NLP that I have done is on Python and Java, for their built-in string methods. I don't know if this is inexperience or ignorance, but does C++ work well for NLP? I was thinking of having a GSoC project that is devoted to provide support for NLP. I mean like the kind that looks real easy on python. Also, I was looking at some C++ code using Boost/tokenizer.hpp that tokenized some text and it looked a bit scary. Any suggestions or advice? -- Regards, Sarma Tangirala, Junior - Class of 2012, Department of Information Science and Technology, College of Engineering Guindy - Anna University

Mathias Gaunard wrote:
On 26/03/2011 18:54, Sarma Tangirala wrote:
I have worked with Boost on my projects but haven't really thought of using C++ as a language for NLP.
What's NLP?
Natural Language Processing. This makes me wonder whether such a NLP system would be able to understand the word NLP.

On 26/03/2011 18:54, Sarma Tangirala wrote:
Also, I was looking at some C++ code using Boost/tokenizer.hpp that tokenized some text and it looked a bit scary.
Any suggestions or advice?
Look into iterators, ranges, and the various string manipulation and parsing libraries within Boost (Iterator, Range, StringAlgo, Regex, Spirit, Xpressive). You could also want to look into my Unicode library. I could add word boundaries for non-thai languages if your project needs that.

I am so sorry I forgot to mention I was referring to Natural Language Processing. Thanks for your advice! On 27 March 2011 17:00, Mathias Gaunard <mathias.gaunard@ens-lyon.org>wrote:
On 26/03/2011 18:54, Sarma Tangirala wrote:
Also, I was looking at some C++ code using Boost/tokenizer.hpp that
tokenized some text and it looked a bit scary.
Any suggestions or advice?
Look into iterators, ranges, and the various string manipulation and parsing libraries within Boost (Iterator, Range, StringAlgo, Regex, Spirit, Xpressive). You could also want to look into my Unicode library. I could add word boundaries for non-thai languages if your project needs that.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Regards, Sarma Tangirala, Junior - Class of 2012, Department of Information Science and Technology, College of Engineering Guindy - Anna University

Hi,
I have worked with Boost on my projects but haven't really thought of using C++ as a language for NLP.
The NLP that I have done is on Python and Java, for their built-in string methods.
I think this would be an interesting project, but doing it correctly would require *way* more than 3 months of effort. If you were interested in starting to work on an NLP support library, you should focus on designing a small set of tools (WordNet support, stopword removal support, stemmers, etc.),
I don't know if this is inexperience or ignorance, but does C++ work well for NLP?
There's no reason it should not be a great choice for NLP applications. It might be worth pointing out that some of the performance critical components of the Python NLTK (Natural Language Toolkit) are written in C... and we all know that C++ is a better C than C :)
Also, I was looking at some C++ code using Boost/tokenizer.hpp that tokenized some text and it looked a bit scary.
Welcome to Boost. The learning curve can be a bit steep, but don't let that scare you away. Andrew

Hello, On 27 March 2011 18:26, Andrew Sutton <asutton.list@gmail.com> wrote:
Hi,
I have worked with Boost on my projects but haven't really thought of using C++ as a language for NLP.
The NLP that I have done is on Python and Java, for their built-in string methods.
I think this would be an interesting project, but doing it correctly would require *way* more than 3 months of effort. If you were interested in starting to work on an NLP support library, you should focus on designing a small set of tools (WordNet support, stopword removal support, stemmers, etc.),
I do realize it will take a lot more than a summer's worth of effort, but as you pointed out, a small library with a couple of basic of tools could be an excellent start point.
I don't know if this is inexperience or ignorance, but does C++ work well for NLP?
There's no reason it should not be a great choice for NLP applications. It might be worth pointing out that some of the performance critical components of the Python NLTK (Natural Language Toolkit) are written in C... and we all know that C++ is a better C than C :)
I have worked on the Python NLTK and absolutely loved it. I did not know that the critical components were written in C/C++. But I must admit, I haven't seen a final application written entirely in C/C++. I am a moderate to good level programmer and I think the reason why a lot of people prefer python is for the simplicity of the code or as one forum user put it, "the syntactical fluff and non-abstraction" that goes with C++.
Also, I was looking at some C++ code using Boost/tokenizer.hpp that
tokenized some text and it looked a bit scary.
Welcome to Boost. The learning curve can be a bit steep, but don't let that scare you away.
Andrew
Haha. Thanks for the welcome! I do realize the complexity involved in a project such as Boost, but for a noob, I was in total awe! :) -- Regards, Sarma Tangirala, Junior - Class of 2012, Department of Information Science and Technology, College of Engineering Guindy - Anna University

Just a small follow-up. I was caught in exam week and could not do anything constructive for a while. I did catch up with my advisor who specializes in AI and she was also of the opinion that a small set of tool properly implemented should keep me busy through the summer. I am preparing my proposal and should submit in a while. I want to know if I have a good chance of being selected. Any advice at this stage would be awesome. I am looking at tagging, chunking, tokenizing and parsing, stemming and stop-word removal as suggested. I will be using the O'Reilly NLTK book as a model reference. Any other good reference sources would be helpful! On 29 March 2011 06:53, Sarma Tangirala <tvssarma.omega9@gmail.com> wrote:
Hello,
On 27 March 2011 18:26, Andrew Sutton <asutton.list@gmail.com> wrote:
Hi,
I have worked with Boost on my projects but haven't really thought of using C++ as a language for NLP.
The NLP that I have done is on Python and Java, for their built-in string methods.
I think this would be an interesting project, but doing it correctly would require *way* more than 3 months of effort. If you were interested in starting to work on an NLP support library, you should focus on designing a small set of tools (WordNet support, stopword removal support, stemmers, etc.),
I do realize it will take a lot more than a summer's worth of effort, but as you pointed out, a small library with a couple of basic of tools could be an excellent start point.
I don't know if this is inexperience or ignorance, but does C++ work well for NLP?
There's no reason it should not be a great choice for NLP applications. It might be worth pointing out that some of the performance critical components of the Python NLTK (Natural Language Toolkit) are written in C... and we all know that C++ is a better C than C :)
I have worked on the Python NLTK and absolutely loved it. I did not know that the critical components were written in C/C++. But I must admit, I haven't seen a final application written entirely in C/C++. I am a moderate to good level programmer and I think the reason why a lot of people prefer python is for the simplicity of the code or as one forum user put it, "the syntactical fluff and non-abstraction" that goes with C++.
Also, I was looking at some C++ code using Boost/tokenizer.hpp that
tokenized some text and it looked a bit scary.
Welcome to Boost. The learning curve can be a bit steep, but don't let that scare you away.
Andrew
Haha. Thanks for the welcome!
I do realize the complexity involved in a project such as Boost, but for a noob, I was in total awe! :)
--
Regards, Sarma Tangirala, Junior - Class of 2012, Department of Information Science and Technology, College of Engineering Guindy - Anna University
-- Regards, Sarma Tangirala, Junior - Class of 2012, Department of Information Science and Technology, College of Engineering Guindy - Anna University

I finally finished my proposal and its here http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/omega9/1 <http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/omega9/1>Any comments or suggestions would be helpful. Thanks again for this opportunity! On 8 April 2011 01:29, Sarma Tangirala <tvssarma.omega9@gmail.com> wrote:
Just a small follow-up.
I was caught in exam week and could not do anything constructive for a while. I did catch up with my advisor who specializes in AI and she was also of the opinion that a small set of tool properly implemented should keep me busy through the summer.
I am preparing my proposal and should submit in a while.
I want to know if I have a good chance of being selected. Any advice at this stage would be awesome.
I am looking at tagging, chunking, tokenizing and parsing, stemming and stop-word removal as suggested.
I will be using the O'Reilly NLTK book as a model reference. Any other good reference sources would be helpful!
On 29 March 2011 06:53, Sarma Tangirala <tvssarma.omega9@gmail.com> wrote:
Hello,
On 27 March 2011 18:26, Andrew Sutton <asutton.list@gmail.com> wrote:
Hi,
I have worked with Boost on my projects but haven't really thought of using C++ as a language for NLP.
The NLP that I have done is on Python and Java, for their built-in string methods.
I think this would be an interesting project, but doing it correctly would require *way* more than 3 months of effort. If you were interested in starting to work on an NLP support library, you should focus on designing a small set of tools (WordNet support, stopword removal support, stemmers, etc.),
I do realize it will take a lot more than a summer's worth of effort, but as you pointed out, a small library with a couple of basic of tools could be an excellent start point.
I don't know if this is inexperience or ignorance, but does C++ work well for NLP?
There's no reason it should not be a great choice for NLP applications. It might be worth pointing out that some of the performance critical components of the Python NLTK (Natural Language Toolkit) are written in C... and we all know that C++ is a better C than C :)
I have worked on the Python NLTK and absolutely loved it. I did not know that the critical components were written in C/C++. But I must admit, I haven't seen a final application written entirely in C/C++. I am a moderate to good level programmer and I think the reason why a lot of people prefer python is for the simplicity of the code or as one forum user put it, "the syntactical fluff and non-abstraction" that goes with C++.
Also, I was looking at some C++ code using Boost/tokenizer.hpp that
tokenized some text and it looked a bit scary.
Welcome to Boost. The learning curve can be a bit steep, but don't let that scare you away.
Andrew
Haha. Thanks for the welcome!
I do realize the complexity involved in a project such as Boost, but for a noob, I was in total awe! :)
--
Regards, Sarma Tangirala, Junior - Class of 2012, Department of Information Science and Technology, College of Engineering Guindy - Anna University
-- Regards, Sarma Tangirala, Junior - Class of 2012, Department of Information Science and Technology, College of Engineering Guindy - Anna University
-- Regards, Sarma Tangirala, Junior - Class of 2012, Department of Information Science and Technology, College of Engineering Guindy - Anna University

http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/omega9/1
The proposal tells me what you want to do but not how you plan to do it. A good proposal speculates on a design for the proposed work. What classes and algorithms will the project include? How do you think users will interact with elements? What might the interfaces look like? What components of those elements might be made generic (template parameters?). Andrew
participants (4)
-
Andrew Sutton
-
Mathias Gaunard
-
Sarma Tangirala
-
Thomas Klimpel