Get All Words Offset in String using Boost regex

Hello,
I wish to determine the "start" and "length" of all the completed words in
a string using Boost Regex
For example I have a String "*Hello World and Google*"
I would like to create a map like this...
1, 5 //First Character position and Length of Word "Hello"
7, 5 // First Character position and Length of Word World
14, 3 //First Character position and Length of Word "And"
18, 6 // First Character position and Length of Word "Google"
This is the code I have written..
void CreateOffSetMap(std::string completeInStr)
{
std::map

S Nagre
std::string escapeChar = "\\" ; std::string bChar = "b"; std::string dotChar = ".";
std::string findWordInStr = escapeChar + bChar + dotChar + escapeChar + bChar;
This ends up with the expression "\\b.\\b", which will only ever match a single character with word break on either side (so, in your example, it should match all and only the spaces):
"Hello World and Google" ^ ^ ^
Closer would be "\\b.+?\\b", but that would still match on your spaces:
"Hello World and Google" ^ ^^ ^^ ^^
If you really want words, you are best off deciding what constitutes a
word, and then writing the regex for exactly that purpose. There is
the built-in "\\w" character class, but only you can decide whether
things like apostrophes and hyphens break words. (And that's just in
English; I have no idea what constitutes word-break most other
languages!) For English, I'd consider something like "[\\w'-]+"
(which should be: all word chars, plus apostrophes, plus hyphens).
And from a personal taste point of view, I'd likely write it exactly
that way. (I do sometimes decompose my regexes, but only if they have
repeated subsections that could better be described as a variable
name.)
You also had a small logic error, when you wrote this:
OffSetMap[foundPos] = foundLen;
"foundPos" is relative to the start of the last search, not to the
start of the whole string.
Here's my version:
| #include <map>
| #include <string>
|
| #include

Many Thanks Anthony , Your version of code works much better than mine..
Thanks
Subhash
On Fri, Jan 6, 2012 at 2:25 AM, Anthony Foiani
S Nagre
writes: std::string escapeChar = "\\" ; std::string bChar = "b"; std::string dotChar = ".";
std::string findWordInStr = escapeChar + bChar + dotChar + escapeChar + bChar;
This ends up with the expression "\\b.\\b", which will only ever match a single character with word break on either side (so, in your example, it should match all and only the spaces):
"Hello World and Google" ^ ^ ^
Closer would be "\\b.+?\\b", but that would still match on your spaces:
"Hello World and Google" ^ ^^ ^^ ^^
If you really want words, you are best off deciding what constitutes a word, and then writing the regex for exactly that purpose. There is the built-in "\\w" character class, but only you can decide whether things like apostrophes and hyphens break words. (And that's just in English; I have no idea what constitutes word-break most other languages!) For English, I'd consider something like "[\\w'-]+" (which should be: all word chars, plus apostrophes, plus hyphens).
And from a personal taste point of view, I'd likely write it exactly that way. (I do sometimes decompose my regexes, but only if they have repeated subsections that could better be described as a variable name.)
You also had a small logic error, when you wrote this:
OffSetMap[foundPos] = foundLen;
"foundPos" is relative to the start of the last search, not to the start of the whole string.
Here's my version:
| #include <map> | #include <string> | | #include
| #include | | typedef int int32; | | typedef std::map< int32, int32 > offset_map_t; | | void create_offset_map( const std::string & str, | offset_map_t & offset_map ) | { | std::cout << "searching '" << str << "'" << std::endl; | | boost::regex re( "[\\w'-]+" ); | | boost::smatch what; | | std::string::const_iterator start = str.begin(); | std::string::const_iterator end = str.end(); | | while ( boost::regex_search( start, end, what, re ) ) | { | int32 pos = what.position(); | int32 len = what.length(); | | std::cout << " found '" << what.str( 0 ) << "'" | << " at pos=" << pos << ", len=" << len << std::endl; | | start += pos; | offset_map[ start - str.begin() ] = len; | start += len; | } | | BOOST_FOREACH( const offset_map_t::value_type & p, offset_map ) | std::cout << " ( " << p.first << ", " | << p.second << " )" << std::endl; | } | | int main( int argc, char * argv [] ) | { | for ( int i = 1; i < argc; ++i ) | { | offset_map_t my_map; | create_offset_map( argv[i], my_map ); | } | return 0; | } Hope this helps.
Best Regards, Anthony Foiani _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (2)
-
Anthony Foiani
-
S Nagre