Get All Words Offset in String using Boost regex
Hello, I wish to determine the "start" and "length" of all the completed words in a string using Boost Regex For example I have a String "*Hello World and Google*" I would like to create a map like this... 1, 5 //First Character position and Length of Word "Hello" 7, 5 // First Character position and Length of Word World 14, 3 //First Character position and Length of Word "And" 18, 6 // First Character position and Length of Word "Google" This is the code I have written.. void CreateOffSetMap(std::string completeInStr) { std::map<int32, int32>OffSetMap; std::string escapeChar = "\\" ; std::string bChar = "b"; std::string dotChar = "."; std::string findWordInStr = escapeChar + bChar + dotChar + escapeChar + bChar; boost::regex regExpression(findWordInStr); boost::smatch what; std::string::const_iterator start = completeInStr.begin(); std::string::const_iterator end = completeInStr.end(); while (boost::regex_search(start, end, what, regExpression)) { int32 foundPos = what.position(); int32 foundLen = what.length(); int32 endOffSet = startOffSet + foundLen; std::string foundString(start + foundPos, start + foundPos + foundLen); OffSetMap[foundPos] = foundLen; start += foundPos + foundLen; } } But is not working correct as expected, Could Anyone tell me where I am doing wrong here ? Thanks in Advance Subhash
S Nagre <snagre.mumbai@gmail.com> writes:
std::string escapeChar = "\\" ; std::string bChar = "b"; std::string dotChar = ".";
std::string findWordInStr = escapeChar + bChar + dotChar + escapeChar + bChar;
This ends up with the expression "\\b.\\b", which will only ever match a single character with word break on either side (so, in your example, it should match all and only the spaces):
"Hello World and Google" ^ ^ ^
Closer would be "\\b.+?\\b", but that would still match on your spaces:
"Hello World and Google" ^ ^^ ^^ ^^
If you really want words, you are best off deciding what constitutes a word, and then writing the regex for exactly that purpose. There is the built-in "\\w" character class, but only you can decide whether things like apostrophes and hyphens break words. (And that's just in English; I have no idea what constitutes word-break most other languages!) For English, I'd consider something like "[\\w'-]+" (which should be: all word chars, plus apostrophes, plus hyphens). And from a personal taste point of view, I'd likely write it exactly that way. (I do sometimes decompose my regexes, but only if they have repeated subsections that could better be described as a variable name.) You also had a small logic error, when you wrote this: OffSetMap[foundPos] = foundLen; "foundPos" is relative to the start of the last search, not to the start of the whole string. Here's my version: | #include <map> | #include <string> | | #include <boost/foreach.hpp> | #include <boost/regex.hpp> | | typedef int int32; | | typedef std::map< int32, int32 > offset_map_t; | | void create_offset_map( const std::string & str, | offset_map_t & offset_map ) | { | std::cout << "searching '" << str << "'" << std::endl; | | boost::regex re( "[\\w'-]+" ); | | boost::smatch what; | | std::string::const_iterator start = str.begin(); | std::string::const_iterator end = str.end(); | | while ( boost::regex_search( start, end, what, re ) ) | { | int32 pos = what.position(); | int32 len = what.length(); | | std::cout << " found '" << what.str( 0 ) << "'" | << " at pos=" << pos << ", len=" << len << std::endl; | | start += pos; | offset_map[ start - str.begin() ] = len; | start += len; | } | | BOOST_FOREACH( const offset_map_t::value_type & p, offset_map ) | std::cout << " ( " << p.first << ", " | << p.second << " )" << std::endl; | } | | int main( int argc, char * argv [] ) | { | for ( int i = 1; i < argc; ++i ) | { | offset_map_t my_map; | create_offset_map( argv[i], my_map ); | } | return 0; | } Hope this helps. Best Regards, Anthony Foiani
Many Thanks Anthony , Your version of code works much better than mine.. Thanks Subhash On Fri, Jan 6, 2012 at 2:25 AM, Anthony Foiani <tkil@scrye.com> wrote:
S Nagre <snagre.mumbai@gmail.com> writes:
std::string escapeChar = "\\" ; std::string bChar = "b"; std::string dotChar = ".";
std::string findWordInStr = escapeChar + bChar + dotChar + escapeChar + bChar;
This ends up with the expression "\\b.\\b", which will only ever match a single character with word break on either side (so, in your example, it should match all and only the spaces):
"Hello World and Google" ^ ^ ^
Closer would be "\\b.+?\\b", but that would still match on your spaces:
"Hello World and Google" ^ ^^ ^^ ^^
If you really want words, you are best off deciding what constitutes a word, and then writing the regex for exactly that purpose. There is the built-in "\\w" character class, but only you can decide whether things like apostrophes and hyphens break words. (And that's just in English; I have no idea what constitutes word-break most other languages!) For English, I'd consider something like "[\\w'-]+" (which should be: all word chars, plus apostrophes, plus hyphens).
And from a personal taste point of view, I'd likely write it exactly that way. (I do sometimes decompose my regexes, but only if they have repeated subsections that could better be described as a variable name.)
You also had a small logic error, when you wrote this:
OffSetMap[foundPos] = foundLen;
"foundPos" is relative to the start of the last search, not to the start of the whole string.
Here's my version:
| #include <map> | #include <string> | | #include <boost/foreach.hpp> | #include <boost/regex.hpp> | | typedef int int32; | | typedef std::map< int32, int32 > offset_map_t; | | void create_offset_map( const std::string & str, | offset_map_t & offset_map ) | { | std::cout << "searching '" << str << "'" << std::endl; | | boost::regex re( "[\\w'-]+" ); | | boost::smatch what; | | std::string::const_iterator start = str.begin(); | std::string::const_iterator end = str.end(); | | while ( boost::regex_search( start, end, what, re ) ) | { | int32 pos = what.position(); | int32 len = what.length(); | | std::cout << " found '" << what.str( 0 ) << "'" | << " at pos=" << pos << ", len=" << len << std::endl; | | start += pos; | offset_map[ start - str.begin() ] = len; | start += len; | } | | BOOST_FOREACH( const offset_map_t::value_type & p, offset_map ) | std::cout << " ( " << p.first << ", " | << p.second << " )" << std::endl; | } | | int main( int argc, char * argv [] ) | { | for ( int i = 1; i < argc; ++i ) | { | offset_map_t my_map; | create_offset_map( argv[i], my_map ); | } | return 0; | }
Hope this helps.
Best Regards, Anthony Foiani _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (2)
-
Anthony Foiani
-
S Nagre