help extracting TAG with boost::regex

hi there, I am working with a TAG-oriented text with boost:regex. For example, the following pattern might occur in the text <before> <pre><p>Some Text</p></pre> <after> <pre> ddd </pre> In this case, I would like to extract everything between <pre> </pre>. Meanwhile, everything outside <pre> </pre> should be unchanged except that < is replaced by < and > is replaced by > For that purpose, I tried the following code boost::regex regexp("<\s*pre[^>]*>(.*?)<\s*/pre\s*>", boost::regex::icase); boost::match_results<std::string::const_iterator> what; if (regex_search(sometext, what, regexp,boost::match_default|boost::format_first_only)) { std::string between_tag = std::string(what[1].first, what[2].second) + "\r\n"; MessageBox(0, between_tag.c_str(), "", 0); std::string left_tag(sometext.begin(), what[1].first); std::string right_tag(what[1].second, sometext.end()); replace_all<string, LPCSTR, LPCSTR>( left_tag, "<", "<" ); replace_all<string, LPCSTR, LPCSTR>( right_tag, ">", ">"); sometext = left_tag + between_tag + right_tag + "\r\n"; } However, the code seems not works propely unless nothing outside <pre> </pre>. In addition, when the text being handled contains the \r\n, the code return error no matter if left_tag and right_rag is null or not. In a far more complicated case, a nested <pre></pre> might occur as follow <before> <pre><pre><p>Some Text</p></pre></pre> <after> <pre> ddd </pre> For this case, I only want to handle the outermost <pre></pre> and keep everything inside it unchanged, i.e., the inner <pre></pre> will be extracted as common text. Any idea? Thanks in advance.

llwaeva@21cn.com wrote:
hi there, I am working with a TAG-oriented text with boost:regex. For example, the following pattern might occur in the text
<before> <pre><p>Some Text</p></pre> <after> <pre> ddd </pre>
In this case, I would like to extract everything between <pre> </pre>. Meanwhile, everything outside <pre> </pre> should be unchanged except that < is replaced by < and > is replaced by >
For that purpose, I tried the following code
I don't see anything obviously too wrong based on a quick glance except that \s* should be \\s*. If that doesn't fix things, post a self contained test case and I'll take a look.
In a far more complicated case, a nested <pre></pre> might occur as follow
<before> <pre><pre><p>Some Text</p></pre></pre> <after> <pre> ddd </pre>
For this case, I only want to handle the outermost <pre></pre> and keep everything inside it unchanged, i.e., the inner <pre></pre> will be extracted as common text.
Hmmmm, traditional regexes don't handle that all that well, how deep will the nesting go? You handle a finite number of nested occurences using something like: <\s*pre[^>]*>(<\s*pre[^>]*>.*?</\s*pre\s*>|.)*?</\s*pre\s*> and so on, but remember to double those \'s if you embed this in a C++ string. John.
participants (2)
-
John Maddock
-
llwaeva@21cn.com