[regexp] Replace a substring with a regexp

Hi all, I am not really familiar with regexp, an I am facing a problem. I have some strings containing unicode sequences (like "\u****"), and I would like to replace them with html sequences (such that "\u****" becomes "****;"). I think I can do that with boost regexp, but I really do not know how. The major problem is that I do not now in advance what are the characters for "****". I however know that they are always 4 and alphanumeric. So, I have to detect them and also append after a ";". Do you have any hint on how to do that? Best regards, Olivier

Olivier Tournaire wrote:
I am not really familiar with regexp, an I am facing a problem. I have some strings containing unicode sequences (like "\u****"), and I would like to replace them with html sequences (such that "\u****" becomes "****;"). I think I can do that with boost regexp, but I really do not know how. The major problem is that I do not now in advance what are the characters for "****". I however know that they are always 4 and alphanumeric. So, I have to detect them and also append after a ";".
I have no experience with Boost.Regex, but these are the notations you need. Search pattern: "\\u(\w{4})" Replacement pattern: "\\1;" Here, "\1" stands for "the match to the first pattern in parentheses", so that's your four digits. You'll have to refer to the Boost.Regex manual to find out how to apply these patterns. HTH, Julian

Thak you Julian,
2011/3/17 Julian Gonggrijp
Olivier Tournaire wrote:
I am not really familiar with regexp, an I am facing a problem. I have some strings containing unicode sequences (like "\u****"), and I would like to replace them with html sequences (such that "\u****" becomes "****;"). I think I can do that with boost regexp, but I really do not know how. The major problem is that I do not now in advance what are the characters for "****". I however know that they are always 4 and alphanumeric. So, I have to detect them and also append after a ";".
I have no experience with Boost.Regex, but these are the notations you need.
Search pattern: "\\u(\w{4})"
It seems that we also have to escape the "\" in "\u". The working regex seems to be: "\\\\u(\w{4})" Best regards, Olivier
Replacement pattern: "\\1;"
Here, "\1" stands for "the match to the first pattern in parentheses", so that's your four digits. You'll have to refer to the Boost.Regex manual to find out how to apply these patterns.
HTH, Julian _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Olivier Tournaire wrote:
2011/3/17 Julian Gonggrijp
Search pattern: "\\u(\w{4})"
It seems that we also have to escape the "\" in "\u". The working regex seems to be:
"\\\\u(\w{4})"
I'm surprised. The first backslash was already there to escape the second. Are there maybe two steps of backslash interpretation at work, one by the C++ compiler and one by Boost.Regex? In any case, I forgot to escape the backslash in "\w", you'd probably have to give that one the same treatment. And also the ones in "\&" and "\1" in the replacement pattern. But if you get the right result as the patterns stand right now, that's of course also fine. :) -Julian

2011/3/18 Julian Gonggrijp
Olivier Tournaire wrote:
2011/3/17 Julian Gonggrijp
Search pattern: "\\u(\w{4})"
It seems that we also have to escape the "\" in "\u". The working regex seems to be:
"\\\\u(\w{4})"
I'm surprised. The first backslash was already there to escape the second. Are there maybe two steps of backslash interpretation at work, one by the C++ compiler and one by Boost.Regex?
I forgive to say that I finally used Qt (since I already used it in my project) which has a convenient QString::replace method which handles regex.
In any case, I forgot to escape the backslash in "\w", you'd probably have to give that one the same treatment. And also the ones in "\&" and "\1" in the replacement pattern.
Yes, you are right, I should have also mentionned them.
But if you get the right result as the patterns stand right now, that's of course also fine. :)
Thank you for pointing me in the right direction! Regards, Olivier
-Julian
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Search pattern: "\\u(\w{4})"
It seems that we also have to escape the "\" in "\u". The working regex seems to be:
"\\\\u(\w{4})"
I'm surprised. The first backslash was already there to escape the second. Are there maybe two steps of backslash interpretation at work, one by the C++ compiler and one by Boost.Regex? In any case, I forgot to escape the backslash in "\w", you'd probably have to give that one the same treatment. And also the ones in "\&" and "\1" in the replacement pattern.
Yes, if you want to match a literal '\' then you need to use '\\\\' in the regex - the compiler swallows one set of \'s so the regex engine then sees '\\' which is what you want... HTH, John.
participants (3)
-
John Maddock
-
Julian Gonggrijp
-
Olivier Tournaire