[xpressive] use of Boyer-Moore

newer
Re: [Boost-users] [boost] Mixing a...

older
[Test] minimal test

Lynn Allan

10 Apr 2006 10 Apr '06

6:54 p.m.

Eric N. wrote during 2005

...

do to make Boyer-Moore work with the regex traits interface make Boyer-Moore more trouble than its worth for case-insensitive matches. I haven't yet run the other tests.

Just curious ... does xpressive use "standard" Boyer-Moore or a variant? In my limited experience, the Horspool variant of Boyer-Moore is simpler and tends to be faster. It may be more or less prone to have problems with "pathological" cases than "standard" Boyer-Moore, however.

Attachments:

attachment.html (text/html — 886 bytes)

Show replies by date

Eric Niebler

11 Apr 11 Apr

2:15 a.m.

Lynn Allan wrote:

...

Eric N. wrote during 2005

...
do to make Boyer-Moore work with the regex traits interface make Boyer-Moore more trouble than its worth for case-insensitive matches. I haven't yet run the other tests.

Just curious ... does xpressive use "standard" Boyer-Moore or a variant? In my limited experience, the Horspool variant of Boyer-Moore is simpler and tends to be faster. It may be more or less prone to have problems with "pathological" cases than "standard" Boyer-Moore, however.

Xpressive uses a home-grown variant of Horspool. It's an extension to the algorithm that allows it to perform case-insensitive matches. It also works with large character sets, such as Unicode, which standard Horspool does not, IIUC. It yields potential matches, not exact matches. Only if it finds a tentative match is a full regex match attepted. Since I wrote the above, I have fixed the performance problem with BMH and case-insensitive matches by extended the regex traits class with a function that returns all the case-folded equivalents of a character. This resulted in a significant performance improvement for case-insensitive matches. HTH, -- Eric Niebler Boost Consulting www.boost-consulting.com

Sebastian Redl

15 Apr 15 Apr

3:31 p.m.

Eric Niebler wrote:

...

Since I wrote the above, I have fixed the performance problem with BMH and case-insensitive matches by extended the regex traits class with a function that returns all the case-folded equivalents of a character. This resulted in a significant performance improvement for case-insensitive matches.

How does that work with multiple character case mappings, like the German ß -> SS (the sharp s does not exist in upper case)? Sebastian Redl

Eric Niebler

16 Apr 16 Apr

1:24 a.m.

Sebastian Redl wrote:

...

Eric Niebler wrote:

...
Since I wrote the above, I have fixed the performance problem with BMH and case-insensitive matches by extended the regex traits class with a function that returns all the case-folded equivalents of a character. This resulted in a significant performance improvement for case-insensitive matches.

How does that work with multiple character case mappings, like the German ß -> SS (the sharp s does not exist in upper case)?

It doesn't. :-P Xpressive aims for "Basic Unicode Support," as defined by Unicode TR18 (http://www.unicode.org/reports/tr18/): Some caseless matches may match one character against two: for example, U+00DF "ß" matches the two characters "SS". And case matching may vary by locale. However, because many implementations are not set up to handle this, at Level 1 only simple case matches are necessary. So correct handling of German ß -> SS is only necessary for "Extended Unicode Support," which would be nice but is a more distant goal. Sadly and AFAICT, TR1.Regex doesn't even make accommodation for Basic Unicode Support, since it doesn't provide syntax for character set subtraction and intersection. In short, it's a problem, but there are bigger fish to fry. If you need a regex engine that can handle this today, try ICU. -- Eric Niebler Boost Consulting www.boost-consulting.com

7070

Age (days ago)

7076

Last active (days ago)

List overview

Download

3 comments

3 participants

participants (3)

Eric Niebler
Lynn Allan
Sebastian Redl