Re: [boost] Re: Regex ease-of-use ideas

7 Apr 2004

      ...
Actually, I was asking about initial construction cost, in particular
of an object representing a failed match.  The acceptance of N1610
means that copy costs should be insignificant for cases like this one,
provided that the smatch author puts in the required effort to make it
moveable.  ;-)
Sounds like a hint - maybe if we could make shared_ptr moveable then we
could all delegate the work to that :-)

As for the initial construction cost - yes there is a cost - it has to
allocate memory to store the sub-expression matches, the matcher needs some
working space and therefore starts storing the submatches before it knows
whether there will be a match.

Consider your current code:

    std::string line;
    boost::regex pat("^Subject: (Re: )?(.*)");
    boost::smatch matches;

    while (std::cin)
    {
        std::getline(std::cin, line);
        if (boost::regex_match(line,matches, pat))
            std::cout << matches[2];
    }

The first time regex_match gets called it allocates the storage it needs in
the match_results class, subsequent calls then re-use this storage.  This is
efficient - in fact the cost of a single memory allocation is about 10 times
that of a simple regex_match attempt - so this is very important IMO.  In
fact I've spent a lot of last year eliminating unnecessary memory
allocations from regex, and there are some more I intend to stamp on this
year. Believe me it makes a difference, and other libraries like GRETA and
PCRE have all been through the same process and for the same reasons.  In
contrast if regex_match returns a match_results structure then you
effectively "pessimise" the performance for a small improvement in ease of
use (although I admit that there are options similar to the small-string
optimisation that *might* be applicable here).

BTW, just to be hyper critical, your alternative code:

    std::string line;
    boost::regex pat("^Subject: (Re: )?(.*)");

    while (std::cin)
    {
        std::getline(std::cin, line);
        if (boost::smatch m = boost::regex_match(line, pat))
            std::cout << m[2];
    }

contains an assignment inside a while loop, which while "neat", I have often
seen criticised for being potentially error prone, there are even some
compilers that throw out a helpful(!) warning if you do that (along the
lines of "didn't you want to use operator==).
...
...
One other thing - the current regex_match overload that doesn't take
a match_results as a parameter currently returns bool - the intent
is that if the user doesn't need the info generated in the
match_results, then some time can be saved by not storing it.
Boost.Regex doesn't currently take advantage of that, but I was
planning to in the next revision (basically you can cut out memory
allocation altogether, and that's an order or magnitude saving).
But I do need the match results, when the match succeeds.
I understand that, but there is a group of users who don't - one example is
a (commercial) email spam-filter that uses Boost.Regex.  It only needs a
true/false result "does this message have this pattern or not", and it wants
the answer as fast as possible.  For uses like this even a small change in
performance can make the difference between "coping" and "not coping" with
the email traffic they're seeing these days.
...
I guess my original suggestion of making it implicitly convertible to
some safe_bool solves that problem.  I guess I prefer that idea,
though Allan probably has more experience with this than I do.
OK, let me mull this over, maybe we can find a way to keep everyone happy,
maybe not ...

John.

Re: [boost] Re: Regex ease-of-use ideas

John Maddock