Hi all,
recently, I've found a bug in regex that can lead to access violation.
std::string s(".*?");
boost::regex regEx(s.begin(), s.end()); // Potential AV here
basic_regex constructor creates a local variable of type traits::string_type and passes its the first element address and one-beyond-the-last element address to the assing() function.
template <class InputIterator>
basic_regex(InputIterator arg_first, InputIterator arg_last, flag_type f = regex_constants::normal)
{
typedef typename traits::string_type seq_type;
seq_type a(arg_first, arg_last);
if(a.size())
assign(&*a.begin(), &*a.begin() + a.size(), f);
else
assign(static_cast(0), static_cast(0), f);
}
Calling assign() eventually leads to creation of basic_regex_parser object and calling its parse_repeat() function. m_begin and m_end members are initialized with values passed to assign().
template
bool basic_regex_parser::parse_repeat(std::size_t low, std::size_t high)
{
...
// OK we have a perl or emacs regex, check for a '?':
if(this->m_traits.syntax_type(*m_position) == regex_constants::syntax_question)
{
greedy = false;
++m_position;
}
// for perl regexes only check for pocessive ++ repeats.
if((0 == (this->flags() & regbase::main_option_type))
&& (this->m_traits.syntax_type(*m_position) == regex_constants::syntax_plus))
{
pocessive = true;
++m_position;
}
...
}
In parse_repeat() the m_position, member points to '?' so condition in the first if is true and m_position is advanced by ++m_position. Now it is equal to m_end.
In the next if statement, *m_position is evaluated that may cause access violation.
Actually, it is unlikely that the sample code above will cause AV. It is because seq_type in basic_regex constructor is basic_string, and basic_string's underlying buffer is usually null-terminated. So m_end points to the null terminator. It always readable, no AV can happen.
The actual code I used is slightly different but the concept is the same:
boost::u32regex regEx = boost::make_u32regex(L".*?");
u32regex and make_u32regex are provided by the ICU library. That is a way to support unicode strings.
seq_type become std::vector. Some time the "a" object's underlying buffer was allocated right in the end of page, so basic_regex_parser::m_end pointed to the start of the next page that was not allocated. Dereferencing m_position caused access violation.
My environment is Visual VC++ 9.0, Boost 1.42.0 (as far I can see, the code in Boost 1.46.1 is the same).
By the way. When traits::string_type is basic_string, the code in basic_regex::basic_regex is not portable:
assign(&*a.begin(), &*a.begin() + a.size(), f);
If I remember correctly, standard does not guarantee that basic_string's controlling sequence is contiguous.
Just in case, the call stack when AV occured:
boost::re_detail::basic_regex_parser::parse_repeat+0x87
boost::re_detail::basic_regex_parser::parse_extended+0x182
boost::re_detail::basic_regex_parser::parse+0x133
boost::re_detail::basic_regex_implementation::assign+0x98
boost::basic_regex::do_assign+0x14c
boost::basic_regex::assign+0x16
boost::basic_regex::assign+0xa2
boost::basic_regex::basic_regex >+0xb8
boost::re_detail::do_make_u32regex+0x25
boost::make_u32regex+0x2c
Sergey