u32regex_search crashes

newer
date_time time_duration getting...

Anjaly

27 Sep 2007 27 Sep '07

6:06 a.m.

Hello, I am new to Boost-regex library. The libray is build with icu support.The user interface is created using wxWidgets library. I want to search a file using u32regex_search. But the program crashes and the error it shows is 'Invalid UTF-8 sequence encountered while trying to encode UTF-32 character'.I am reading the contents of the file and it is stored in char array.The program is crashing when it encounters a char with hex value 00. wxString s=SearchTxt->GetValue(); const char *start=(const char*)buffer; const char *end=(const char*)buffer+len; boost::cmatch what; boost::u32regex e1= boost::make_u32regex(s.fn_str(), boost::regex::icase); while(boost::u32regex_search(start,end, what,e, boost::match_default)) { cout << what[0] << "\n"; } ______________________________________ Scanned and protected by Email scanner

Show replies by date

John Maddock

27 Sep 27 Sep

4:29 p.m.

Anjaly wrote:

...

Hello,

I am new to Boost-regex library. The libray is build with icu support.The user interface is created using wxWidgets library. I want to search a file using u32regex_search. But the program crashes and the error it shows is 'Invalid UTF-8 sequence encountered while trying to encode UTF-32 character'.I am reading the contents of the file and it is stored in char array.The program is crashing when it encounters a char with hex value 00.

The current regression test suite tests strings containing embedded NULL's (and yes with UTF8 and ICU), so it certainly should work OK. BTW if your program is "crashing" it's because you're not catching the thrown exception. If think it's a bug please provide the UTF-8 sequence that causes the failure so I have something to debug. Regards, John Maddock.

Anjaly

28 Sep 28 Sep

4:33 a.m.

Hai, Thank you for your response. I have catched the exception.Now the program does not crash but the searching is incomplete. Even if the file is of encoding type utf16 ,the exception occurs(I have used a message box to show reason of exception). Is the problem due to reading the file and storing in char array or due to making the regex expression. I have attached the file in which i am searching. Hope you can help me. Anjaly On Thu, 2007-09-27 at 17:29 +0100, John Maddock wrote:

...

Anjaly wrote:

...
Hello,

I am new to Boost-regex library. The libray is build with icu support.The user interface is created using wxWidgets library. I want to search a file using u32regex_search. But the program crashes and the error it shows is 'Invalid UTF-8 sequence encountered while trying to encode UTF-32 character'.I am reading the contents of the file and it is stored in char array.The program is crashing when it encounters a char with hex value 00.

The current regression test suite tests strings containing embedded NULL's (and yes with UTF8 and ICU), so it certainly should work OK. BTW if your program is "crashing" it's because you're not catching the thrown exception. If think it's a bug please provide the UTF-8 sequence that causes the failure so I have something to debug.

Regards, John Maddock.

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

______________________________________ Scanned and protected by Email scanner

John Maddock

9:29 a.m.

Anjaly wrote:

...

...
Hai, Thank you for your response. I have catched the exception.Now the program does not crash but the searching is incomplete. Even if the file is of encoding type utf16 ,the exception occurs(I have used a message box to show reason of exception). Is the problem due to reading the file and storing in char array or due to making the regex expression. I have attached the file in which i am searching. Hope you can help me.

The first byte in the file is 0xFF which is not a valid UTF8 character, likewise the second byte is 0xFE which is also not used in UTF8: so there's no way to decode the file and convert to UTF32. However, if I start reading from the third byte in the file, then the search does go through to the end: I can't guarentee that the content was correct though ! John.

Jens Seidel

9:43 a.m.

On Fri, Sep 28, 2007 at 10:29:09AM +0100, John Maddock wrote:

...

Anjaly wrote:

...
...
Hai, Thank you for your response. I have catched the exception.Now the program does not crash but the searching is incomplete. Even if the file is of encoding type utf16 ,the exception occurs(I have used a message box to show reason of exception). Is the problem due to reading the file and storing in char array or due to making the regex expression. I have attached the file in which i am searching. Hope you can help me.

The first byte in the file is 0xFF which is not a valid UTF8 character, likewise the second byte is 0xFE which is also not used in UTF8: so there's no way to decode the file and convert to UTF32.

However, if I start reading from the third byte in the file, then the search does go through to the end: I can't guarentee that the content was correct though !

That's the valid byte order mark. See e.g. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 Jens

John Maddock

9:53 a.m.

Jens Seidel wrote:

...

That's the valid byte order mark. See e.g. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058

Right: it's a byte order mark for UTF16LE, but the user is trying to read it as a UTF8 sequence. If the file is indeed UTF16LE then it's up to the user to read it into a sequence of valid UTF16 code points before passing to Boost.Regex. HTH, John.

Anjaly

1 Oct 1 Oct

3:52 a.m.

In the regex document it was said that the size of data type of the variable passed to the make_u32regex that determines character encoding (utf8,utf16 or utf32) . I passed wchar_t (which i think size is 4) so that the buffer encoding is considered as utf8 by u32regex_search irrespectively. Actually i am trying to do a utf8 search. Anjaly G S On Fri, 2007-09-28 at 10:53 +0100, John Maddock wrote:

...

Jens Seidel wrote:

...
That's the valid byte order mark. See e.g. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058

Right: it's a byte order mark for UTF16LE, but the user is trying to read it as a UTF8 sequence.

If the file is indeed UTF16LE then it's up to the user to read it into a sequence of valid UTF16 code points before passing to Boost.Regex.

HTH, John.

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

______________________________________ Scanned and protected by Email scanner

John Maddock

8:42 a.m.

Anjaly wrote:

...

In the regex document it was said that the size of data type of the variable passed to the make_u32regex that determines character encoding (utf8,utf16 or utf32) .

*For construction of the regex object*. The search algorithms operate independently on any of UTF8/16/32.

...

I passed wchar_t (which i think size is 4) so that the buffer encoding is considered as utf8 by u32regex_search irrespectively. Actually i am trying to do a utf8 search.

Except the data file you sent *was not valid UTF8* ! It looks like it's probably UTF16LE, it's up to you in that case to decode the byte order mark and read the text into something that Boost.Regex can handle (for example platform-native UTF16). ICU should have some file IO routines for doing that kind of thing: for example for loading a file into a UnicodeString type. HTH, John.

Anjaly

9:02 a.m.

I am sorry the last message had an mistake.I wanted to say that I want to do a search that would take all the data as though it is Utf32 rather than utf8 ( as i incorrectly wrote). I don't know whether i am making myself clear (I am not very good in expressing the opnion). What i really want to do is a unicode search on the available data. Anjaly G S On Mon, 2007-10-01 at 09:42 +0100, John Maddock wrote:

...

Anjaly wrote:

...
In the regex document it was said that the size of data type of the variable passed to the make_u32regex that determines character encoding (utf8,utf16 or utf32) .

*For construction of the regex object*.

The search algorithms operate independently on any of UTF8/16/32.

...
I passed wchar_t (which i think size is 4) so that the buffer encoding is considered as utf8 by u32regex_search irrespectively. Actually i am trying to do a utf8 search.

Except the data file you sent *was not valid UTF8* !

It looks like it's probably UTF16LE, it's up to you in that case to decode the byte order mark and read the text into something that Boost.Regex can handle (for example platform-native UTF16). ICU should have some file IO routines for doing that kind of thing: for example for loading a file into a UnicodeString type.

HTH, John.

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

______________________________________ Scanned and protected by Email scanner

John Maddock

3:10 p.m.

Anjaly wrote:

...

I am sorry the last message had an mistake.I wanted to say that I want to do a search that would take all the data as though it is Utf32 rather than utf8 ( as i incorrectly wrote). I don't know whether i am making myself clear (I am not very good in expressing the opnion).

What i really want to do is a unicode search on the available data.

Right, but if that data is in a file then first you need to read it into memory so that it's in a well defined "in-memory-encoding". You didn't say how you were reading the file you sent, but ICU has some API's here: http://www.icu-project.org/apiref/icu4c/ustdio_8h.html that assist with correctly reading and writing Unicode data to and from files. John.

6518

Age (days ago)

6522

Last active (days ago)

List overview

Download

9 comments

3 participants

participants (3)

Anjaly
Jens Seidel
John Maddock