Andrey Semashev wrote:
On 07/10/17 19:57, Phil Endecott via Boost wrote:
Vinnie Falco wrote:
The utf-8 validation of text websocket message payloads is a critical bottleneck.
I thought I'd have a look at that. Your code has a fast-path that attempts to check 8 bytes at a time by masking with 0x8080808080808080. I agree that that is probably a useful optimisation which a compiler is unlikely to apply by itself.
I wouldn't be so quick to agree. Modern compilers tend to be able to vectorize simple loops quite efficiently.
Not this one (I've tried it).
Your implementation ( https://github.com/vinniefalco/Beast/blob/6f88f0182d461e9cabe55032f966c9ca4f... ) does this:
if((*reinterpret_cast<std::size_t const*>(in) & mask) != 0)
So you're reinterpret_casting a char* to a size_t* and then dereferencing it. Isn't that undefined behaviour?
C++ allows chars and unsigned chars (which std::uint8_t presumably is) to alias other types
My understanding is that you can reinterpret_cast FROM other types TO chars, but not FROM chars TO other types. But there are surely people here who understand this stuff better than me.
BTW, the fast loop ends on the first byte >=0x80 and then the validation continues in the slow loop until the end of input. It might be better to resume the fast loop if validation of the "vector" passes. Otherwise, for example, and early comment in the input could ruin the performance.
Yes, I also noticed that; quite a common case in en_GB would be to have mostly-ASCII with occasional currency symbols. Vinnie Falco wrote:
So you're reinterpret_casting a char* to a size_t* and then dereferencing it. Isn't that undefined behaviour?
Which platform do you think this doesn't work on?
Wrong question. It's likely that it works on current compilers on current platforms. But if I'm right that it is undefined behaviour then it might stop working with some future compiler update somewhere.
The reinterpret_cast<> can be trivially changed to std::memcpy:
std::size_t temp; std::memcpy(&temp, in, sizeof(temp if((temp & mask) != 0)
Yes, I believe that's the right thing to do. Regarding the relative performance of different UTF-8 checking code: I've done a quick test by hacking some of my own UTF-8 decoding code to just validate, and I see a factor of about 4 improvement between the byte-at-a-time and an optimised version that checks the top bits of 8 bytes at a time. The code is below. I measure a throughput of about 1.6 GB/second for ASCII-only input, on my ARMv8 test system. I'd be interested to hear how that compares to the other implementations; a very quick test suggests that it might be a bit quicker than the Beast code but I'd like someone else to confirm or deny. (It might also be wrong - I've not tested it with non-ASCII input.) Note that this is only about 50 lines of code; Beast's utf8_checker.hpp is maybe 5 times as long. Code follows. Regards, Phil. template <typename ITER> bool is_valid_utf8(ITER i, ITER end) { // Check if range is valid and complete UTF-8. // (Maybe incomplete input, i.e. ending in the middle of a multi-byte character, needs // to be reported differently?) while (i != end) { // If i is suitably aligned, do a fast word-at-a-time check for ASCII characters. // FIXME this only works if ITER is a contiguous iterator; it needs a "static if". const char* p = &(*i); const char* e = p + (end-i); // I don't think &(*end) is allowed because it appears to dereference end. unsigned long int w; // Should be 32 bits on 32-bit processor and 64 bits on 64-bit processor. if (reinterpret_cast<uintptr_t>(p) % sizeof(w) == 0) { while (p+sizeof(w) <= e) { memcpy(&w,p,sizeof(w)); if (w & 0x8080808080808080) break; // If any of the top bits are set, fall back to the // byte-at-a-time code below. // (Is there a better way to write the mask value that would work // for e.g. 128-bit ints? Is that expression OK for 32-bit ints?) p += sizeof(w); i += sizeof(w); } if (p == e) break; } uint8_t b0 = *i++; if ((b0 & 0x80) == 0) continue; // Single byte chars are 0xxxxxxx if (i == end) return false; // Incomplete uint8_t b1 = *i++; if ((b1 & 0xc0) != 0x80) return false; // Following bytes are all 10xxxxxx if ((b0 & 0xe0) == 0xc0) continue; // Two-byte chars start 110xxxxx if (i == end) return false; // Incomplete uint8_t b2 = *i++; if ((b2 & 0xc0) != 0x80) return false; // Following bytes are all 10xxxxxx if ((b0 & 0xf0) == 0xe0) continue; // Three-byte chars start 1110xxxx if (i == end) return false; // Incomplete uint8_t b3 = *i++; if ((b3 & 0xc0) != 0x80) return false; // Following bytes are all 10xxxxxx if ((b0 & 0xf8) == 0xf0) continue; // Four-byte chars start 11110xxx return false; // First byte is invalid, i.e. 10xxxxxx or 11111xxx } return true; }