Re: [boost] [beast] Formal review

11 Jul 2017

      Andrey Semashev wrote:
...
On 07/10/17 19:57, Phil Endecott via Boost wrote:
...
Vinnie Falco wrote:
...
The utf-8 validation of text websocket message payloads is a critical
bottleneck.
I thought I'd have a look at that.  Your code has a fast-path that
attempts to check 8 bytes at a time by masking with 0x8080808080808080.
I agree that that is probably a useful optimisation which a compiler is
unlikely to apply by itself.
I wouldn't be so quick to agree. Modern compilers tend to be able to 
vectorize simple loops quite efficiently.
Not this one (I've tried it).
...
...
Your implementation
( 
https://github.com/vinniefalco/Beast/blob/6f88f0182d461e9cabe55032f966c9ca4f... 
) does this:
if((*reinterpret_cast<std::size_t const*>(in) & mask) != 0)
So you're reinterpret_casting a char* to a size_t* and then dereferencing
it.  Isn't that undefined behaviour?
C++ allows chars and unsigned chars (which std::uint8_t presumably is) 
to alias other types
My understanding is that you can reinterpret_cast FROM other types TO 
chars, but not FROM chars TO other types.  But there are surely people 
here who understand this stuff better than me.
...
BTW, the fast loop ends on the first byte >=0x80 and then the validation 
continues in the slow loop until the end of input. It might be better to 
resume the fast loop if validation of the "vector" passes. Otherwise, 
for example, and early comment in the input could ruin the performance.
Yes, I also noticed that; quite a common case in en_GB would be to have 
mostly-ASCII with occasional currency symbols.

Vinnie Falco wrote:
...
...
So you're reinterpret_casting a char* to a size_t* and then dereferencing
it.  Isn't that undefined behaviour?
Which platform do you think this doesn't work on?
Wrong question.  It's likely that it works on current compilers on current 
platforms.  But if I'm right that it is undefined behaviour then it might 
stop working with some future compiler update somewhere.
...
The reinterpret_cast<> can be trivially changed to std::memcpy:
std::size_t temp;
    std::memcpy(&temp, in, sizeof(temp
    if((temp & mask) != 0)
Yes, I believe that's the right thing to do.

Regarding the relative performance of different UTF-8 checking code:

I've done a quick test by hacking some of my own UTF-8 decoding code to 
just validate, and I see a factor of about 4 improvement between the 
byte-at-a-time and an optimised version that checks the top bits of 8 bytes 
at a time.  The code is below.  I measure a throughput of about 1.6 GB/second 
for ASCII-only input, on my ARMv8 test system.  I'd be interested to hear how 
that compares to the other implementations; a very quick test suggests that 
it might be a bit quicker than the Beast code but I'd like someone else to 
confirm or deny.  (It might also be wrong - I've not tested it with non-ASCII 
input.)

Note that this is only about 50 lines of code; Beast's utf8_checker.hpp is 
maybe 5 times as long.  Code follows.

Regards, Phil.

template <typename ITER>
bool is_valid_utf8(ITER i, ITER end)
{
  // Check if range is valid and complete UTF-8.
  // (Maybe incomplete input, i.e. ending in the middle of a multi-byte character, needs
  // to be reported differently?)

  while (i != end) {

    // If i is suitably aligned, do a fast word-at-a-time check for ASCII characters.
    // FIXME this only works if ITER is a contiguous iterator; it needs a "static if".
    const char* p = &(*i);
    const char* e = p + (end-i);  // I don't think &(*end) is allowed because it appears to dereference end.
    unsigned long int w;          // Should be 32 bits on 32-bit processor and 64 bits on 64-bit processor.
    if (reinterpret_cast<uintptr_t>(p) % sizeof(w) == 0) {
      while (p+sizeof(w) <= e) {
        memcpy(&w,p,sizeof(w));
        if (w & 0x8080808080808080) break;  // If any of the top bits are set, fall back to the
                                            // byte-at-a-time code below.
                                            // (Is there a better way to write the mask value that would work
                                            // for e.g. 128-bit ints?  Is that expression OK for 32-bit ints?)
        p += sizeof(w);
        i += sizeof(w);
      }
      if (p == e) break;
    }

    uint8_t b0 = *i++;
    if ((b0 & 0x80) == 0) continue;         // Single byte chars are 0xxxxxxx
    if (i == end) return false;             // Incomplete
    uint8_t b1 = *i++;
    if ((b1 & 0xc0) != 0x80) return false;  // Following bytes are all 10xxxxxx
    if ((b0 & 0xe0) == 0xc0) continue;      // Two-byte chars start 110xxxxx
    if (i == end) return false;             // Incomplete
    uint8_t b2 = *i++;
    if ((b2 & 0xc0) != 0x80) return false;  // Following bytes are all 10xxxxxx
    if ((b0 & 0xf0) == 0xe0) continue;      // Three-byte chars start 1110xxxx
    if (i == end) return false;             // Incomplete
    uint8_t b3 = *i++;
    if ((b3 & 0xc0) != 0x80) return false;  // Following bytes are all 10xxxxxx
    if ((b0 & 0xf8) == 0xf0) continue;      // Four-byte chars start 11110xxx
    return false;                           // First byte is invalid, i.e. 10xxxxxx or 11111xxx

  }
  return true;
}