On Thu, Jul 18, 2024 at 2:47 PM Niall Douglas via Boost < boost@lists.boost.org> wrote:
Instead of over-allocating and wasting a page, I would put the link pointers at the end and slightly reduce the maximum size of the i/o buffer. This kinda is annoying to look at because the max buffer fill is no longer a power of two, but in terms of efficiency it's the right call.
Hey, this is actually a good idea. I had similar thoughts when I was designing it. I can give benchmarking it a shot and see what the results are. What kind of benchmark do you think would be the best test here? I suppose one thing I should try is a multishot recv benchmark with many small buffers and a large amount of traffic to send. Probably just max out the size of a buf_ring, which is only like 32k buffers anyway. Ooh, we can even try page-aligning the buffers too. Surely for reading you want io_uring to tell you the buffers, and when
you're done, you immediately push them back to io_uring? So no need to keep buffer lists except for the write buffers?
You'd think so, but there's no such thing as a free lunch. When it comes to borrowing the buffers, to do any meaningful work you'll have to either allocate and memcpy the incoming buffers so you can then immediately release them back to the ring or you risk buffer starvation. This is because not all protocol libraries are designed to copy their input from you and they require the caller use stable storage. Beast is like this and I think zlib is too. There's no guarantee across protocol libraries that they'll reliably copy your input for you. The scheme I chose is one where users own the returned buffer sequence and this enables nice things like an in-place TLS decryption, which I use via Botan. This reminds me, I use Botan in order to provide a generally much stronger TLS interface than Asio's. I've experimented with routines that recycle the owned buffers but honestly, it's faster to just re-allocate holes in the buf_ring in `recv_awaitable::await_resume()`. Benchmarks show a small hit to perf but I think it's an acceptable trade-off here as I now have properly working TLS/TCP streams, which is kind of all that matters. On Thu, Jul 18, 2024 at 4:28 PM Virgilio Fornazin via Boost < boost@lists.boost.org> wrote:
The linux kernel code for sendmmsg/recvmmsg is just a for loop, the cost of syscall traversing ring3 to ring0(1 on virtualized) it's something that really pays off in high performance udp networking. If you consider something like this, thils would be a high win for high packet I/O use in UDP.
As Niall previously noted, you don't need recvmmsg() with io_uring. The point of recvmmsg() was to avoid syscall overhead, which io_uring already solves via bulk submission and bulk reaping of completions and then via multishot recvmsg(). multishot recvmsg() will definitely be fast enough, I confidently say while measuring nothing. I was torn after completing a MVP of TLS/TCP: do I add UDP or file I/O? Unfortunately, I chose file I/O because what's the point of an io_uring runtime if it doesn't even offer async file I/O? This conversation makes me realize that I should've just chosen UDP lol. - Christian