On 19/07/2024 17:12, Christian Mazakas via Boost wrote:
On Thu, Jul 18, 2024 at 2:47 PM Niall Douglas via Boost < boost@lists.boost.org> wrote:
Instead of over-allocating and wasting a page, I would put the link pointers at the end and slightly reduce the maximum size of the i/o buffer. This kinda is annoying to look at because the max buffer fill is no longer a power of two, but in terms of efficiency it's the right call.
Hey, this is actually a good idea. I had similar thoughts when I was designing it.
I can give benchmarking it a shot and see what the results are.
It'll be a bit faster due to reduced TLB pressure :)
What kind of benchmark do you think would be the best test here? I suppose one thing I should try is a multishot recv benchmark with many small buffers and a large amount of traffic to send. Probably just max out the size of a buf_ring, which is only like 32k buffers anyway.
Ooh, we can even try page-aligning the buffers too.
The first one I always start with is "how much bandwidth can I transfer using a single kernel thread?" The second one is how small the write quantum can I use to still max out bandwidth from a single kernel thread. It's not dissimilar to tuning for file i/o, there is a bandwidth-latency tradeoff and latency is proportional to i/o quantum. If you can get the i/o quantum down without overly affecting bandwidth, that has huge beneficial effects on i/o latency, particularly in terms of a nice flat-ish latency distribution.
Surely for reading you want io_uring to tell you the buffers, and when you're done, you immediately push them back to io_uring? So no need to keep buffer lists except for the write buffers?
You'd think so, but there's no such thing as a free lunch.
When it comes to borrowing the buffers, to do any meaningful work you'll have to either allocate and memcpy the incoming buffers so you can then immediately release them back to the ring or you risk buffer starvation.
This is because not all protocol libraries are designed to copy their input from you and they require the caller use stable storage. Beast is like this and I think zlib is too. There's no guarantee across protocol libraries that they'll reliably copy your input for you.
The scheme I chose is one where users own the returned buffer sequence and this enables nice things like an in-place TLS decryption, which I use via Botan. This reminds me, I use Botan in order to provide a generally much stronger TLS interface than Asio's.
Oh okay. io_uring permits 4096 locked i/o buffers per ring. I put together a bit of C++ metaprogramming which encourages users to release i/o buffers as soon as possible, but if they really want to hang onto a buffer, they can. If we run out of buffers, I stall new i/o until new buffers appear. I then have per-op TSC counts so if we spend too much time stalling new i/o, the culprits holding onto buffers for too long can be easily identified. I reckon this the least worst of the approaches before us - well behaved code gets maximum performance, less well behaved code gets less performance. But everything is reliable. If you think this model through, the most efficient implementation requirement is that all work must always be suspend-resumable because any work can be suspended at any time due to temporary lack of resources. In other words, completion callbacks won't cut it here. Niall