On 17/07/2024 18:17, Christian Mazakas via Boost wrote:
That plus the DMA registered buffers support. ASIO could support the older form which didn't deliver much speedup, but the new form where io_uring/the NIC allocates the receive buffers for you ... it's Windows RIO levels of fast. I certainly can saturate a 40 Gbps NIC from a single kernel thread without much effort now, and 100 Gbps NIC if you can keep the i/o granularity big enough. That was expensive Mellanox userspace TCP type performance a few years ago.
I'm not sure I know what you're talking about here, being honest. I know io_uring has registered buffers for file I/O and I know that you can also use a provided buffers API for multishot recv() and multishot read() (i.e. `io_uring_register_buffers()` and `io_uring_buf_ring_setup()`).
This is confusing to me because these two functions don't really allocate. _You_ allocate and then register them with the ring. So I'm curious about this NIC allocating a receive buffer for me here.
Fwiw, Fiona does actually use multishot TCP recv(), so it does use the buf_ring stuff. This has interesting API implications because in the epoll world, users are accustomed to:
co_await socket.async_recv(my_buffer);
But in Fiona, you instead have:
auto m_buf_sequence = co_await socket.async_recv(); return std::move(m_buf_sequence).value();
Ownership of the buffers is inverted here, which actually turns out to be quite the API break.
Once I get the code into better shape, I'd like to start shilling it but who knows if it'll ever catch on.
Yes, you're already using the thing I was referring to, which is the "ring provided buffers" feature via the API io_uring_register_buf_ring. You're right that its docs presents the feature as userspace allocating pages from the kernel, then giving ownership of those pages to io_uring, which then fills them with received data as it chooses and hands ownership back to userspace. That's how it appears from userspace anyway. If I were the kernel, I'd free the backing store for the pages handed to me, and repoint the virtual memory address at pages coming off the NIC's DMA. Depends on the NIC, some can address all of memory, some a subset, some barely at all. High end NICs would be very efficient, occasional memory copying might be needed for prosumer NICs, and for cheap and nasty NICs incapable of more than a 64Kb window ... well, kinda have to copy memory there. Anyway point is having the kernel tell you the buffers filled instead of you telling it what buffers to fill is the right design. This is why LLFIO's read op allowed reads to fill in buffers read completely differently to buffers supplied, incidentally. Niall