On 16/07/2024 21:29, Christian Mazakas via Boost wrote:
On Tue, Jul 16, 2024 at 12:35 PM Niall Douglas via Boost
But ... I don't agree with hard coding in C++ coroutines personally. I think Sender-Receiver (before WG21 corrupted it) is a better design choice here especially as if within a C++ coroutine you can co_await and it'll "just work" without any extra effort.
This is interesting. Asio was developed when there was no standardized concurrency primitive in C++. We now have one: c++20 coroutines. To me, the universal completion token stuff was a lot of try-hard and template bloat for a feature wasn't worth its weight. But at the time, we didn't know better because no one was doing this kind of stuff.
I think in hindsight, the universal completion token was a mistake. Maybe Sender, Receiver abuses all that ADL to avoid introducing templates here but I'm hesitant to un-hardcode myself from coroutines because being realistic, I imagine most C++ users really just wanna `co_await some_socket_recv();`.
That's exactly what S&R delivers! WG21 S&R has very severe template bloat. Some people see compile times reminiscent of Boost at its worst in the late 2000s. But non-WG21 S&R can be implemented in a much lighter weight way. I made mine ABI stable, and that forces most of the template bloat to not exist.
I see in your github repo you are benching against ASIO. What kinds of results did you get?
Pretty alright.
I have benchmarks that attempt to measure both latency and throughput and in general, I'm like 1.75x faster than Asio, almost 2x. This includes builtin timeouts so I use Beast's tcp_stream for this purpose. I guess this affects the latency-based benchmark more but for the throughput one, io_uring's batched I/O and handling of it really starts to shine.
You're not using the linked op timeout feature of io_uring? It's a bit expensive TBH. I've 'cheated' and set a timeout directly on the socket itself so it errors out after a while. This is nasty, but fast :)
Anything where you can use multishot recv() effectively means you're going to shred Asio or other readiness-based models.
That plus the DMA registered buffers support. ASIO could support the older form which didn't deliver much speedup, but the new form where io_uring/the NIC allocates the receive buffers for you ... it's Windows RIO levels of fast. I certainly can saturate a 40 Gbps NIC from a single kernel thread without much effort now, and 100 Gbps NIC if you can keep the i/o granularity big enough. That was expensive Mellanox userspace TCP type performance a few years ago. Niall