Re: [boost] Reimplementing ASIOs

16 Jul 2024

      On 16/07/2024 21:29, Christian Mazakas via Boost wrote:
...
On Tue, Jul 16, 2024 at 12:35 PM Niall Douglas via Boost
...
But ... I don't agree with hard coding in C++ coroutines personally. I
think Sender-Receiver (before WG21 corrupted it) is a better design
choice here especially as if within a C++ coroutine you can co_await and
it'll "just work" without any extra effort.
This is interesting. Asio was developed when there was no standardized
concurrency primitive
in C++. We now have one: c++20 coroutines. To me, the universal completion
token stuff was a
lot of try-hard and template bloat for a feature wasn't worth its weight.
But at the time, we didn't
know better because no one was doing this kind of stuff.
I think in hindsight, the universal completion token was a mistake. Maybe
Sender, Receiver abuses
all that ADL to avoid introducing templates here but I'm hesitant to
un-hardcode myself from coroutines
because being realistic, I imagine most C++ users really just wanna
`co_await some_socket_recv();`.
That's exactly what S&R delivers!

WG21 S&R has very severe template bloat. Some people see compile times 
reminiscent of Boost at its worst in the late 2000s. But non-WG21 S&R 
can be implemented in a much lighter weight way. I made mine ABI stable, 
and that forces most of the template bloat to not exist.
...
...
I see in your github repo you are benching against ASIO. What kinds of
results did you get?
Pretty alright.
I have benchmarks that attempt to measure both latency and throughput and
in general, I'm
like 1.75x faster than Asio, almost 2x. This includes builtin timeouts so I
use Beast's tcp_stream
for this purpose. I guess this affects the latency-based benchmark more but
for the throughput one,
io_uring's batched I/O and handling of it really starts to shine.
You're not using the linked op timeout feature of io_uring?

It's a bit expensive TBH. I've 'cheated' and set a timeout directly on 
the socket itself so it errors out after a while. This is nasty, but fast :)
...
Anything where you can use multishot recv() effectively means you're going
to shred Asio or other
readiness-based models.
That plus the DMA registered buffers support. ASIO could support the 
older form which didn't deliver much speedup, but the new form where 
io_uring/the NIC allocates the receive buffers for you ... it's Windows 
RIO levels of fast. I certainly can saturate a 40 Gbps NIC from a single 
kernel thread without much effort now, and 100 Gbps NIC if you can keep 
the i/o granularity big enough. That was expensive Mellanox userspace 
TCP type performance a few years ago.

Niall