Re: [boost] Review of Boost.MySql

12 May 2022

      On Wed, 11 May 2022 at 10:19, Dominique Devienne via Boost
<boost@lists.boost.org> wrote:
...
On Tue, May 10, 2022 at 10:54 PM Richard Hodges via Boost
<boost@lists.boost.org> wrote:
...
It is worth remembering that because this is MySQL, nothing can be done
with the connection object until the resultset has been fully read.
To be clear here, this is what the protocol for a query (or a prepared statement
execution) looks like. Every "message" in the below diagram is a 4-byte header
plus a payload.

     client                                server
             ----- query request ---> // written by connection::query
             <-------   query OK ----  // read by connection::query
             <-------   metadata ----  // read by connection::query
             <-------   rows --------     // read by resultset::read_one
             <-------   EOF ---------   // read by resultset::read_one

When sending two queries, you can follow a strictly sequential model.
Let's call this model a). It is the one implemented by this library:

     client                                server
             ----- query request ---> // written by connection::query
             <-------   query OK ----  // read by connection::query
             <-------   metadata ----  // read by connection::query
             <-------   rows --------     // read by resultset::read_one
             <-------   EOF ---------   // read by resultset::read_one
             ----- query request ---> // written by connection::query
             <-------   query OK ----  // read by connection::query
             <-------   metadata ----  // read by connection::query
             <-------   rows --------     // read by resultset::read_one
             <-------   EOF ---------   // read by resultset::read_one

You could also initiate the next query without completely
reading all the packets sent by the server. This would look like this:

     client                                server
             --- query request 1 --->
             --- query request 2 --->
             <-----   query 1 OK ----
             <-----   metadata 1 ----
             <-----   rows 1 --------
             <-----   EOF 1 ---------
             <-----   query 2 OK ----
             <-----   metadata 2 ----
             <-----   rows 2 --------
             <-----   EOF 2 ---------

This is possible at the protocol level. Note that the server won't
do any special handling here: it will process the two queries sequentially.
It will read the first one, process it, then send all the response packets,
then repeat for the second one. This second model b) is *currently not
possible* with the current interface. It would require a batch interface like:

serializer sr;
sr.add_query("SELECT * FROM table1");
sr.add_query("SELECT * FROM table2");
connection.write(sr);
resultset r1 = connection.read_query_result();
// Now you *MUST* read all the rows in this resultset
// before moving on to the next one
resultset r2 = connection.read_query_result();
// Same for r2

Note that this is different from a proper pipeline mode,
as described by Postgres. I see two major differences:

1) In a real pipeline mode, the server processes the queries
in batch. Here, the server still processes the queries sequentially.
2) Pipeline modes usually specify an option on what to do when a query
in the pipeline fails. Here, you don't have that - subsequent
queries will be executed regardless of the result of previous ones.

If you think this interface should be provided, please let me know,
and I will raise the relevant issues.

More recent versions of MySQL (v8.x) include a plugin with a new
version of the protocol, called the X protocol. I'm not an expert on it,
AFAIK it was created with the idea of using MySQL as a document store,
but can also be used for regular SQL ops. The documentation is here:
https://dev.mysql.com/doc/dev/mysql-server/8.0.26/page_mysqlx_protocol.html
This protocol does have a pipeline mode.

Please note that this library does *NOT* implement this protocol.
There are actually
two problems with it:

1) It employs Google's protobufs as message format, which creates licensing
conflicts with anything submitted to Boost.
2) It only targets MySQL 8+, not MariaDB or MySQL v5.x.

In the classic protocol (the one implemented by this library), there
are a couple extra concepts that haven't been implemented yet:

1) Multi-statement. This is a primitive form of pipelining, where you specify
several queries to connection::query() as a single string, separated
by semicolons.
The server sends several resultsets after that. I haven't focused a lot on
this because it sounded risky (in terms of security) for me.
2) Multi-resultset. This is used with stored procedures, when a procedure
returns more than one resultset.

This issue tracks both: https://github.com/anarthal/mysql/issues/8

MySQL docs on this:
https://dev.mysql.com/doc/dev/mysql-server/8.0.26/page_protocol_command_phas...
...
Interesting, didn't know that (not being familiar with MySQL).
Note that PostgreSQL's official C client API now offers a *pipeline* mode [1],
which was part of the protocol for a while, but not exposed
client-side until v14.
The protocol also supports two *encoding* modes, text or binary. Is
MySQL's protocol text only?
We tested both, and binary is faster (surprise, right), which matter to me/us.
MySQL does define a text and a binary encoding. It will use the text
encoding when using text queries (i.e. connection::query) and the
binary encoding when using prepared statements (i.e.
prepared_statement::execute).
Resultset objects remember where they come from and will use the relevant
encoding.
...
There's also a special COPY mode, for higher-performance (and
incremental IO) bulk row access.
Any equivalent MySQL side? Again, COPY IN and OUT is faster than
regular DML in PostgreSQL.
There are two operations on the MySQL side:
1) The LOAD DATA statement
(https://dev.mysql.com/doc/refman/8.0/en/load-data.html#load-data-local).
You can use it like "LOAD DATA LOCAL INFILE 'test.csv' INTO TABLE test;"
to do bulk loads from the client. This needs library support and is
currently *NOT* implemented (I actually wasn't aware of the LOCAL option,
thanks for bringing this up). I've raised
https://github.com/anarthal/mysql/issues/67
to track this issue.
2) The SELECT INTO OUTFILE statement
(https://dev.mysql.com/doc/refman/8.0/en/select-into.html).
Unfortunately, this only writes files in the server host and not to the client.
This means there is no special support required in the client.
I'd say the usefulness of this statement is more limited than 1)
...
A little off-topic for this review, but OTOH answers to the above
would help understand the kind of async
possible with MySQL in general, and thus the proposed Boost.MySQL, at
the protocol level.
More related to this review, I don't see any performance chapter in
the doc's TOC with the official MySQL (C I guess) API.
Doing away with the official client code is a risk, and using ASIO
brings in significant complexity (from my POV),
so how is proposed Boost.MySQL better than the official C client, or a
good C++ wrapper on top of it?
I've updated https://github.com/anarthal/mysql/issues/50
to include this performance page.

Traditionally, the C interface had only synchronous functions,
so you would have to spawn a separate thread for each connection you had.
With an async API, you can likely have much higher throughput.
I've noticed that the official client has very recently added non-blocking
functions. It's a curious interface, as it seems you have to repeatedly call
the same function with the same parameters until the operation completes.
https://dev.mysql.com/doc/c-api/8.0/en/c-api-asynchronous-interface-usage.ht...
if you're curious.
...
Or is the target market of this library only existing ASIO users who
want to perform MySQL queries/statements?
This was definitely my initial target audience. But I think
we can reach more than that.
...
What's the target use-case? Small (in resultset size) and frequent
(thus low duration / overhead) queries?
Any wins on large bulk data loads for ETL tools, especially compared
to (simpler?) code using the official client?
I've been more focused on that first case, typical in web apps.
I'd say ETL loads can be a target in the future, but we may have
some missing features (like the LOAD DATA I mentioned above).
...
Efficiency is listed as a goal, but there's no absolute metric, nor
comparison to the obvious alternative that's the offical client API.
This is true. I'd say being able to use an async API can already
grant the user some performance benefits over a sync-only API,
but I agree that we should have benchmarks. Which benchmarks
would you, as a user, find useful? I'm now thinking on single-query
execution time as a measurement of latency, and bulk query
execution time as a measurement of throughput. I'm open to suggestions.
...
Again, I'm not a MySQL user at this point. But I'm a heavy user of
PostgreSQL and SQLite.
And also of Oracle OCI in the past. So I have interest and perspective
on the domain.
I'm also aware of the Boost.PostgreSQL demo/proto Ruben made in the
past, in the same style as Boost.MySQL.
Thus IF I was a user of MySQL, the questions above would be first and
foremost in my mind, about Boost.MySQL.
And OTOH, if Boost.MySQL really has benefits, then maybe I can
translate those benefits to that Boost.PostgreSQL proto?
Beyond the design and implementation of such an ASIO-based
*from-scracth* C++ client,
I feel like there's too much missing for me to evaluate the "What-In-It-For-Me"
and tradeoffs associated with proposed Boost.MySQL.
Hopefully that makes sense :). Thanks, --DD
[1]: https://www.postgresql.org/docs/current/libpq-pipeline-mode.html
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [boost] Review of Boost.MySql

Ruben Perez