On Wed, 11 May 2022 at 10:19, Dominique Devienne via Boost
On Tue, May 10, 2022 at 10:54 PM Richard Hodges via Boost
wrote: It is worth remembering that because this is MySQL, nothing can be done with the connection object until the resultset has been fully read.
To be clear here, this is what the protocol for a query (or a prepared statement execution) looks like. Every "message" in the below diagram is a 4-byte header plus a payload. client server ----- query request ---> // written by connection::query <------- query OK ---- // read by connection::query <------- metadata ---- // read by connection::query <------- rows -------- // read by resultset::read_one <------- EOF --------- // read by resultset::read_one When sending two queries, you can follow a strictly sequential model. Let's call this model a). It is the one implemented by this library: client server ----- query request ---> // written by connection::query <------- query OK ---- // read by connection::query <------- metadata ---- // read by connection::query <------- rows -------- // read by resultset::read_one <------- EOF --------- // read by resultset::read_one ----- query request ---> // written by connection::query <------- query OK ---- // read by connection::query <------- metadata ---- // read by connection::query <------- rows -------- // read by resultset::read_one <------- EOF --------- // read by resultset::read_one You could also initiate the next query without completely reading all the packets sent by the server. This would look like this: client server --- query request 1 ---> --- query request 2 ---> <----- query 1 OK ---- <----- metadata 1 ---- <----- rows 1 -------- <----- EOF 1 --------- <----- query 2 OK ---- <----- metadata 2 ---- <----- rows 2 -------- <----- EOF 2 --------- This is possible at the protocol level. Note that the server won't do any special handling here: it will process the two queries sequentially. It will read the first one, process it, then send all the response packets, then repeat for the second one. This second model b) is *currently not possible* with the current interface. It would require a batch interface like: serializer sr; sr.add_query("SELECT * FROM table1"); sr.add_query("SELECT * FROM table2"); connection.write(sr); resultset r1 = connection.read_query_result(); // Now you *MUST* read all the rows in this resultset // before moving on to the next one resultset r2 = connection.read_query_result(); // Same for r2 Note that this is different from a proper pipeline mode, as described by Postgres. I see two major differences: 1) In a real pipeline mode, the server processes the queries in batch. Here, the server still processes the queries sequentially. 2) Pipeline modes usually specify an option on what to do when a query in the pipeline fails. Here, you don't have that - subsequent queries will be executed regardless of the result of previous ones. If you think this interface should be provided, please let me know, and I will raise the relevant issues. More recent versions of MySQL (v8.x) include a plugin with a new version of the protocol, called the X protocol. I'm not an expert on it, AFAIK it was created with the idea of using MySQL as a document store, but can also be used for regular SQL ops. The documentation is here: https://dev.mysql.com/doc/dev/mysql-server/8.0.26/page_mysqlx_protocol.html This protocol does have a pipeline mode. Please note that this library does *NOT* implement this protocol. There are actually two problems with it: 1) It employs Google's protobufs as message format, which creates licensing conflicts with anything submitted to Boost. 2) It only targets MySQL 8+, not MariaDB or MySQL v5.x. In the classic protocol (the one implemented by this library), there are a couple extra concepts that haven't been implemented yet: 1) Multi-statement. This is a primitive form of pipelining, where you specify several queries to connection::query() as a single string, separated by semicolons. The server sends several resultsets after that. I haven't focused a lot on this because it sounded risky (in terms of security) for me. 2) Multi-resultset. This is used with stored procedures, when a procedure returns more than one resultset. This issue tracks both: https://github.com/anarthal/mysql/issues/8 MySQL docs on this: https://dev.mysql.com/doc/dev/mysql-server/8.0.26/page_protocol_command_phas...
Interesting, didn't know that (not being familiar with MySQL).
Note that PostgreSQL's official C client API now offers a *pipeline* mode [1], which was part of the protocol for a while, but not exposed client-side until v14.
The protocol also supports two *encoding* modes, text or binary. Is MySQL's protocol text only? We tested both, and binary is faster (surprise, right), which matter to me/us.
MySQL does define a text and a binary encoding. It will use the text encoding when using text queries (i.e. connection::query) and the binary encoding when using prepared statements (i.e. prepared_statement::execute). Resultset objects remember where they come from and will use the relevant encoding.
There's also a special COPY mode, for higher-performance (and incremental IO) bulk row access. Any equivalent MySQL side? Again, COPY IN and OUT is faster than regular DML in PostgreSQL.
There are two operations on the MySQL side: 1) The LOAD DATA statement (https://dev.mysql.com/doc/refman/8.0/en/load-data.html#load-data-local). You can use it like "LOAD DATA LOCAL INFILE 'test.csv' INTO TABLE test;" to do bulk loads from the client. This needs library support and is currently *NOT* implemented (I actually wasn't aware of the LOCAL option, thanks for bringing this up). I've raised https://github.com/anarthal/mysql/issues/67 to track this issue. 2) The SELECT INTO OUTFILE statement (https://dev.mysql.com/doc/refman/8.0/en/select-into.html). Unfortunately, this only writes files in the server host and not to the client. This means there is no special support required in the client. I'd say the usefulness of this statement is more limited than 1)
A little off-topic for this review, but OTOH answers to the above would help understand the kind of async possible with MySQL in general, and thus the proposed Boost.MySQL, at the protocol level.
More related to this review, I don't see any performance chapter in the doc's TOC with the official MySQL (C I guess) API. Doing away with the official client code is a risk, and using ASIO brings in significant complexity (from my POV), so how is proposed Boost.MySQL better than the official C client, or a good C++ wrapper on top of it?
I've updated https://github.com/anarthal/mysql/issues/50 to include this performance page. Traditionally, the C interface had only synchronous functions, so you would have to spawn a separate thread for each connection you had. With an async API, you can likely have much higher throughput. I've noticed that the official client has very recently added non-blocking functions. It's a curious interface, as it seems you have to repeatedly call the same function with the same parameters until the operation completes. https://dev.mysql.com/doc/c-api/8.0/en/c-api-asynchronous-interface-usage.ht... if you're curious.
Or is the target market of this library only existing ASIO users who want to perform MySQL queries/statements?
This was definitely my initial target audience. But I think we can reach more than that.
What's the target use-case? Small (in resultset size) and frequent (thus low duration / overhead) queries? Any wins on large bulk data loads for ETL tools, especially compared to (simpler?) code using the official client?
I've been more focused on that first case, typical in web apps. I'd say ETL loads can be a target in the future, but we may have some missing features (like the LOAD DATA I mentioned above).
Efficiency is listed as a goal, but there's no absolute metric, nor comparison to the obvious alternative that's the offical client API.
This is true. I'd say being able to use an async API can already grant the user some performance benefits over a sync-only API, but I agree that we should have benchmarks. Which benchmarks would you, as a user, find useful? I'm now thinking on single-query execution time as a measurement of latency, and bulk query execution time as a measurement of throughput. I'm open to suggestions.
Again, I'm not a MySQL user at this point. But I'm a heavy user of PostgreSQL and SQLite. And also of Oracle OCI in the past. So I have interest and perspective on the domain.
I'm also aware of the Boost.PostgreSQL demo/proto Ruben made in the past, in the same style as Boost.MySQL. Thus IF I was a user of MySQL, the questions above would be first and foremost in my mind, about Boost.MySQL. And OTOH, if Boost.MySQL really has benefits, then maybe I can translate those benefits to that Boost.PostgreSQL proto?
Beyond the design and implementation of such an ASIO-based *from-scracth* C++ client, I feel like there's too much missing for me to evaluate the "What-In-It-For-Me" and tradeoffs associated with proposed Boost.MySQL.
Hopefully that makes sense :). Thanks, --DD
[1]: https://www.postgresql.org/docs/current/libpq-pipeline-mode.html
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost