ACID transactions/flushing/MySQL InnoDB

hey, this is not a boost question, but it is for the development of a boost library, so maybe a database expert on the list could help me out? I've been benchmarking some test cases of my library against MySQL InnoDB, which isn't even apples to oranges anymore, but nevertheless gave some insights. what I find surprising is the performance of MySQL for small transactions. while my library spends 90% of its time waiting for the disk to flush a transaction, InnoDB uses 100% CPU and barely uses any disk. I've reduced the calls to fsync() for small transactions to 1 call per transaction after some sequential writing, so I'm wondering how MySQL accomplishes this? by definition there must be a flush per transaction, unless you loosen the requirements of the D in ACID. Is that what InnoDB is doing? the only way I see to reduce flushes even more is combining some transactions. but after a power failure some of those transactions can be lost even though they haven reported as successfully committed. I have not yet examined their source code yet. the MySQL documentation doesn't say anything about loosening Durability. (although you have to look pretty hard to see that the default isolation level isn't Serializable either.) thanks in advance, Stefan

Hi Stefan, On Fri, Dec 11, 2009 at 1:07 AM, Stefan Strasser <strasser@uni-bremen.de> wrote:
this is not a boost question, but it is for the development of a boost library, so maybe a database expert on the list could help me out?
I'm not a database expert but I have dealt with databases for the better part of my development career so let me share a little bit of the insights I've learned with MySQL and InnoDB.
I've been benchmarking some test cases of my library against MySQL InnoDB, which isn't even apples to oranges anymore, but nevertheless gave some insights.
what I find surprising is the performance of MySQL for small transactions. while my library spends 90% of its time waiting for the disk to flush a transaction, InnoDB uses 100% CPU and barely uses any disk.
Am I reading this correctly, you're doing something with persistence of tuples to disk, allowing indexed access to this persisted data, and then allowing someone to embed it in their application. This sounds a lot like an embedded database system and looking at SQLite and BDB might be better than comparing with InnoDB. But then that's just my take on it. ;)
I've reduced the calls to fsync() for small transactions to 1 call per transaction after some sequential writing, so I'm wondering how MySQL accomplishes this? by definition there must be a flush per transaction, unless you loosen the requirements of the D in ACID.
Is that what InnoDB is doing? the only way I see to reduce flushes even more is combining some transactions. but after a power failure some of those transactions can be lost even though they haven reported as successfully committed.
This depends completely on how you've configured InnoDB. There's a configuration setting that allows you to commit every after transaction, a setting that allows you to flush every second instead of after every commit, and a setting that allows you to flush the log file entry every after commit but the data to the disk. The way InnoDB does it is it keeps the index and the data in the memory pool -- the CPU usage is probably due to cache misses when you're doing a full-table scan. When you commit a transaction, the B-Tree index in memory is updated and a log entry is added to the log. When the flush happens is completely configurable -- so right after the B-Tree index gets updated, you can have it flushed to disk, or after a timeout, or have only the log committed and have the index+data flushed later on. Another means that they do away with the flush is by using memory mapped regions in Linux/UNIX, and then delegate the flushing to the operating system. InnoDB also supports raw storage management in which case it bypasses the normal filesystem flushing and deals with raw storage devices as block devices on its own, using more low-level access to the device.
I have not yet examined their source code yet. the MySQL documentation doesn't say anything about loosening Durability. (although you have to look pretty hard to see that the default isolation level isn't Serializable either.)
There's a section about InnoDB log file flushing. I would have included a link but I think the information above should already be helpful to you.
thanks in advance,
HTH BTW, I am interested in what you're doing and maybe using a library that tackles the embedded database space in a modern C++ manner. Looking forward to hearing more about what you're working on. -- Dean Michael Berris blog.cplusplus-soup.com | twitter.com/mikhailberis linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

thank you very much for both your answers. but they don't solve the mystery, please see below. Am Thursday 10 December 2009 16:40:12 schrieb Dean Michael Berris:
what I find surprising is the performance of MySQL for small transactions. while my library spends 90% of its time waiting for the disk to flush a transaction, InnoDB uses 100% CPU and barely uses any disk.
Am I reading this correctly, you're doing something with persistence of tuples to disk, allowing indexed access to this persisted data, and
persistence yes, tuples no. persistent objects. in the backend that means: chunks of arbitrary sized data. there is some indexing going on, but it isn't relevant to this case because small transactions are dumped to the log on commit. the data files are only updated from time to time, to avoid additional disk syncs. Am Thursday 10 December 2009 16:56:27 schrieb Jose:
This is the setting where you can change the performance vs durability tradeoff:
If the value of innodb_flush_log_at_trx_commit is
thanks, I didn't know about this. but the value was 1 by default, so MySQL claims to flush for each transactions. but, including all the SQL parsing, running 10000 small transactions takes about 4 seconds on my system. while this little code snippet alone takes 6-7 seconds, with much more disk usage: int fd=open("test.bin",O_RDWR | O_CREAT | O_NOATIME); for(int c=0;c<10000;++c){ write(fd,&data,1); fsync(fd); } neither fdatasync(), O_WRONLY, nor writing whole sectors instead of just a byte make a difference. MySQL also claims to use fsync on the page you linked. do you have an explanation for this?
then allowing someone to embed it in their application. This sounds a lot like an embedded database system and looking at SQLite and BDB
I have looked at SQLite, they seem to use the same method as I do. judging by this FAQ entry, with similar results: http://www.sqlite.org/faq.html#q19 (although I get much more than 60 transactions a second. I guess they have calculated that from the RPM without actually trying it. fsync() returns about 1600 times/s on my old 10 GB disk, presumable due to hardware caching.).
Another means that they do away with the flush is by using memory mapped regions in Linux/UNIX, and then delegate the flushing to the operating system.
I have implemented that, with an own set of problems, see this: http://lists.boost.org/boost-users/2009/12/54449.php
BTW, I am interested in what you're doing and maybe using a library that tackles the embedded database space in a modern C++ manner. Looking forward to hearing more about what you're working on.
I've posted a description for the Libraries Under Construction page (not yet there) in this thread: http://lists.boost.org/Archives/boost/2009/12/159900.php

On Fri, Dec 11, 2009 at 6:02 AM, Stefan Strasser <strasser@uni-bremen.de> wrote:
thank you very much for both your answers. but they don't solve the mystery, please see below.
Let me try one more time. ;)
Am Thursday 10 December 2009 16:40:12 schrieb Dean Michael Berris:
what I find surprising is the performance of MySQL for small transactions. while my library spends 90% of its time waiting for the disk to flush a transaction, InnoDB uses 100% CPU and barely uses any disk.
Am I reading this correctly, you're doing something with persistence of tuples to disk, allowing indexed access to this persisted data, and
persistence yes, tuples no. persistent objects. in the backend that means: chunks of arbitrary sized data. there is some indexing going on, but it isn't relevant to this case because small transactions are dumped to the log on commit. the data files are only updated from time to time, to avoid additional disk syncs.
When you say logged on commit, are you writing to a file that's already created with enough space, or are you writing to a file which is "grown" every time you add data to it? Apparently this matters and this is how InnoDB does its log file management. What you can try doing is creating a fixed-size file first (initially maybe 10M) then writing from the start of the file and growing the file in fixed size increments.
Am Thursday 10 December 2009 16:56:27 schrieb Jose:
This is the setting where you can change the performance vs durability tradeoff:
If the value of innodb_flush_log_at_trx_commit is
thanks, I didn't know about this. but the value was 1 by default, so MySQL claims to flush for each transactions.
but, including all the SQL parsing, running 10000 small transactions takes about 4 seconds on my system. while this little code snippet alone takes 6-7 seconds, with much more disk usage:
What InnoDB does is puts everything that fits in memory up there, and keeps it there. Small transactions will write to the log file, but not write all the data directly to disk right away.
int fd=open("test.bin",O_RDWR | O_CREAT | O_NOATIME); for(int c=0;c<10000;++c){ write(fd,&data,1); fsync(fd); }
It looks like you're writing to a file that is created but not set to contain enough data to hold all the 10k data elements. Try zero'ing out the file first (or growing it to the desired size) then try again. What you want to do to increase performance (on Linux at least): * Leverage the VFS Cache * Allow the OS to keep your file in memory unknown to you * Leverage the buffers at the OS level Because you're using fsync, you're asking the kernel to do it for you -- and if your file is already in the vfs cache, the chances of fsync returning quicker is higher due to write caching at the OS level.
neither fdatasync(), O_WRONLY, nor writing whole sectors instead of just a byte make a difference. MySQL also claims to use fsync on the page you linked.
do you have an explanation for this?
They use fsync on the log file, not the data file which IIRC is still an mmap'ed region. And because their file is already of a certain size and is usually cached effectively by the OS, they get better performance numbers even if they keep using fsync on a descriptor.
then allowing someone to embed it in their application. This sounds a lot like an embedded database system and looking at SQLite and BDB
I have looked at SQLite, they seem to use the same method as I do. judging by this FAQ entry, with similar results: http://www.sqlite.org/faq.html#q19 (although I get much more than 60 transactions a second. I guess they have calculated that from the RPM without actually trying it. fsync() returns about 1600 times/s on my old 10 GB disk, presumable due to hardware caching.).
That's not surprising -- and if you're on Linux you even have vfs caching going for you if your file is already pre-grown.
Another means that they do away with the flush is by using memory mapped regions in Linux/UNIX, and then delegate the flushing to the operating system.
I have implemented that, with an own set of problems, see this: http://lists.boost.org/boost-users/2009/12/54449.php
Yeah, unfortunately those are platform-specific issues you're running into. Ultimately you're at the mercy of the filesystem implementation you're using -- if it's journalling -- or whether you're accessing a block device directly so that you can implement the caching and implementation on your own. MySQL's InnoDB supports raw block storage although I doubt you want to go to that level of control.
BTW, I am interested in what you're doing and maybe using a library that tackles the embedded database space in a modern C++ manner. Looking forward to hearing more about what you're working on.
I've posted a description for the Libraries Under Construction page (not yet there) in this thread:
Sweet, thanks for the link. I'll look forward to progress soon. :) -- Dean Michael Berris blog.cplusplus-soup.com | twitter.com/mikhailberis linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

Am Thursday 10 December 2009 21:47:53 schrieb Dean Michael Berris:
On Fri, Dec 11, 2009 at 6:02 AM, Stefan Strasser <strasser@uni-bremen.de> wrote:
thank you very much for both your answers. but they don't solve the mystery, please see below.
Let me try one more time. ;)
I was about to write you've got to try one more time, but...see below.
When you say logged on commit, are you writing to a file that's already created with enough space, or are you writing to a file which is "grown" every time you add data to it?
I had tried that. and I've tried writing a sector instead of 1 byte. and I've tried removing O_CREAT. but you have to actually do ALL THREE, and one more: the sector writes need to be sector-aligned. so, writing 512 bytes, aligned to 512 bytes, without O_CREAT, when the file already exists, brings the desired results 2 seconds with much less disk usage. that's some set of conditions. thanks for helping with this.
What InnoDB does is puts everything that fits in memory up there, and keeps it there. Small transactions will write to the log file, but not write all the data directly to disk right away.
that's almost equal to my approach. I do write to the data files, but only sync them when a large transaction is committed or the log is rolled to a new one. I should probably think about sector-aligning those data writes, too, given the new insights.
Because you're using fsync, you're asking the kernel to do it for you -- and if your file is already in the vfs cache, the chances of fsync returning quicker is higher due to write caching at the OS level.
I don't think the OS uses write caching in the case of fsync. it isn't supposed to, is it?

On Fri, Dec 11, 2009 at 7:02 AM, Stefan Strasser <strasser@uni-bremen.de> wrote:
Am Thursday 10 December 2009 21:47:53 schrieb Dean Michael Berris:
When you say logged on commit, are you writing to a file that's already created with enough space, or are you writing to a file which is "grown" every time you add data to it?
I had tried that. and I've tried writing a sector instead of 1 byte. and I've tried removing O_CREAT. but you have to actually do ALL THREE, and one more: the sector writes need to be sector-aligned. so, writing 512 bytes, aligned to 512 bytes, without O_CREAT, when the file already exists, brings the desired results 2 seconds with much less disk usage. that's some set of conditions.
thanks for helping with this.
Nice. :) You're welcome.
What InnoDB does is puts everything that fits in memory up there, and keeps it there. Small transactions will write to the log file, but not write all the data directly to disk right away.
that's almost equal to my approach. I do write to the data files, but only sync them when a large transaction is committed or the log is rolled to a new one. I should probably think about sector-aligning those data writes, too, given the new insights.
Sounds like a good approach. If you're thinking of multi-threading and having an active object do the flush management, that should be something worth looking into to move the latency from persistence away from "worker" threads to a single serializing writer thread.
Because you're using fsync, you're asking the kernel to do it for you -- and if your file is already in the vfs cache, the chances of fsync returning quicker is higher due to write caching at the OS level.
I don't think the OS uses write caching in the case of fsync. it isn't supposed to, is it?
It actually has license to "cache" in the sense that it queues the data to be written on a per-fd basis. Even if you're not doing buffered write, that doesn't mean the OS will actually honor a call to fsync that returns right away to mean the data has already been written to disk. IIRC, the POSIX standard doesn't really say that after an fsync the data is guaranteed to have been written to disk -- only that the state of the file descriptor that the kernel holds and the userspace descriptor are synchronized; this can mean a lot of things and it doesn't guarantee that it's already written to disk. I may be wrong though but that is how I understand it. HTH -- Dean Michael Berris blog.cplusplus-soup.com | twitter.com/mikhailberis linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

Am Thursday 10 December 2009 22:38:54 schrieb Dean Michael Berris:
On Fri, Dec 11, 2009 at 7:02 AM, Stefan Strasser <strasser@uni-bremen.de> wrote:
Am Thursday 10 December 2009 21:47:53 schrieb Dean Michael Berris:
When you say logged on commit, are you writing to a file that's already created with enough space, or are you writing to a file which is "grown" every time you add data to it?
I had tried that. and I've tried writing a sector instead of 1 byte. and I've tried removing O_CREAT. but you have to actually do ALL THREE, and one more: the sector writes need to be sector-aligned. so, writing 512 bytes, aligned to 512 bytes, without O_CREAT, when the file already exists, brings the desired results 2 seconds with much less disk usage. that's some set of conditions.
thanks for helping with this.
Nice. :) You're welcome.
it's actually 1.2 seconds. I fergot to remove a call to this_thread::sleep(), from some earlier tests ;-) and there's a fifth condition: no lseek() calls in between. a call to lseek() not only has its normal overhead, but brings the whole thing back to the 5-7 seconds range. so I'll have to waste some log space to ensure sector writes without seeking.
that's almost equal to my approach. I do write to the data files, but only sync them when a large transaction is committed or the log is rolled to a new one. I should probably think about sector-aligning those data writes, too, given the new insights.
Sounds like a good approach. If you're thinking of multi-threading and having an active object do the flush management, that should be something worth looking into to move the latency from persistence away from "worker" threads to a single serializing writer thread.
there's already a TODO in the code to seperate the sync preparations (flushing internal buffer) from the actual system call, so the mutex can be unlocked and other threads can continue to write to the log. but it is some effort in my current design, and I'm not sure if it even helps. according to this report, fsync() currently blocks other writes: http://bugzilla.kernel.org/show_bug.cgi?id=9911 not sure about windows.

Am Thursday 10 December 2009 22:38:54 schrieb Dean Michael Berris:
On Fri, Dec 11, 2009 at 7:02 AM, Stefan Strasser <strasser@uni-bremen.de> wrote:
Am Thursday 10 December 2009 21:47:53 schrieb Dean Michael Berris:
When you say logged on commit, are you writing to a file that's already created with enough space, or are you writing to a file which is "grown" every time you add data to it?
I had tried that. and I've tried writing a sector instead of 1 byte. and I've tried removing O_CREAT. but you have to actually do ALL THREE, and one more: the sector writes need to be sector-aligned. so, writing 512 bytes, aligned to 512 bytes, without O_CREAT, when the file already exists, brings the desired results 2 seconds with much less disk usage. that's some set of conditions.
thanks for helping with this.
in case you're interested, I now get about 5000 synced small transactions/s. with a theoretical maximum of about 7000, but the 5000 are including the data writes and the writing-ahead to zero out the data and make sure the log file doesn't grow on each transaction sync. thanks again. there are some more weird conditions to get sequential writing performance, e.g. when you write with the following sync intervals, starting from offset 0 and "|" representing syncs: 512 | 2048 | 512 | 512 | 512 | ... performance is much much worse than: 512 | [1536] | 2048 | 512 | 512 | ... with 1536 bytes of garbage inserted so the 2048-bytes block can be written aligned to its own size. and yes, the sync after writing the garbage is necessary.
Sounds like a good approach. If you're thinking of multi-threading and having an active object do the flush management, that should be something worth looking into to move the latency from persistence away from "worker" threads to a single serializing writer thread.
I've tried something like that, still synced by the worker threads but the sync is lock-free, so in theory the threads could meet at the sync call and if there were no writes between the sync calls, the OS could decide to perform only one of them. but it doesn't have large effects, about 5200 txs/s with 10 threads (on single CPU). I guess you could improve multithreaded performance a lot if you yield the thread right before the sync and only sync in one thread if threads meet at that point. but I haven't looked into this and I don't plan to right now.

On Sat, Dec 12, 2009 at 8:00 PM, Stefan Strasser <strasser@uni-bremen.de> wrote:
in case you're interested, I now get about 5000 synced small transactions/s. with a theoretical maximum of about 7000, but the 5000 are including the data writes and the writing-ahead to zero out the data and make sure the log file doesn't grow on each transaction sync.
That's cool. :)
thanks again.
Glad to be of help.
there are some more weird conditions to get sequential writing performance, e.g. when you write with the following sync intervals, starting from offset 0 and "|" representing syncs:
512 | 2048 | 512 | 512 | 512 | ...
performance is much much worse than:
512 | [1536] | 2048 | 512 | 512 | ...
with 1536 bytes of garbage inserted so the 2048-bytes block can be written aligned to its own size. and yes, the sync after writing the garbage is necessary.
One other thing you can do is to pack the data into 512-byte divisible chunks. If you know you're writing 2048 in one go, then you might have to align to 512-byte chunks. Instead of writing "garbage" you can pack null bytes past the 2048 bytes you originally wanted to store. It also helps if you can deal with a memory pool with fixed-chunk recyclable arrays so that you can leverage processor-cache locality for the bytes you are going to eventually pass to the kernel -- which it will eventually copy (unfortunately).
Sounds like a good approach. If you're thinking of multi-threading and having an active object do the flush management, that should be something worth looking into to move the latency from persistence away from "worker" threads to a single serializing writer thread.
I've tried something like that, still synced by the worker threads but the sync is lock-free, so in theory the threads could meet at the sync call and if there were no writes between the sync calls, the OS could decide to perform only one of them. but it doesn't have large effects, about 5200 txs/s with 10 threads (on single CPU). I guess you could improve multithreaded performance a lot if you yield the thread right before the sync and only sync in one thread if threads meet at that point. but I haven't looked into this and I don't plan to right now.
The buffer packing algorithm that I describe above could be done on a single thread (the active object's lifetime thread) and the write/sync done appropriately. I imagine the cost of context switching is mitigated by the improved worker-thread free-up because it doesn't need to wait for a transaction to be actively committed after it has passed it to the active object. I'm not sure if this is part of the requirements in your design or whether you want client threads to really block on transaction commits (even though the syncing is lock-free). HTH -- Dean Michael Berris blog.cplusplus-soup.com | twitter.com/mikhailberis linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

Zitat von Dean Michael Berris <mikhailberis@gmail.com>:
512 | 2048 | 512 | 512 | 512 | ...
performance is much much worse than:
512 | [1536] | 2048 | 512 | 512 | ...
with 1536 bytes of garbage inserted so the 2048-bytes block can be written aligned to its own size. and yes, the sync after writing the garbage is necessary.
One other thing you can do is to pack the data into 512-byte divisible chunks. If you know you're writing 2048 in one go, then you might have to align to 512-byte chunks. Instead of writing "garbage" you can pack null bytes past the 2048 bytes you originally wanted to store.
I do have sector chunks, that was required for the reason alone that a used sector needs to be distinguishable from an unused sector, because of the zeroed out data at the end of the log. I still insert the "garbage" before the data, I don't think I understood why inserting at the end would be preferable, if that was what you're trying to explain? if cache locality was your only concern that doesn't really matter here, since CPU usage is way below 100%. there can be about 40000 un-flushed small transactions/s, so this is really about optimizing the syncs. in a mode that is only safe in case of an application crash, but not in case of a system failure (flush, but no sync) there are about 30000 transactions/s for the same test case.
I'm not sure if this is part of the requirements in your design or whether you want client threads to really block on transaction commits (even though the syncing is lock-free).
the worker threads are the user threads, so they need to block until the transaction is recorded committed (-> log sync). you could only avoid some syncs when 2 or more independent transactions run concurrently and they both need to sync, those could be combined. but that's not my focus right now.

On Sat, Dec 12, 2009 at 11:27 PM, <strasser@uni-bremen.de> wrote:
Zitat von Dean Michael Berris <mikhailberis@gmail.com>:
One other thing you can do is to pack the data into 512-byte divisible chunks. If you know you're writing 2048 in one go, then you might have to align to 512-byte chunks. Instead of writing "garbage" you can pack null bytes past the 2048 bytes you originally wanted to store.
I do have sector chunks, that was required for the reason alone that a used sector needs to be distinguishable from an unused sector, because of the zeroed out data at the end of the log.
I see. Do you have a separate index mechanism?
I still insert the "garbage" before the data, I don't think I understood why inserting at the end would be preferable, if that was what you're trying to explain?
If you had an index mechanism that contains information about where a certain log entry starts (and the size of that entry) then putting it in the beginning allows you to re-use the area after that entry (thus the zero's). Instead of garbage data that you're never going to be able to re-use if it was in the beginning of a multi-sector entry, you get usable space after the entry. This only works though if you have an index, or at least a "header" as part of each entry that you write. Imagining it this way: | header < offset, size > | entry | You can then pack many of these entries in a single sector chunk. Of course you're going to run into buffering and serialized management of the chunk, but this allows you to save the space you otherwise would be using for garbage data. If in case you maintain a separate index, you can store the "headers" independent of the data file. In which case you can keep updating the index while writing the data on an mmap'ed file that's msync'ed regularly and have the index "log style" if it's always append-only anyway.
if cache locality was your only concern that doesn't really matter here, since CPU usage is way below 100%.
That's for single threaded code, but when you run into multi-threaded code running on multiple cores and/or hardware threads, then it's going to be one of the killers for performance -- even if CPU usage is way below 100%. Of course tuning that requires profiling data and would be done on a case to case basis. Just something worth keeping in mind IMO.
there can be about 40000 un-flushed small transactions/s, so this is really about optimizing the syncs. in a mode that is only safe in case of an application crash, but not in case of a system failure (flush, but no sync) there are about 30000 transactions/s for the same test case.
Alright. In which case sector-aligned writing and a good OS IO scheduling algorithm would be your best friend in this case. :)
I'm not sure if this is part of the requirements in your design or whether you want client threads to really block on transaction commits (even though the syncing is lock-free).
the worker threads are the user threads, so they need to block until the transaction is recorded committed (-> log sync).
I see. Have you looked at using futures to handle the communication between user threads and the writer thread?
you could only avoid some syncs when 2 or more independent transactions run concurrently and they both need to sync, those could be combined. but that's not my focus right now.
Alright. Thanks for indulging me and my suggestions. :) -- Dean Michael Berris blog.cplusplus-soup.com | twitter.com/mikhailberis linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

Am Saturday 12 December 2009 16:47:58 schrieb Dean Michael Berris:
I see. Do you have a separate index mechanism?
no log index. the log is rolled when it reaches a certain size, so it can be read as a whole sequentially on restart.
then putting it in the beginning allows you to re-use the area after that entry
the data needs to be flushed asap as the user thread is waiting for its recorded commit, which is part of that data. so reusing that area is not an option, unless you want to seek(). (you don't.)
you could only avoid some syncs when 2 or more independent transactions run concurrently and they both need to sync, those could be combined. but that's not my focus right now.
Alright. Thanks for indulging me and my suggestions. :)
you are very welcome, you're just thinking 5 steps ahead. if it's not improving performance by 400%, like this case did, it has to wait among the > 100 other optimization TODOs.

On Sun, Dec 13, 2009 at 1:03 AM, Stefan Strasser <strasser@uni-bremen.de> wrote:
Am Saturday 12 December 2009 16:47:58 schrieb Dean Michael Berris:
I see. Do you have a separate index mechanism?
no log index. the log is rolled when it reaches a certain size, so it can be read as a whole sequentially on restart.
I see.
then putting it in the beginning allows you to re-use the area after that entry
the data needs to be flushed asap as the user thread is waiting for its recorded commit, which is part of that data. so reusing that area is not an option, unless you want to seek(). (you don't.)
Right, but I was thinking if you were packing the data in memory first and then merging flushes then you don't need to seek on write.
you could only avoid some syncs when 2 or more independent transactions run concurrently and they both need to sync, those could be combined. but that's not my focus right now.
Alright. Thanks for indulging me and my suggestions. :)
you are very welcome, you're just thinking 5 steps ahead.
:)
if it's not improving performance by 400%, like this case did, it has to wait among the > 100 other optimization TODOs.
Alright, am looking forward to the resulting library then! :) -- Dean Michael Berris blog.cplusplus-soup.com | twitter.com/mikhailberis linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

Hi Stefan, Stefan Strasser wrote:
writing 512 bytes, aligned to 512 bytes, without O_CREAT, when the file already exists, brings the desired results 2 seconds with much less disk usage. that's some set of conditions.
I would be interested to know how this compares to mmapping the file and then writing to memory and calling msync(). I while ago I did some work on a transactional persistent object library for "write mostly" data. Since it didn't need to support efficiently erasing or modifying existing data it could use a pure log structure with transactions implemented just by atomically adjusting the end-of-data offset. I did this all via mmapped files and the performance seemed good enough, but I never benchmarked it against e.g. MySQL. Actually it has been on hold since I discovered how badly it performs over NFS... Phil.

Am Friday 11 December 2009 00:04:31 schrieb Phil Endecott:
Hi Stefan,
Stefan Strasser wrote:
writing 512 bytes, aligned to 512 bytes, without O_CREAT, when the file already exists, brings the desired results 2 seconds with much less disk usage. that's some set of conditions.
I would be interested to know how this compares to mmapping the file and then writing to memory and calling msync().
for sequential writing? there is no real difference, with mapping a little slower in my tests, but not by much. the advantage of mapping from the performance viewpoint is random access, since you're avoiding all the lseek() system calls. with mapping you have to meet the same criteria as mentioned above to get physical sequential writing. aligned sequential sector writes.

On Thu, Dec 10, 2009 at 11:02 PM, Stefan Strasser <strasser@uni-bremen.de> wrote:
but, including all the SQL parsing, running 10000 small transactions takes about 4 seconds on my system. while this little code snippet alone takes 6-7 seconds, with much more disk usage:
int fd=open("test.bin",O_RDWR | O_CREAT | O_NOATIME); for(int c=0;c<10000;++c){ write(fd,&data,1); fsync(fd); }
neither fdatasync(), O_WRONLY, nor writing whole sectors instead of just a byte make a difference. MySQL also claims to use fsync on the page you linked.
do you have an explanation for this?
Check this blog post and part II, where innodb developer explains some of the "unknown" details: http://www.mysqlperformanceblog.com/2007/10/26/heikki-tuuri-innodb-answers-p...

On Thu, Dec 10, 2009 at 11:02 PM, Stefan Strasser <strasser@uni-bremen.de> wrote:
what I find surprising is the performance of MySQL for small transactions. while my library spends 90% of its time waiting for the disk to flush a transaction, InnoDB uses 100% CPU and barely uses any disk.
but, including all the SQL parsing, running 10000 small transactions takes about 4 seconds on my system. while this little code snippet alone takes 6-7 seconds, with much more disk usage:
int fd=open("test.bin",O_RDWR | O_CREAT | O_NOATIME); for(int c=0;c<10000;++c){ write(fd,&data,1); fsync(fd); }
neither fdatasync(), O_WRONLY, nor writing whole sectors instead of just a byte make a difference. MySQL also claims to use fsync on the page you linked.
do you have an explanation for this?
You should be comparing using msync (not fsync). Also, the high cpu usage might be caused by innodb gzip compression regards

On Thu, Dec 10, 2009 at 6:07 PM, Stefan Strasser <strasser@uni-bremen.de> wrote:
Is that what InnoDB is doing? the only way I see to reduce flushes even more is combining some transactions. but after a power failure some of those transactions can be lost even though they haven reported as successfully committed.
Hi Stefan, This is the setting where you can change the performance vs durability tradeoff: http://dev.mysql.com/doc/refman/4.1/en/innodb-parameters.html innodb_flush_log_at_trx_commit If the value of innodb_flush_log_at_trx_commit is 0, the log buffer is written out to the log file once per second and the flush to disk operation is performed on the log file, but nothing is done at a transaction commit. When the value is 1, the log buffer is written out to the log file at each transaction commit and the flush to disk operation is performed on the log file. When the value is 2, the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it. However, the flushing on the log file takes place once per second also when the value is 2. Note that the once-per-second flushing is not 100% guaranteed to happen every second, due to process scheduling issues. The default value of this variable is 1 (prior to MySQL 4.0.13, the default is 0). A value of 1 is required for ACID compliance. You can achieve better performance by setting the value different from 1, but then you can lose at most one second worth of transactions in a crash. With a value of 0, any mysqld process crash can erase the last second of transactions. With a value of 2, then only an operating system crash or a power outage can erase the last second of transactions. However, InnoDB's crash recovery is not affected and thus crash recovery does work regardless of the value.
participants (5)
-
Dean Michael Berris
-
Jose
-
Phil Endecott
-
Stefan Strasser
-
strasser@uni-bremen.de