
On Fri, Dec 11, 2009 at 6:02 AM, Stefan Strasser <strasser@uni-bremen.de> wrote:
thank you very much for both your answers. but they don't solve the mystery, please see below.
Let me try one more time. ;)
Am Thursday 10 December 2009 16:40:12 schrieb Dean Michael Berris:
what I find surprising is the performance of MySQL for small transactions. while my library spends 90% of its time waiting for the disk to flush a transaction, InnoDB uses 100% CPU and barely uses any disk.
Am I reading this correctly, you're doing something with persistence of tuples to disk, allowing indexed access to this persisted data, and
persistence yes, tuples no. persistent objects. in the backend that means: chunks of arbitrary sized data. there is some indexing going on, but it isn't relevant to this case because small transactions are dumped to the log on commit. the data files are only updated from time to time, to avoid additional disk syncs.
When you say logged on commit, are you writing to a file that's already created with enough space, or are you writing to a file which is "grown" every time you add data to it? Apparently this matters and this is how InnoDB does its log file management. What you can try doing is creating a fixed-size file first (initially maybe 10M) then writing from the start of the file and growing the file in fixed size increments.
Am Thursday 10 December 2009 16:56:27 schrieb Jose:
This is the setting where you can change the performance vs durability tradeoff:
If the value of innodb_flush_log_at_trx_commit is
thanks, I didn't know about this. but the value was 1 by default, so MySQL claims to flush for each transactions.
but, including all the SQL parsing, running 10000 small transactions takes about 4 seconds on my system. while this little code snippet alone takes 6-7 seconds, with much more disk usage:
What InnoDB does is puts everything that fits in memory up there, and keeps it there. Small transactions will write to the log file, but not write all the data directly to disk right away.
int fd=open("test.bin",O_RDWR | O_CREAT | O_NOATIME); for(int c=0;c<10000;++c){ write(fd,&data,1); fsync(fd); }
It looks like you're writing to a file that is created but not set to contain enough data to hold all the 10k data elements. Try zero'ing out the file first (or growing it to the desired size) then try again. What you want to do to increase performance (on Linux at least): * Leverage the VFS Cache * Allow the OS to keep your file in memory unknown to you * Leverage the buffers at the OS level Because you're using fsync, you're asking the kernel to do it for you -- and if your file is already in the vfs cache, the chances of fsync returning quicker is higher due to write caching at the OS level.
neither fdatasync(), O_WRONLY, nor writing whole sectors instead of just a byte make a difference. MySQL also claims to use fsync on the page you linked.
do you have an explanation for this?
They use fsync on the log file, not the data file which IIRC is still an mmap'ed region. And because their file is already of a certain size and is usually cached effectively by the OS, they get better performance numbers even if they keep using fsync on a descriptor.
then allowing someone to embed it in their application. This sounds a lot like an embedded database system and looking at SQLite and BDB
I have looked at SQLite, they seem to use the same method as I do. judging by this FAQ entry, with similar results: http://www.sqlite.org/faq.html#q19 (although I get much more than 60 transactions a second. I guess they have calculated that from the RPM without actually trying it. fsync() returns about 1600 times/s on my old 10 GB disk, presumable due to hardware caching.).
That's not surprising -- and if you're on Linux you even have vfs caching going for you if your file is already pre-grown.
Another means that they do away with the flush is by using memory mapped regions in Linux/UNIX, and then delegate the flushing to the operating system.
I have implemented that, with an own set of problems, see this: http://lists.boost.org/boost-users/2009/12/54449.php
Yeah, unfortunately those are platform-specific issues you're running into. Ultimately you're at the mercy of the filesystem implementation you're using -- if it's journalling -- or whether you're accessing a block device directly so that you can implement the caching and implementation on your own. MySQL's InnoDB supports raw block storage although I doubt you want to go to that level of control.
BTW, I am interested in what you're doing and maybe using a library that tackles the embedded database space in a modern C++ manner. Looking forward to hearing more about what you're working on.
I've posted a description for the Libraries Under Construction page (not yet there) in this thread:
Sweet, thanks for the link. I'll look forward to progress soon. :) -- Dean Michael Berris blog.cplusplus-soup.com | twitter.com/mikhailberis linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com