
On Fri, Dec 11, 2009 at 7:02 AM, Stefan Strasser <strasser@uni-bremen.de> wrote:
Am Thursday 10 December 2009 21:47:53 schrieb Dean Michael Berris:
When you say logged on commit, are you writing to a file that's already created with enough space, or are you writing to a file which is "grown" every time you add data to it?
I had tried that. and I've tried writing a sector instead of 1 byte. and I've tried removing O_CREAT. but you have to actually do ALL THREE, and one more: the sector writes need to be sector-aligned. so, writing 512 bytes, aligned to 512 bytes, without O_CREAT, when the file already exists, brings the desired results 2 seconds with much less disk usage. that's some set of conditions.
thanks for helping with this.
Nice. :) You're welcome.
What InnoDB does is puts everything that fits in memory up there, and keeps it there. Small transactions will write to the log file, but not write all the data directly to disk right away.
that's almost equal to my approach. I do write to the data files, but only sync them when a large transaction is committed or the log is rolled to a new one. I should probably think about sector-aligning those data writes, too, given the new insights.
Sounds like a good approach. If you're thinking of multi-threading and having an active object do the flush management, that should be something worth looking into to move the latency from persistence away from "worker" threads to a single serializing writer thread.
Because you're using fsync, you're asking the kernel to do it for you -- and if your file is already in the vfs cache, the chances of fsync returning quicker is higher due to write caching at the OS level.
I don't think the OS uses write caching in the case of fsync. it isn't supposed to, is it?
It actually has license to "cache" in the sense that it queues the data to be written on a per-fd basis. Even if you're not doing buffered write, that doesn't mean the OS will actually honor a call to fsync that returns right away to mean the data has already been written to disk. IIRC, the POSIX standard doesn't really say that after an fsync the data is guaranteed to have been written to disk -- only that the state of the file descriptor that the kernel holds and the userspace descriptor are synchronized; this can mean a lot of things and it doesn't guarantee that it's already written to disk. I may be wrong though but that is how I understand it. HTH -- Dean Michael Berris blog.cplusplus-soup.com | twitter.com/mikhailberis linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com