Re: [Maria-discuss] fsync necessary for synchronous page flush?
Hi Jan, Thanks for clarification. I should have used synchronous write, instead of synchronous flush. My point is that I noticed for sync writes, fsync is called to force pages to be on persistent storage. while for AIO pages, fsync is not called to force pages to be on persistent storage. My question here is why fsync is required for sync IOs. Does InnoDB maintain a dirty page table? Is fsync called to guarantee the page to be on persistent storage so that the dirty page table can be updated? If this is the case, when is the dirty page table updated for asynchronous IOs? P.S. I couldn't find code for dirty page table. So I am not sure if InnoDB maintains a dirty page table for recovery. Could you please give me a pointer to related code and related resources? Thanks. Xiaofei On Wed, May 6, 2015 at 10:12 AM, Jan Lindström <jan.lindstrom@mariadb.com> wrote:
Hi,
Terminology is little bit confusing here. Page flushing means that we have done synchronous write to disk but that does not mean that write is physically on device yet, therefore there is flush to force it to persistent storage.
R: Jan
On Wed, May 6, 2015 at 12:57 AM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Hello,
When a dirty page is flushed synchronously in buf_flush_write_block_low, fsync is called in the following snippet. I am wondering why we need this fsync for synchronous flush? The record should be in the log already, so recovery should be able to successfully redo it and apply to the disk during recovery. Maybe I am missing something here, please let me know if I am wrong. Thanks much!
Xiaofei
/* When doing single page flushing the IO is done synchronously and we flush the changes to disk only for the tablespace we are working on. */ if (sync) { ut_ad(flush_type == BUF_FLUSH_SINGLE_PAGE); fil_flush(buf_page_get_space(bpage)); /* true means we want to evict this page from the LRU list as well. */ buf_page_io_complete(bpage, true); }
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Xiaofei -
Does InnoDB maintain a dirty page table?
You must be referring to the buffer pool flush_list.
Is fsync called to guarantee the page to be on persistent storage so that the dirty page table can be updated? If this is the case, when is the dirty page table updated for asynchronous IOs?
Check buf_flush_write_complete in buf0flu.cc. For async IO it is called from buf_page_io_complete in buf0buf.cc. -- Laurynas
Hi Laurynas, On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis < laurynas.biveinis@gmail.com> wrote:
Xiaofei -
Does InnoDB maintain a dirty page table?
You must be referring to the buffer pool flush_list.
You are right. The flush_list is can be used for recovery and checkpoint.
Is fsync called to guarantee the page to be on persistent storage so that the dirty page table can be updated? If this is the case, when is the dirty page table updated for asynchronous IOs?
Check buf_flush_write_complete in buf0flu.cc. For async IO it is called from buf_page_io_complete in buf0buf.cc.
You are right that this is the place it updates the dirty page information. But I still don't understand why the fsync is needed for synchronous IOs, but not for the AIOs. Jan Lindstrom said fsync is also called for other AIO operations. But I could only it true in one of many AIO operations. Or maybe I am missing something still?
-- Laurynas
Jan, Laurynas, Thank you both for your helpful answers. The general question I have is that do we need to call fsync to guarantee a dirty page is written to persistent storage before it's removed from the flush list, and how would this affect the recovery process? Based on the code, looks like it does not call fsync before buf_flush_write_complete is called. So I guess not calling fsync will not affect recovery process. But I also saw fsync is called after synchronous IOs. So I am confused about if we should call fsync or not after flushing a dirty page in order to not break recovery. Thanks. Xiaofei On Wed, May 6, 2015 at 11:01 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Hi Laurynas,
On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis < laurynas.biveinis@gmail.com> wrote:
Xiaofei -
Does InnoDB maintain a dirty page table?
You must be referring to the buffer pool flush_list.
You are right. The flush_list is can be used for recovery and checkpoint.
Is fsync called to guarantee the page to be on persistent storage so that the dirty page table can be updated? If this is the case, when is the dirty page table updated for asynchronous IOs?
Check buf_flush_write_complete in buf0flu.cc. For async IO it is called from buf_page_io_complete in buf0buf.cc.
You are right that this is the place it updates the dirty page information. But I still don't understand why the fsync is needed for synchronous IOs, but not for the AIOs. Jan Lindstrom said fsync is also called for other AIO operations. But I could only it true in one of many AIO operations. Or maybe I am missing something still?
-- Laurynas
Xiaofei - fsync is performed for all the flush types (LRU, flush, single page) if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The apparent difference in sync and async is not because of the sync difference itself, but because of the flush type difference. The single page flush flushes one page, and requests a fsync for its file. Other flushes flush in batches, don't have to fsync for each written page individually but rather sync once at the end. Then doublewrite complicates this further. If it is disabled, fsync will happen in buf_dblwr_sync_datafiles called from buf_dblwr_flush_buffered_writes called from buf_flush_common called at the end of either LRU or flush list flush. If doublewrite is enabled, fsync will happen in buf_dblwr_update called from buf_flush_write_complete. 2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
Hi Laurynas,
On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote:
Xiaofei -
Does InnoDB maintain a dirty page table?
You must be referring to the buffer pool flush_list.
You are right. The flush_list is can be used for recovery and checkpoint.
Is fsync called to guarantee the page to be on persistent storage so that the dirty page table can be updated? If this is the case, when is the dirty page table updated for asynchronous IOs?
Check buf_flush_write_complete in buf0flu.cc. For async IO it is called from buf_page_io_complete in buf0buf.cc.
You are right that this is the place it updates the dirty page information. But I still don't understand why the fsync is needed for synchronous IOs, but not for the AIOs. Jan Lindstrom said fsync is also called for other AIO operations. But I could only it true in one of many AIO operations. Or maybe I am missing something still?
-- Laurynas
-- Laurynas
Laurynas, This is exactly what I was looking for. I went through these functions before. I disabled double write buffer, so I didn't pay attention to code under buf_dblwr... The reason I asked this question is because I didn't know how the recovery process works, so I was wondering if it's necessary to fsync after each write. It's a performance concern. Anyway, thank you very much! Jan -- Thank you for your answer too! Xiaofei On Thu, May 7, 2015 at 9:59 PM, Laurynas Biveinis < laurynas.biveinis@gmail.com> wrote:
Xiaofei -
fsync is performed for all the flush types (LRU, flush, single page) if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The apparent difference in sync and async is not because of the sync difference itself, but because of the flush type difference. The single page flush flushes one page, and requests a fsync for its file. Other flushes flush in batches, don't have to fsync for each written page individually but rather sync once at the end. Then doublewrite complicates this further. If it is disabled, fsync will happen in buf_dblwr_sync_datafiles called from buf_dblwr_flush_buffered_writes called from buf_flush_common called at the end of either LRU or flush list flush. If doublewrite is enabled, fsync will happen in buf_dblwr_update called from buf_flush_write_complete.
2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
Hi Laurynas,
On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote:
Xiaofei -
Does InnoDB maintain a dirty page table?
You must be referring to the buffer pool flush_list.
You are right. The flush_list is can be used for recovery and checkpoint.
Is fsync called to guarantee the page to be on persistent storage so that the dirty page table can be updated? If this is the case, when is the dirty page table updated for asynchronous IOs?
Check buf_flush_write_complete in buf0flu.cc. For async IO it is called from buf_page_io_complete in buf0buf.cc.
You are right that this is the place it updates the dirty page information. But I still don't understand why the fsync is needed for synchronous IOs, but not for the AIOs. Jan Lindstrom said fsync is also called for other AIO operations. But I could only it true in one of many AIO operations. Or maybe I am missing something still?
-- Laurynas
-- Laurynas
Hi, InnoDB recovery can not handle torn pages. An fsync is required to ensure that the page is fully written to disk. This is also why the doublewrite buffer is used. Before pages are written down to disk, they are first written sequentially into the doublewrite buffer. This buffer is synced, then async page writing can proceed. If the database crashes, the pages in flight will be rewritten by the doublewrite buffer. The detection mechanism for torn pages comes from an LSN, which is written into the top and the bottom of the page. If the LSN at the top and bottom do not match the page is torn. Regards, --Justin On Fri, May 8, 2015 at 12:43 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Laurynas,
This is exactly what I was looking for. I went through these functions before. I disabled double write buffer, so I didn't pay attention to code under buf_dblwr... The reason I asked this question is because I didn't know how the recovery process works, so I was wondering if it's necessary to fsync after each write. It's a performance concern. Anyway, thank you very much!
Jan -- Thank you for your answer too!
Xiaofei
On Thu, May 7, 2015 at 9:59 PM, Laurynas Biveinis < laurynas.biveinis@gmail.com> wrote:
Xiaofei -
fsync is performed for all the flush types (LRU, flush, single page) if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The apparent difference in sync and async is not because of the sync difference itself, but because of the flush type difference. The single page flush flushes one page, and requests a fsync for its file. Other flushes flush in batches, don't have to fsync for each written page individually but rather sync once at the end. Then doublewrite complicates this further. If it is disabled, fsync will happen in buf_dblwr_sync_datafiles called from buf_dblwr_flush_buffered_writes called from buf_flush_common called at the end of either LRU or flush list flush. If doublewrite is enabled, fsync will happen in buf_dblwr_update called from buf_flush_write_complete.
2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
Hi Laurynas,
On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote:
Xiaofei -
Does InnoDB maintain a dirty page table?
You must be referring to the buffer pool flush_list.
You are right. The flush_list is can be used for recovery and checkpoint.
Is fsync called to guarantee the page to be on persistent storage so that the dirty page table can be updated? If this is the case, when is the dirty page table updated for asynchronous IOs?
Check buf_flush_write_complete in buf0flu.cc. For async IO it is called from buf_page_io_complete in buf0buf.cc.
You are right that this is the place it updates the dirty page information. But I still don't understand why the fsync is needed for synchronous IOs, but not for the AIOs. Jan Lindstrom said fsync is also called for other AIO operations. But I could only it true in one of many AIO operations. Or maybe I am missing something still?
-- Laurynas
-- Laurynas
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Justin, I was thinking of if fsync is needed each time after a write. The operations are already in the log. So recovery can always be done from the log. The difference is that during recovery, we need to go back further in the log and it will take longer. But in that way, I guess it would be hard to coordinate with the kernel flush thread. Xiaofei On Fri, May 8, 2015 at 2:06 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
InnoDB recovery can not handle torn pages. An fsync is required to ensure that the page is fully written to disk. This is also why the doublewrite buffer is used. Before pages are written down to disk, they are first written sequentially into the doublewrite buffer. This buffer is synced, then async page writing can proceed. If the database crashes, the pages in flight will be rewritten by the doublewrite buffer. The detection mechanism for torn pages comes from an LSN, which is written into the top and the bottom of the page. If the LSN at the top and bottom do not match the page is torn.
Regards,
--Justin
On Fri, May 8, 2015 at 12:43 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Laurynas,
This is exactly what I was looking for. I went through these functions before. I disabled double write buffer, so I didn't pay attention to code under buf_dblwr... The reason I asked this question is because I didn't know how the recovery process works, so I was wondering if it's necessary to fsync after each write. It's a performance concern. Anyway, thank you very much!
Jan -- Thank you for your answer too!
Xiaofei
On Thu, May 7, 2015 at 9:59 PM, Laurynas Biveinis < laurynas.biveinis@gmail.com> wrote:
Xiaofei -
fsync is performed for all the flush types (LRU, flush, single page) if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The apparent difference in sync and async is not because of the sync difference itself, but because of the flush type difference. The single page flush flushes one page, and requests a fsync for its file. Other flushes flush in batches, don't have to fsync for each written page individually but rather sync once at the end. Then doublewrite complicates this further. If it is disabled, fsync will happen in buf_dblwr_sync_datafiles called from buf_dblwr_flush_buffered_writes called from buf_flush_common called at the end of either LRU or flush list flush. If doublewrite is enabled, fsync will happen in buf_dblwr_update called from buf_flush_write_complete.
2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
Hi Laurynas,
On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote:
Xiaofei -
Does InnoDB maintain a dirty page table?
You must be referring to the buffer pool flush_list.
You are right. The flush_list is can be used for recovery and checkpoint.
Is fsync called to guarantee the page to be on persistent storage so that the dirty page table can be updated? If this is the case, when is the dirty page table updated for asynchronous IOs?
Check buf_flush_write_complete in buf0flu.cc. For async IO it is called from buf_page_io_complete in buf0buf.cc.
You are right that this is the place it updates the dirty page information. But I still don't understand why the fsync is needed for synchronous IOs, but not for the AIOs. Jan Lindstrom said fsync is also called for other AIO operations. But I could only it true in one of many AIO operations. Or maybe I am missing something still?
-- Laurynas
-- Laurynas
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Hi, The log does not have whole pages. Pages must not be torn for the recovery process to work. A fsync is required when a page is written to disk. During recovery all changes since the last checkpoint are replayed, then transactions that do not have a commit marker are rolled back. This is called roll forward/roll back recovery. --Justin On Fri, May 8, 2015 at 6:09 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Justin,
I was thinking of if fsync is needed each time after a write. The operations are already in the log. So recovery can always be done from the log. The difference is that during recovery, we need to go back further in the log and it will take longer. But in that way, I guess it would be hard to coordinate with the kernel flush thread.
Xiaofei
On Fri, May 8, 2015 at 2:06 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
InnoDB recovery can not handle torn pages. An fsync is required to ensure that the page is fully written to disk. This is also why the doublewrite buffer is used. Before pages are written down to disk, they are first written sequentially into the doublewrite buffer. This buffer is synced, then async page writing can proceed. If the database crashes, the pages in flight will be rewritten by the doublewrite buffer. The detection mechanism for torn pages comes from an LSN, which is written into the top and the bottom of the page. If the LSN at the top and bottom do not match the page is torn.
Regards,
--Justin
On Fri, May 8, 2015 at 12:43 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Laurynas,
This is exactly what I was looking for. I went through these functions before. I disabled double write buffer, so I didn't pay attention to code under buf_dblwr... The reason I asked this question is because I didn't know how the recovery process works, so I was wondering if it's necessary to fsync after each write. It's a performance concern. Anyway, thank you very much!
Jan -- Thank you for your answer too!
Xiaofei
On Thu, May 7, 2015 at 9:59 PM, Laurynas Biveinis < laurynas.biveinis@gmail.com> wrote:
Xiaofei -
fsync is performed for all the flush types (LRU, flush, single page) if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The apparent difference in sync and async is not because of the sync difference itself, but because of the flush type difference. The single page flush flushes one page, and requests a fsync for its file. Other flushes flush in batches, don't have to fsync for each written page individually but rather sync once at the end. Then doublewrite complicates this further. If it is disabled, fsync will happen in buf_dblwr_sync_datafiles called from buf_dblwr_flush_buffered_writes called from buf_flush_common called at the end of either LRU or flush list flush. If doublewrite is enabled, fsync will happen in buf_dblwr_update called from buf_flush_write_complete.
2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
Hi Laurynas,
On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote:
Xiaofei -
> Does InnoDB maintain a dirty > page table?
You must be referring to the buffer pool flush_list.
You are right. The flush_list is can be used for recovery and checkpoint.
> Is fsync called to guarantee the page to be on persistent > storage so that the dirty page table can be updated? If this is the > case, > when is the dirty page table updated for asynchronous IOs?
Check buf_flush_write_complete in buf0flu.cc. For async IO it is called from buf_page_io_complete in buf0buf.cc.
You are right that this is the place it updates the dirty page information. But I still don't understand why the fsync is needed for synchronous IOs, but not for the AIOs. Jan Lindstrom said fsync is also called for other AIO operations. But I could only it true in one of many AIO operations. Or maybe I am missing something still?
-- Laurynas
-- Laurynas
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Justin, I think the fsync I was concerning and the torn page problem are two different things. But now I have a question about double write buffer. If we can detect a torn page by checking the top and bottom of a page, why would we still need double write buffer? If the page is consistent, then we use it, otherwise, we just discard it. Maybe this is a naive question. But please let me know. Thanks. Xiaofei On Fri, May 8, 2015 at 9:24 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
The log does not have whole pages. Pages must not be torn for the recovery process to work. A fsync is required when a page is written to disk. During recovery all changes since the last checkpoint are replayed, then transactions that do not have a commit marker are rolled back. This is called roll forward/roll back recovery.
--Justin
On Fri, May 8, 2015 at 6:09 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Justin,
I was thinking of if fsync is needed each time after a write. The operations are already in the log. So recovery can always be done from the log. The difference is that during recovery, we need to go back further in the log and it will take longer. But in that way, I guess it would be hard to coordinate with the kernel flush thread.
Xiaofei
On Fri, May 8, 2015 at 2:06 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
InnoDB recovery can not handle torn pages. An fsync is required to ensure that the page is fully written to disk. This is also why the doublewrite buffer is used. Before pages are written down to disk, they are first written sequentially into the doublewrite buffer. This buffer is synced, then async page writing can proceed. If the database crashes, the pages in flight will be rewritten by the doublewrite buffer. The detection mechanism for torn pages comes from an LSN, which is written into the top and the bottom of the page. If the LSN at the top and bottom do not match the page is torn.
Regards,
--Justin
On Fri, May 8, 2015 at 12:43 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Laurynas,
This is exactly what I was looking for. I went through these functions before. I disabled double write buffer, so I didn't pay attention to code under buf_dblwr... The reason I asked this question is because I didn't know how the recovery process works, so I was wondering if it's necessary to fsync after each write. It's a performance concern. Anyway, thank you very much!
Jan -- Thank you for your answer too!
Xiaofei
On Thu, May 7, 2015 at 9:59 PM, Laurynas Biveinis < laurynas.biveinis@gmail.com> wrote:
Xiaofei -
fsync is performed for all the flush types (LRU, flush, single page) if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The apparent difference in sync and async is not because of the sync difference itself, but because of the flush type difference. The single page flush flushes one page, and requests a fsync for its file. Other flushes flush in batches, don't have to fsync for each written page individually but rather sync once at the end. Then doublewrite complicates this further. If it is disabled, fsync will happen in buf_dblwr_sync_datafiles called from buf_dblwr_flush_buffered_writes called from buf_flush_common called at the end of either LRU or flush list flush. If doublewrite is enabled, fsync will happen in buf_dblwr_update called from buf_flush_write_complete.
Hi Laurynas,
On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote: > > Xiaofei - > > > Does InnoDB maintain a dirty > > page table? > > You must be referring to the buffer pool flush_list.
You are right. The flush_list is can be used for recovery and checkpoint.
> > > > Is fsync called to guarantee the page to be on persistent > > storage so that the dirty page table can be updated? If this is
2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>: the
> > case, > > when is the dirty page table updated for asynchronous IOs? > > Check buf_flush_write_complete in buf0flu.cc. For async IO it is > called from buf_page_io_complete in buf0buf.cc.
You are right that this is the place it updates the dirty page information. But I still don't understand why the fsync is needed for synchronous IOs, but not for the AIOs. Jan Lindstrom said fsync is also called for other AIO operations. But I could only it true in one of many AIO operations. Or maybe I am missing something still?
> > > -- > Laurynas
-- Laurynas
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Xiaofei - We can indeed detect the torn page write without the doublewrite buffer (and WebScaleSQL has a patch utilising this observation). But we need not only to detect, but to recover the page as well. And without the doublewrite, if we discard the page, we have nothing: a half-old half-new page on the disk and the redo log records for that page are not enough to recover it. 2015-05-09 8:44 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
Justin,
I think the fsync I was concerning and the torn page problem are two different things. But now I have a question about double write buffer. If we can detect a torn page by checking the top and bottom of a page, why would we still need double write buffer? If the page is consistent, then we use it, otherwise, we just discard it. Maybe this is a naive question. But please let me know. Thanks.
Xiaofei
On Fri, May 8, 2015 at 9:24 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
The log does not have whole pages. Pages must not be torn for the recovery process to work. A fsync is required when a page is written to disk. During recovery all changes since the last checkpoint are replayed, then transactions that do not have a commit marker are rolled back. This is called roll forward/roll back recovery.
--Justin
On Fri, May 8, 2015 at 6:09 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Justin,
I was thinking of if fsync is needed each time after a write. The operations are already in the log. So recovery can always be done from the log. The difference is that during recovery, we need to go back further in the log and it will take longer. But in that way, I guess it would be hard to coordinate with the kernel flush thread.
Xiaofei
On Fri, May 8, 2015 at 2:06 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
InnoDB recovery can not handle torn pages. An fsync is required to ensure that the page is fully written to disk. This is also why the doublewrite buffer is used. Before pages are written down to disk, they are first written sequentially into the doublewrite buffer. This buffer is synced, then async page writing can proceed. If the database crashes, the pages in flight will be rewritten by the doublewrite buffer. The detection mechanism for torn pages comes from an LSN, which is written into the top and the bottom of the page. If the LSN at the top and bottom do not match the page is torn.
Regards,
--Justin
On Fri, May 8, 2015 at 12:43 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Laurynas,
This is exactly what I was looking for. I went through these functions before. I disabled double write buffer, so I didn't pay attention to code under buf_dblwr... The reason I asked this question is because I didn't know how the recovery process works, so I was wondering if it's necessary to fsync after each write. It's a performance concern. Anyway, thank you very much!
Jan -- Thank you for your answer too!
Xiaofei
On Thu, May 7, 2015 at 9:59 PM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote:
Xiaofei -
fsync is performed for all the flush types (LRU, flush, single page) if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The apparent difference in sync and async is not because of the sync difference itself, but because of the flush type difference. The single page flush flushes one page, and requests a fsync for its file. Other flushes flush in batches, don't have to fsync for each written page individually but rather sync once at the end. Then doublewrite complicates this further. If it is disabled, fsync will happen in buf_dblwr_sync_datafiles called from buf_dblwr_flush_buffered_writes called from buf_flush_common called at the end of either LRU or flush list flush. If doublewrite is enabled, fsync will happen in buf_dblwr_update called from buf_flush_write_complete.
2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>: > Hi Laurynas, > > On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis > <laurynas.biveinis@gmail.com> wrote: >> >> Xiaofei - >> >> > Does InnoDB maintain a dirty >> > page table? >> >> You must be referring to the buffer pool flush_list. > > > You are right. The flush_list is can be used for recovery and > checkpoint. > >> >> >> > Is fsync called to guarantee the page to be on persistent >> > storage so that the dirty page table can be updated? If this is >> > the >> > case, >> > when is the dirty page table updated for asynchronous IOs? >> >> Check buf_flush_write_complete in buf0flu.cc. For async IO it is >> called from buf_page_io_complete in buf0buf.cc. > > > You are right that this is the place it updates the dirty page > information. > But I still don't understand why the fsync is needed for synchronous > IOs, > but not for the AIOs. Jan Lindstrom said fsync is also called for > other AIO > operations. But I could only it true in one of many AIO operations. > Or maybe > I am missing something still? > >> >> >> -- >> Laurynas > >
-- Laurynas
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
-- Laurynas
Laurynas, We cannot recover from a torn page only using redo log. But wouldn't undo log record enough information for recovery in the case of a torn page? Undo log should have old values of affected rows. So shouldn't it be enough to recover a torn page using information from undo log? Xiaofei On Sat, May 9, 2015 at 12:07 AM, Laurynas Biveinis < laurynas.biveinis@gmail.com> wrote:
Xiaofei -
We can indeed detect the torn page write without the doublewrite buffer (and WebScaleSQL has a patch utilising this observation). But we need not only to detect, but to recover the page as well. And without the doublewrite, if we discard the page, we have nothing: a half-old half-new page on the disk and the redo log records for that page are not enough to recover it.
Justin,
I think the fsync I was concerning and the torn page problem are two different things. But now I have a question about double write buffer. If we can detect a torn page by checking the top and bottom of a page, why would we still need double write buffer? If the page is consistent, then we use it, otherwise, we just discard it. Maybe this is a naive question. But please let me know. Thanks.
Xiaofei
On Fri, May 8, 2015 at 9:24 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
The log does not have whole pages. Pages must not be torn for the recovery process to work. A fsync is required when a page is written to disk. During recovery all changes since the last checkpoint are
replayed,
then transactions that do not have a commit marker are rolled back. This is called roll forward/roll back recovery.
--Justin
On Fri, May 8, 2015 at 6:09 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Justin,
I was thinking of if fsync is needed each time after a write. The operations are already in the log. So recovery can always be done from
log. The difference is that during recovery, we need to go back further in the log and it will take longer. But in that way, I guess it would be hard to coordinate with the kernel flush thread.
Xiaofei
On Fri, May 8, 2015 at 2:06 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
InnoDB recovery can not handle torn pages. An fsync is required to ensure that the page is fully written to disk. This is also why the doublewrite buffer is used. Before pages are written down to disk,
2015-05-09 8:44 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>: the they are
first written sequentially into the doublewrite buffer. This buffer is synced, then async page writing can proceed. If the database crashes, the pages in flight will be rewritten by the doublewrite buffer. The detection mechanism for torn pages comes from an LSN, which is written into the top and the bottom of the page. If the LSN at the top and bottom do not match the page is torn.
Regards,
--Justin
On Fri, May 8, 2015 at 12:43 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Laurynas,
This is exactly what I was looking for. I went through these
functions
before. I disabled double write buffer, so I didn't pay attention to code under buf_dblwr... The reason I asked this question is because I didn't know how the recovery process works, so I was wondering if it's necessary to fsync after each write. It's a performance concern. Anyway, thank you very much!
Jan -- Thank you for your answer too!
Xiaofei
On Thu, May 7, 2015 at 9:59 PM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote: > > Xiaofei - > > fsync is performed for all the flush types (LRU, flush, single page) > if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The > apparent difference in sync and async is not because of the sync > difference itself, but because of the flush type difference. The > single page flush flushes one page, and requests a fsync for its file. > Other flushes flush in batches, don't have to fsync for each written > page individually but rather sync once at the end. Then doublewrite > complicates this further. If it is disabled, fsync will happen in > buf_dblwr_sync_datafiles called from buf_dblwr_flush_buffered_writes > called from buf_flush_common called at the end of either LRU or flush > list flush. If doublewrite is enabled, fsync will happen in > buf_dblwr_update called from buf_flush_write_complete. > > > > > 2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>: > > Hi Laurynas, > > > > On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis > > <laurynas.biveinis@gmail.com> wrote: > >> > >> Xiaofei - > >> > >> > Does InnoDB maintain a dirty > >> > page table? > >> > >> You must be referring to the buffer pool flush_list. > > > > > > You are right. The flush_list is can be used for recovery and > > checkpoint. > > > >> > >> > >> > Is fsync called to guarantee the page to be on persistent > >> > storage so that the dirty page table can be updated? If this is > >> > the > >> > case, > >> > when is the dirty page table updated for asynchronous IOs? > >> > >> Check buf_flush_write_complete in buf0flu.cc. For async IO it is > >> called from buf_page_io_complete in buf0buf.cc. > > > > > > You are right that this is the place it updates the dirty page > > information. > > But I still don't understand why the fsync is needed for synchronous > > IOs, > > but not for the AIOs. Jan Lindstrom said fsync is also called for > > other AIO > > operations. But I could only it true in one of many AIO operations. > > Or maybe > > I am missing something still? > > > >> > >> > >> -- > >> Laurynas > > > > > > > > -- > Laurynas
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
-- Laurynas
I came across some slides by Percona CEO. https://www.percona.com/live/mysql-conference-2015/sites/default/files/slide... On page 45, It says "Flash can avoid this with little cost due to internal design". Does this mean we can disable doublewrite buffer for safe? Thanks. Xiaofei On Sat, May 9, 2015 at 4:57 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Laurynas,
We cannot recover from a torn page only using redo log. But wouldn't undo log record enough information for recovery in the case of a torn page? Undo log should have old values of affected rows. So shouldn't it be enough to recover a torn page using information from undo log?
Xiaofei
On Sat, May 9, 2015 at 12:07 AM, Laurynas Biveinis < laurynas.biveinis@gmail.com> wrote:
Xiaofei -
We can indeed detect the torn page write without the doublewrite buffer (and WebScaleSQL has a patch utilising this observation). But we need not only to detect, but to recover the page as well. And without the doublewrite, if we discard the page, we have nothing: a half-old half-new page on the disk and the redo log records for that page are not enough to recover it.
Justin,
I think the fsync I was concerning and the torn page problem are two different things. But now I have a question about double write buffer. If we can detect a torn page by checking the top and bottom of a page, why would we still need double write buffer? If the page is consistent, then we use it, otherwise, we just discard it. Maybe this is a naive question. But please let me know. Thanks.
Xiaofei
On Fri, May 8, 2015 at 9:24 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
The log does not have whole pages. Pages must not be torn for the recovery process to work. A fsync is required when a page is written
to
disk. During recovery all changes since the last checkpoint are replayed, then transactions that do not have a commit marker are rolled back. This is called roll forward/roll back recovery.
--Justin
On Fri, May 8, 2015 at 6:09 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Justin,
I was thinking of if fsync is needed each time after a write. The operations are already in the log. So recovery can always be done
from the
log. The difference is that during recovery, we need to go back further in the log and it will take longer. But in that way, I guess it would be hard to coordinate with the kernel flush thread.
Xiaofei
On Fri, May 8, 2015 at 2:06 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
InnoDB recovery can not handle torn pages. An fsync is required to ensure that the page is fully written to disk. This is also why the doublewrite buffer is used. Before pages are written down to disk,
first written sequentially into the doublewrite buffer. This buffer is synced, then async page writing can proceed. If the database crashes, the pages in flight will be rewritten by the doublewrite buffer. The detection mechanism for torn pages comes from an LSN, which is written into
and the bottom of the page. If the LSN at the top and bottom do not match the page is torn.
Regards,
--Justin
On Fri, May 8, 2015 at 12:43 PM, Xiaofei Du <xiaofei.du008@gmail.com
wrote: > > Laurynas, > > This is exactly what I was looking for. I went through these functions > before. I disabled double write buffer, so I didn't pay attention to code > under buf_dblwr... The reason I asked this question is because I didn't know > how the recovery process works, so I was wondering if it's necessary to > fsync after each write. It's a performance concern. Anyway, thank you very > much! > > Jan -- Thank you for your answer too! > > Xiaofei > > On Thu, May 7, 2015 at 9:59 PM, Laurynas Biveinis > <laurynas.biveinis@gmail.com> wrote: >> >> Xiaofei - >> >> fsync is performed for all the flush types (LRU, flush, single
2015-05-09 8:44 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>: they are the top page)
>> if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The >> apparent difference in sync and async is not because of the sync >> difference itself, but because of the flush type difference. The >> single page flush flushes one page, and requests a fsync for its file. >> Other flushes flush in batches, don't have to fsync for each written >> page individually but rather sync once at the end. Then doublewrite >> complicates this further. If it is disabled, fsync will happen in >> buf_dblwr_sync_datafiles called from buf_dblwr_flush_buffered_writes >> called from buf_flush_common called at the end of either LRU or flush >> list flush. If doublewrite is enabled, fsync will happen in >> buf_dblwr_update called from buf_flush_write_complete. >> >> >> >> >> 2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>: >> > Hi Laurynas, >> > >> > On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis >> > <laurynas.biveinis@gmail.com> wrote: >> >> >> >> Xiaofei - >> >> >> >> > Does InnoDB maintain a dirty >> >> > page table? >> >> >> >> You must be referring to the buffer pool flush_list. >> > >> > >> > You are right. The flush_list is can be used for recovery and >> > checkpoint. >> > >> >> >> >> >> >> > Is fsync called to guarantee the page to be on persistent >> >> > storage so that the dirty page table can be updated? If this is >> >> > the >> >> > case, >> >> > when is the dirty page table updated for asynchronous IOs? >> >> >> >> Check buf_flush_write_complete in buf0flu.cc. For async IO it is >> >> called from buf_page_io_complete in buf0buf.cc. >> > >> > >> > You are right that this is the place it updates the dirty page >> > information. >> > But I still don't understand why the fsync is needed for synchronous >> > IOs, >> > but not for the AIOs. Jan Lindstrom said fsync is also called for >> > other AIO >> > operations. But I could only it true in one of many AIO operations. >> > Or maybe >> > I am missing something still? >> > >> >> >> >> >> >> -- >> >> Laurynas >> > >> > >> >> >> >> -- >> Laurynas > > > > _______________________________________________ > Mailing list: https://launchpad.net/~maria-discuss > Post to : maria-discuss@lists.launchpad.net > Unsubscribe : https://launchpad.net/~maria-discuss > More help : https://help.launchpad.net/ListHelp >
-- Laurynas
If the device and the filesystem provide the guarantees, then yes: http://www.percona.com/doc/percona-server/5.5/performance/atomic_fio.html, but not in the general case. 2015-05-10 9:12 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
I came across some slides by Percona CEO. https://www.percona.com/live/mysql-conference-2015/sites/default/files/slide... On page 45, It says "Flash can avoid this with little cost due to internal design". Does this mean we can disable doublewrite buffer for safe? Thanks.
Xiaofei
On Sat, May 9, 2015 at 4:57 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Laurynas,
We cannot recover from a torn page only using redo log. But wouldn't undo log record enough information for recovery in the case of a torn page? Undo log should have old values of affected rows. So shouldn't it be enough to recover a torn page using information from undo log?
Xiaofei
On Sat, May 9, 2015 at 12:07 AM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote:
Xiaofei -
We can indeed detect the torn page write without the doublewrite buffer (and WebScaleSQL has a patch utilising this observation). But we need not only to detect, but to recover the page as well. And without the doublewrite, if we discard the page, we have nothing: a half-old half-new page on the disk and the redo log records for that page are not enough to recover it.
2015-05-09 8:44 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
Justin,
I think the fsync I was concerning and the torn page problem are two different things. But now I have a question about double write buffer. If we can detect a torn page by checking the top and bottom of a page, why would we still need double write buffer? If the page is consistent, then we use it, otherwise, we just discard it. Maybe this is a naive question. But please let me know. Thanks.
Xiaofei
On Fri, May 8, 2015 at 9:24 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
The log does not have whole pages. Pages must not be torn for the recovery process to work. A fsync is required when a page is written to disk. During recovery all changes since the last checkpoint are replayed, then transactions that do not have a commit marker are rolled back. This is called roll forward/roll back recovery.
--Justin
On Fri, May 8, 2015 at 6:09 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Justin,
I was thinking of if fsync is needed each time after a write. The operations are already in the log. So recovery can always be done from the log. The difference is that during recovery, we need to go back further in the log and it will take longer. But in that way, I guess it would be hard to coordinate with the kernel flush thread.
Xiaofei
On Fri, May 8, 2015 at 2:06 PM, Justin Swanhart <greenlion@gmail.com> wrote: > > Hi, > > InnoDB recovery can not handle torn pages. An fsync is required to > ensure that the page is fully written to disk. This is also why the > doublewrite buffer is used. Before pages are written down to disk, > they are > first written sequentially into the doublewrite buffer. This buffer > is > synced, then async page writing can proceed. If the database > crashes, the > pages in flight will be rewritten by the doublewrite buffer. The > detection > mechanism for torn pages comes from an LSN, which is written into > the top > and the bottom of the page. If the LSN at the top and bottom do not > match > the page is torn. > > Regards, > > --Justin > > On Fri, May 8, 2015 at 12:43 PM, Xiaofei Du > <xiaofei.du008@gmail.com> > wrote: >> >> Laurynas, >> >> This is exactly what I was looking for. I went through these >> functions >> before. I disabled double write buffer, so I didn't pay attention >> to code >> under buf_dblwr... The reason I asked this question is because I >> didn't know >> how the recovery process works, so I was wondering if it's >> necessary to >> fsync after each write. It's a performance concern. Anyway, thank >> you very >> much! >> >> Jan -- Thank you for your answer too! >> >> Xiaofei >> >> On Thu, May 7, 2015 at 9:59 PM, Laurynas Biveinis >> <laurynas.biveinis@gmail.com> wrote: >>> >>> Xiaofei - >>> >>> fsync is performed for all the flush types (LRU, flush, single >>> page) >>> if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The >>> apparent difference in sync and async is not because of the sync >>> difference itself, but because of the flush type difference. The >>> single page flush flushes one page, and requests a fsync for its >>> file. >>> Other flushes flush in batches, don't have to fsync for each >>> written >>> page individually but rather sync once at the end. Then >>> doublewrite >>> complicates this further. If it is disabled, fsync will happen in >>> buf_dblwr_sync_datafiles called from >>> buf_dblwr_flush_buffered_writes >>> called from buf_flush_common called at the end of either LRU or >>> flush >>> list flush. If doublewrite is enabled, fsync will happen in >>> buf_dblwr_update called from buf_flush_write_complete. >>> >>> >>> >>> >>> 2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>: >>> > Hi Laurynas, >>> > >>> > On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis >>> > <laurynas.biveinis@gmail.com> wrote: >>> >> >>> >> Xiaofei - >>> >> >>> >> > Does InnoDB maintain a dirty >>> >> > page table? >>> >> >>> >> You must be referring to the buffer pool flush_list. >>> > >>> > >>> > You are right. The flush_list is can be used for recovery and >>> > checkpoint. >>> > >>> >> >>> >> >>> >> > Is fsync called to guarantee the page to be on persistent >>> >> > storage so that the dirty page table can be updated? If this >>> >> > is >>> >> > the >>> >> > case, >>> >> > when is the dirty page table updated for asynchronous IOs? >>> >> >>> >> Check buf_flush_write_complete in buf0flu.cc. For async IO it >>> >> is >>> >> called from buf_page_io_complete in buf0buf.cc. >>> > >>> > >>> > You are right that this is the place it updates the dirty page >>> > information. >>> > But I still don't understand why the fsync is needed for >>> > synchronous >>> > IOs, >>> > but not for the AIOs. Jan Lindstrom said fsync is also called >>> > for >>> > other AIO >>> > operations. But I could only it true in one of many AIO >>> > operations. >>> > Or maybe >>> > I am missing something still? >>> > >>> >> >>> >> >>> >> -- >>> >> Laurynas >>> > >>> > >>> >>> >>> >>> -- >>> Laurynas >> >> >> >> _______________________________________________ >> Mailing list: https://launchpad.net/~maria-discuss >> Post to : maria-discuss@lists.launchpad.net >> Unsubscribe : https://launchpad.net/~maria-discuss >> More help : https://help.launchpad.net/ListHelp >> >
-- Laurynas
-- Laurynas
Undo logs log only a subset of a database instance. And, since their purpose is different, by the time of crash recovery the undo logs might be purged. 2015-05-10 2:57 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
Laurynas,
We cannot recover from a torn page only using redo log. But wouldn't undo log record enough information for recovery in the case of a torn page? Undo log should have old values of affected rows. So shouldn't it be enough to recover a torn page using information from undo log?
Xiaofei
On Sat, May 9, 2015 at 12:07 AM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote:
Xiaofei -
We can indeed detect the torn page write without the doublewrite buffer (and WebScaleSQL has a patch utilising this observation). But we need not only to detect, but to recover the page as well. And without the doublewrite, if we discard the page, we have nothing: a half-old half-new page on the disk and the redo log records for that page are not enough to recover it.
2015-05-09 8:44 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
Justin,
I think the fsync I was concerning and the torn page problem are two different things. But now I have a question about double write buffer. If we can detect a torn page by checking the top and bottom of a page, why would we still need double write buffer? If the page is consistent, then we use it, otherwise, we just discard it. Maybe this is a naive question. But please let me know. Thanks.
Xiaofei
On Fri, May 8, 2015 at 9:24 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
The log does not have whole pages. Pages must not be torn for the recovery process to work. A fsync is required when a page is written to disk. During recovery all changes since the last checkpoint are replayed, then transactions that do not have a commit marker are rolled back. This is called roll forward/roll back recovery.
--Justin
On Fri, May 8, 2015 at 6:09 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Justin,
I was thinking of if fsync is needed each time after a write. The operations are already in the log. So recovery can always be done from the log. The difference is that during recovery, we need to go back further in the log and it will take longer. But in that way, I guess it would be hard to coordinate with the kernel flush thread.
Xiaofei
On Fri, May 8, 2015 at 2:06 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
InnoDB recovery can not handle torn pages. An fsync is required to ensure that the page is fully written to disk. This is also why the doublewrite buffer is used. Before pages are written down to disk, they are first written sequentially into the doublewrite buffer. This buffer is synced, then async page writing can proceed. If the database crashes, the pages in flight will be rewritten by the doublewrite buffer. The detection mechanism for torn pages comes from an LSN, which is written into the top and the bottom of the page. If the LSN at the top and bottom do not match the page is torn.
Regards,
--Justin
On Fri, May 8, 2015 at 12:43 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote: > > Laurynas, > > This is exactly what I was looking for. I went through these > functions > before. I disabled double write buffer, so I didn't pay attention to > code > under buf_dblwr... The reason I asked this question is because I > didn't know > how the recovery process works, so I was wondering if it's necessary > to > fsync after each write. It's a performance concern. Anyway, thank > you very > much! > > Jan -- Thank you for your answer too! > > Xiaofei > > On Thu, May 7, 2015 at 9:59 PM, Laurynas Biveinis > <laurynas.biveinis@gmail.com> wrote: >> >> Xiaofei - >> >> fsync is performed for all the flush types (LRU, flush, single >> page) >> if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The >> apparent difference in sync and async is not because of the sync >> difference itself, but because of the flush type difference. The >> single page flush flushes one page, and requests a fsync for its >> file. >> Other flushes flush in batches, don't have to fsync for each >> written >> page individually but rather sync once at the end. Then doublewrite >> complicates this further. If it is disabled, fsync will happen in >> buf_dblwr_sync_datafiles called from >> buf_dblwr_flush_buffered_writes >> called from buf_flush_common called at the end of either LRU or >> flush >> list flush. If doublewrite is enabled, fsync will happen in >> buf_dblwr_update called from buf_flush_write_complete. >> >> >> >> >> 2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>: >> > Hi Laurynas, >> > >> > On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis >> > <laurynas.biveinis@gmail.com> wrote: >> >> >> >> Xiaofei - >> >> >> >> > Does InnoDB maintain a dirty >> >> > page table? >> >> >> >> You must be referring to the buffer pool flush_list. >> > >> > >> > You are right. The flush_list is can be used for recovery and >> > checkpoint. >> > >> >> >> >> >> >> > Is fsync called to guarantee the page to be on persistent >> >> > storage so that the dirty page table can be updated? If this >> >> > is >> >> > the >> >> > case, >> >> > when is the dirty page table updated for asynchronous IOs? >> >> >> >> Check buf_flush_write_complete in buf0flu.cc. For async IO it is >> >> called from buf_page_io_complete in buf0buf.cc. >> > >> > >> > You are right that this is the place it updates the dirty page >> > information. >> > But I still don't understand why the fsync is needed for >> > synchronous >> > IOs, >> > but not for the AIOs. Jan Lindstrom said fsync is also called for >> > other AIO >> > operations. But I could only it true in one of many AIO >> > operations. >> > Or maybe >> > I am missing something still? >> > >> >> >> >> >> >> -- >> >> Laurynas >> > >> > >> >> >> >> -- >> Laurynas > > > > _______________________________________________ > Mailing list: https://launchpad.net/~maria-discuss > Post to : maria-discuss@lists.launchpad.net > Unsubscribe : https://launchpad.net/~maria-discuss > More help : https://help.launchpad.net/ListHelp >
-- Laurynas
-- Laurynas
Laurynas, Thank you for your explanation. It helps me a lot. Appreciate your help! Thank everyone else's help also! Xiaofei On Sun, May 10, 2015 at 9:53 PM, Laurynas Biveinis < laurynas.biveinis@gmail.com> wrote:
Undo logs log only a subset of a database instance. And, since their purpose is different, by the time of crash recovery the undo logs might be purged.
Laurynas,
We cannot recover from a torn page only using redo log. But wouldn't undo log record enough information for recovery in the case of a torn page? Undo log should have old values of affected rows. So shouldn't it be enough to recover a torn page using information from undo log?
Xiaofei
On Sat, May 9, 2015 at 12:07 AM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote:
Xiaofei -
We can indeed detect the torn page write without the doublewrite buffer (and WebScaleSQL has a patch utilising this observation). But we need not only to detect, but to recover the page as well. And without the doublewrite, if we discard the page, we have nothing: a half-old half-new page on the disk and the redo log records for that page are not enough to recover it.
2015-05-09 8:44 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
Justin,
I think the fsync I was concerning and the torn page problem are two different things. But now I have a question about double write buffer. If we can detect a torn page by checking the top and bottom of a page, why would we still need double write buffer? If the page is consistent, then we use it, otherwise, we just discard it. Maybe this is a naive question. But please let me know. Thanks.
Xiaofei
On Fri, May 8, 2015 at 9:24 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
The log does not have whole pages. Pages must not be torn for the recovery process to work. A fsync is required when a page is written to disk. During recovery all changes since the last checkpoint are replayed, then transactions that do not have a commit marker are rolled back. This is called roll forward/roll back recovery.
--Justin
On Fri, May 8, 2015 at 6:09 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Justin,
I was thinking of if fsync is needed each time after a write. The operations are already in the log. So recovery can always be done
from
the log. The difference is that during recovery, we need to go back further in the log and it will take longer. But in that way, I guess it would be hard to coordinate with the kernel flush thread.
Xiaofei
On Fri, May 8, 2015 at 2:06 PM, Justin Swanhart < greenlion@gmail.com> wrote: > > Hi, > > InnoDB recovery can not handle torn pages. An fsync is required to > ensure that the page is fully written to disk. This is also why
> doublewrite buffer is used. Before pages are written down to disk, > they are > first written sequentially into the doublewrite buffer. This buffer > is > synced, then async page writing can proceed. If the database > crashes, the > pages in flight will be rewritten by the doublewrite buffer. The > detection > mechanism for torn pages comes from an LSN, which is written into
2015-05-10 2:57 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>: the the
> top > and the bottom of the page. If the LSN at the top and bottom do not > match > the page is torn. > > Regards, > > --Justin > > On Fri, May 8, 2015 at 12:43 PM, Xiaofei Du < xiaofei.du008@gmail.com> > wrote: >> >> Laurynas, >> >> This is exactly what I was looking for. I went through these >> functions >> before. I disabled double write buffer, so I didn't pay attention to >> code >> under buf_dblwr... The reason I asked this question is because I >> didn't know >> how the recovery process works, so I was wondering if it's necessary >> to >> fsync after each write. It's a performance concern. Anyway, thank >> you very >> much! >> >> Jan -- Thank you for your answer too! >> >> Xiaofei >> >> On Thu, May 7, 2015 at 9:59 PM, Laurynas Biveinis >> <laurynas.biveinis@gmail.com> wrote: >>> >>> Xiaofei - >>> >>> fsync is performed for all the flush types (LRU, flush, single >>> page) >>> if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The >>> apparent difference in sync and async is not because of the sync >>> difference itself, but because of the flush type difference. The >>> single page flush flushes one page, and requests a fsync for its >>> file. >>> Other flushes flush in batches, don't have to fsync for each >>> written >>> page individually but rather sync once at the end. Then doublewrite >>> complicates this further. If it is disabled, fsync will happen in >>> buf_dblwr_sync_datafiles called from >>> buf_dblwr_flush_buffered_writes >>> called from buf_flush_common called at the end of either LRU or >>> flush >>> list flush. If doublewrite is enabled, fsync will happen in >>> buf_dblwr_update called from buf_flush_write_complete. >>> >>> >>> >>> >>> 2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>: >>> > Hi Laurynas, >>> > >>> > On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis >>> > <laurynas.biveinis@gmail.com> wrote: >>> >> >>> >> Xiaofei - >>> >> >>> >> > Does InnoDB maintain a dirty >>> >> > page table? >>> >> >>> >> You must be referring to the buffer pool flush_list. >>> > >>> > >>> > You are right. The flush_list is can be used for recovery and >>> > checkpoint. >>> > >>> >> >>> >> >>> >> > Is fsync called to guarantee the page to be on persistent >>> >> > storage so that the dirty page table can be updated? If this >>> >> > is >>> >> > the >>> >> > case, >>> >> > when is the dirty page table updated for asynchronous IOs? >>> >> >>> >> Check buf_flush_write_complete in buf0flu.cc. For async IO it is >>> >> called from buf_page_io_complete in buf0buf.cc. >>> > >>> > >>> > You are right that this is the place it updates the dirty page >>> > information. >>> > But I still don't understand why the fsync is needed for >>> > synchronous >>> > IOs, >>> > but not for the AIOs. Jan Lindstrom said fsync is also called for >>> > other AIO >>> > operations. But I could only it true in one of many AIO >>> > operations. >>> > Or maybe >>> > I am missing something still? >>> > >>> >> >>> >> >>> >> -- >>> >> Laurynas >>> > >>> > >>> >>> >>> >>> -- >>> Laurynas >> >> >> >> _______________________________________________ >> Mailing list: https://launchpad.net/~maria-discuss >> Post to : maria-discuss@lists.launchpad.net >> Unsubscribe : https://launchpad.net/~maria-discuss >> More help : https://help.launchpad.net/ListHelp >> >
-- Laurynas
-- Laurynas
Xiaofei - The fsync is required not for recovery itself. It is required for indicating that recovery will not need to happen for this page and so the 1) log records in the circular redo log can now be safely overwritten and reused for new writes, and 2) the copy of this page in the doublewrite buffer is not required anymore neither. 2015-05-09 4:09 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
Justin,
I was thinking of if fsync is needed each time after a write. The operations are already in the log. So recovery can always be done from the log. The difference is that during recovery, we need to go back further in the log and it will take longer. But in that way, I guess it would be hard to coordinate with the kernel flush thread.
Xiaofei
On Fri, May 8, 2015 at 2:06 PM, Justin Swanhart <greenlion@gmail.com> wrote:
Hi,
InnoDB recovery can not handle torn pages. An fsync is required to ensure that the page is fully written to disk. This is also why the doublewrite buffer is used. Before pages are written down to disk, they are first written sequentially into the doublewrite buffer. This buffer is synced, then async page writing can proceed. If the database crashes, the pages in flight will be rewritten by the doublewrite buffer. The detection mechanism for torn pages comes from an LSN, which is written into the top and the bottom of the page. If the LSN at the top and bottom do not match the page is torn.
Regards,
--Justin
On Fri, May 8, 2015 at 12:43 PM, Xiaofei Du <xiaofei.du008@gmail.com> wrote:
Laurynas,
This is exactly what I was looking for. I went through these functions before. I disabled double write buffer, so I didn't pay attention to code under buf_dblwr... The reason I asked this question is because I didn't know how the recovery process works, so I was wondering if it's necessary to fsync after each write. It's a performance concern. Anyway, thank you very much!
Jan -- Thank you for your answer too!
Xiaofei
On Thu, May 7, 2015 at 9:59 PM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote:
Xiaofei -
fsync is performed for all the flush types (LRU, flush, single page) if it is asked for (innodb_flush_method != O_DIRECT_NO_FSYNC). The apparent difference in sync and async is not because of the sync difference itself, but because of the flush type difference. The single page flush flushes one page, and requests a fsync for its file. Other flushes flush in batches, don't have to fsync for each written page individually but rather sync once at the end. Then doublewrite complicates this further. If it is disabled, fsync will happen in buf_dblwr_sync_datafiles called from buf_dblwr_flush_buffered_writes called from buf_flush_common called at the end of either LRU or flush list flush. If doublewrite is enabled, fsync will happen in buf_dblwr_update called from buf_flush_write_complete.
2015-05-07 9:01 GMT+03:00 Xiaofei Du <xiaofei.du008@gmail.com>:
Hi Laurynas,
On Wed, May 6, 2015 at 9:14 PM, Laurynas Biveinis <laurynas.biveinis@gmail.com> wrote:
Xiaofei -
> Does InnoDB maintain a dirty > page table?
You must be referring to the buffer pool flush_list.
You are right. The flush_list is can be used for recovery and checkpoint.
> Is fsync called to guarantee the page to be on persistent > storage so that the dirty page table can be updated? If this is the > case, > when is the dirty page table updated for asynchronous IOs?
Check buf_flush_write_complete in buf0flu.cc. For async IO it is called from buf_page_io_complete in buf0buf.cc.
You are right that this is the place it updates the dirty page information. But I still don't understand why the fsync is needed for synchronous IOs, but not for the AIOs. Jan Lindstrom said fsync is also called for other AIO operations. But I could only it true in one of many AIO operations. Or maybe I am missing something still?
-- Laurynas
-- Laurynas
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
-- Laurynas
participants (3)
-
Justin Swanhart
-
Laurynas Biveinis
-
Xiaofei Du