[Maria-discuss] WL#4925 and multi-source replication
Hi, I'm testing multi-source replication with 6 sources, it works great ! Now I'm at the "HA" state, testing a master multi-source with DRBD. To do a safe switch between masters, I have to enable sync_binlog=1 and then the master delay grow... I see that WL#4925 suggest a solution that could work : is that job in progress in MariaDB 10 trunk ? And if not, how to ask to include it ? Thanks! -- Greg
Hi!
"Greg" == Greg <skygreg@gmail.com> writes:
Greg> Hi, Greg> I'm testing multi-source replication with 6 sources, it works great ! Greg> Now I'm at the "HA" state, testing a master multi-source with DRBD. To do a Greg> safe switch between masters, I have to enable sync_binlog=1 and then the Greg> master delay grow... Greg> I see that WL#4925 suggest a solution that could work : is that job in Greg> progress in MariaDB 10 trunk ? For future reference, here is a link to this: http://dev.mysql.com/worklog/task/?id=4925 I didn't think that sync_binlog=1 is required anymore for safe reliction. We are always using group commit in MariaDB 10.0 for the master, so the binary log will be synced for each group commit, which is safe. The preallocation of binary logs will of course speed up the syncs a bit, but probably not as much as suggested in the worklog task. Greg> And if not, how to ask to include it ? Currently we don't have any plans to implement this ourselves to MariaDB. However if MySQL 5.6 gets this or anyone from the community provides us with a patch, we will get it into MariaDB 10.0 too. This is of course something that a customer can sponsor us, if they think this would be useful for them.... Regards, Monty
Hi, 2013/3/14 Michael Widenius <monty@askmonty.org>
I didn't think that sync_binlog=1 is required anymore for safe reliction.
We are always using group commit in MariaDB 10.0 for the master, so the binary log will be synced for each group commit, which is safe.
I have to use it in a DRBD config. When testing this config, I killed mysqld with SIGKILL and sync_binlog=0 and failover start binary log in old position, that caused duplicates on slaves. With sync_binlog=1, this happens no more. How to configure how often group commits are fdatasynced ?
This is of course something that a customer can sponsor us, if they think this would be useful for them....
We will discuss about that on March 27 ;) -- Greg
Greg <skygreg@gmail.com> writes:
I didn't think that sync_binlog=1 is required anymore for safe reliction.
We are always using group commit in MariaDB 10.0 for the master, so the binary log will be synced for each group commit, which is safe.
I have to use it in a DRBD config. When testing this config, I killed mysqld with SIGKILL and sync_binlog=0 and failover start binary log in old position, that caused duplicates on slaves.
With sync_binlog=1, this happens no more.
How to configure how often group commits are fdatasynced ?
Correct, sync_binlog=1 is still required to ensure consistency between binlog and innodb. With sync_binlog=1, each group commit is fdatasynced (with a single binlog fdatasync per group), no further configuration is needed. You also need innodb_flush_log_at_trx_commit=1, of course. I am a bit surprised that you got duplicates with SIGKILL of mysqld. I would have expected crashing the OS kernel (ie. power failure) would be needed for fdatasync to make any difference? - Kristian.
2013/3/14 Kristian Nielsen <knielsen@knielsen-hq.org>
Correct, sync_binlog=1 is still required to ensure consistency between binlog and innodb. With sync_binlog=1, each group commit is fdatasynced (with a single binlog fdatasync per group), no further configuration is needed.
You also need innodb_flush_log_at_trx_commit=1, of course.
Yes, it's set.
I am a bit surprised that you got duplicates with SIGKILL of mysqld. I would have expected crashing the OS kernel (ie. power failure) would be needed for fdatasync to make any difference?
With "kill -9 mysqld", and sync_binlog=0, I'm not really surprised since mysql will not fdatasync after each commit, right ? To be more precise, actually the master run with sync_binlog=<number of sources>, here 4 sources including main. Some config vars and status : MariaDB [(none)]> SHOW STATUS LIKE 'binlog_%commits'; +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | Binlog_commits | 3239987 | | Binlog_group_commits | 3210932 | +----------------------+---------+ 2 rows in set (0.00 sec) MariaDB [(none)]> SHOW variables LIKE '%trx%'; +--------------------------------+-------+ | Variable_name | Value | +--------------------------------+-------+ | innodb_flush_log_at_trx_commit | 1 | +--------------------------------+-------+ 1 row in set (0.00 sec) MariaDB [(none)]> SHOW variables LIKE 'sync_binlog'; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | sync_binlog | 4 | +---------------+-------+ 1 row in set (0.00 sec) Seconds_Behind_Master: 1013 Some stats from mytop : MariaDB on localhost (10.0.1-MariaDB-log) up 0+03:35:22 [20:55:58] Queries: 18.9k qps: 1 Slow: 1.0 Se/In/Up/De(%): 33/13881/2497/559 Sorts: 0 qps now: 3 Slow qps: 0.0 Threads: 3 ( 3/ 1) 23/8885/577/315 Handler: (R/W/U/D) 5622/ 236/ 215/ 88 Tmp: R/W/U: 75/ 75/ 0 ISAM Key Efficiency: 91.2% Bps in/out: 127.1k/250.2k Now in/out: 128.2k/241.9k iostat -xm 10 : Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 324.20 3.30 205.80 0.04 2.23 22.29 1.12 5.36 6.06 5.35 4.63 96.80 drbd0 0.00 0.00 3.30 528.90 0.04 2.24 8.76 2.74 5.16 6.06 5.15 1.81 96.24 Do you need more informations ? How can I help more ? -- Greg
Greg <skygreg@gmail.com> writes:
With "kill -9 mysqld", and sync_binlog=0, I'm not really surprised since mysql will not fdatasync after each commit, right ?
Right. So this is mostly just my own academic interest, in practice it is of course real crashes/powerfailures we want to handle, not SIGKILL. If you are interested, this is my thinking: the server always does a write(2) system call on the binlog at (group) commit, even with sync_binlog=0. So even if we SIGKILL the server, the data is still in the kernel buffers (at least on Linux), and will eventually reach disk. However, you are using DRBD. I am guessing that when mysqld on one node dies, a failover is done to the other node, and this looses data in the kernel disk buffers on the first node that have not been fdatasync()'ed. So I learned a bit about DRBD, thanks ;-) - Kristian.
2013/3/15 Kristian Nielsen <knielsen@knielsen-hq.org>
Right. So this is mostly just my own academic interest, in practice it is of course real crashes/powerfailures we want to handle, not SIGKILL.
If you are interested, this is my thinking: the server always does a write(2) system call on the binlog at (group) commit, even with sync_binlog=0. So even if we SIGKILL the server, the data is still in the kernel buffers (at least on Linux), and will eventually reach disk.
I'm interested, and I learn too, thanks!
However, you are using DRBD. I am guessing that when mysqld on one node dies, a failover is done to the other node, and this looses data in the kernel disk buffers on the first node that have not been fdatasync()'ed.
I think you're right. I know that DRBD is not the best choice to have high-availibility on a MariaDB master, but I have my reasons to make this choice.... What could be the best settings in MariaDB and DRBD to handle this situation ? In DRBD, I just set protocol B which should be sufficient, but I can easily test with C. I have test some disk options like no-disk-barrier no-disk-flushes no-md-flushes use-bmbv and yes I tried too no-disk-drain ... In MariaDB I set : innodb_flush_log_at_trx_commit = 1 sync_binlog = 1 innodb_file_per_table = 1 I can test things, patchs, but I don't know what ... I didn't found WL#4925 patch so I can't test it. Any clue ? -- Greg
Greg <skygreg@gmail.com> writes:
I think you're right. I know that DRBD is not the best choice to have high-availibility on a MariaDB master, but I have my reasons to make this
Isn't it? I do not have practical experience with MariaDB/MySQL in a HA setup, but I have always thought that DRBD was one of the best choices, being a simple and robust solution. Please do not be confused with my thoughts on SIGKILL and fdatasync(), as I said it was just academic interest. MariaDB and DRBD are doing the right thing here, as far as I understand.
In DRBD, I just set protocol B which should be sufficient, but I can easily test with C. I have test some disk options
Well, if I understood you correctly you tested with sync_binlog=1 and verified that no inconsistencies occured. So this should be fine (I am not aware of the difference between B and C).
In MariaDB I set : innodb_flush_log_at_trx_commit = 1 sync_binlog = 1 innodb_file_per_table = 1
Yes, this is correct.
I can test things, patchs, but I don't know what ... I didn't found WL#4925 patch so I can't test it.
We do not currently have any patch or plans for pre-allocation of binlog. - Kristian.
2013/3/15 Kristian Nielsen <knielsen@knielsen-hq.org>
Isn't it? I do not have practical experience with MariaDB/MySQL in a HA setup, but I have always thought that DRBD was one of the best choices, being a simple and robust solution.
Yoshinori Matsunobu described available solutions with pros and cons here : http://code.google.com/p/mysql-master-ha/wiki/Other_HA_Solutions
I can test things, patchs, but I don't know what ... I didn't found WL#4925 patch so I can't test it.
We do not currently have any patch or plans for pre-allocation of binlog.
This morning I've removed DRBD to verify that it is not the bottleneck, and .... It's much faster with DRBD ! I didn't found anywere if Bug#13669 (http://bugs.mysql.com/bug.php?id=13669) was fixed, but it's seems to, as I see Binlog_commits =~ Binlog_group_commits. So I still have performance problems and perhaps WL#4925 could help .... or I'll have to buy some better hardware. -- Greg
Greg <skygreg@gmail.com> writes:
This morning I've removed DRBD to verify that it is not the bottleneck, and .... It's much faster with DRBD !
Yes, because with DRBD, an fdatasync() only needs to do a network trip to the other node. On local disk, it needs a physical disk write, which can take 10 milliseconds (on non-SSD and non-battery-backup disk controller cache).
I didn't found anywere if Bug#13669 (http://bugs.mysql.com/bug.php?id=13669) was fixed, but it's seems to,
Yes, it is fixed from MariaDB 5.3.
as I see Binlog_commits =~ Binlog_group_commits. So I still have performance problems and perhaps
If Binlog_commits =~ Binlog_group_commits, then yes, you do not get much benefit from group commit. How many transactions do you have running in parallel? Binlog group commit only helps if you have more than 2 transactions writing concurrently (and the more, the better). Of course, it could also be that the fdatasync() at commit is not a bottlenect, which would also result in Binlog_commits =~ Binlog_group_commits. How many commits per second do you do?
WL#4925 could help .... or I'll have to buy some better hardware.
Binlog pre-allocation can help a little bit. Basically, in MariaDB 5.5, you need three fdatasync() calls per group commit (one in innodb prepare, one in binlog write, one in innodb commit). The one in binlog write is more expensive because the file length changes, so it requires to sync both the data and the metadata. Though on DRBD it is probably not much more costly to send two disk blocks to the other node as compared to one. So binlog pre-allocate would remove the "more expensive" for binlog write, but you would still have 3 fdatasync() calls per group commit. In MariaDB 10.0, I implemented MDEV-181 (https://mariadb.atlassian.net/browse/MDEV-181). This reduces the 3 fdatasync() calls to 2. Another task that would help is MDEV-47 (https://mariadb.atlassian.net/browse/MDEV-47). This would reduce the overhead to a single fdatasync() call. However, like binlog pre-allocation, there are currently no concrete plans for this. - Kristian.
2013/3/15 Kristian Nielsen <knielsen@knielsen-hq.org>
How many transactions do you have running in parallel? Binlog group commit only helps if you have more than 2 transactions writing concurrently (and the more, the better).
I didn't add write queries to this master, so I only have the 4 sources (I removed 2 sources since I enable sync_binlog), so I mean 4 transactions in parallel at max ...
Of course, it could also be that the fdatasync() at commit is not a bottlenect, which would also result in Binlog_commits =~ Binlog_group_commits. How many commits per second do you do?
~120/s (live now :) )
In MariaDB 10.0, I implemented MDEV-181
I'm already using 10.0.1. If interested I found last commit for WL#4925 by Yoshinori here : http://lists.mysql.com/commits/113309 I'm looking for adding a SSD drive to those servers for binlogs. -- Greg
Hi!
"Greg" == Greg <skygreg@gmail.com> writes:
Greg> Hi, Greg> 2013/3/14 Michael Widenius <monty@askmonty.org>
I didn't think that sync_binlog=1 is required anymore for safe reliction.
We are always using group commit in MariaDB 10.0 for the master, so the binary log will be synced for each group commit, which is safe.
Greg> I have to use it in a DRBD config. When testing this config, I killed Greg> mysqld with SIGKILL and sync_binlog=0 and failover start binary log in old Greg> position, that caused duplicates on slaves. Kristian, do you have any comments for the above? Greg> With sync_binlog=1, this happens no more. Greg> How to configure how often group commits are fdatasynced ? They should always be fdatasynced. Kristian ?
This is of course something that a customer can sponsor us, if they think this would be useful for them....
Greg> We will discuss about that on March 27 ;) Greg> -- Greg> Greg
On Wed, Mar 13, 2013 at 4:34 PM, Greg <skygreg@gmail.com> wrote:
Hi,
I'm testing multi-source replication with 6 sources, it works great ! Now I'm at the "HA" state, testing a master multi-source with DRBD. To do a safe switch between masters, I have to enable sync_binlog=1 and then the master delay grow...
Hi Greg I realize you said in this thread that you are already committed to use DRBD, but just to educate the list if nothing else: Is MariaDB Galera Cluster an option for the master database? The difference with Galera in this case would be: - It is not dependent on innodb_flush_log_at_trx_commit needing to be 1, and with some additional configuration/verficiation you could also safely do sync_binlog=0. - Even if you choose to set both of these to 1, I've found that the performance penalty from Galera is much smaller (or nonexistent) compared to DRBD. The difference is that Galera replication happens completely separately from syncing data to disk, whereas DRBD makes syncing data to disk slower and this is unfortunate as it is already the slowest part of committing a transaction. Other than this I agree with Kristian that DRBD in general is a robust and simple HA solution. (It sounds like in your use case the slow failover times are not an issue, which is true for many applications of course.) I just wanted to introduce an alternative. Disclosure: I nowadays work for Codership, although my experience with Galera comes as being an end user. henrik -- henrik.ingo@avoinelama.fi +358-40-8211286 skype: henrik.ingo irc: hingo www.openlife.cc My LinkedIn profile: http://www.linkedin.com/profile/view?id=9522559
participants (4)
-
Greg
-
Henrik Ingo
-
Kristian Nielsen
-
Michael Widenius