[Maria-developers] MariaDB multi-source replication testing at Booking.com
Dear MariaDB developers, I'm Károly Nagy working for Booking.com currently testing multi-source replication functionality of MariaDB. Kristian suggested I should reach out to you on this mailing list regarding my questions. We're seeing very high and fluctuating mutex contentions while replicating from two sources (Oracle MySQL 5.6) to a single MariaDB slave. You can see that on the graphs below. The spin waits are relatively [1] aligned but the mutex rounds [2] are 5-10 times higher than it is on the two sources combined together and not consistent. The sources have a relatively constant pattern while the target has dips around 2.5k and spikes up to 8k. The os waits are in completely different order of magnitude [3]. The scenario where values were captured: * Multi-source target is replicating the full dataset of `source 2` and a subset of `source 1` (the hot data) - MariaDB 10.0.16 * Both sources are MySQL 5.6 being part of their replication chain as slaves with log_slave_updates * Source 2 is in normal mode - Oracle MySQL 5.6.17 * Source 1 is catching up from a 1 day replication delay - Oracle MySQL 5.6.24 * All the slaves are warm having the buffer pool fully populated Is this behavior expected? Could you give us some insights on why we could see these results? If there is any more information you need please let me know. Thank you for your help in advance! Every metric is on /10 seconds basis. [1] Mutex spin waits [2] Mutex rounds [3] Mutex OS waits Best regards, -- Károly Nagy System engineer Booking.com <http://booking.com/> BV Rembrandt Square Office, Herengracht 597, 1017 CE Amsterdam Direct +31 (0)20 715 8403
Karoly Nagy <karoly.nagy@booking.com> writes:
We're seeing very high and fluctuating mutex contentions while replicating from two sources (Oracle MySQL 5.6) to a single MariaDB slave. You can see that on the graphs below. The spin waits are relatively [1] aligned but the mutex rounds [2] are 5-10 times higher than it is on the two sources combined together and not consistent. The sources have a relatively constant pattern while the target has dips around 2.5k and spikes up to 8k. The os waits are in completely different order of magnitude [3].
The scenario where values were captured:
* Multi-source target is replicating the full dataset of `source 2` and a subset of `source 1` (the hot data) - MariaDB 10.0.16 * Both sources are MySQL 5.6 being part of their replication chain as slaves with log_slave_updates * Source 2 is in normal mode - Oracle MySQL 5.6.17 * Source 1 is catching up from a 1 day replication delay - Oracle MySQL 5.6.24 * All the slaves are warm having the buffer pool fully populated
Is this behavior expected?
So if I understand correctly, what is compared here is the value of some InnoDB statistics between two MySQL 5.6 servers each running a single replication SQL thread, and a MariaDB 10.0 server running two replication SQL threads (multi-source replication). I do not have much experience with interpreting InnoDB mutex wait statistics, hopefully some with more experience on this can contribute. But it does seem somewhat expected that a server with two threads has a much higher potential for mutex contention (mutex rounds and os waits) than a server using only a single thread, right? Did you try comparing the numbers when only one thread is running on the MariaDB slave (eg. stopping first one of the multisource connections, then the other) ? Did you try comparing the configurations of the three servers for any relevant differences? What are the corresponding statistics on the original masters generating the load? Did you try to determine which individual mutexes are mostly contributing to the differences (just total number of mutex waits is a somewhat crude statistics which might be hard to interpret)? Do you have any indication that these differences are causing problems with performance, or are you just curious to understand them? Hope this helps, - Kristian.
Hi Karoly, Can you share a 'pstack' result of Multi-Source target when OS/waits is high? And how about your options on my.cnf? In my mind, Multi-Source replication will not effect InnoDB. But if Source_1 and Source_2 will operating the same table, maybe cause some conflict on slave, then slave maybe have more OS/waits or spin/watis. Thanks, Lixun 2015-06-29 17:18 GMT+08:00 Kristian Nielsen <knielsen@knielsen-hq.org>:
Karoly Nagy <karoly.nagy@booking.com> writes:
We're seeing very high and fluctuating mutex contentions while replicating from two sources (Oracle MySQL 5.6) to a single MariaDB slave. You can see that on the graphs below. The spin waits are relatively [1] aligned but the mutex rounds [2] are 5-10 times higher than it is on the two sources combined together and not consistent. The sources have a relatively constant pattern while the target has dips around 2.5k and spikes up to 8k. The os waits are in completely different order of magnitude [3].
The scenario where values were captured:
* Multi-source target is replicating the full dataset of `source 2` and a subset of `source 1` (the hot data) - MariaDB 10.0.16 * Both sources are MySQL 5.6 being part of their replication chain as slaves with log_slave_updates * Source 2 is in normal mode - Oracle MySQL 5.6.17 * Source 1 is catching up from a 1 day replication delay - Oracle MySQL 5.6.24 * All the slaves are warm having the buffer pool fully populated
Is this behavior expected?
So if I understand correctly, what is compared here is the value of some InnoDB statistics between two MySQL 5.6 servers each running a single replication SQL thread, and a MariaDB 10.0 server running two replication SQL threads (multi-source replication).
I do not have much experience with interpreting InnoDB mutex wait statistics, hopefully some with more experience on this can contribute. But it does seem somewhat expected that a server with two threads has a much higher potential for mutex contention (mutex rounds and os waits) than a server using only a single thread, right?
Did you try comparing the numbers when only one thread is running on the MariaDB slave (eg. stopping first one of the multisource connections, then the other) ?
Did you try comparing the configurations of the three servers for any relevant differences?
What are the corresponding statistics on the original masters generating the load?
Did you try to determine which individual mutexes are mostly contributing to the differences (just total number of mutex waits is a somewhat crude statistics which might be hard to interpret)?
Do you have any indication that these differences are causing problems with performance, or are you just curious to understand them?
Hope this helps,
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
-- Staff Database Engineer @ Alibaba Cloud Computing Oracle ACE for MySQL Phone: +86 18658156856 (Hangzhou) Blog: http://www.penglixun.com
Hi Lixun! Thank you for your reply. I gave more details in my response to Kristian's email. I cannot get that right now because the server is running another test at the moment but I will set it up again and see the results. The my.cnf were the same except some MariaDB specific thing (slave repository for example, we use table in MySQL). The two threads were writing to different databases so hypothetically there shouldn't be any locking involved between the two except the global ones. I cannot share the specific information on open mailing lists but if it will come to that point we can follow it up in a service request. Best regards, -- Károly Nagy System engineer Booking.com <http://booking.com/> BV Rembrandt Square Office, Herengracht 597, 1017 CE Amsterdam Direct +31 (0)20 715 8403
Lixun Peng <mailto:penglixun@gmail.com> 29 Jun 2015 11:48 Hi Karoly,
Can you share a 'pstack' result of Multi-Source target when OS/waits is high? And how about your options on my.cnf?
In my mind, Multi-Source replication will not effect InnoDB. But if Source_1 and Source_2 will operating the same table, maybe cause some conflict on slave, then slave maybe have more OS/waits or spin/watis.
Thanks, Lixun
-- Staff Database Engineer @ Alibaba Cloud Computing Oracle ACE for MySQL Phone: +86 18658156856 (Hangzhou) Blog: http://www.penglixun.com Kristian Nielsen <mailto:knielsen@knielsen-hq.org> 29 Jun 2015 11:18
So if I understand correctly, what is compared here is the value of some InnoDB statistics between two MySQL 5.6 servers each running a single replication SQL thread, and a MariaDB 10.0 server running two replication SQL threads (multi-source replication).
I do not have much experience with interpreting InnoDB mutex wait statistics, hopefully some with more experience on this can contribute. But it does seem somewhat expected that a server with two threads has a much higher potential for mutex contention (mutex rounds and os waits) than a server using only a single thread, right?
Did you try comparing the numbers when only one thread is running on the MariaDB slave (eg. stopping first one of the multisource connections, then the other) ?
Did you try comparing the configurations of the three servers for any relevant differences?
What are the corresponding statistics on the original masters generating the load?
Did you try to determine which individual mutexes are mostly contributing to the differences (just total number of mutex waits is a somewhat crude statistics which might be hard to interpret)?
Do you have any indication that these differences are causing problems with performance, or are you just curious to understand them?
Hope this helps,
- Kristian. Karoly Nagy <mailto:karoly.nagy@booking.com> 29 Jun 2015 10:51 Dear MariaDB developers,
I'm Károly Nagy working for Booking.com currently testing multi-source replication functionality of MariaDB. Kristian suggested I should reach out to you on this mailing list regarding my questions.
We're seeing very high and fluctuating mutex contentions while replicating from two sources (Oracle MySQL 5.6) to a single MariaDB slave. You can see that on the graphs below. The spin waits are relatively [1] aligned but the mutex rounds [2] are 5-10 times higher than it is on the two sources combined together and not consistent. The sources have a relatively constant pattern while the target has dips around 2.5k and spikes up to 8k. The os waits are in completely different order of magnitude [3].
The scenario where values were captured:
* Multi-source target is replicating the full dataset of `source 2` and a subset of `source 1` (the hot data) - MariaDB 10.0.16 * Both sources are MySQL 5.6 being part of their replication chain as slaves with log_slave_updates * Source 2 is in normal mode - Oracle MySQL 5.6.17 * Source 1 is catching up from a 1 day replication delay - Oracle MySQL 5.6.24 * All the slaves are warm having the buffer pool fully populated
Is this behavior expected?
Could you give us some insights on why we could see these results?
If there is any more information you need please let me know. Thank you for your help in advance!
Every metric is on /10 seconds basis.
[1] Mutex spin waits
[2] Mutex rounds
[3] Mutex OS waits
Best regards, _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp Karoly Nagy <mailto:karoly.nagy@booking.com> 5 Jun 2015 16:10 Dear MariaDB developers,
I'm Károly Nagy working for Booking.com currently testing multi-source replication functionality of MariaDB. Kristian suggested I should reach out to you on this mailing list regarding my questions.
We're seeing very high and fluctuating mutex contentions while replicating from two sources (Oracle MySQL 5.6) to a single MariaDB slave. You can see that on the graphs below. The spin waits are relatively [1] aligned but the mutex rounds [2] are 5-10 times higher than it is on the two sources combined together and not consistent. The sources have a relatively constant pattern while the target has dips around 2.5k and spikes up to 8k. The os waits are in completely different order of magnitude [3].
The scenario where values were captured:
* Multi-source target is replicating the full dataset of `source 2` and a subset of `source 1` (the hot data) - MariaDB 10.0.16 * Both sources are MySQL 5.6 being part of their replication chain as slaves with log_slave_updates * Source 2 is in normal mode - Oracle MySQL 5.6.17 * Source 1 is catching up from a 1 day replication delay - Oracle MySQL 5.6.24 * All the slaves are warm having the buffer pool fully populated
Is this behavior expected?
Could you give us some insights on why we could see these results?
If there is any more information you need please let me know. Thank you for your help in advance!
Every metric is on /10 seconds basis.
[1] Mutex spin waits
[2] Mutex rounds
[3] Mutex OS waits
Best regards,
Hi Kristian, Thank you for your quick response. Please see my answers inline.
Kristian Nielsen <mailto:knielsen@knielsen-hq.org> 29 Jun 2015 11:18
So if I understand correctly, what is compared here is the value of some InnoDB statistics between two MySQL 5.6 servers each running a single replication SQL thread, and a MariaDB 10.0 server running two replication SQL threads (multi-source replication).
I do not have much experience with interpreting InnoDB mutex wait statistics, hopefully some with more experience on this can contribute. But it does seem somewhat expected that a server with two threads has a much higher potential for mutex contention (mutex rounds and os waits) than a server using only a single thread, right? Yes, that's correct. Two Oracle MySQL 5.6 slave with log_slave_updates serve as the sources of the "merged" slave running MariaDB. I would expect some higher contention I was surprised by the multiplied factor since the two thread are running hypothetically isolated on different databases on the same servers so other than some global mutexes shouldn't be any contention on the threads.
Please correct me if I'm wrong but if I understand it correctly the spin waits indicate that roughly the expected number of lockings happens on the target slave (source1 + source2). Order of magnitude higher mutex rounds means it took longer to acquire those locks.
Did you try comparing the numbers when only one thread is running on the MariaDB slave (eg. stopping first one of the multisource connections, then the other) ?
Yes. We let it run for a while stopped one thread and the other. The graphs below show the status when we stopped one source completely (the server itself). The results roughly the same. The spin waits are aligned (sometimes even lower) while the mutex round and os waits are multiplied.
Did you try comparing the configurations of the three servers for any relevant differences?
The server configurations were the same except of course MariaDB specific things driven by our puppet configuration.
What are the corresponding statistics on the original masters generating the load?
Did you try to determine which individual mutexes are mostly contributing to the differences (just total number of mutex waits is a somewhat crude statistics which might be hard to interpret)? Yes, I did. We don't track that so I cannot look it back. The server are running different test now but I can set it up again and see. If I remember correctly nothing really stood out. Do you have any indication that these differences are causing problems with performance, or are you just curious to understand them? Not really. We saw it being a bit slower in replication but in spite of
I can't really compare the master metrics because the write happens parallel therefore the spin waits are much higher although the OS waits seems roughly equally high like what we experienced on the slave. the multiple runs there were no statistically significant difference. We had the sources get behind in replication with 60-3600 seconds and start replication from there. The target was keeping up with both upstreams with only 1-2 seconds delay. We were only stressing the slave with the upstream slaves catching up. Do you think it would worth to setup two true master and test with natural writes on them and see how they replicate to the multi-sourced slave? Best regards, -- Károly Nagy System engineer Booking.com <http://booking.com/> BV Rembrandt Square Office, Herengracht 597, 1017 CE Amsterdam Direct +31 (0)20 715 8403
Karoly Nagy <mailto:karoly.nagy@booking.com> 29 Jun 2015 10:51 Dear MariaDB developers,
I'm Károly Nagy working for Booking.com currently testing multi-source replication functionality of MariaDB. Kristian suggested I should reach out to you on this mailing list regarding my questions.
We're seeing very high and fluctuating mutex contentions while replicating from two sources (Oracle MySQL 5.6) to a single MariaDB slave. You can see that on the graphs below. The spin waits are relatively [1] aligned but the mutex rounds [2] are 5-10 times higher than it is on the two sources combined together and not consistent. The sources have a relatively constant pattern while the target has dips around 2.5k and spikes up to 8k. The os waits are in completely different order of magnitude [3].
The scenario where values were captured:
* Multi-source target is replicating the full dataset of `source 2` and a subset of `source 1` (the hot data) - MariaDB 10.0.16 * Both sources are MySQL 5.6 being part of their replication chain as slaves with log_slave_updates * Source 2 is in normal mode - Oracle MySQL 5.6.17 * Source 1 is catching up from a 1 day replication delay - Oracle MySQL 5.6.24 * All the slaves are warm having the buffer pool fully populated
Is this behavior expected?
Could you give us some insights on why we could see these results?
If there is any more information you need please let me know. Thank you for your help in advance!
Every metric is on /10 seconds basis.
[1] Mutex spin waits
[2] Mutex rounds
[3] Mutex OS waits
Best regards, _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp Karoly Nagy <mailto:karoly.nagy@booking.com> 5 Jun 2015 16:10 Dear MariaDB developers,
I'm Károly Nagy working for Booking.com currently testing multi-source replication functionality of MariaDB. Kristian suggested I should reach out to you on this mailing list regarding my questions.
We're seeing very high and fluctuating mutex contentions while replicating from two sources (Oracle MySQL 5.6) to a single MariaDB slave. You can see that on the graphs below. The spin waits are relatively [1] aligned but the mutex rounds [2] are 5-10 times higher than it is on the two sources combined together and not consistent. The sources have a relatively constant pattern while the target has dips around 2.5k and spikes up to 8k. The os waits are in completely different order of magnitude [3].
The scenario where values were captured:
* Multi-source target is replicating the full dataset of `source 2` and a subset of `source 1` (the hot data) - MariaDB 10.0.16 * Both sources are MySQL 5.6 being part of their replication chain as slaves with log_slave_updates * Source 2 is in normal mode - Oracle MySQL 5.6.17 * Source 1 is catching up from a 1 day replication delay - Oracle MySQL 5.6.24 * All the slaves are warm having the buffer pool fully populated
Is this behavior expected?
Could you give us some insights on why we could see these results?
If there is any more information you need please let me know. Thank you for your help in advance!
Every metric is on /10 seconds basis.
[1] Mutex spin waits
[2] Mutex rounds
[3] Mutex OS waits
Best regards,
participants (3)
-
Karoly Nagy
-
Kristian Nielsen
-
Lixun Peng