[Maria-developers] prospective GSOC 2017 student [MDEV-7502]
Hi, My name is Ibrar Arshad and I am interested in working on the task of automatic slave provisioning(ticket: MDEV-7502 <https://jira.mariadb.org/browse/MDEV-7502>) during GSOC 2017. I have read the summary on the ticket and have achieved a fair understanding of the problem and I am working towards ironing out the implementation details. The use-case as I understand is that we want the slave to auto-replicate the data from master once pointed the master and we want to do it in such a manner that the binlog events from current master position as well as the old data chunks are relayed to the slave in a parallel fashion. I have a few questions related to the proposal: 1. After reading a few pages on replication, my understanding is that after "CHANGE MASTER TO" and "START SLAVE", master starts sending binlog events from its current position to the slave which slave starts applying. The usual replication approach is to get the current binlog position on master, backup all the data till this position from master to slave, point slave to this position(or GTID) via "CHANGE MASTER TO", and START SLAVE to start replicating bin events from master. But for MDEV-7502, we want the normal events and old data chunks to be transmitted in parallel. The ticket summary mentions using separate domain_ids to send the new and old data in parallel, does there exist a way to do so currently? How can domain id be used here? Can we currently point the slave to 2 different bin positions on a single master and expect the master to send events from both positions? Or will this require some sort of new process/thread implementation on master to do so? 2. There are at-least two other approaches mentioned in the ticket's comments section. It doesn't seem like that a single approach has been finalized. This project doesn't seem to have a mentor yet to provide guidance so which approach should an applicant pursue further? I would like to discuss the project approaches and implementation further in detail before submitting a proposal so can somebody please answer my queries and further suggest pointers to this project specific material which I can go through to get a deeper understanding? Thanks.
Hi, ibrar! On Mar 19, ibrar arshad wrote:
Hi,
My name is Ibrar Arshad and I am interested in working on the task of automatic slave provisioning(ticket: MDEV-7502 <https://jira.mariadb.org/browse/MDEV-7502>) during GSOC 2017. I have read the summary on the ticket and have achieved a fair understanding of the problem and I am working towards ironing out the implementation details. The use-case as I understand is that we want the slave to auto-replicate the data from master once pointed the master
Yes.
and we want to do it in such a manner that the binlog events from current master position as well as the old data chunks are relayed to the slave in a parallel fashion.
Not necessarily. There could be other approaches too. May be even bulk-loading the data would be faster than sending data in chunks and applying events in parallel. Or may be not.
I have a few questions related to the proposal:
1. After reading a few pages on replication, my understanding is that after "CHANGE MASTER TO" and "START SLAVE", master starts sending binlog events from its current position to the slave which slave starts applying. The usual replication approach is to get the current binlog position on master, backup all the data till this position from master to slave, point slave to this position(or GTID) via "CHANGE MASTER TO", and START SLAVE to start replicating bin events from master. But for MDEV-7502, we want the normal events and old data chunks to be transmitted in parallel.
The main thing we want for MDEV-7502 is to avoid the step of "backup all the data... restore on the slave".
The ticket summary mentions using separate domain_ids to send the new and old data in parallel, does there exist a way to do so currently? How can domain id be used here? Can we currently point the slave to 2 different bin positions on a single master and expect the master to send events from both positions? Or will this require some sort of new process/thread implementation on master to do so?
No, this won't. I didn't actually try to connect twice from a slave to the same master, but I suspect it'll either work or can be fixed to work rather easily.
2. There are at-least two other approaches mentioned in the ticket's comments section. It doesn't seem like that a single approach has been finalized. This project doesn't seem to have a mentor yet to provide guidance so which approach should an applicant pursue further?
Yes, the project suggests few different approaches. You can discuss them in your proposal and suggest the one you think is the best. There will be a mentor, don't worry. It just wasn't formally assigned yet.
I would like to discuss the project approaches and implementation further in detail before submitting a proposal so can somebody please answer my queries and further suggest pointers to this project specific material which I can go through to get a deeper understanding? Thanks.
Hmm.. For example, I've mentioned above that it's not clear whether sending all data first and bulk-loading them will be faster or slower than interleaving data anf RBR binlog events. You can test it. Get a big table dump (not huge, but something that loads a noticeable amount of time). Then get a bunch of single-row update/delete/updates. And try 1) load the dump, do updates. 2) do updates in parallel with the dump. Just take care to enable at least the primary key, and made sure that in both approaches you get the same table content at the end. That's a simple test, no coding involved, but it'll give some understanding as to what approach is faster on the slave side. Regards, Sergei Chief Architect MariaDB and security@mariadb.org
Le 19 mars 2017 à 18:53, Sergei Golubchik <serg@mariadb.org> a écrit :
Hi, ibrar!
On Mar 19, ibrar arshad wrote:
Hi,
My name is Ibrar Arshad and I am interested in working on the task of automatic slave provisioning(ticket: MDEV-7502 <https://jira.mariadb.org/browse/MDEV-7502>) during GSOC 2017. I have read the summary on the ticket and have achieved a fair understanding of the problem and I am working towards ironing out the implementation details. The use-case as I understand is that we want the slave to auto-replicate the data from master once pointed the master
Yes.
and we want to do it in such a manner that the binlog events from current master position as well as the old data chunks are relayed to the slave in a parallel fashion.
Not necessarily. There could be other approaches too.
May be even bulk-loading the data would be faster than sending data in chunks and applying events in parallel. Or may be not.
I have a few questions related to the proposal:
1. After reading a few pages on replication, my understanding is that after "CHANGE MASTER TO" and "START SLAVE", master starts sending binlog events from its current position to the slave which slave starts applying. The usual replication approach is to get the current binlog position on master, backup all the data till this position from master to slave, point slave to this position(or GTID) via "CHANGE MASTER TO", and START SLAVE to start replicating bin events from master. But for MDEV-7502, we want the normal events and old data chunks to be transmitted in parallel.
The main thing we want for MDEV-7502 is to avoid the step of "backup all the data... restore on the slave".
The ticket summary mentions using separate domain_ids to send the new and old data in parallel, does there exist a way to do so currently? How can domain id be used here? Can we currently point the slave to 2 different bin positions on a single master and expect the master to send events from both positions? Or will this require some sort of new process/thread implementation on master to do so?
No, this won't. I didn't actually try to connect twice from a slave to the same master, but I suspect it'll either work or can be fixed to work rather easily.
2. There are at-least two other approaches mentioned in the ticket's comments section. It doesn't seem like that a single approach has been finalized. This project doesn't seem to have a mentor yet to provide guidance so which approach should an applicant pursue further?
Yes, the project suggests few different approaches. You can discuss them in your proposal and suggest the one you think is the best. There will be a mentor, don't worry. It just wasn't formally assigned yet.
I would like to discuss the project approaches and implementation further in detail before submitting a proposal so can somebody please answer my queries and further suggest pointers to this project specific material which I can go through to get a deeper understanding? Thanks.
Hmm..
For example, I've mentioned above that it's not clear whether sending all data first and bulk-loading them will be faster or slower than interleaving data anf RBR binlog events.
You can test it. Get a big table dump (not huge, but something that loads a noticeable amount of time). Then get a bunch of single-row update/delete/updates. And try 1) load the dump, do updates. 2) do updates in parallel with the dump. Just take care to enable at least the primary key, and made sure that in both approaches you get the same table content at the end. That's a simple test, no coding involved, but it'll give some understanding as to what approach is faster on the slave side.
I would strongly suggest to have a look at https://github.com/maxbube/mydumper Before implementing there are interesting collections of issues already fixed inside . /svar Stéphane Varoqui, Senior Consultant Phone: +33 695-926-401, skype: svaroqui http://www.mariadb.com
Regards, Sergei Chief Architect MariaDB and security@mariadb.org
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
participants (3)
-
ibrar arshad
-
Sergei Golubchik
-
Stephane Varoqui