[Maria-developers] MDEV-6877 - binlog_row_image implementation advice
Hi Sergei! This past week I've done quite a bit of reading regarding the replication code and how the binlog is used by both mysqlbinlog and the slaves in order to do replication. I think I now have a good enough understanding of the flow of the code for both MySQL and MariaDB to actually start coding the feature into MariaDB. I do have a couple of unknowns still. I don't quite understand what's the use for the TABLE_MAP event. My intuition tells me its some sort of id we assign to a table name (test.t1 for example) that gets used afterwards in the binlog row events. We seem to be using a bitmap of columns that we want to log into the binlog. MySQL does extend the TABLE class to flip the bits according to the binlog_row_image variable. That seems like the first place to start with the addition. I could not find any other "hidden" logic that we use to choose how we log row operations in the binlog. What I would like to know is if there are any other pitfalls that I am supposed to watch out for, or if there is anything else I may be missing regarding the logging part. When it comes to actually doing the replication, I concluded that Log_event::do_apply_event is the place where we begin translating the event from the binlog to an SQL statement. Another concern that I have is how to test the implementation. Right now, I run the test, get the binlog file and print it with mysqlbinlog and see if the output is what I expect. Do we have something within mysql-test-run that automates this? Regards, Vicentiu
Vicențiu Ciorbaru <vicentiu@mariadb.org> writes:
do have a couple of unknowns still. I don't quite understand what's the use for the TABLE_MAP event. My intuition tells me its some sort of id we assign to a table name (test.t1 for example) that gets used afterwards in the binlog row events.
Yes. The Table_map_log_event assigns numerical IDs to the tables used in a statement. These IDs are then used in one or more following Rows_log_event to say which table the row operations should be made on.
the addition. I could not find any other "hidden" logic that we use to choose how we log row operations in the binlog. What I would like to know is if there are any other pitfalls that I am supposed to watch out for, or if there is anything else I may be missing regarding the logging part.
I also cannot think of any. (Though experience shows that there will probably be a number of pitfalls anyway).
When it comes to actually doing the replication, I concluded that Log_event::do_apply_event is the place where we begin translating the event from the binlog to an SQL statement.
Right. For row-based replication, it is Rows_log_event::do_apply_event().
Another concern that I have is how to test the implementation. Right now, I run the test, get the binlog file and print it with mysqlbinlog and see if the output is what I expect. Do we have something within mysql-test-run that automates this?
There are some MTR tests that run mysqlbinlog and checks the output, for example mysqlbinlog.test. But the usual way to check replication is all the MTR tests in mysql-test/suite/rpl/t/. They start up a master and a slave server (or more), make some transactions on the master, check that they replicate as expected to the slave. - Kristian.
Hi Kristian,
Another concern that I have is how to test the implementation. Right now, I
run the test, get the binlog file and print it with mysqlbinlog and see if the output is what I expect. Do we have something within mysql-test-run that automates this?
There are some MTR tests that run mysqlbinlog and checks the output, for example mysqlbinlog.test.
But the usual way to check replication is all the MTR tests in mysql-test/suite/rpl/t/. They start up a master and a slave server (or more), make some transactions on the master, check that they replicate as expected to the slave.
Thank you for the reply! I have already started doing the changes and I seem to be making some progress. The mysqlbinlog tests were the ones that I was looking for. I'll let you know if I run into any more issues. This actually turned out to be a very good learning experience for me as I previously had no experience with replication. I'll mark you as a reviewer when I have a patch ready. Vicențiu
Hi Kristian, I've gotten to a "sort of" stable state with binlog_row_image. It's not 100% complete as I did not add any tests for the new use cases. Also there are a couple of changes that I am not 100% sure if they are complete or correct. I would like some input on these changes: https://github.com/MariaDB/server/compare/10.1-MDEV-6877-binlog_row_image The unpack_current_row being a very iffy change for me. I understand why it is needed in the Update_rows_event::do_exec_row code, as the after image has other columns that get written as opposed to the before image. Thing is I've just made it in an attempt to fix a HA_ERR_SAME_KEY error that I kept getting on the slave and it seems to have worked. Do you think the implementation that MySQL chose is the correct one? The changes are mostly the same as the MySQL variant, except they have V2 ROW LOG EVENTS which seem to pass some extra info which we don't use (need?). There are a lot of changes made and I've tried my best to not miss anything. I agree that the git history could be better. I will rewrite it when I get the final code variant. Regards, Vicentiu On Mon, 30 Mar 2015 at 17:22 Vicențiu Ciorbaru <vicentiu@mariadb.org> wrote:
Hi Kristian,
Another concern that I have is how to test the implementation. Right now, I
run the test, get the binlog file and print it with mysqlbinlog and see if the output is what I expect. Do we have something within mysql-test-run that automates this?
There are some MTR tests that run mysqlbinlog and checks the output, for example mysqlbinlog.test.
But the usual way to check replication is all the MTR tests in mysql-test/suite/rpl/t/. They start up a master and a slave server (or more), make some transactions on the master, check that they replicate as expected to the slave.
Thank you for the reply! I have already started doing the changes and I seem to be making some progress. The mysqlbinlog tests were the ones that I was looking for. I'll let you know if I run into any more issues. This actually turned out to be a very good learning experience for me as I previously had no experience with replication. I'll mark you as a reviewer when I have a patch ready.
Vicențiu
I've also added a bunch of inline comments to the github commits to further explain my reasoning. Going through the commits in order chronologically might make the most sense. Regards, Vicențiu On Mon, 6 Apr 2015 at 17:39 Vicențiu Ciorbaru <vicentiu@mariadb.org> wrote:
Hi Kristian,
I've gotten to a "sort of" stable state with binlog_row_image. It's not 100% complete as I did not add any tests for the new use cases. Also there are a couple of changes that I am not 100% sure if they are complete or correct.
I would like some input on these changes: https://github.com/MariaDB/server/compare/10.1-MDEV-6877-binlog_row_image
The unpack_current_row being a very iffy change for me. I understand why it is needed in the Update_rows_event::do_exec_row code, as the after image has other columns that get written as opposed to the before image. Thing is I've just made it in an attempt to fix a HA_ERR_SAME_KEY error that I kept getting on the slave and it seems to have worked.
Do you think the implementation that MySQL chose is the correct one? The changes are mostly the same as the MySQL variant, except they have V2 ROW LOG EVENTS which seem to pass some extra info which we don't use (need?).
There are a lot of changes made and I've tried my best to not miss anything. I agree that the git history could be better. I will rewrite it when I get the final code variant.
Regards, Vicentiu
On Mon, 30 Mar 2015 at 17:22 Vicențiu Ciorbaru <vicentiu@mariadb.org> wrote:
Hi Kristian,
Another concern that I have is how to test the implementation. Right now, I
run the test, get the binlog file and print it with mysqlbinlog and see if the output is what I expect. Do we have something within mysql-test-run that automates this?
There are some MTR tests that run mysqlbinlog and checks the output, for example mysqlbinlog.test.
But the usual way to check replication is all the MTR tests in mysql-test/suite/rpl/t/. They start up a master and a slave server (or more), make some transactions on the master, check that they replicate as expected to the slave.
Thank you for the reply! I have already started doing the changes and I seem to be making some progress. The mysqlbinlog tests were the ones that I was looking for. I'll let you know if I run into any more issues. This actually turned out to be a very good learning experience for me as I previously had no experience with replication. I'll mark you as a reviewer when I have a patch ready.
Vicențiu
Vicențiu Ciorbaru <vicentiu@mariadb.org> writes:
I've gotten to a "sort of" stable state with binlog_row_image. It's not 100% complete as I did not add any tests for the new use cases. Also there are a couple of changes that I am not 100% sure if they are complete or correct.
I would like some input on these changes: https://github.com/MariaDB/server/compare/10.1-MDEV-6877-binlog_row_image
I read through the changes (10 commits). Generally, it seems ok. I assume you have taken the code from the MySQL tree (5.6 or 5.7?), and only changed it as necessary to work with any local MariaDB changes?
The unpack_current_row being a very iffy change for me. I understand why it is needed in the Update_rows_event::do_exec_row code, as the after image has other columns that get written as opposed to the before image. Thing is I've just made it in an attempt to fix a HA_ERR_SAME_KEY error that I kept getting on the slave and it seems to have worked.
I don't understand here. You wrote "I've just made it" - why could you not just use the same code that MySQL uses? But I think you meant something else. Maybe this is caused by some changes done in one of the trees but not the other, and that needs to be merged/adapted? One thing done in MariaDB is that most code now uses rpl_group_info rather than Relay_log_info, due to parallel replication. MySQL may have done something conceptually similar, but different implementation. In general, it would seem best to follow MySQL/Oracle code as closely as possible. We should not have changed the code for applying row events much compared to MySQL. However, there may of course be other things that MySQL did that we have not yet merged.
Do you think the implementation that MySQL chose is the correct one? The
I suppose the MySQL one should be correct. In general, there does not seem to be much going on in the patches (though I'm sure digging up exactly which code to carry over has taken a lot of effort on your part). It seems most of the required functionality is already in the code. Though it is hard for me to see if anything might be missing. Maybe there are some test cases from MySQL that can be taken over? Or maybe they run the existing test suite with --mysqld=--binlog-row-image={NOBLOB,MINIMAL,FULL}. This is something that we would benefit a lot from doing also, I think...
changes are mostly the same as the MySQL variant, except they have V2 ROW LOG EVENTS which seem to pass some extra info which we don't use (need?).
I think the extra info in V2 rows is only used by NDB, something we do not have in our tree. And I think reading V2 events were ported by Serg to MariaDB already (just ignoring the extra info). So I think generally the V2 can just be ignored, except to be able to read those events... Some answers to specific comments you put on github:
This code is not used and prevents compilation when changing the prototype of binlog_write_row. As discussed with Sergei Golubchik, this whole code is up for the chopping block after this feature is complete.
Yes. Binlog injector is for NDB, which we do not have in our tree.
Same of old log events, but I don't know exactly where these are used. I could use a bit of help with this part.
I think log_event_old.{h,cc} is for backwards compatibility with really old versions of row-based replication. It may even be for event format that was never released (only alpha/beta). I think just updating the code without further action should be ok. I am not aware of a way to test this part of the code.
I actually don't understand why we need this. This was copy pasted from MySQL and is on my TODO list to figure out why.
Well, THD::binlog_prepare_row_images() changes the table->read_set to point to table->tmp_set, so I suppose that is why they restore it...
This doesn't seem to make any sense for me. This is what MySQL does but does ha_update_row care?
Yeah, I don't understand either, but I suppose the comment "Temporary fix to find out why it fails [/Matz]" is a clue that it is some debugging code, possibly forgotten (Mats Kindahl is one of the main replication developers). You could try asking him in a mail (mats.kindahl@oracle.com).
There are some things that need to be fixed within that patch, but I'm looking to know if it's going in the right direction or not, before investing any more time into it.
It seems ok to me. Hope this helps, - Kristian.
Hi Kristian, I've done more work on binlog row image and I think it is nearly ready to be merged with 10.1. I still need to import all mysql's test cases, however none of our rpl tests fail. There are 2 test cases: rpl.rpl_not_null_innodb rpl.rpl_not_null_myisam that do have a different output when run with minimal row image. This output however is normal, considering the slave and master have different table specifications. It is a corner case that is not treated in MySQL either. I'd like a review from you for the whole patch, while I work on adding all the relevant tests from MySQL. To me, it seems like a pretty big change with lots of implications. I tried to cover all of them. Let me know if you have any questions or concerns. https://github.com/MariaDB/server/compare/10.1-MDEV-6877-binlog_row_image Regards, Vicențiu On Wed, 29 Apr 2015 at 11:37 Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Vicențiu Ciorbaru <vicentiu@mariadb.org> writes:
I've gotten to a "sort of" stable state with binlog_row_image. It's not 100% complete as I did not add any tests for the new use cases. Also there are a couple of changes that I am not 100% sure if they are complete or correct.
I would like some input on these changes:
https://github.com/MariaDB/server/compare/10.1-MDEV-6877-binlog_row_image
I read through the changes (10 commits). Generally, it seems ok. I assume you have taken the code from the MySQL tree (5.6 or 5.7?), and only changed it as necessary to work with any local MariaDB changes?
The unpack_current_row being a very iffy change for me. I understand why it is needed in the Update_rows_event::do_exec_row code, as the after image has other columns that get written as opposed to the before image. Thing is I've just made it in an attempt to fix a HA_ERR_SAME_KEY error that I kept getting on the slave and it seems to have worked.
I don't understand here. You wrote "I've just made it" - why could you not just use the same code that MySQL uses? But I think you meant something else. Maybe this is caused by some changes done in one of the trees but not the other, and that needs to be merged/adapted?
One thing done in MariaDB is that most code now uses rpl_group_info rather than Relay_log_info, due to parallel replication. MySQL may have done something conceptually similar, but different implementation.
In general, it would seem best to follow MySQL/Oracle code as closely as possible. We should not have changed the code for applying row events much compared to MySQL. However, there may of course be other things that MySQL did that we have not yet merged.
Do you think the implementation that MySQL chose is the correct one? The
I suppose the MySQL one should be correct. In general, there does not seem to be much going on in the patches (though I'm sure digging up exactly which code to carry over has taken a lot of effort on your part). It seems most of the required functionality is already in the code. Though it is hard for me to see if anything might be missing.
Maybe there are some test cases from MySQL that can be taken over? Or maybe they run the existing test suite with --mysqld=--binlog-row-image={NOBLOB,MINIMAL,FULL}. This is something that we would benefit a lot from doing also, I think...
changes are mostly the same as the MySQL variant, except they have V2 ROW LOG EVENTS which seem to pass some extra info which we don't use (need?).
I think the extra info in V2 rows is only used by NDB, something we do not have in our tree. And I think reading V2 events were ported by Serg to MariaDB already (just ignoring the extra info). So I think generally the V2 can just be ignored, except to be able to read those events...
Some answers to specific comments you put on github:
This code is not used and prevents compilation when changing the prototype of binlog_write_row. As discussed with Sergei Golubchik, this whole code is up for the chopping block after this feature is complete.
Yes. Binlog injector is for NDB, which we do not have in our tree.
Same of old log events, but I don't know exactly where these are used. I could use a bit of help with this part.
I think log_event_old.{h,cc} is for backwards compatibility with really old versions of row-based replication. It may even be for event format that was never released (only alpha/beta). I think just updating the code without further action should be ok. I am not aware of a way to test this part of the code.
I actually don't understand why we need this. This was copy pasted from MySQL and is on my TODO list to figure out why.
Well, THD::binlog_prepare_row_images() changes the table->read_set to point to table->tmp_set, so I suppose that is why they restore it...
This doesn't seem to make any sense for me. This is what MySQL does but does ha_update_row care?
Yeah, I don't understand either, but I suppose the comment "Temporary fix to find out why it fails [/Matz]" is a clue that it is some debugging code, possibly forgotten (Mats Kindahl is one of the main replication developers). You could try asking him in a mail (mats.kindahl@oracle.com).
There are some things that need to be fixed within that patch, but I'm looking to know if it's going in the right direction or not, before investing any more time into it.
It seems ok to me.
Hope this helps,
- Kristian.
Vicențiu Ciorbaru <vicentiu@mariadb.org> writes:
I'd like a review from you for the whole patch, while I work on adding all the relevant tests from MySQL. To me, it seems like a pretty big change
https://github.com/MariaDB/server/compare/10.1-MDEV-6877-binlog_row_image
It looks ok. A few comments below. Though generally, it seems best to stay as closely as possible to MySQL code, right? So you should take them only as suggestions, probably. - Kristian.
+ virtual bool read_write_bitmaps_cmp(TABLE *table)
+ case DELETE_ROWS_EVENT:
+ case UPDATE_ROWS_EVENT:
+ case WRITE_ROWS_EVENT:
+ default: + /* + We should just compare bitmaps for Delete, Write + or Update rows events. + */ + DBUG_ASSERT(0);
Why is this a virtual method? There is nothing in the patch that overrides it, at it asserts if called with arbitrary Log_event, so seems it should just be a normal method, maybe even in class Rows_log_event() with a static_cast<> at the call site?
+ // Unpack the current row into m_table->record[0], but with + // a different columns bitmap. + int unpack_current_row(rpl_group_info *rgi, MY_BITMAP const *cols)
The style guide uses /*\n ...\n*/ for multi-line comments I think.
@@ -6217,24 +6237,96 @@ int THD::binlog_delete_row(TABLE* table, bool is_trans,
uchar *row_data= memory.slot(0);
- size_t const len= pack_row(table, cols, row_data, record); + DBUG_DUMP("table->read_set", (uchar*) table->read_set->bitmap, (table->s->fields + 7) / 8);
Overlong line (style guide).
participants (2)
-
Kristian Nielsen
-
Vicențiu Ciorbaru