[Maria-developers] set_time() on slave and unversioned -> versioned replication
Hello, Sergei! In unversioned -> versioned scenario in the code below it first gets to Set time 4, creates some records (on slave) and then seconds on slave increased (X+1) while on master seconds are not yet increased (X). Then we get to Set time 3 and reset time on slave to X.0 therefore we get back in time and all stored records with timestamp X.n will be in future. 'n' came from ++system_time.sec_part in Set time 4. Why did you decided to use such logic of getting seconds from master and microseconds from slave? Since microseconds sooner or later reset to 0 it's not better than just assigning some random number. What sending microseconds from master conditionally only is good for? And for RBR since you don't send microseconds at all I see no good solution for this. inline void set_time(my_time_t t, ulong sec_part) { if (opt_secure_timestamp > (slave_thread ? SECTIME_REPL : SECTIME_SUPER)) { set_time(); // note that BINLOG itself requires SUPER DBUG_EXECUTE("time", print_start_time("Set time 1");); } else { if (sec_part <= TIME_MAX_SECOND_PART) { start_time= system_time.sec= t; start_time_sec_part= system_time.sec_part= sec_part; DBUG_EXECUTE("time", print_start_time("Set time 2");); } else if (t != system_time.sec) { DBUG_EXECUTE("time", print_system_time("System time"););; start_time= system_time.sec= t; start_time_sec_part= system_time.sec_part= 0; DBUG_EXECUTE("time", print_start_time("Set time 3");); } else { start_time= t; start_time_sec_part= ++system_time.sec_part; DBUG_EXECUTE("time", print_start_time("Set time 4");); } user_time.val= hrtime_from_time(start_time) + start_time_sec_part; PSI_CALL_set_thread_start_time(start_time); start_utime= utime_after_lock= microsecond_interval_timer(); } } -- All the best, Aleksey Midenkov @midenok
Hi, Aleksey! On Apr 04, Aleksey Midenkov wrote:
Hello, Sergei!
In unversioned -> versioned scenario in the code below it first gets to Set time 4, creates some records (on slave) and then seconds on slave increased (X+1) while on master seconds are not yet increased (X). Then we get to Set time 3 and reset time on slave to X.0 therefore we get back in time and all stored records with timestamp X.n will be in future. 'n' came from ++system_time.sec_part in Set time 4.
That's not how it is supposed to work. As long as the master sends events with seconds=X, the slave will generate microsecons, X.0, X.1, X.2, etc. When the master sends an event with a new timestamp, Y, the slave goes back to Y.0, and continues with Y.1, etc.
Why did you decided to use such logic of getting seconds from master and microseconds from slave? Since microseconds sooner or later reset to 0 it's not better than just assigning some random number. What sending microseconds from master conditionally only is good for?
Because the master was sending microseconds conditionally, since 5.3. The slave had to cope with that somehow anyway. And I didn't want to force the master to include microseconds in every single event for every single user just in case someone would decide to do unversioned->versioned replication. Also, I thought that processing of 1000000 Query_log_event's in a second is not realistic. But now I see some issues with that. One can freeze the time on the master with 'SET TIMESTAMP' and send an arbitrary number of events with the same timestamp. Or one can generate a query event that includes microseconds, to force the slave to count not from X.0, but from, say, X.999998. So, a wraparound is possible and we need some fix for it. Regards, Sergei Chief Architect MariaDB and security@mariadb.org
On Thu, Apr 4, 2019 at 2:43 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Aleksey!
On Apr 04, Aleksey Midenkov wrote:
Hello, Sergei!
In unversioned -> versioned scenario in the code below it first gets to Set time 4, creates some records (on slave) and then seconds on slave increased (X+1) while on master seconds are not yet increased (X). Then we get to Set time 3 and reset time on slave to X.0 therefore we get back in time and all stored records with timestamp X.n will be in future. 'n' came from ++system_time.sec_part in Set time 4.
That's not how it is supposed to work. As long as the master sends events with seconds=X, the slave will generate microsecons, X.0, X.1, X.2, etc.
When the master sends an event with a new timestamp, Y, the slave goes back to Y.0, and continues with Y.1, etc.
Why did you decided to use such logic of getting seconds from master and microseconds from slave? Since microseconds sooner or later reset to 0 it's not better than just assigning some random number. What sending microseconds from master conditionally only is good for?
Because the master was sending microseconds conditionally, since 5.3. The slave had to cope with that somehow anyway.
If there is an installation from unversioned 5.3 to versioned 10.3 we can warn user about lost microseconds. This is minor issue since such setups are rare, I guess. But for what microseconds are sent in 5.3?
And I didn't want to force the master to include microseconds in every single event for every single user just in case someone would decide to do unversioned->versioned replication.
If it's critical, this can be configured. But is it really performance issue?
Also, I thought that processing of 1000000 Query_log_event's in a second is not realistic.
Now it fails just with several events. I guess because system_time.sec_part is not reset to 0 initially.
But now I see some issues with that. One can freeze the time on the master with 'SET TIMESTAMP' and send an arbitrary number of events with the same timestamp.
Or one can generate a query event that includes microseconds, to force the slave to count not from X.0, but from, say, X.999998.
So, a wraparound is possible and we need some fix for it.
Looks like a lots of complications for a minor issue. Other DBMS don't use microseconds for System Versioning at all. I guess user should cope either with sent microseconds unconditionally (for both SBR and RBR) or he should not use microseconds-precise System Versioning on slave.
Regards, Sergei Chief Architect MariaDB and security@mariadb.org
-- All the best, Aleksey Midenkov @midenok
Hi, Aleksey! On Apr 04, Aleksey Midenkov wrote:
On Thu, Apr 4, 2019 at 2:43 PM Sergei Golubchik <serg@mariadb.org> wrote:
If there is an installation from unversioned 5.3 to versioned 10.3 we can warn user about lost microseconds. This is minor issue since such setups are rare, I guess. But for what microseconds are sent in 5.3?
When it's needed for replication. Say, in INSERT t1 VALUES (NOW(6)); thd->query_start_sec_part() sets the flag.
And I didn't want to force the master to include microseconds in every single event for every single user just in case someone would decide to do unversioned->versioned replication.
If it's critical, this can be configured. But is it really performance issue?
A small performance and storage issue. At least 3 bytes per event. And it's a general principle - there will be definitely less than 1% of users, who will use this. Less than 0.1% too. Most probably, less than 0.01%. So, the remaining 99.99% should not pay the price for it.
Also, I thought that processing of 1000000 Query_log_event's in a second is not realistic.
Now it fails just with several events. I guess because system_time.sec_part is not reset to 0 initially.
You said yourself that is is reset: start_time_sec_part= system_time.sec_part= 0; Initially it's reset in THD::THD() too: system_time.start.val= system_time.sec= system_time.sec_part= 0;
But now I see some issues with that. One can freeze the time on the master with 'SET TIMESTAMP' and send an arbitrary number of events with the same timestamp.
Or one can generate a query event that includes microseconds, to force the slave to count not from X.0, but from, say, X.999998.
So, a wraparound is possible and we need some fix for it.
Looks like a lots of complications for a minor issue. Other DBMS don't use microseconds for System Versioning at all. I guess user should cope either with sent microseconds unconditionally (for both SBR and RBR) or he should not use microseconds-precise System Versioning on slave.
one cannot possibly use microseconds-precise System Versioning on slave, if the server does not send microseconds. the slave merely tries to distingush between different statements, it doesn't try being precise. Regards, Sergei Chief Architect MariaDB and security@mariadb.org
On Thu, Apr 4, 2019 at 5:08 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Aleksey!
On Apr 04, Aleksey Midenkov wrote:
On Thu, Apr 4, 2019 at 2:43 PM Sergei Golubchik <serg@mariadb.org> wrote:
If there is an installation from unversioned 5.3 to versioned 10.3 we can warn user about lost microseconds. This is minor issue since such setups are rare, I guess. But for what microseconds are sent in 5.3?
When it's needed for replication. Say, in
INSERT t1 VALUES (NOW(6));
thd->query_start_sec_part() sets the flag.
And I didn't want to force the master to include microseconds in every single event for every single user just in case someone would decide to do unversioned->versioned replication.
If it's critical, this can be configured. But is it really performance issue?
A small performance and storage issue. At least 3 bytes per event.
But is it really an issue: do you know setups where replication communication is a bottleneck?
And it's a general principle - there will be definitely less than 1% of users, who will use this. Less than 0.1% too. Most probably, less than 0.01%. So, the remaining 99.99% should not pay the price for it.
Btw, it would be good to see the stats. We have some feedback plugin that does the job, don't we?
Also, I thought that processing of 1000000 Query_log_event's in a second is not realistic.
Now it fails just with several events. I guess because system_time.sec_part is not reset to 0 initially.
You said yourself that is is reset:
start_time_sec_part= system_time.sec_part= 0;
Initially it's reset in THD::THD() too:
system_time.start.val= system_time.sec= system_time.sec_part= 0;
It is synchronized with hardware clock on each set_start_time().
But now I see some issues with that. One can freeze the time on the master with 'SET TIMESTAMP' and send an arbitrary number of events with the same timestamp.
Or one can generate a query event that includes microseconds, to force the slave to count not from X.0, but from, say, X.999998.
So, a wraparound is possible and we need some fix for it.
Looks like a lots of complications for a minor issue. Other DBMS don't use microseconds for System Versioning at all. I guess user should cope either with sent microseconds unconditionally (for both SBR and RBR) or he should not use microseconds-precise System Versioning on slave.
one cannot possibly use microseconds-precise System Versioning on slave, if the server does not send microseconds. the slave merely tries to distingush between different statements, it doesn't try being precise.
But it can't recover correct statements order anyway. The statements came from many master threads to single slave thread in some arbitrary order. What is the point in ordering them at slave end?
Regards, Sergei Chief Architect MariaDB and security@mariadb.org
-- All the best, Aleksey Midenkov @midenok
On Thu, Apr 4, 2019 at 5:54 PM Aleksey Midenkov <midenok@gmail.com> wrote:
But it can't recover correct statements order anyway. The statements came from many master threads to single slave thread in some arbitrary order. What is the point in ordering them at slave end?
This is probably incorrect. The order is predefined, so last event is most important. We just need to overwrite history in case of conflict. -- All the best, Aleksey Midenkov @midenok
Hi, Aleksey! On Apr 04, Aleksey Midenkov wrote:
On Thu, Apr 4, 2019 at 5:08 PM Sergei Golubchik <serg@mariadb.org> wrote:
And I didn't want to force the master to include microseconds in every single event for every single user just in case someone would decide to do unversioned->versioned replication.
But is it really an issue: do you know setups where replication communication is a bottleneck?
It's few percent increase of the binlog size. Not much.
And it's a general principle - there will be definitely less than 1% of users, who will use this. Less than 0.1% too. Most probably, less than 0.01%. So, the remaining 99.99% should not pay the price for it.
Btw, it would be good to see the stats. We have some feedback plugin that does the job, don't we?
Yes.
Also, I thought that processing of 1000000 Query_log_event's in a second is not realistic.
Now it fails just with several events. I guess because system_time.sec_part is not reset to 0 initially.
You said yourself that is is reset:
start_time_sec_part= system_time.sec_part= 0;
Initially it's reset in THD::THD() too:
system_time.start.val= system_time.sec= system_time.sec_part= 0;
It is synchronized with hardware clock on each set_start_time().
It must be a bug. Hardware clock shouldn't overwrite the counter as it comes from the slave. Regards, Sergei Chief Architect MariaDB and security@mariadb.org
Hi Sergei! On Thu, Apr 4, 2019 at 8:59 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Aleksey!
On Apr 04, Aleksey Midenkov wrote:
On Thu, Apr 4, 2019 at 5:08 PM Sergei Golubchik <serg@mariadb.org> wrote:
And I didn't want to force the master to include microseconds in every single event for every single user just in case someone would decide to do unversioned->versioned replication.
But is it really an issue: do you know setups where replication communication is a bottleneck?
It's few percent increase of the binlog size. Not much.
And it's a general principle - there will be definitely less than 1% of users, who will use this. Less than 0.1% too. Most probably, less than 0.01%. So, the remaining 99.99% should not pay the price for it.
Btw, it would be good to see the stats. We have some feedback plugin that does the job, don't we?
Yes.
Also, I thought that processing of 1000000 Query_log_event's in a second is not realistic.
Now it fails just with several events. I guess because system_time.sec_part is not reset to 0 initially.
You said yourself that is is reset:
start_time_sec_part= system_time.sec_part= 0;
Initially it's reset in THD::THD() too:
system_time.start.val= system_time.sec= system_time.sec_part= 0;
It is synchronized with hardware clock on each set_start_time().
It must be a bug. Hardware clock shouldn't overwrite the counter as it comes from the slave.
Yes. And there are more complications: for replication log we can check thd->slave_thread, because it is replayed, well, in a slave thread. But executing it from client (which is original MDEV-16370 bug) does not execute it in a slave thread.
Regards, Sergei Chief Architect MariaDB and security@mariadb.org
-- All the best, Aleksey Midenkov @midenok
participants (2)
-
Aleksey Midenkov
-
Sergei Golubchik