Sachin Setiya <sachin.setiya@mariadb.com> writes:
I am thinking of implementing true lock=none for slaves. Currently
Can you explain what you mean by "true LOCK=NONE"? It is not clear from your description. I think you mean something that will allow to run an ALTER TABLE on the slave in parallel with the same statement on the master?
Alter command is only executed on slave when master commits it. I(thanks marko for suggesting this idea) am thinking of writing BEGIN_DDL_EVENT in binlog after master hits ha_prepare_inplace_alter_table. Then master will write QUERY_EVENT on binlog with actual alter query . On commit/rollback master will write COMMIT_DDL_EVENT/ROLLBACK_DDL_EVENT.
There will be a number of things to consider with this. Previously, all transactions in the binlog have been atomic; this introduces related events that can be arbitrarily separated in the binlog. For example, suppose the slave stops in the middle between BEGIN_DDL_EVENT and COMMIT_DDL_EVENT/ROLLBACK_DDL_EVENT. What is then the slave's binlog position / GTID position? You should probably make a list of all relevant issues, along with possible solutions.
On slave there will be pool of threads(configurable global variable), which
Hopefully, the exiting pool of slave worker threads (currently used for parallel replication) can be used for this as well? Doesn't seem right to introduce a new kind of thread pool...
will apply these DDLs. On receiving BEGIN_DDL_EVENT slave thread will pass the QUERY_EVENT to one of the worker thread. Worker thread will execute until ha_inplace_alter_table. Actual commit_inplace_alter will be called by sql thread.
Won't you need some mechanism to prevent following events to access the table before ALTER TABLE in a worker thread can acquire the metadata lock? There are some nasty issues to be aware of with potential deadlocks related to FLUSH TABLES WITH GLOBAL READ LOCK and such, when multiple threads are trying to coordinate waiting and locking; there were already some quite hard issues with this for parallel replication.
If sql thread recieve some kind of rollback event , then it will somehow signal worker thread to stop executing alter. If none of the worker threads are available then event will be en-queued, then If we received rollback event the we will simply discard event from queue, If we received commit event then SQL thread will synchronously process DDL event.
Another mechanism that currently exists for improving replication speed of ALTER TABLE (and other long-running queries) is out-of-order parallel replication, using gtid_domain_id. This allows a DBA to run the ALTER TABLE in a separate GTID domain, which will allow the ALTER to run on the slave in parallel with any other queries. But unlike your suggestion, it does not allow the ALTER on the slave to run in parallel with the one on the master, and it is not fully automatic; the DBA needs to SET SESSION gtid_domain_id and ensure that no conflicting transactions are run for the duration of the ALTER. Hope this helps, - Kristian.