Hello, This comes from MDEV-16242. == Symptoms == When one runs a parallel slave (mode=conservative) and replicates DML for MyRocks table without a Primary Key, replication may stop with a ER_KEY_NOT_FOUND error. This may happen even if the queries were run on the master sequentially. == A detail about conservative replication == Suppose the master runs these two statements, sequentially: INSERT INTO t1 VALUES (5),(6),(7); DELETE FROM t1; Parallel slave may schedule the INSERT to Thread1, and the DELETE to Thread2. In conservative parallel replication, the DELETE "will wait" to be executed after the INSERT. One may expect that "will wait" here means that DELETE execution does not start until the INSERT has committed. But it's more subtle than that: actually, DELETE execution will start as soon as the INSERT is ready to do the prepare/commit steps. I assume this was done to increase the parallelism: initial phases of DELETE can be ran in parallel with commit/prepare steps of the INSERT. This is safe to do: 1. The INSERT has acquired locks for all rows it is about to touch. DELETE will not be able to prevent INSERT from committing. 2. The DELETE starts its execution on a database that doesn't include the results of the INSERT (the INSERT has not committed yet!). But this is fine, because the DELETE locks the rows it is about to modify. If it attempts to access a row that the INSERT is about to insert, it will block on a lock until the INSERT finishes. == Applying this to MyRocks == The critical part is #2. It works if the DELETE command will acquire row locks for rows it about to DELETE. This is normally true: If the storage engine supports gap locks, attempting to read a locked gap will cause DELETE to wait for INSERT If the storage engine doesn't support gap locks, it will use Row-Based-Replication. RBR's Delete_rows_event includes Primary Key value of the row that should be deleted. Doing a point lookup on PK will cause the DELETE to block on the row lock. The remaining case is - storage engine that doesn't support gap locks - Row-based-replication of table without primary key In this case, finding the row to delete is done by scanning the table until we find the row where all columns match. The scan does not see the rows that the INSERT is about to INSERT, and DELETE stops with an error, "we could not find the row to delete". == Possible solutions == 1. Implement gap locking in MyRocks. This will take some time. 2. Change parallel slave to wait *for commit*. This should only be done if tables that are updated do not support Gap Locking. This is hard, it will require making risky changes on the SQL layer. 3. Disallow parallel slave execution for tables without Gap Locking and without primary key. (Looks most promising but I am not sure if it is sufficient). BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog