Hi y'all! I've just completed an analysis of isolation levels in MySQL. The report focuses on MySQL, but the test suite runs against MariaDB as well. MariaDB exhibits all the same behaviors--"Repeatable Read" allows a whole bunch of anomalies that should be prohibited under Repeatable Read, including G2-item, G-single (even without predicates), non-repeatable reads, violations of Monotonic Atomic View, etc. Figured this might be of interest to the list. https://jepsen.io/analyses/mysql-8.0.34 I'm also curious whether this is widely-understood behavior in the MariaDB community, or if it's news to most people. The extant literature doesn't really seem to discuss these behaviors much, but maybe there's more tacit knowledge among experts? Sincerely, --Kyle
Kyle Kingsbury via discuss <discuss@lists.mariadb.org> writes:
I've just completed an analysis of isolation levels in MySQL. The report focuses on MySQL, but the test suite runs against MariaDB as well. MariaDB exhibits all the same behaviors--"Repeatable Read" allows a whole bunch of anomalies that should be prohibited under Repeatable Read, including G2-item, G-single (even without predicates), non-repeatable reads, violations of Monotonic Atomic View, etc. Figured this might be of interest to the list.
https://jepsen.io/analyses/mysql-8.0.34
I'm also curious whether this is widely-understood behavior in the MariaDB community, or if it's news to most people. The extant
From a quick read of the link, the main point seems to be about what happens to INSERT/UPDATE/DELETE ("updates") in a REPEATABLE READ transaction. As is mentioned, the basic idea in InnoDB REPEATABLE READ is to use a logical snapshot at the time of transaction start to satisfy SELECTs. I think it was always obvious to me that this cannot work for updates. Updates have to modify the current version of the database, not some old version. So I have always thought of REPEATABLE READ to be "repeatable" for read-only transactions spanning multiple selects. The primary purpose of the way locking is done in InnoDB is to support statement-based asynchronous replication. The idea is that taking the update statements and re-running them transaction by transaction, in commit order, will produce identical data. Similar to SERIALIZABLE, but for the INSERT/UPDATE/DELETE statements, not necessarily the SELECTs. The ability to replicate using statements is very important for performance. Queries that modify a large number of rows otherwise have to write huge amounts of replication data ("binlogs" in MySQL/MariaDB terms) recording the contents of every single changed database row. As is also remarked in the link, using statements for replication is also very complicated to do correctly in all corner cases, and in practise quite fragile, unfortunately. In the end, it is an evolutionary compromise between getting flexibility and performance on the one hand, and sufficiently correct behaviour on the other. My personal opinion is that the success of the database is in large part due to the constant focus on ensuring the flexibility and performance needed by practical use. But the resulting complexity and fragility of replication is a very real problem for users, and needs improvement. So this kind of deep analysis that you have done is very valuable I think, to understand the current situation and point out current weaknesses. As to concrete suggestions, getting documented what REPEATABLE READ actually does would be very valuable I think. As to the SQL standard, I am not intimately familiar with it, but the inadequacy of the isolation levels as defined by the standard has been "folklore" in the community for decades. Unlike something like the C/C++ standard, the SQL standard seems to be far removed from the needs of users, at least in the applications that typically employ MySQL/MariaDB. And from my experience, people are resigned rather than hopeful that this could ever change. This is sad, as a good standard could help improve the portability of applications between different SQL implementations. - Kristian.
Hi Kyle, On Tue, Dec 19, 2023 at 6:03 PM Kyle Kingsbury via discuss <discuss@lists.mariadb.org> wrote:
I've just completed an analysis of isolation levels in MySQL. The report focuses on MySQL, but the test suite runs against MariaDB as well. MariaDB exhibits all the same behaviors--"Repeatable Read" allows a whole bunch of anomalies that should be prohibited under Repeatable Read, including G2-item, G-single (even without predicates), non-repeatable reads, violations of Monotonic Atomic View, etc. Figured this might be of interest to the list.
I really appreciate your efforts. Heikki Tuuri implemented locking reads in a peculiar way: They are actually always READ COMMITTED, and the read view is being ignored. I do not remember it clearly, but I heard from a former member of the MySQL Falcon storage engine team that they called some of the InnoDB anomalies "write committed". It could be exactly this. Also, in InnoDB, in any read view, the changes of the current transaction will be visible. InnoDB’s READ COMMITTED is like REPEATABLE READ, but it will suppress some next-key locking, and the read view (the set of transactions that will be considered as committed) will be refreshed at statement boundaries. We have a few bug reports that I think could be related to your findings: https://jira.mariadb.org/browse/MDEV-32898 https://jira.mariadb.org/browse/MDEV-26642 https://jira.mariadb.org/browse/MDEV-26643 In https://jira.mariadb.org/browse/MDEV-14589 there is some discussion related to the problems of using SERIALIZABLE isolation level for some but not all read/write transactions. I hope that this helps you identify the exact root causes of the anomalies that you found. I think that it could make sense to implement some additional isolation levels. A starting point for that would be a minimal patch that would make the test pass (never mind that it could break compatibility with other applications). Locking can be really tricky. https://jira.mariadb.org/browse/MDEV-27025 and https://jira.mariadb.org/browse/MDEV-27992 are a good example. InnoDB would treat an UPDATE of the PRIMARY KEY as a DELETE and an UPDATE. The regression scenario involved updating a PRIMARY KEY to a smaller value. There are also race conditions, such as https://jira.mariadb.org/browse/MDEV-15326 which I think could still be open in MySQL. With best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
On 12/19/23 17:00, Kristian Nielsen wrote:
From a quick read of the link, the main point seems to be about what happens to INSERT/UPDATE/DELETE ("updates") in a REPEATABLE READ transaction.
That's part of it, definitely! But on top of that, there's a serious read safety issue--if you perform a mutation in a transaction, it breaks read isolation too! You can observe state from the middle of a separate transaction, or see some of a transaction's effects, then go on to *fail* to see other effects of the same transaction.
As is mentioned, the basic idea in InnoDB REPEATABLE READ is to use a logical snapshot at the time of transaction start to satisfy SELECTs. I think it was always obvious to me that this cannot work for updates. Updates have to modify the current version of the database, not some old version.
That's certainly one option, yes. Other databases choose other paths. For instance, you can apply writes to a transaction-local view of the database, and buffer their effects until commit time. That's the SI approach Postgres takes.
So I have always thought of REPEATABLE READ to be "repeatable" for read-only transactions spanning multiple selects.
That property does seem to hold! I can't seem to find any actual documentation to this effect though--you're the first person who's suggested it. Any chance you've got a source handy?
As is also remarked in the link, using statements for replication is also very complicated to do correctly in all corner cases, and in practise quite fragile, unfortunately.
Ah, I should clarify--all the behaviors I mentioned occur in single-node MySQL & MariaDB. No replication required. :-)
As to the SQL standard, I am not intimately familiar with it, but the inadequacy of the isolation levels as defined by the standard has been "folklore" in the community for decades. Unlike something like the C/C++ standard, the SQL standard seems to be far removed from the needs of users, at least in the applications that typically employ MySQL/MariaDB. And from my experience, people are resigned rather than hopeful that this could ever change. This is sad, as a good standard could help improve the portability of applications between different SQL implementations.
Very much agreed! Thanks for your thoughts. :-) --Kyle
On 12/21/23 05:14, Marko Mäkelä wrote:
Heikki Tuuri implemented locking reads in a peculiar way: They are actually always READ COMMITTED, and the read view is being ignored. I do not remember it clearly, but I heard from a former member of the MySQL Falcon storage engine team that they called some of the InnoDB anomalies "write committed". It could be exactly this.
Oh that's *very* interesting! I hadn't heard of this--I'll have to take a look.
Also, in InnoDB, in any read view, the changes of the current transaction will be visible.
That's fine! Internal consistency requires transactions observe their own prior effects. One of the problems here is that InnoDB goes *too* far--not only do you observe your own transaction's previous effects, but you also observe new (some, but not necessarily all) effects from *other* transactions.
We have a few bug reports that I think could be related to your findings: https://jira.mariadb.org/browse/MDEV-32898 https://jira.mariadb.org/browse/MDEV-26642 https://jira.mariadb.org/browse/MDEV-26643
Oh that's *very* interesting! MDEV-26642 in particular--I'm guessing you're right, it's something related to the behavior Alvaro and I identified where updates break snapshots in some way. I'll bet you we might have similar cases lurking in our histories, but maybe it depends on predicates instead of kv writes.
In https://jira.mariadb.org/browse/MDEV-14589 there is some discussion related to the problems of using SERIALIZABLE isolation level for some but not all read/write transactions.
Ah, that does look tricky! I'm afraid I've skimmed some of the literature on isolation level mixing but I do *not* understand it, haha. Perhaps someday. :) Thanks so much, Marko! --Kyle
Hi Kyle, On Fri, Dec 22, 2023 at 6:36 PM Kyle Kingsbury via discuss <discuss@lists.mariadb.org> wrote:
On 12/21/23 05:14, Marko Mäkelä wrote:
Heikki Tuuri implemented locking reads in a peculiar way: They are actually always READ COMMITTED, and the read view is being ignored. I do not remember it clearly, but I heard from a former member of the MySQL Falcon storage engine team that they called some of the InnoDB anomalies "write committed". It could be exactly this.
Oh that's *very* interesting! I hadn't heard of this--I'll have to take a look.
In row0sel.cc, inside the function row_search_mvcc() or its callees, you will find calls to lock_clust_rec_read_check_and_lock() and lock_sec_rec_read_check_and_lock(). After those calls, we fail to check if the DB_TRX_ID of the successfully locked record is visible in our read view. If not, we could throw an error. That could fix these anomalies, hopefully all of them. I would like to see an experimental patch to do that. Once we have that, it should be possible to evaluate the impact to other applications, and then maybe introduce a new isolation level REPEATABLE LOCKING READ or something. The code is also duplicated in the function row_sel(), which is for the InnoDB internal SQL parser that is used for things like updating some internal tables for FULLTEXT INDEX, updating the InnoDB persistent statistics tables, or DDL operations. With best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Kyle Kingsbury via discuss <discuss@lists.mariadb.org> writes:
So I have always thought of REPEATABLE READ to be "repeatable" for read-only transactions spanning multiple selects.
That property does seem to hold! I can't seem to find any actual documentation to this effect though--you're the first person who's suggested it. Any chance you've got a source handy?
I don't think this is based on any specific source. Rather, it's starting from how InnoDB takes read/write locks and thinking through what this means in terms of behaviour: READ COMMITTED - no read locks, only write locks, new snapshot each statement. REPEATABLE READ - no read locks, but keep the snapshot (for SELECTs) over the transaction. SERIALIZABLE - both read and write locks. For me, this is a good starting point for understanding how things need to work, together with the requirement that update transactions can be replayed in sequence and get identical data (to support replication). But this ignores many tricky details, and is probably too crude to be of much use at the level of a thorough test like yours. - Kristian.
On 12/22/23 10:53, Kristian Nielsen wrote:
I don't think this is based on any specific source. Rather, it's starting from how InnoDB takes read/write locks and thinking through what this means in terms of behaviour:
READ COMMITTED - no read locks, only write locks, new snapshot each statement.
REPEATABLE READ - no read locks, but keep the snapshot (for SELECTs) over the transaction.
SERIALIZABLE - both read and write locks.
Ah, this is a good intuition, and roughly how I understood things too. The thing is that RR's behavior isn't consistent with *any* kind of snapshot behavior I'm familiar with. Like the point of a snapshot is generally that it's not affected by other transactions' writes, but RR "snapshots" *are* modified by other transactions' writes. Even worse, you can wind up with "impossible" snapshots like this: T1: w(x, 1), w(y, 2) T2: ... r(y, 2), r(x, 1) Here, T2's "snapshot" was clearly taken after T1's write of y=2, but did *not* reflect T1's write of x=1. Violates both atomicity and monotonicity! --Kyle
In the end, it is an evolutionary compromise between getting flexibility and performance on the one hand, and sufficiently correct behaviour on the other. My personal opinion is that the success of the database is in large part due to the constant focus on ensuring the flexibility and performance needed by practical use
This makes sense to me. However, something puzzles me. If mysql places more importance to flexibility and performance, aren't there any customers(especially financial fields) greatly complaining the subtle incorrectness of mysql, leading to money loss and business failure ? If there are, is it enough to push mysql team to correct some incorrectnesses that Kyle found ?
Hi Lewis, Personally I'm a big fan of Kyle's works with Jepsen, but we have to keep in mind that this is analytical work and that in real life we are able to work around those inefficiencies or constraints. MySQL and derivatives are DB systems which are used mainly by websites, small or big. Although I did have some customers back in the day at MariaDB in payment systems but they are generally aware of the complexity and are able to retrace any errors, financial institutions will often use commercial database systems and/or have some transaction processing and checking in place. I have been involved personally in many cases of using and supporting MySQL and MariaDB in e-commerce situations and even in cases where incorrectness of data could be observed, those have remained extremely rare and never have led to business issues. You could compare this situation to pharmaceutical studies. Medicines are rarely 100% efficient or have side effects but still, people do use them :) Le lun. 1 janv. 2024 à 05:55, Lewis Chan via discuss < discuss@lists.mariadb.org> a écrit :
In the end, it is an evolutionary compromise between getting flexibility and performance on the one hand, and sufficiently correct behaviour on the other. My personal opinion is that the success of the database is in large part due to the constant focus on ensuring the flexibility and performance needed by practical use
This makes sense to me. However, something puzzles me. If mysql places more importance to flexibility and performance, aren't there any customers(especially financial fields) greatly complaining the subtle incorrectness of mysql, leading to money loss and business failure ? If there are, is it enough to push mysql team to correct some incorrectnesses that Kyle found ? _______________________________________________ discuss mailing list -- discuss@lists.mariadb.org To unsubscribe send an email to discuss-leave@lists.mariadb.org
On Mon, Jan 1, 2024 at 10:55 AM Guillaume Lefranc via discuss <discuss@lists.mariadb.org> wrote:
MySQL and derivatives are DB systems which are used mainly by websites, small or big. Although I did have some customers back in the day at MariaDB in payment systems but they are generally aware of the complexity and are able to retrace any errors, financial institutions will often use commercial database systems and/or have some transaction processing and checking in place.
That may have been true 20 years ago, but I don't think it is still true today. Over the past decade I have worked with many financial institutions who use MySQL and MariaDB in production environments to handle all kinds of non-website workloads, including payment processing.
On Monday 01 January 2024 at 09:55:08, Guillaume Lefranc via discuss wrote:
I have been involved personally in many cases of using and supporting MySQL and MariaDB in e-commerce situations and even in cases where incorrectness of data could be observed, those have remained extremely rare and never have led to business issues.
You could compare this situation to pharmaceutical studies. Medicines are rarely 100% efficient or have side effects but still, people do use them :)
I think this is a completely inappropriate analogy. Pharmaceuticals affect different people in different ways, simply due to variations in genetics and biology. Two people who earn the same money and spend the same amount do not expect their bank balances to vary simply because one is 50kg heavier than the other, or has a predisposition to lactose intolerance. Data corruption is an error which is correctable. Medical side effects are inevitable consequences of people's biology being different. Antony. -- "Have you been drinking brake fluid again? I think you're addicted to the stuff." "No, no, it's alright - I can stop any time I want to." Please reply to the list; please *don't* CC me.
Lewis Chan via discuss <discuss@lists.mariadb.org> writes:
more importance to flexibility and performance, aren't there any customers(especially financial fields) greatly complaining the subtle incorrectness of mysql, leading to money loss and business failure ? If there are, is it enough to push mysql team to correct some incorrectnesses that Kyle found ?
I think there are two different meanings of "incorrectness" being mixed here. One is deviations from what the SQL standard says. In general, I don't think deviating from the SQL standard's definitions of isolation levels is directly correlated to "money loss and business failures". Another is "returning different results than intended/documented". And while "intended/documented" is often somewhat vaguely defined, this kind of incorrectness _is_ complained about and _is_ taken very seriously by the developers to fix. As long as the queries are returning the results expected by the application developers, the applications will perform correctly. The SQL standard in practice plays a much weaker role in the SQL space compared to something like the C/C++ standard. As Kyle notes: "ANSI SQL’s isolation levels are bad, but some levels have caused more problems than others ... there is less agreement on the semantics of Repeatable Read." Snapshot isolation provides a nice solution for, informally, getting SELECTs be repeatable. Each SELECT query in a transaction reads the same snapshot of the database, so always sees the same data, without requiring any row or table locks. But that does not carry over to writes, since we want our transactions to change the current state of the database, not some old snapshot. Consider a simple example like this, where two transactions T2 and T3 update the same row in parallel: T1: BEGIN T1: INSERT INTO t1 VALUES (1,1), (2,2), (3,3) T1: COMMIT T2: BEGIN T2: SELECT * FROM t1 # S1: (2,2) ? T3: BEGIN T3: UPDATE t1 SET b=12 where a=2 T3: COMMIT T2: SELECT * FROM t1 WHERE a=2 # S2: (2,2) ? (2,12) ? T2: UPDATE t1 SET b=-1 WHERE b>=10 # U1: update the row? T2: SELECT * FROM t1 WHERE a=2 # S3: (2,2) ? (2,-1) ? T2: COMMIT From a usability point of view, what should the result of transaction T2 be, assuming "REPEATABLE READ" mode? The first select (S1) should return the row (2,2) obviously. But what about (S2)? If SELECTs are supposed to be repeatable, it is reasonable to expect it to return (2,2) also. But the update (U1) needs to see the current row (2,12) to change it, as would be reasonably expected by the user (and if not it would break statement-based replication). And the last select (S3) should reasonably be expected to return the row (2,-1), the result of the update (U1). Basically, if reads and writes are mixed, something has to give. With InnoDB, if the user wants the selects (S1) and (S2) to be consistent with the update (U2), the form SELECT ... FOR UPDATE can be used. Under the hood, this will take read locks on the rows, and block T3 from running in parallel with T2, possibly raising a deadlock error. From my (limited) reading of Kyle's analysis, the discrepancies he finds all seem to involve this basic kind of read/write mix. It would be interesting to know if using FOR UPDATE on the SELECTs would remove these descrepancies, or if there are more subtle issues (/bugs) remaining around this. Should REPEATABLE READ in MariaDB/InnoDB strictly obey the SQL standard's definition of isolation levels? Ideally, yes. But in contrast, is it useful in practice to have the choice between higher consistency and more concurrency using FOR UPDATE? Probably, also yes. - Kristian.
On 1/1/24 11:55, Kristian Nielsen via discuss wrote:
But that does not carry over to writes, since we want our transactions to change the current state of the database, not some old snapshot.
I do want to emphasize that doing this in the primary-key workloads I've discussed so far violates both Snapshot Isolation *and* Repeatable Read. It's not how these consistency models work in abstract, or how other major databases work in practice. Whether MySQL users *want* this weird behavior is, well... I guess that's up to y'all, haha. ;-)
Consider a simple example like this, where two transactions T2 and T3 update the same row in parallel:
T1: BEGIN T1: INSERT INTO t1 VALUES (1,1), (2,2), (3,3) T1: COMMIT
T2: BEGIN T2: SELECT * FROM t1 # S1: (2,2) ?
T3: BEGIN T3: UPDATE t1 SET b=12 where a=2 T3: COMMIT
T2: SELECT * FROM t1 WHERE a=2 # S2: (2,2) ? (2,12) ? T2: UPDATE t1 SET b=-1 WHERE b>=10 # U1: update the row? T2: SELECT * FROM t1 WHERE a=2 # S3: (2,2) ? (2,-1) ? T2: COMMIT
From a usability point of view, what should the result of transaction T2 be, assuming "REPEATABLE READ" mode?
If you interpret MySQL REPEATABLE READ as SI, T2 must observe (2, 2). Its snapshot must have been taken before T3 committed, since it performed a read before T3 started. Its update affects nothing, it selects (2,2), and commits successfully. If you interpret MySQL REPEATABLE READ as RR, and interpret `a` as a primary key... this might come down to how you interpret the results of selects *not* by primary key: whether they're purely a predicate read, or also manifest as item reads. If we go with the item read interpretation, T2's first select should observe (1, 1), (2, 2), (3, 3)--that seems obvious. This means T1 wr T2 and T3 rw T2, since T3 writes the second version of the tuple (2, 12) and T2 observed the previous version. These are item dependencies, not just predicate dependencies, since we actually read the tuple values. T2's second select is a primary-key read, and it must select (2, 2). If it observes (2, 12), that would imply T2 wr T3, and we'd have a G2-item cycle between T2 and T3: T3 both fails to observe, and also observes, T2. You might also count that as fractured read, which would violate read atomic. T2's update I'm less sure of. A predicate write without any actual item reads is unconstrained in RR, so it's free to either update or not update the row--I think either is legal. The following SELECT should, I think, observe either (2,2) if the update fails (otherwise we have G2-item), and (2, -1) if the update succeeds (otherwise we violate internal consistency).
From my (limited) reading of Kyle's analysis, the discrepancies he finds all seem to involve this basic kind of read/write mix. It would be interesting to know if using FOR UPDATE on the SELECTs would remove these descrepancies, or if there are more subtle issues (/bugs) remaining around this.
Ah, there's a key difference here. Your example involves predicates, and RR explicitly does not constrain predicates. My report involves only access by primary key, and there RR is quite clear: MySQL's behavior is illegal. --Kyle
Hi Kyle, On Mon, Jan 1, 2024 at 10:02 PM Kyle Kingsbury via discuss <discuss@lists.mariadb.org> wrote:
I do want to emphasize that doing this in the primary-key workloads I've discussed so far violates both Snapshot Isolation *and* Repeatable Read. It's not how these consistency models work in abstract, or how other major databases work in practice. Whether MySQL users *want* this weird behavior is, well... I guess that's up to y'all, haha. ;-)
Consider a simple example like this, where two transactions T2 and T3 update the same row in parallel:
T1: BEGIN T1: INSERT INTO t1 VALUES (1,1), (2,2), (3,3) T1: COMMIT
T2: BEGIN T2: SELECT * FROM t1 # S1: (2,2) ?
T3: BEGIN T3: UPDATE t1 SET b=12 where a=2 T3: COMMIT
T2: SELECT * FROM t1 WHERE a=2 # S2: (2,2) ? (2,12) ? T2: UPDATE t1 SET b=-1 WHERE b>=10 # U1: update the row? T2: SELECT * FROM t1 WHERE a=2 # S3: (2,2) ? (2,-1) ? T2: COMMIT
From a usability point of view, what should the result of transaction T2 be, assuming "REPEATABLE READ" mode?
If you interpret MySQL REPEATABLE READ as SI, T2 must observe (2, 2). Its snapshot must have been taken before T3 committed, since it performed a read before T3 started. Its update affects nothing, it selects (2,2), and commits successfully.
If the WHERE clause in T2 had been b=2 instead of b>=10, what is the expected outcome? Does MySQL or MariaDB even have an error code for that? I think that this would be a perfect use of https://dbfiddle.uk. This example reminds me of the "semi-consistent read" that we implemented a long time ago in order to fix https://bugs.mysql.com/bug.php?id=3300. If your example behaves as you claimed, it would seem to be inconsistent with the intention of that fix. (If we defer the COMMIT of T3 until after the UPDATE in T2, then I guess the UPDATE should do nothing.) I will have to dig deeper into this. Thank you for the simple example. With secondary indexes, it gets trickier, because InnoDB does not have a concept of "table row locks". Instead, it has a concept of "index record locks", and therefore some locking behaviour may depend on the choice of indexes that are being accessed by locking reads. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
On 1/2/24 00:07, Marko Mäkelä wrote:
Consider a simple example like this, where two transactions T2 and T3 update the same row in parallel:
T1: BEGIN T1: INSERT INTO t1 VALUES (1,1), (2,2), (3,3) T1: COMMIT
T2: BEGIN T2: SELECT * FROM t1 # S1: (2,2) ?
T3: BEGIN T3: UPDATE t1 SET b=12 where a=2 T3: COMMIT
T2: SELECT * FROM t1 WHERE a=2 # S2: (2,2) ? (2,12) ? T2: UPDATE t1 SET b=-1 WHERE b>=10 # U1: update the row? T2: SELECT * FROM t1 WHERE a=2 # S3: (2,2) ? (2,-1) ? T2: COMMIT
If the WHERE clause in T2 had been b=2 instead of b>=10, what is the expected outcome? Does MySQL or MariaDB even have an error code for that? I think that this would be a perfect use of https://dbfiddle.uk.
Under SI, T2 always operates on the snapshot it took before T3 began, and observes none of T3's effects. T2 updates row 2 to (2, -1), and aborts. Why? First-committer-wins identifies that a new version of row 2 was committed in the interval between T2's snapshot and commit timestamp. T2 must abort to prevent lost update. Under RR, I thiiiink there's the same freedom of choice--T2 is executing a predicate write, and those are (broadly speaking) unconstrained in RR. There might be something here about predicate wr/ww dependencies vs predicate rw anti-dependencies; I'd have to stare at this one for a while. Predicates are *weird*. Just to take predicate confusion out of the mix, here's the same thing with primary-key access (assuming a is the primary key). T1: BEGIN T1: INSERT INTO t1 VALUES (1,1), (2,2), (3,3) T1: COMMIT T2: BEGIN T2: SELECT * FROM t1 where a=2; # S1: (2,2) T3: BEGIN T3: UPDATE t1 SET b=12 where a=2 T3: COMMIT T2: SELECT * FROM t1 WHERE a=2 # S2: (2,2) T2: UPDATE t1 SET b=-1 WHERE a=2 # U1: Either update or abort T2: SELECT * FROM t1 WHERE a=2 # S3: (2,-1) T2: COMMIT # Must abort, if it hasn't already This is what would happen under both snapshot isolation and repeatable read. Under RR, you have a choice of aborting as soon as T2 updates, or at commit time, but ultimately T2 *must not commit*. Doing so would cause lost update. Here's what MySQL does: T2: SELECT * FROM t1 WHERE a=2 # S2: (2,2) T2: UPDATE t1 SET b=-1 WHERE a=2 # U1: Updates row to (2, -1) T2: SELECT * FROM t1 WHERE a=2 # S3: (2,-1) T2: COMMIT # Commits This is the canonical definition of lost update--it's forbidden both by RR and SI, but RR lets it happen! --Kyle
Hi Kyle, Thank you for the simplified example. It is showing a different problem than the one that I initially had in mind. I converted it to the following .test file and invoked ./mtr --rr on the test for more convenient debugging with rr replay: --source include/have_innodb.inc CREATE TABLE t1 (a INT PRIMARY KEY, b INT NOT NULL) ENGINE=InnoDB; INSERT INTO t1 VALUES (1,1),(2,2),(3,3); BEGIN; SELECT * FROM t1 WHERE a=2; --connect t3,localhost,root UPDATE t1 SET b=12 WHERE a=2; --disconnect t3 --connection default SELECT * FROM t1 WHERE a=2; UPDATE t1 SET b=-1 WHERE a=2; SELECT * FROM t1 WHERE a=2; COMMIT; DROP TABLE t1; During the last UPDATE, I set a breakpoint on row_search_mvcc and then watch -l prebuilt->trx.lock.n_rec_locks to catch where the record lock is created. Here is a MariaDB Server 10.6 stack trace: #0 lock_rec_set_nth_bit (lock=0x7f4e583e5c80, i=3) at /mariadb/10.6/storage/innobase/include/lock0priv.inl:101 #1 0x0000564e48969508 in lock_rec_create_low (c_lock=0x0, type_mode=1027, page_id=..., page=0x7f4e53cd0000 "", heap_no=3, index=0x7f4e2c115a18, trx=0x7f4e583e5b80, holds_trx_mutex=false) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:1270 #2 0x0000564e4896b704 in lock_rec_lock (impl=false, mode=1027, block=0x7f4e5380f490, heap_no=3, index=0x7f4e2c115a18, thr=0x7f4e2c203118) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:1692 #3 0x0000564e4897d320 in lock_clust_rec_read_check_and_lock (flags=0, block=0x7f4e5380f490, rec=0x7f4e53cd0097 "\200", index=0x7f4e2c115a18, offsets=0x7f4e5814bb40, mode=LOCK_X, gap_mode=1024, thr=0x7f4e2c203118) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:5927 #4 0x0000564e48ac85b7 in sel_set_rec_lock (pcur=0x7f4e2c202978, rec=0x7f4e53cd0097 "\200", index=0x7f4e2c115a18, offsets=0x7f4e5814bb40, mode=3, type=1024, thr=0x7f4e2c203118, mtr=0x7f4e5814bf20) at /mariadb/10.6/storage/innobase/row/row0sel.cc:1349 #5 0x0000564e48ad3dfe in row_search_mvcc (buf=0x7f4e2c2012f8 "\377", mode=PAGE_CUR_GE, prebuilt=0x7f4e2c2027a8, match_mode=1, direction=0) at /mariadb/10.6/storage/innobase/row/row0sel.cc:5234 I tried a quick and dirty patch (attached) to fix this particular case. Note: it only works with a fixed-length PRIMARY KEY, and I did not run the regression test suite, only this particular test. We would also need something similar for secondary indexes. But it is a start: mysqltest: At line 9: query 'UPDATE t1 SET b=12 WHERE a=2' failed: ER_LOCK_DEADLOCK (1213): Deadlock found when trying to get lock; try restarting transaction Would you be interested to help fix all this? It would be an iterative process: you provide tests and we would update a branch accordingly. You could even download precompiled packages from our CI system if that is more convenient. Once we have it sorted out in some way (such as "faking" deadlock errors), then it would be time to polish it, to introduce new error messages or a new transaction isolation level, to avoid risk of breaking any applications that would expect the current buggy behaviour. First we would have to get the logic right. Best regards, Marko
On 1/4/24 09:52, Marko Mäkelä wrote:
I tried a quick and dirty patch (attached) to fix this particular case. Note: it only works with a fixed-length PRIMARY KEY, and I did not run the regression test suite, only this particular test. We would also need something similar for secondary indexes. But it is a start:
Nicely done! BTW (and this is tentative, I'm in a years-long process of wrapping my head around this as well) you may be able to get away without providing the same degree of safety for secondary indices--RR doesn't constrain predicate deps in the same way that Serializable does. I'm still working on building tests for that.
Would you be interested to help fix all this? It would be an iterative process: you provide tests and we would update a branch accordingly. You could even download precompiled packages from our CI system if that is more convenient. Once we have it sorted out in some way (such as "faking" deadlock errors), then it would be time to polish it, to introduce new error messages or a new transaction isolation level, to avoid risk of breaking any applications that would expect the current buggy behaviour. First we would have to get the logic right.
Yeah! The existing Jepsen MySQL test suite should be able to do this. Right now it installs mariadb-server from the standard Debian Bookworm repositories, but modifying it to install a custom .deb file from your build system wouldn't be too hard. Or we could do tarballs with a little more work. Feel free to email me on or off-list, if you'd like me to try out a particular build. :-) --Kyle
Hi Kyle, I am sorry for the delay. It turns out that https://jira.mariadb.org/browse/MDEV-32898 had been filed for a scenario that does not involve other indexes than PRIMARY KEY. I revised my crude initial patch and got the regression test suite pass. 6 tests will need adjustments; I adjusted 4 of them and disabled 2 for now. I think that we may have to remove those tests because they would no longer be meaningful.
modifying it to install a custom .deb file from your build system wouldn't be too hard
I pushed a tentative fix to a specially named branch, so that packages should become available for this build, at https://ci.mariadb.org/42468/ within an hour or two. The branch includes some other work-in-progress changes, but I do not think that they should affect any Jepsen tests. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Hi again, It looks like https://jira.mariadb.org/browse/MDEV-32898 could be an exact duplicate of https://jira.mariadb.org/browse/MDEV-26642. The case https://jira.mariadb.org/browse/MDEV-26643 is about the READ UNCOMMITTED isolation level. I would appreciate an expert opinion on whether that should be considered a bug at all. Best regards, Marko
participants (7)
-
Antony Stone
-
baiwfg2@gmail.com
-
Gordan Bobic
-
Guillaume Lefranc
-
Kristian Nielsen
-
Kyle Kingsbury
-
Marko Mäkelä