Re: [Maria-developers] 489a7fba324: MDEV-29322 ASAN heap-use-after-free in Query_log_event::do_apply_event
Hi, Andrei, I still don't understand this. 1. Why rgi->options_to_bin_log is only set for GTID_EVENT? Is there a guarantee that there always be a GTID_EVENT before a QUERY_EVENT? What if gtids aren't enabled? 2. how do you guarantee that all query events for a previous value of options_written_to_bin_log have been applied before updating it in the rgi? 3. and if you do guarantee it, why is it atomic? No query events = nobody reads options_written_to_bin_log concurrently. And why not to do a simple patch, like the one I've attached? (note, it's an intentionally minimal patch to show the fix, practically I'd accompany it with a small cleanup patch) Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org On Sep 02, Andrei Elkin wrote:
Query_log_event cannot just randomly use some options_written_to_bin_log, it must use options_written_to_bin_log from the Format_description_log_event that came from the binlog file where the Query_log_event was read from.
[ Not to engage you into a separate discussion, still why 'randomly'? In the quoted description p.1 says the Query_log_event uses `options_written_to_bin_log` from the immediately preceding-in-binlog FD. ]
If rli->relay_log.description_event_for_exec is not that Format_description_log_event, and assuming that the correct Format_description_log_event is not stored anywhere, the Query_log_event has to remember options_written_to_bin_log itself,
It's an option, but the chosen one avoids such remembering per each Query event. Instead ...
there's not much we can do about it.
Or it can apply the mask in the constructor and store not flags2 but the new value of thd->variables.option_bits.
Or it can extend flags2 to cover all OPTIONS_WRITTEN_TO_BINLOG.
... of taking care of individual events the patch basically goes along the base to use a placeholder that A. lasts (unlike FD instance) the whole slave session time, and B. the stored `options_written_to_binlog` is guaranteed to be a copy of the most recent FD before the Query
To help you to understand the method, here is a sequence diagram, where on the top line the slave SQL thread, its Relay-log-info object, a similar to that Worker's instance, and a Worker thread.
SQL thread RLI/RGI Worker RGI Worker | . . . --->+ read FD . . i | . . . +--------------> stores options . --->+ read Query | . +-------------------------->+ finds Worker RGI to store in theree a copy FD.options | . +-------------------------------------------->+ schedules Query Worker | . | .<----------------+ reads options | | applies ... Query
It must be clear the A holds. To prove B let after Query a second FD^2 is read by the SQL thread and that FD^2 is from a different version server. Then the SQL thread always have a wait-for-workers-to-become-idle condition, so execution would go like this:
--->+ read Query . ... | . | +-------------------------->+ | finds Worker RGI | to store in theree a copy FD.options | | | --->+ read FD^2 | applying of | | Query ... | ... wait for Worker to complete Query | ... | +<============================================| | +-------------->. | stores FD^2 | options ...
So FD^2 can't affect queries from earlier binlogs.
I am all ears to hear about your feedback.
Cheers,
Andrei
Hi Sergei, Andrei, The commit message had me confused at first - because there are mechanisms in parallel replication to ensure that things stay alive as long as needed; and because if description_event_for_exec points to invalid (freed) memory, then it would seem to indicate a deeper problem than can be solved by just copying a single value out. So I looked into it a bit (thanks Andrei for doing a testcase BTW!). And the real problem is not use-after-free, it is that rli->relay_log.description_event_for_exec is owned by the SQL driver thread and updated without any thread synchronisation; it is wrong to access it from a worker thread. So I checked and found that there is actually another case of the same problem not fixed in the patch, in Execute_load_log_event::do_apply_event(): int Execute_load_log_event::do_apply_event(rpl_group_info *rgi) if (!(lev= (Load_log_event*) Log_event::read_log_event(&file, rli->relay_log.description_event_for_exec, opt_slave_sql_verify_checksum)) || It looks to me that this will have the same problem - when run in the worker thread, the value of description_event_for_exec will be for whatever point in the log the SQL thread is currently reading, not the one corresponding to the LOAD_EVENT. So Load_log_event will need to take a copy of the description_event_for_exec, just like Query_log_event needed the options_written_to_binlog value. An alternative approach to fix could be to use the inuse_relaylog. Parallel replication keeps a list of these (one for each relay log having workers active), and keeps each element alive until all workers are done with that relay log. Currently inuse_relaylog mostly has the name of the relay log file, used for transaction retry in optimistic parallel replication. But it could be extended with a copy of description_event_for_exec which would be available to do_apply_event() in rgi->relaylog. That is perhaps the "correct" fix. But as long as it's only copying a 32-bit value to query event (plus a similar fix for LOAD_EVENT), maybe the fix in the patch is fine as well. Some other smaller comments: The commit message is misleading. The problem is that worker threads are accessing an FD owned by and updated by the SQL driver thread. This FD can be from any random binlog following the current query of the worker thread, as Sergei pointed out. The FD being freed while the worker thread is using its copy is just one symptom, more common will be to use a too new FD, which is also what triggers an error in Andrei's testcase. And it would be good to put a comment in the code on description_event_for_exec that it must _not_ be accessed from worker threads. The name is somewhat misleading ("exec" suggests exactly worker thread operation), maybe it should be renamed to something like "description_event_for_sql_thread", and description_event_for_queue could be called "description_event_for_io_thread", or something. BTW, the testcase only fails sporadically (without the fix), becase it depends on whether the SQL thread has had time to read ahead to a new FD when the CREATE TABLE t1 runs in the worker thread. There's a wait in the testcase which seems to be intended to ensure this condition: # Wait for the last event recieve before to unlock t2 --let $wait_condition= SELECT COUNT(*) > 0 FROM information_schema.processlist WHERE state = "Waiting for worker threads to be idle" But apparently it's ineffective, I had to run the test ~50 times for the error to trigger. Just mentioning it, maybe it's worth spending time on. Hope this helps, - Kristian. Sergei Golubchik <serg@mariadb.org> writes:
Hi, Andrei,
I still don't understand this.
1. Why rgi->options_to_bin_log is only set for GTID_EVENT? Is there a guarantee that there always be a GTID_EVENT before a QUERY_EVENT? What if gtids aren't enabled?
2. how do you guarantee that all query events for a previous value of options_written_to_bin_log have been applied before updating it in the rgi?
3. and if you do guarantee it, why is it atomic? No query events = nobody reads options_written_to_bin_log concurrently.
And why not to do a simple patch, like the one I've attached? (note, it's an intentionally minimal patch to show the fix, practically I'd accompany it with a small cleanup patch)
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
BTW, the testcase only fails sporadically (without the fix), becase it depends on whether the SQL thread has had time to read ahead to a new FD when the CREATE TABLE t1 runs in the worker thread. There's a wait in the testcase which seems to be intended to ensure this condition:
# Wait for the last event recieve before to unlock t2 --let $wait_condition= SELECT COUNT(*) > 0 FROM information_schema.processlist WHERE state = "Waiting for worker threads to be idle"
But apparently it's ineffective, I had to run the test ~50 times for the error to trigger. Just mentioning it, maybe it's worth spending time on.
Erhm - i meant: "maybe it's *not* worth spending time on". - Kristian.
participants (2)
-
Kristian Nielsen
-
Sergei Golubchik