On Thu 2023-05-25 22:14:13 +0300, Sergey Petrunia wrote:
+ if (!optimizer_flag(thd, OPTIMIZER_SWITCH_EXISTS_TO_IN) || + /* proceed only if I'm a toplevel IN or a toplevel NOT IN */ + !(is_top_level_item() || + (upper_not && upper_not->is_top_level_item())) ||
what is the reason behind this condition? If we're able to handle top-level NOT IN, why can't we handle subquery in any context?
Can we really handle the NOT IN? (I suspect not, it will suffer from the semantics of IN subqueries with NULLs.. I can't find the name for this, it's described e.g. here: https://petrunia.net/2006/11/16/inany-subqueries-null-woes/ )
Thanks for the pointer. It seems to me from this post that the server follows a standard 3-value logic (3vl) where `null op foo` evals to the unknown value for comparison operator op. So indeed the transformation cannot handle NOT IN without some surgery. I studied the exists2in transformation more and 3vl and came to the conclusion that the fix there for the same situation could help with our case too, that is, replacing the equalities with `inner_field is not null`, and from outside AND the whole thing with `outer_expr is not null`. More specifically, consider the simple example of --8<---------------cut here---------------start------------->8--- create table t1 (a1 int, b1 int); create table t2 (a2 int, b2 int); insert into t1 values (0, null), (1, 1); insert into t2 values (0, 1), (1, null); select * from t1 where a1 in (select a2 from t2 where b1 = b2); --8<---------------cut here---------------end--------------->8--- Focusing on the where condition of the outer select, basically what we are trying to do is (\V denotes "OR over" i.e. ⋁) transforming the lhs to the rhs from the following identity: \V_{i: b1 = b2i} (a1 = a2i) = \V_i (a1 = a2i and b1 = b2i) This identity holds in classical logic, but not 3vl which adds an unknown (abbreviated u) to t and f. In mariadb, unknown seems to be an alias to null, but it has a different semantics so in the following I write u as an unknown logic value (where we can use comparison with other logical expressions) and null as a sql missing value. The 3vl version as in my broken implementation is \V_{i: (b1 = b2i) = t} (a1 = a2i) = \V_i (a1 = a2i and b1 = b2i) # WRONG! This identity holds when lhs is t or u, but not f. It breaks, i.e. lhs = f and rhs = u if and only if both of the following hold - there exists i s.t. (b1 = b2i) = u and (a1 = a2i) <> f - forall i s.t. (b1 = b2i) = t, (a1 = a2i) = f One can verify that the following modified version holds \V_{i: (b1 = b2i) = t} (a1 = a2i) = (\V_{i: b2i is not null} (a1 = a2i and b1 = b2i)) and (b1 is not null) Translating back to the query language, we have select * from t1 where (b1 is not null and (a1, b1) in (select a2, b2 from t2 where b2 is not null)); With this modification, theoretically we don't have to care about whether the IN subselect is top level, however, there are two issues: 1. When the subquery is indeed top level, do the extra terms lower performance? 2. When the subquery is not top level, how do we substitute `this` which is an Item_in_subselect with the Item_func_and? I suspect this is why the exists2in transformation only does the modification when there is an upper_not, and the transformation is skipped when the subquery is not toplevel EXISTS nor toplevel NOT EXISTS. More to follow. Best, Yuchen