Re: [Maria-developers] 734aee3cd3c: MDEV-27018 IF and COALESCE lose "json" property
Hi, Alexander! On Dec 28, Alexander Barkov wrote:
revision-id: 734aee3cd3c (mariadb-10.5.13-33-g734aee3cd3c) parent(s): 2776635cb98 author: Alexander Barkov committer: Alexander Barkov timestamp: 2021-12-27 17:54:31 +0400 message:
MDEV-27018 IF and COALESCE lose "json" property
Hybrid functions (IF, COALESCE, etc) did not preserve the JSON property from their arguments. The same problem was repeatable for single row subselects.
The problem happened because the method Item::is_json_type() was inconsistently implemented across the Item hierarchy. For example, Item_hybrid_func and Item_singlerow_subselect did not override is_json_type().
Okay. The approach is fine, but see some questions below:
diff --git a/sql/field.cc b/sql/field.cc index 2c768527ced..8fa3bbd538c 100644 --- a/sql/field.cc +++ b/sql/field.cc @@ -7277,6 +7277,19 @@ bool Field_longstr::send(Protocol *protocol) }
+const Type_handler *Field_string::type_handler() const +{ + if (is_var_string()) + return &type_handler_var_string; + /* + This is a temporary solution and will be fixed soon (in 10.9?). + Type_handler_string_json will provide its own Field_string_json.
why?
+ */ + if (Type_handler_json_common::has_json_valid_constraint(this))
could you please use names that don't depend on how exactly we detect json? especially if the goal is to move to FORMAT JSON syntax
+ return &type_handler_string_json; + return &type_handler_string; +} + /* Copy a string and fill with space */
int Field_string::store(const char *from, size_t length,CHARSET_INFO *cs) diff --git a/sql/sql_type.cc b/sql/sql_type.cc index c1801c1ae3e..3b2753d80a6 100644 --- a/sql/sql_type.cc +++ b/sql/sql_type.cc @@ -1755,19 +1755,110 @@ const Type_handler *Type_handler_typelib::cast_to_int_type_handler() const
/***************************************************************************/
+class Recursive_type_pair_iterator +{ + const Type_handler *m_a; + const Type_handler *m_b; + uint m_switched_to_base_count; +public: + Recursive_type_pair_iterator(const Type_handler *a, + const Type_handler *b, + uint switched_to_base_count= 0) + :m_a(a), m_b(b), m_switched_to_base_count(switched_to_base_count) + { } + const Type_handler *a() const { return m_a; } + const Type_handler *b() const { return m_b; } + Recursive_type_pair_iterator base() const + { + Recursive_type_pair_iterator res(m_a->type_handler_base(), + m_b->type_handler_base()); + res.m_switched_to_base_count= (res.m_a != NULL) + (res.m_b != NULL); + if (res.m_a == NULL) + res.m_a= m_a; + if (res.m_b == NULL) + res.m_b= m_b; + return res; + } + bool done() const + { + switch (m_switched_to_base_count) + { + case 2: + /* + Expression: + COALESCE(TYPE1, TYPE2) + + where both type handlers derive from some other handlers, e.g. + VARCHAR -> TYPE1 + TEXT -> TYPE2 + + Continue aggregation as: + SELECT COALESCE(VARCHAR, TEXT) + + Note, we now have only one level of data type inheritance, + e.g. VARCHAR -> VARCHAR/JSON + + This code branch will change when we have longer inheritance chains: + A0 -> A1 -> A2 -> A3 + B0 -> B1 -> B2 + + Functions like COALESE(A2, B2) will need to do full overload resolution: + - iterate through all pairs from the below matrix + - find all existing candidates + - resolve ambiguities in case multiple candidates, etc.
and I don't see how you're going to do that. you need to try m_a+m_b.base(), m_a.base()+m_b, and m_a.base()+m_b.base(). And that's not what you're doing. I suggest to add an assert that m_switched_to_base_count < 2 and stop trying to solve complex cases until we have at least one.
+ + A0 A1 A2 A3 + B0 . . . . + B1 . . . . + B2 . . . . + */ + return false; + + case 1: + /* + Expression: + COALESCE(TYPE1, TEXT) + + If only one handler derives from some other handler: + VARCHAR-> TYPE1 + + Continue aggregation as: + SELECT COALESCE(VARCHAR, TEXT) + */ + return false; + + case 0: + default: + /* + Non of the two handlers are derived from other handlers. + Nothing left to try. + */ + return true; + } + } +};
diff --git a/sql/sql_type_json.h b/sql/sql_type_json.h index 6c4ee8cb2eb..4a394809a06 100644 --- a/sql/sql_type_json.h +++ b/sql/sql_type_json.h @@ -21,18 +21,145 @@ ... + bool Item_append_extended_type_info(Send_field_extended_metadata *to, + const Item *item) const + { + return set_format_name(to); // Send "format=json" in the protocol + } + + bool Item_hybrid_func_fix_attributes(THD *thd, + const char *name, + Type_handler_hybrid_field_type *hybrid, + Type_all_attributes *attr, + Item **items, uint nitems) + const override + { + if (BASE::Item_hybrid_func_fix_attributes(thd, name, hybrid, attr, + items, nitems)) + return true; + /* + The above call can change the type handler on "hybrid", e.g. + choose a proper BLOB type handler according to the calculated max_length. + Convert general purpose string type handler to its JSON counterpart. + This makes hybrid functions preserve JSON data types, e.g.: + COALESCE(json_expr1, json_expr2) -> JSON + */ + hybrid->set_handler(json_type_handler_from_generic(hybrid->type_handler()));
I don't get it. It's supposed to "fix attributes", like signed/unsigned, max length, etc. The correct type handler should've been set already, otherwise this type handler method wouldn't even be called.
+ return false; + } };
+ +class Type_handler_string_json: + public Type_handler_general_purpose_string_to_json<Type_handler_string, + type_handler_string> +{ }; diff --git a/sql/sql_type_json.cc b/sql/sql_type_json.cc index a804366ec03..a84f720bcc9 100644 --- a/sql/sql_type_json.cc +++ b/sql/sql_type_json.cc @@ -20,17 +20,111 @@ #include "sql_class.h"
-Type_handler_json_longtext type_handler_json_longtext; +Named_type_handler<Type_handler_string_json> + type_handler_string_json("char/json");
do you mean these names are not shown anywhere?
+Named_type_handler<Type_handler_varchar_json> + type_handler_varchar_json("varchar/json"); + +Named_type_handler<Type_handler_tiny_blob_json> + type_handler_tiny_blob_json("tinyblob/json"); + +Named_type_handler<Type_handler_blob_json> + type_handler_blob_json("blob/json"); + +Named_type_handler<Type_handler_medium_blob_json> + type_handler_medium_blob_json("mediumblob/json"); + +Named_type_handler<Type_handler_long_blob_json> + type_handler_long_blob_json("longblob/json"); ... +/* + This method ressembles what Field_blob::type_handler()
resembles (also below)
+ does for general purpose BLOB type handlers. +*/ +const Type_handler * +Type_handler_json_common::json_blob_type_handler_by_length_bytes(uint len) +{ + switch (len) { + case 1: return &type_handler_tiny_blob_json; + case 2: return &type_handler_blob_json; + case 3: return &type_handler_medium_blob_json; + } + return &type_handler_long_blob_json; +}
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hello Sergei, Thanks for your review! Please find comments below: On 12/28/21 10:10 PM, Sergei Golubchik wrote:
Hi, Alexander!
On Dec 28, Alexander Barkov wrote:
revision-id: 734aee3cd3c (mariadb-10.5.13-33-g734aee3cd3c) parent(s): 2776635cb98 author: Alexander Barkov committer: Alexander Barkov timestamp: 2021-12-27 17:54:31 +0400 message:
MDEV-27018 IF and COALESCE lose "json" property
Hybrid functions (IF, COALESCE, etc) did not preserve the JSON property from their arguments. The same problem was repeatable for single row subselects.
The problem happened because the method Item::is_json_type() was inconsistently implemented across the Item hierarchy. For example, Item_hybrid_func and Item_singlerow_subselect did not override is_json_type().
Okay. The approach is fine, but see some questions below:
diff --git a/sql/field.cc b/sql/field.cc index 2c768527ced..8fa3bbd538c 100644 --- a/sql/field.cc +++ b/sql/field.cc @@ -7277,6 +7277,19 @@ bool Field_longstr::send(Protocol *protocol) }
+const Type_handler *Field_string::type_handler() const +{ + if (is_var_string()) + return &type_handler_var_string; + /* + This is a temporary solution and will be fixed soon (in 10.9?). + Type_handler_string_json will provide its own Field_string_json.
why?
I'd like to split JSON off eventually: - To turn JSON into a real pluggable data type, so it can be compiled as a dynamic plugin. - For performance purposes, to avoid this constraint tests for "normal" CHAR/VARCHAR/TEXT fields. - We'll probably have more FORMATs in the future. Adding all format tests inside "normal" Field_string/Field_varchar/Field_blob is not a good idea.
+ */ + if (Type_handler_json_common::has_json_valid_constraint(this))
could you please use names that don't depend on how exactly we detect json? especially if the goal is to move to FORMAT JSON syntax
For now this name reflects what it actually does. When we split JSON off and fix the code to have JSON put extra data type information into FRM, this test will be gone from Field_xxx. But it will move to the code opening FRM, to be able open old tables where JSON is detected only by a CHECK constraint. And there it will reflect what it actually does again. I propose to keep it as is.
+ return &type_handler_string_json; + return &type_handler_string; +} + /* Copy a string and fill with space */
int Field_string::store(const char *from, size_t length,CHARSET_INFO *cs) diff --git a/sql/sql_type.cc b/sql/sql_type.cc index c1801c1ae3e..3b2753d80a6 100644 --- a/sql/sql_type.cc +++ b/sql/sql_type.cc @@ -1755,19 +1755,110 @@ const Type_handler *Type_handler_typelib::cast_to_int_type_handler() const
/***************************************************************************/
+class Recursive_type_pair_iterator +{ + const Type_handler *m_a; + const Type_handler *m_b; + uint m_switched_to_base_count; +public: + Recursive_type_pair_iterator(const Type_handler *a, + const Type_handler *b, + uint switched_to_base_count= 0) + :m_a(a), m_b(b), m_switched_to_base_count(switched_to_base_count) + { } + const Type_handler *a() const { return m_a; } + const Type_handler *b() const { return m_b; } + Recursive_type_pair_iterator base() const + { + Recursive_type_pair_iterator res(m_a->type_handler_base(), + m_b->type_handler_base()); + res.m_switched_to_base_count= (res.m_a != NULL) + (res.m_b != NULL); + if (res.m_a == NULL) + res.m_a= m_a; + if (res.m_b == NULL) + res.m_b= m_b; + return res; + } + bool done() const + { + switch (m_switched_to_base_count) + { + case 2: + /* + Expression: + COALESCE(TYPE1, TYPE2) + + where both type handlers derive from some other handlers, e.g. + VARCHAR -> TYPE1 + TEXT -> TYPE2 + + Continue aggregation as: + SELECT COALESCE(VARCHAR, TEXT) + + Note, we now have only one level of data type inheritance, + e.g. VARCHAR -> VARCHAR/JSON + + This code branch will change when we have longer inheritance chains: + A0 -> A1 -> A2 -> A3 + B0 -> B1 -> B2 + + Functions like COALESE(A2, B2) will need to do full overload resolution: + - iterate through all pairs from the below matrix + - find all existing candidates + - resolve ambiguities in case multiple candidates, etc.
and I don't see how you're going to do that. you need to try m_a+m_b.base(), m_a.base()+m_b, and m_a.base()+m_b.base(). And that's not what you're doing.
I suggest to add an assert that m_switched_to_base_count < 2 and stop trying to solve complex cases until we have at least one.
Yes, we'll need to do all these combinations eventually. But I'm not trying to solve a more complex case than it's really necessary now: If the SQL query is: SELECT varchar_json_expr + text_json_expr FROM t1; it works as follows: - It tries VARCHAR/JSON+TEXT/JSON in Type_collection_json, nothing is found - Then it tries directly VARCHAR+TEXT, which is m_a.base()+m_b.base(), and this combination is found in Type_collection_std, so it effectively performs: SELECT varchar_expr + text_expr FROM t1; In this case m_switched_to_base_count==2. The assert m_switched_to_base_count < 2 won't be correct. For now it does not test these pairs: m_a + m_b.base(), i.e. VARCHAR/JSON + TEXT m_a.base() + m_b, i.e. VARCHAR + TEXT/JSON I agree, we can start testing these pairs when we have a real more complex case. I believe the code is fine for now.
+ + A0 A1 A2 A3 + B0 . . . . + B1 . . . . + B2 . . . . + */ + return false; + + case 1: + /* + Expression: + COALESCE(TYPE1, TEXT) + + If only one handler derives from some other handler: + VARCHAR-> TYPE1 + + Continue aggregation as: + SELECT COALESCE(VARCHAR, TEXT) + */ + return false; + + case 0: + default: + /* + Non of the two handlers are derived from other handlers. + Nothing left to try. + */ + return true; + } + } +};
diff --git a/sql/sql_type_json.h b/sql/sql_type_json.h index 6c4ee8cb2eb..4a394809a06 100644 --- a/sql/sql_type_json.h +++ b/sql/sql_type_json.h @@ -21,18 +21,145 @@ ... + bool Item_append_extended_type_info(Send_field_extended_metadata *to, + const Item *item) const + { + return set_format_name(to); // Send "format=json" in the protocol + } + + bool Item_hybrid_func_fix_attributes(THD *thd, + const char *name, + Type_handler_hybrid_field_type *hybrid, + Type_all_attributes *attr, + Item **items, uint nitems) + const override + { + if (BASE::Item_hybrid_func_fix_attributes(thd, name, hybrid, attr, + items, nitems)) + return true; + /* + The above call can change the type handler on "hybrid", e.g. + choose a proper BLOB type handler according to the calculated max_length. + Convert general purpose string type handler to its JSON counterpart. + This makes hybrid functions preserve JSON data types, e.g.: + COALESCE(json_expr1, json_expr2) -> JSON + */ + hybrid->set_handler(json_type_handler_from_generic(hybrid->type_handler()));
I don't get it. It's supposed to "fix attributes", like signed/unsigned, max length, etc. The correct type handler should've been set already, otherwise this type handler method wouldn't even be called.
That true, it looks confusing. I also would prefer it did not change the data type after fixing attributes. But unfortunately, the handler can change between BLOB variants in the end. For "normal" BLOBs it works as follows - First, the upper level code detects that the result is going to be some BLOB. But it does not know yet the exact BLOB variant, because it does not know max_length. So it chooses the largest one - Type_handler_long_blob. - Then Type_handler_blob_common::Item_hybrid_func_fix_attributes is called, which calculates max_length, and then switches to a proper BLOB variant in the end, the smallest possible: bool Type_handler_blob_common:: Item_hybrid_func_fix_attributes(THD *thd, const LEX_CSTRING &func_name, Type_handler_hybrid_field_type *handler, Type_all_attributes *func, Item **items, uint nitems) const { if (func->aggregate_attributes_string(func_name, items, nitems)) return true; handler->set_handler(blob_type_handler(func->max_length)); return false; } For JSON I just call the inherited Item_hybrid_func_fix_attributes, and then translate the optionally changed "normal" BLOB variant to the corresponding BLOB/JSON variant. Possibly, we'll eventually get rid of separate four handlers: - Type_handler_tiny_blob - Type_handler_blob - Type_handler_medium_blob - Type_handler_long_blob and have just one handing all BLOB variants (and just one handling all BLOB/JSON variants). I seems it can be less confusing and more natural.
+ return false; + } };
+ +class Type_handler_string_json: + public Type_handler_general_purpose_string_to_json<Type_handler_string, + type_handler_string> +{ }; diff --git a/sql/sql_type_json.cc b/sql/sql_type_json.cc index a804366ec03..a84f720bcc9 100644 --- a/sql/sql_type_json.cc +++ b/sql/sql_type_json.cc @@ -20,17 +20,111 @@ #include "sql_class.h"
-Type_handler_json_longtext type_handler_json_longtext; +Named_type_handler<Type_handler_string_json> + type_handler_string_json("char/json");
do you mean these names are not shown anywhere?
They are shown in error messages, e.g.: Illegal parameter data types inet6 and longblob/json for operation '+'
+Named_type_handler<Type_handler_varchar_json> + type_handler_varchar_json("varchar/json"); + +Named_type_handler<Type_handler_tiny_blob_json> + type_handler_tiny_blob_json("tinyblob/json"); + +Named_type_handler<Type_handler_blob_json> + type_handler_blob_json("blob/json"); + +Named_type_handler<Type_handler_medium_blob_json> + type_handler_medium_blob_json("mediumblob/json"); + +Named_type_handler<Type_handler_long_blob_json> + type_handler_long_blob_json("longblob/json"); ... +/* + This method ressembles what Field_blob::type_handler()
resembles (also below)
Thanks. Will fix.
+ does for general purpose BLOB type handlers. +*/ +const Type_handler * +Type_handler_json_common::json_blob_type_handler_by_length_bytes(uint len) +{ + switch (len) { + case 1: return &type_handler_tiny_blob_json; + case 2: return &type_handler_blob_json; + case 3: return &type_handler_medium_blob_json; + } + return &type_handler_long_blob_json; +}
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi, Alexander!
- We'll probably have more FORMATs in the future.
Like what? As far as I can see, standard doesn't support FORMAT in the column definition at all.
Adding all format tests inside "normal" Field_string/Field_varchar/Field_blob is not a good idea.
+ */ + if (Type_handler_json_common::has_json_valid_constraint(this))
could you please use names that don't depend on how exactly we detect json? especially if the goal is to move to FORMAT JSON syntax
For now this name reflects what it actually does.
When we split JSON off and fix the code to have JSON put extra data type information into FRM, this test will be gone from Field_xxx.
But it will move to the code opening FRM, to be able open old tables where JSON is detected only by a CHECK constraint. And there it will reflect what it actually does again.
I propose to keep it as is.
ok
diff --git a/sql/sql_type.cc b/sql/sql_type.cc index c1801c1ae3e..3b2753d80a6 100644 --- a/sql/sql_type.cc +++ b/sql/sql_type.cc @@ -1755,19 +1755,110 @@ const Type_handler *Type_handler_typelib::cast_to_int_type_handler() const
/***************************************************************************/
+class Recursive_type_pair_iterator +{ + const Type_handler *m_a; + const Type_handler *m_b; + uint m_switched_to_base_count; +public: + Recursive_type_pair_iterator(const Type_handler *a, + const Type_handler *b, + uint switched_to_base_count= 0) + :m_a(a), m_b(b), m_switched_to_base_count(switched_to_base_count) + { } + const Type_handler *a() const { return m_a; } + const Type_handler *b() const { return m_b; } + Recursive_type_pair_iterator base() const + { + Recursive_type_pair_iterator res(m_a->type_handler_base(), + m_b->type_handler_base()); + res.m_switched_to_base_count= (res.m_a != NULL) + (res.m_b != NULL); + if (res.m_a == NULL) + res.m_a= m_a; + if (res.m_b == NULL) + res.m_b= m_b; + return res; + } + bool done() const + { + switch (m_switched_to_base_count) + { + case 2: + /* + Expression: + COALESCE(TYPE1, TYPE2) + + where both type handlers derive from some other handlers, e.g. + VARCHAR -> TYPE1 + TEXT -> TYPE2 + + Continue aggregation as: + SELECT COALESCE(VARCHAR, TEXT) + + Note, we now have only one level of data type inheritance, + e.g. VARCHAR -> VARCHAR/JSON + + This code branch will change when we have longer inheritance chains: + A0 -> A1 -> A2 -> A3 + B0 -> B1 -> B2 + + Functions like COALESE(A2, B2) will need to do full overload resolution: + - iterate through all pairs from the below matrix + - find all existing candidates + - resolve ambiguities in case multiple candidates, etc.
and I don't see how you're going to do that. you need to try m_a+m_b.base(), m_a.base()+m_b, and m_a.base()+m_b.base(). And that's not what you're doing.
I suggest to add an assert that m_switched_to_base_count < 2 and stop trying to solve complex cases until we have at least one.
Yes, we'll need to do all these combinations eventually. But I'm not trying to solve a more complex case than it's really necessary now:
If the SQL query is:
SELECT varchar_json_expr + text_json_expr FROM t1;
it works as follows:
- It tries VARCHAR/JSON+TEXT/JSON in Type_collection_json, nothing is found
- Then it tries directly VARCHAR+TEXT, which is m_a.base()+m_b.base(), and this combination is found in Type_collection_std,
I'd expect it to find VARCHAR/JSON+TEXT/JSON right away. It would've naturally preserved the json property. Otherwise you'd need to do an extra json propagation. Here, of course, I mean not Item_func_plus, but, say, COALESCE. Or UNION.
diff --git a/sql/sql_type_json.cc b/sql/sql_type_json.cc index a804366ec03..a84f720bcc9 100644 --- a/sql/sql_type_json.cc +++ b/sql/sql_type_json.cc @@ -20,17 +20,111 @@ #include "sql_class.h"
-Type_handler_json_longtext type_handler_json_longtext; +Named_type_handler<Type_handler_string_json> + type_handler_string_json("char/json");
do you mean these names are not shown anywhere?
They are shown in error messages, e.g.:
Illegal parameter data types inet6 and longblob/json for operation '+'
eh, okay. may be we should change it later to "longblob format json". But not now. Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hello Sergei, Please find a new patch version here: https://github.com/MariaDB/server/commit/99e2a49acfc621072a8455db772a5fe342b... On 1/3/22 5:26 PM, Sergei Golubchik wrote:
Hi, Alexander!
- We'll probably have more FORMATs in the future.
Like what? As far as I can see, standard doesn't support FORMAT in the column definition at all.
There will be no new formats in the near future probably, unless the next standard introduces it :) Still, even if we have only JSON, my point here was that it's not a good idea to have JSON specific tests per row inside Field_string/varstring/blob for non-JSON "normal" CHAR/VARCHAR/BLOB fields. So JSON deserves its own Field_string_json, Field_varstring_json and Field_blob_json. I want to add them later. But not in this patch. <cut>
I'd expect it to find VARCHAR/JSON+TEXT/JSON right away. It would've naturally preserved the json property.
Otherwise you'd need to do an extra json propagation.
Here, of course, I mean not Item_func_plus, but, say, COALESCE. Or UNION.
Fixed according to our slack discussion. Type_collection_json::aggregate_json_for_result() now performs aggregation of two JSON types by aggregating their base types and then converting a general purpose string type to the corresponding JSON type.
diff --git a/sql/sql_type_json.cc b/sql/sql_type_json.cc index a804366ec03..a84f720bcc9 100644 --- a/sql/sql_type_json.cc +++ b/sql/sql_type_json.cc @@ -20,17 +20,111 @@ #include "sql_class.h"
-Type_handler_json_longtext type_handler_json_longtext; +Named_type_handler<Type_handler_string_json> + type_handler_string_json("char/json");
do you mean these names are not shown anywhere?
They are shown in error messages, e.g.:
Illegal parameter data types inet6 and longblob/json for operation '+'
eh, okay. may be we should change it later to "longblob format json". But not now.
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
participants (2)
-
Alexander Barkov
-
Sergei Golubchik