Hi, To answer some of my own idiocy, I've been looking at the system installed headers for the below structs, and there is already variance. MariaDB (libmariadb/include/mariadb_com.h, and installed onto my system by mariadb-connector-c): typedef struct st_udf_args { unsigned int arg_count; /* Number of arguments */ enum Item_result *arg_type; /* Pointer to item_results */ char **args; /* Pointer to argument */ unsigned long *lengths; /* Length of string arguments */ char *maybe_null; /* Set to 1 for all maybe_null args */ } UDF_ARGS; MySQL (specifically as installed by mysql-connector-c, apparently): typedef struct UDF_ARGS { unsigned int arg_count; /**< Number of arguments */ enum Item_result *arg_type; /**< Pointer to item_results */ char **args; /**< Pointer to argument */ unsigned long *lengths; /**< Length of string arguments */ char *maybe_null; /**< Set to 1 for all maybe_null args */ char **attributes; /**< Pointer to attribute name */ unsigned long *attribute_lengths; /**< Length of attribute arguments */ void *extension; } UDF_ARGS; Since these are not installed by the two server components I'm already wondering about the safety of the relevant changes I'm suggesting. Or are UDFs somehow handled by the connector packages? And is it 100% accident that ->attributes get set to the function argument strings, eg, for uls_inet6_network_address('::ffff:192.168.1.123', 120); attributes[0] = "'::ffff:192.168.1.123'" and attributes[1]="120" Thanks for any pointers and guidance as to how to approach this, in the meantime I've managed to work around this by invoking the functions as CAST(uls_inet6... AS CHAR) which does the trick, but IMHO is nasty, so I'd still prefer to sort the below. Given the above, I suspect that's why MySQL introduced the "wrapper functions" to access data, which I reckon to a large degree makes sense. I'd need to check the MySQL codebase as well, unless someone perhaps have some insight? Kind regards, Jaco On 2024/06/24 17:13, Jaco Kroon via developers wrote:
Hi,
It's a known issue that strings to and from UDF functions are a pain (https://dev.mysql.com/blog-archive/a-tale-of-udfs-with-character-sets/, amongst others).
Most recently I ran into an issue trying to create a UDF to assist with IPv6 manipulations (in it's current, not working as ex form at https://github.com/jkroonza/uls-mysql/blob/master/src/uls_inet6.c).
The two UDF functions there is intended to find the "network" (first) and "broadcast" (last) address of a range, with the intention to add functions to perform other manipulations (like adding some offset to an IPv6, or even add two IPv6's together, eg, to find the "next" network for 2c0f:f720::/40 you can add 0:0:100:: to 2c0f:f720::).
The referenced works great when you get values from inet6 column to pass to the function (it gets transformed into textual presentation before being passed to the UDF) eg:
MariaDB [test]> select uls_inet6_network_address('::ffff:192.168.1.123', 120); +--------------------------------------------------------+ | uls_inet6_network_address('::ffff:192.168.1.123', 120) | +--------------------------------------------------------+ | ::ffff:192.168.1.0 | +--------------------------------------------------------+ 1 row in set (0.001 sec)
However, given this:
MariaDB [test]> create table t(t inet6 not null, primary key(t)); Query OK, 0 rows affected (0.024 sec)
MariaDB [test]> insert into t values('::ffff:192.168.1.15'); Query OK, 1 row affected (0.014 sec)
MariaDB [test]> select * from t where uls_inet6_network_address('::ffff:192.168.1.123', 120) < t; Empty set, 1 warning (0.009 sec)
MariaDB [test]> show warnings; +---------+------+---------------------------------------------+ | Level | Code | Message | +---------+------+---------------------------------------------+ | Warning | 1292 | Incorrect inet6 value: '::ffff:192.168.1.0' | +---------+------+---------------------------------------------+ 1 row in set (0.000 sec)
I started to dig, to find this:
MariaDB [test]> select charset(uls_inet6_network_address('::ffff:192.168.1.123', 120)); +-----------------------------------------------------------------+ | charset(uls_inet6_network_address('::ffff:192.168.1.123', 120)) | +-----------------------------------------------------------------+ | binary | +-----------------------------------------------------------------+ 1 row in set (0.000 sec)
Then I played a bit and found that if I can rig it to the returned string has 16 characters I don't get an error; eg:
MariaDB [test]> select * from t where uls_inet6_network_address('::ff:f:ffff:ffff', 120) > t; +---------------------+ | t | +---------------------+ | ::ffff:192.168.1.15 | +---------------------+ 1 row in set (0.001 sec)
Based on this I'm inferring that:
1. Contextually the difference between BINARY() and CHAR() types is the character set (ie, binary is implemented as a kind of characters set for strings, or perhaps strings are binaries with a specific character set). 2. The inet6 storage column (which I now realize is simply BINARY(16), with a transform applied to transform strings to and from the BINARY() format, I think the technical term in the code is "fixed binary storage storage"). 3. When string (character) data gets sent *to* an inet6 column a "character set conversion" is performed, and as is the case with UDFs, it's already BINARY, so no conversion is actually performed, and since the BINARY length is now *not* 16 bytes, we get the above warning, and thus comparisons are as if with a NULL value, and always false.
I've been contemplating possible "solutions" to the problem without breaking backwards compatibility, and it's tricky. The below aims at a simple way of specifying the return "character set" of a string return in more detail. And possibly even looking at the character sets for parameters.
MySQL have started down a path (as per the tale above) whereby it's using some other mechanism to give indications about character sets. I don't like the particular mechanism, but it could work, looks like it does only character set though, in which case for my example above, I could just force both input and output to "latin1" so that MySQL/MariaDB is forced to convert between that and INET6 again. I think we can do better. I'm not convinced their change will permit backwards compatibility - ie, any UDF that uses those functions will require them and cannot opt to operate in a degraded mode. Not sure that's even a desirable thing to have though ... so perhaps copying what they're doing here is the way to go.
My ideas:
Option 1. Simply permit specifying a more precise string type during function creation.
For example, instead of:
MariaDB [uls]> CREATE FUNCTION uls_inet6_last_address RETURNS STRING SONAME 'uls_inet6.so';
Do this:
MariaDB [uls]> CREATE FUNCTION uls_inet6_last_address RETURNS INET6 SONAME 'uls_inet6.so';
And this is permitted because INET6 is a specialization of BINARY, which is really what MySQL passes to/from UDF functions anyway.
The downside is this will only work for returns. This would, work for my use case, since from INET6 => UDF the decoder is called anyway, so I do get the text presentation on function entry, and now I can now I just need to return the binary form of INET6. If sent back to the client it gets converted by the server to
This eliminates at least *one* of the two pointless conversions to/from binary format.
Option 2. A mechanism to pass the character set/binary type upon entry and exit.
What I'm thinking here is a non-backwards compatible change, potentially. Or use the *extension pointer somehow. I'm still having some trouble navigating the codebase to figure out exactly how the void *extension pointers in UDF_ARGS and UDF_INIT structs are used. But I'm thinking they can be (ab)used to somehow pass an additional structure, which may be how 3 is achieved perhaps?
To be more precise, the current UDF_ARGS looks like:
47 typedef struct UDF_ARGS { 48 unsigned int arg_count; /**< Number of arguments */ 49 enum Item_result *arg_type; /**< Pointer to item_results */ 50 char **args; /**< Pointer to argument */ 51 unsigned long *lengths; /**< Length of string arguments */ 52 char *maybe_null; /**< Set to 1 for all maybe_null args */ 53 char **attributes; /**< Pointer to attribute name */ 54 unsigned long *attribute_lengths; /**< Length of attribute arguments */ 55 void *extension; 56 } UDF_ARGS;
I can't find any source files that references both the UDF_ARGS type and references this extension field. As such I'm hoping I can safely assume that in current implementations this will always be NULL.
If so it becomes possible to turn this into a pointer-sized version field, or struct size field, ie something like:
union { void *extension; size_t argsize; /* one alternative */ long api_version; /* another alternative */ };
= sizeof(UDF_ARGS) against which the UDF was compiled. This *feels* flakey, but should be fine. If a user wants more fine-grained compile backwards compat, then the user can check that all fields (further
for argsize this way it becomes possible to check that this field is than those listed) are positioned inside the structure.
Another option is simply to have extension be a pointer to some UDF_ARGS_EXPANDED struct, which really is a pointer back to UDF_ARGS which is really something like:
typedef struct UDF2_ARGS { ... everything from above, except void* extension becomes: void *UDF2_ARGS extension; /* this points to self */ size_t argsize; /* or long api_version */ bool udf2_supported; /* set to false by the server, if the UDF supports this, set to true */ const char ** string_arg_types; /* in _init this will be passed as the raw type, eg, INET6 if the input is an INET6 field, can be set in _init to have the server cast the type, ie, perform some conversion */ ... additional fields;
} UDF2_ARGS;
So in my example, if udf2_supported comes back as false after _init, then current behaviour is maintained, and INET6 will be converted to string, and return strings are assumed to be BINARY (current behaviour). If however, udf2_supported, then my code could set string_args_types[0] = "inet6"; - which the engine knows is a BINARY(16) type, and use the INET6 conversion code to convert to and from BINARY(16) as needed.
Option 3. Some other more formal UDF API versioning/api definition/scheme. So call this UDFv2, and make this extensible so that features can be "negotiated", or that the UDF itself can even "fail to load" in case where it requires support from the running server that's not provided.
The idea here would be that a symbol like "mariadb_udf_init" is exported, passing it some structure that defines the capabilities of the server it's running on, the module can then decline to load, or proceed, but limit itself based on functionality available kind of thing.
I'm guessing option 2 is also a way of achieving this. And may be sufficient. I do however like the idea of having a {func}_load and {func}_unload call (some of our PBKDF functions would also benefit from this in that we can load the openssl algorithms once at startup instead of every time on _init. There may be other complications though, but still, this could be one suggestion for v2.
I'm happy to write a more formal specification based off of the informal specification, and I'm inclined to attempt option 2 above. I just honestly have no idea what's all REALLY involved, and where to find the various bits and pieces of code. So I guess the initial steps would be:
1. Add a _load function which is invoked at *load* time, passing server and version information. Not sure what else could be useful, intension is to initialise the UDF for use as needed. UDFs that want to maintain backwards compatible would need to track if this was indeed called or not, and if not, act accordingly, or if it can't be backwards compatible without this, error out in _init. Suggestions as to parameters to pass would be great, I suggest we use a struct again with a size field as it's first field to ensure that we can add extra fields at a later stage, safely. Eg:
struct UDF_LOADINFO { size_t load_size; const char* server; struct { uint16_t major, minor, patch; /* 10, 6, 17 - for example what I've got runing currently */ const char* extra; /* MariaDB-log, the stuff after the dash from SELECT version(); */ } version; } UDF_LOADINFO;
And so we will call "bool (*func_load)(const UDF_LOADINFO*)" IFF it exist.
2. Add an _unload function "void (*func_unload)()", purpose of which is to clean up after _load.
Note: It's possible to use __attribute__((constructor)) and destructor functions, but my experience is that they're messy, and if you have multiple functions in a UDF you can't know which ones you're being loaded for.
3. Expand for UDF2_ARGS as explained in option 2 above, as well as UDF2_INIT.
4. Ability to flag functions as *deterministic* (same return for same arguments) rather than merely const or not const. This way SQL can optimize calls in the same manner as for stored functions. I'm not going to dig into that, but happy to add the flag.
5. Document and flag this as UDFv2 as and when it hits formal release.
Kind regards, Jaco
_______________________________________________ developers mailing list -- developers@lists.mariadb.org To unsubscribe send an email to developers-leave@lists.mariadb.org