Hi Sergei, On 01/09/2016 11:54 PM, Sergei Golubchik wrote:
Hi, Alexander!
On Jan 09, Alexander Barkov wrote:
I disagree to go the easier way with LEX_STRING. LEX_STRING is a very unfortunately designed thing for the server side code.
It was a kind of Ok on 32 bit platforms. But now on 64bit platforms it's just a waste of memory. Length can never be large enough to use the entire "size_t".
I wouldn't sweat over it. No matter what you do with LEX_STRING it won't decrease server memory requirements in any noticeable way.
Monty proposes to collect some of the bool flags in Items into bit mask, to save a few bytes, because he thinks it's important. You propose to lose 4 or 6 bytes, because it does not affect server memory in any noticeable ways and thus is not important.
I agree with Monty here. Saving a few bytes is at least not harmful and should affect performance positively. A little bit, but still positively.
I am not afraid of having more string classes.
It is harmful in a sense that it adds more concepts to the source code and a developer needs to keep them in mind when writing new code. These new string classes make code base more complex. Sometimes the added complexity is unavoidable. Sometimes it's unnecessary.
Anyway, you say that this change improves the performance? I don't think it does. May even make the performance worse - access to unaligned memory can be slower (presuming you've packed your data, otherwise the compiler can align them, which would destroy all your "space savings"). So, the final proof is to benchmark it and see.
Attached. I run the test as follows: g++ -O3 s.cc && ./a.out Ignore the first iteration (I separated it from the other iterations with "------"), as it warms up the processor, so the very first result is always the worst. As you can see, performance is the same. But: - The size of the one with "unsigned int length" is 16. - The size of the one with "size_t length" is 24. Note, I did not use any packing. The compiler is smart enough to use the unused 4 bytes to store the extra members of the child class when inheriting, like m_repertoire in this example.
When column metadata is sent to the client side, there is a character conversion from utf8 to character_set_client. If we remember the column name metadata inside the Name class, we can save on conversion and use faster non-converting methods instead.
What kind of metadata? The repertoire?
Yes, at least repertoire. But maybe something else in the future. For example, store short identifiers directly in the memory used by "const char *ptr", so it does not even need any buffer on memroot (or elsewhere). If we do this, there will be a need for a flag telling that there is no an external buffer when the data is inside. Btw, the same thing could be useful in String. IIRC, some OSes do this trick with std::string.
I've checked, LEX_STRING is only used in the plugin API that is versioned and we can safely change LEX_STRING in 10.2 to use, say, uint16, if needed. I propose not to change LEX_STRING.
Fine :)
Moreover, it's not needed for server. Only needed for the plugin API.
I think that:
1. Trying to reuse classes that are not fully suitable for a task is more harmful than having more classes, because this is simply not efficient.
2. Nobody objects to have many Item or Field classes that suite a particular functionality better. But some people are very afraid of having more strings. Why? Various strings that we have in the server code *are* different enough. Different string do deserve different classes.
Precisely. So, to know what different strings are there in the server, I want to list them all. Then we'll see whether how string classes fit in.
1. char* - zero terminated string. I don't see why one should ever use it (in the server). 2. uchar* - byte array, not zero terminated, no length. Even worse. Never ever. 3. LEX_STRING - a string and a length, no charset. Used in C with a preallocated buffer. 4. LEX_CSTRING - constant string, no charset. 5. LEX_CUSTRING - constant byte array? weird, but possible. 6. DYNAMIC_STRING - C string allocated on the heap. Can be appended to, reallocates automatically. No charset. 7. CSET_STRING - C++ LEX_STRING with a charset. 8. String - swiss army knife C++ string. Uses a preallocated buffer or heap. Grows, shrinks, convert charsets, can copy, move, initialize from anything, print hex, make coffee, escape, append anything, and so on. 9. StringBuffer - helper template to initialize a String from a local fixed buffer (very common usage pattern).
And I'm pretty sure I've missed a few. Two questions:
- what did I miss?
10. ConnectSE strings, which re-implement String with a different allocator, but otherwise repeat String functionality.
Ok, so you're basically saying that String is not generic enough and sometimes one needs a String that uses a different allocator. I agree with that.
Excellent. Now guess how happier Olivier and I could be if you said this in October 2014, when I sent the patch implementing this :) https://mariadb.atlassian.net/browse/MDEV-7063
11. There are also places where we have just pairs
const char *name; uint/size_t name_length;
inside bigger structures.
That's basically my #1, we should never do it. They should be replaced with LEX_STRINGs asap.
Right, or to some class LEX_STRING-alike class. (but let's wait for what we will end up with Name).
- how do your new classes fits into it?
They do not fit.
Various strings have these properties: 1. maximum possible length 2. pointer signness 3. "const" qualifier in the pointer 4. memory allocation/realloction possibility. 5. Needed in the pure C API
Right. So
LEX_STRING: signed/noconst/no/C LEX_CSTRING: signed/const/no/C LEX_CUSTRING: unsigned/const/no/C DYNAMIC_STRING: signed/noconst/yes/C CSET_STRING: signed/noconst/no/C++ String: signed/noconst/yes/C++
hmm, apparently, you've forgot
6. charset and other metadata
last two classes have that, other do not. And, as you can see, I've omitted "maximum possible length" part, as I don't think it matters much.
It matters only for the proper type for the "length" member.
A column name is a string with these properties: 1. Not longer than 64K (perhaps even not longer than 255) 2. Signed pointer 3. Constant pointer 4. No [re]allocation 5. Not needed in C API
Non of the currently existing classes or structures perfectly fit.
it's signed/const/no/C++ which is pretty much LEX_CSTRING, just as you've originally had in MDEV. If you add
6. knows its charset (and repertoire)
then it's CSET_STRING or a (new) CSET_CSTRING, if you want the 'const' qualifier (although, I suspect, that const CSET_STRING might work too).
Name knows its charset. But it's always utf8. So no needs for a CHARSET_INFO pointer. But having repertoire could help to avoid conversion for pure ASCII identifiers (which is the majority).
please add 10, 11, etc to the list as you see fit.
12. Name, with "uint16 length" :)
I do not think we need it.
What about "uint length"? I.e. 4 bytes for length, thus save 4 bytes for something extra, like repertoire and other flags.
Note, some of these strings could share some code if:
1. We use a template, with abstract pointer and length types. So we can pass say "const char *" as PTR and uint16 as LENGTH, to get Name. String, Name, CSET_STRING can have a common template root.
Yes.
2. We accept:
MDEV-7063 Split String into logical components
Unfortunately, nobody was willing to review this patch when I tried to help Olivier not to re-implement String.
This is assigned to Monty now.
This is very very very pity and discouraging. I'm crying in the nights and eat anti-depressants :) Well, not really yet, but getting very close to this :)
Sorry :(
Let's speed up the discussion and take this string-class question on the next maria call?
Well, at this point I'd prefer to postpone "MDEV-7063 Split String into logical components" because Olivier has already copied everything he needed from String to his own class. So this is not urgent now. But I'd like to return to it at some point. Now we need to decide about Name, as it is really urgent. Thanks.
Regards, Sergei Chief Architect MariaDB and security@mariadb.org