[Maria-developers] Please review MDEV-8092 Change Create_field::field from "const char *" to LEX_CSTRING

older
[Maria-developers] Log warnings...

Alexander Barkov

27 Nov 2015 27 Nov '15

11:29 a.m.

Hi Serg, Please review a patch for MDEV-8092. Thanks. Perhaps something interesting, about #3 in the task description. I added a debug output and collected some statistics for "mtr --suite=main", to check how these changes can potentially affect another task: MDEV-7954 my_strcasecmp_utf8() takes 0.19% in OLTP RO Here are the results: my_strcasecmp_utf8() is now called 18,709,905 times total during "mtr --suite=main" 9,200,498 of those are called to compare Field/Create_field names in Column_name::eq_name(). And 3,855,751 calls of my_strcasecmp_utf8() are now optimized away, because eq_name() immediately return "false" if the two name lengths are different. So before this change there were 18,709,905 + 3,855,751 = 22,565,656 calls of my_strcasecmp_utf8(). Looks roughly like 17% improvement in name comparison.

Attachments:

mdev-8092.diff (text/x-patch — 192.4 KB)

Show replies by date

Sergei Golubchik

7 Jan 7 Jan

4:53 p.m.

Hi, Alexander! On Nov 27, Alexander Barkov wrote:

...

Hi Serg,

Please review a patch for MDEV-8092.

I didn't review a lot of it, sorry. Because my first two comments - if we agree on them - might cause quite a few changes (search/replace kind of changes, but still). So I'd better look at the whole patch after that.

...

diff --git a/sql/field.h b/sql/field.h index 0591914..a40e307 100644 --- a/sql/field.h +++ b/sql/field.h @@ -537,6 +537,115 @@ inline bool is_temporal_type_with_time(enum_field_types type) }

+/** + @class Name + An arbitrary SQL name, consisting of a NULL-terminated string + and its length. + Later it can be reused for table and schema names. +*/ +class Name +{ + const char *m_str; + uint m_length;

Sorry, we have too many "string" classes already. Either use LEX_STRING or create a Lex_string which is inherited from LEX_STRING, binary compatible with it and only adds C++ methods (and can be casted back and forth automatically). So that one could use Lex_string where a LEX_STRING is expected and vice versa.

...

@@ -602,7 +711,7 @@ class Virtual_column_info: public Sql_alloc } };

-class Field: public Value_source +class Field: public Value_source, public Column_name

I don't think so, this is logically wrong. A field *has* a name, that's right. But a field *is not* a name. The Field class should include the name, not be inherited from it.

...

{ Field(const Item &); /* Prevent use of these */ void operator=(Field &);

Regards, Sergei Chief Architect MariaDB and security@mariadb.org -- Vote for my Percona Live 2016 talks: https://www.percona.com/live/data-performance-conference-2016/sessions/maria... https://www.percona.com/live/data-performance-conference-2016/sessions/maria...

Alexander Barkov

10:06 p.m.

Hi Sergei, On 01/07/2016 08:53 PM, Sergei Golubchik wrote:

...

Hi, Alexander!

On Nov 27, Alexander Barkov wrote:

...
Hi Serg,

Please review a patch for MDEV-8092.

I didn't review a lot of it, sorry. Because my first two comments - if we agree on them - might cause quite a few changes (search/replace kind of changes, but still). So I'd better look at the whole patch after that.

Sure. Thanks. Please see below.

...

...
diff --git a/sql/field.h b/sql/field.h index 0591914..a40e307 100644 --- a/sql/field.h +++ b/sql/field.h @@ -537,6 +537,115 @@ inline bool is_temporal_type_with_time(enum_field_types type) }

+/** + @class Name + An arbitrary SQL name, consisting of a NULL-terminated string + and its length. + Later it can be reused for table and schema names. +*/ +class Name +{ + const char *m_str; + uint m_length;

Sorry, we have too many "string" classes already. Either use LEX_STRING or create a Lex_string which is inherited from LEX_STRING, binary compatible with it and only adds C++ methods (and can be casted back and forth automatically). So that one could use Lex_string where a LEX_STRING is expected and vice versa

Sorry for a long reply. This is a very convenient moment of time when we can improve a few things. I disagree to go the easier way with LEX_STRING. LEX_STRING is a very unfortunately designed thing for the server side code. It was a kind of Ok on 32 bit platforms. But now on 64bit platforms it's just a waste of memory. Length can never be large enough to use the entire "size_t". Btw, it can even never be long enough to use "uint". It should be enough to use uint16 for length and use the available remainder (6 bytes) to store extra useful things, like string metadata. It will help to avoid some operations, like character set conversion in the client-server protocol, and I planned to add some optimization in the future. I propose to do the other way around: - gradually get rid of using LEX_STRING and LEX_CSTRING for names in the server side and use more suitable Name-alike classes instead. - keep LEX_XXX only for compatibility with the existing pure C API, where it already presents now. Please note, in the current patch version the Name class has constructors that can accept LEX_STRING and LEX_CSTRING. This means that you can pass LEX_STRING and LEX_CSTRING to any function or method that is designed to accept Name, and the patch already uses this. So this switch will go very smoothly. As for "too many string classes", I have heard this a few times already, both in MariaDB and earlier in Oracle :) I don't agree with this statement. I think that: 1. Trying to reuse classes that are not fully suitable for a task is more harmful than having more classes, because this is simply not efficient. 2. Nobody objects to have many Item or Field classes that suite a particular functionality better. But some people are very afraid of having more strings. Why? Various strings that we have in the server code *are* different enough. Different string do deserve different classes. For example, table names and column names are different enough. They are compared differently (case sensitivity). It's very natural that they have to have different classes behind, with different comparison rules. So I really hope that there will be Table_name at some point in the future, as I don't like tons of similar: if (lower_case_table_names) do_something; This has to go inside Table_name. Like I moved column name comparison inside Name::eq() in the MDEV-8092 patch.

...

...
@@ -602,7 +711,7 @@ class Virtual_column_info: public Sql_alloc } };

-class Field: public Value_source +class Field: public Value_source, public Column_name

I don't think so,

Ok. I agree to use a composition instead of inheritance here. Btw, do we have a guidance when to use an inheritance vs a composition? Sometimes it's not clear which is better. Inheritance generally does have some benefits: - it seems to provide more polymorphism and thus to reuse more code. - it can fill the unused aligned memory of the parent class to store small members of the children class (composition cannot do that). So a result of inheritance is generally smaller than the result of a composition.

...

this is logically wrong. A field *has* a name, that's right. But a field *is not* a name. The Field class should include the name, not be inherited from it.

Can I spend a few more seconds of you time, to have some fun? This reminds me the phrase: A rabbit IS not only costly fur, but IS also 3-4 kilograms of dietetic meat. Looks like this phrase is also logically wrong. Or should I rather say it *has* a logically wrong phrasing? :) Thanks.

...

...
{ Field(const Item &); /* Prevent use of these */ void operator=(Field &);

Regards, Sergei Chief Architect MariaDB and security@mariadb.org

Vicențiu Ciorbaru

10:48 p.m.

Hi Alexander, Sergei, On Fri, 8 Jan 2016 at 00:10 Alexander Barkov <bar@mariadb.org> wrote:

...

Hi Sergei,

On 01/07/2016 08:53 PM, Sergei Golubchik wrote:

...
Hi, Alexander!

On Nov 27, Alexander Barkov wrote:

...
Hi Serg,

Please review a patch for MDEV-8092.

I didn't review a lot of it, sorry. Because my first two comments - if we agree on them - might cause quite a few changes (search/replace kind of changes, but still). So I'd better look at the whole patch after that.

Sure. Thanks.

Please see below.

...
...
diff --git a/sql/field.h b/sql/field.h index 0591914..a40e307 100644 --- a/sql/field.h +++ b/sql/field.h @@ -537,6 +537,115 @@ inline bool

is_temporal_type_with_time(enum_field_types type)

...
...
}

+/** + @class Name + An arbitrary SQL name, consisting of a NULL-terminated string + and its length. + Later it can be reused for table and schema names. +*/ +class Name +{ + const char *m_str; + uint m_length;

Sorry, we have too many "string" classes already. Either use LEX_STRING or create a Lex_string which is inherited from LEX_STRING, binary compatible with it and only adds C++ methods (and can be casted back and forth automatically). So that one could use Lex_string where a LEX_STRING is expected and vice versa

Sorry for a long reply.

This is a very convenient moment of time when we can improve a few things. I disagree to go the easier way with LEX_STRING.

LEX_STRING is a very unfortunately designed thing for the server side code.

It was a kind of Ok on 32 bit platforms. But now on 64bit platforms it's just a waste of memory. Length can never be large enough to use the entire "size_t".

Btw, it can even never be long enough to use "uint". It should be enough to use uint16 for length and use the available remainder (6 bytes) to store extra useful things, like string metadata. It will help to avoid some operations, like character set conversion in the client-server protocol, and I planned to add some optimization in the future.

I propose to do the other way around: - gradually get rid of using LEX_STRING and LEX_CSTRING for names in the server side and use more suitable Name-alike classes instead. - keep LEX_XXX only for compatibility with the existing pure C API, where it already presents now.

Please note, in the current patch version the Name class has constructors that can accept LEX_STRING and LEX_CSTRING. This means that you can pass LEX_STRING and LEX_CSTRING to any function or method that is designed to accept Name, and the patch already uses this. So this switch will go very smoothly.

As for "too many string classes", I have heard this a few times already, both in MariaDB and earlier in Oracle :) I don't agree with this statement.

Hi Alexander, Sergei! Since this discussion might be somewhat similar to a difficulty I'm facing (see recent email on dev list "Use of std::string"), I figured I'd join in. In that email I suggested making use of std::string instead of regular raw pointers and length fields. From looking at the patch (mostly just the name classes, did not investigate everything), the std::string class would work just fine for the Name class and be very clear cut parent class for the Column_name. Implementing an equality operator perhaps, instead of eq_name function would improve clarity also. The advantages of using the standard library would be that compilers provide specific optimizations for them. At the same time, we would not be adding an extra string API. Regards, Vicentiu

Alexander Barkov

8 Jan 8 Jan

7:36 a.m.

Hi Vicențiu, On 01/08/2016 02:48 AM, Vicențiu Ciorbaru wrote:

...

Hi Alexander, Sergei,

On Fri, 8 Jan 2016 at 00:10 Alexander Barkov <bar@mariadb.org <mailto:bar@mariadb.org>> wrote:

Hi Sergei,

On 01/07/2016 08:53 PM, Sergei Golubchik wrote: > Hi, Alexander! > > On Nov 27, Alexander Barkov wrote: >> Hi Serg, >> >> Please review a patch for MDEV-8092. > > I didn't review a lot of it, sorry. Because my first two comments - if > we agree on them - might cause quite a few changes (search/replace kind > of changes, but still). So I'd better look at the whole patch after > that.

Sure. Thanks.

Please see below.

> >> diff --git a/sql/field.h b/sql/field.h >> index 0591914..a40e307 100644 >> --- a/sql/field.h >> +++ b/sql/field.h >> @@ -537,6 +537,115 @@ inline bool is_temporal_type_with_time(enum_field_types type) >> } >> >> >> +/** >> + @class Name >> + An arbitrary SQL name, consisting of a NULL-terminated string >> + and its length. >> + Later it can be reused for table and schema names. >> +*/ >> +class Name >> +{ >> + const char *m_str; >> + uint m_length; > > Sorry, we have too many "string" classes already. > Either use LEX_STRING or create a Lex_string > which is inherited from LEX_STRING, binary compatible with it > and only adds C++ methods (and can be casted back and forth automatically). > So that one could use Lex_string where a LEX_STRING is expected and vice > versa

Sorry for a long reply.

This is a very convenient moment of time when we can improve a few things. I disagree to go the easier way with LEX_STRING.

LEX_STRING is a very unfortunately designed thing for the server side code.

It was a kind of Ok on 32 bit platforms. But now on 64bit platforms it's just a waste of memory. Length can never be large enough to use the entire "size_t".

Btw, it can even never be long enough to use "uint". It should be enough to use uint16 for length and use the available remainder (6 bytes) to store extra useful things, like string metadata. It will help to avoid some operations, like character set conversion in the client-server protocol, and I planned to add some optimization in the future.

I propose to do the other way around: - gradually get rid of using LEX_STRING and LEX_CSTRING for names in the server side and use more suitable Name-alike classes instead. - keep LEX_XXX only for compatibility with the existing pure C API, where it already presents now.

Please note, in the current patch version the Name class has constructors that can accept LEX_STRING and LEX_CSTRING. This means that you can pass LEX_STRING and LEX_CSTRING to any function or method that is designed to accept Name, and the patch already uses this. So this switch will go very smoothly.

As for "too many string classes", I have heard this a few times already, both in MariaDB and earlier in Oracle :) I don't agree with this statement.

Hi Alexander, Sergei!

Since this discussion might be somewhat similar to a difficulty I'm facing (see recent email on dev list "Use of std::string"), I figured I'd join in.

In that email I suggested making use of std::string instead of regular raw pointers and length fields. From looking at the patch (mostly just the name classes, did not investigate everything), the std::string class would work just fine for the Name class and be very clear cut parent class for the Column_name. Implementing an equality operator perhaps, instead of eq_name function would improve clarity also.

The advantages of using the standard library would be that compilers provide specific optimizations for them. At the same time, we would not be adding an extra string API.

I think std::string would be a non-optimal storage for column, table, database names. There are at least two serious reasons not to use std::string: 1. std::string is a true dynamic string. It supports reallocation. Internally it consists of a structure which is effectively similar to: { size_t m_length; size_t m_alloced_length; int m_reference_count; char *m_ptr; } Note, sizeof(std::string) reports only 8 bytes, but this is just because of implementation tricks. The real minimum size is in fact 32 bytes. Column names are not dynamic string. Once they're created, they never change. So they don't need reallocation. They only need 10 bytes for a structure like this: { char *m_ptr; // 8 bytes uint16 m_length; // 2 bytes // we can use the remaining unaligned 6 bytes for some other purposes // e.g. for string metadata, to avoid character set conversion in many cases } 10 vs 32 is better. 2. Another, and the most important thing, is that names are created on MEM_ROOT, not directly on heap. This is true both the structure itself, and the char array buffer. std::string must be freed via destructor or delete, otherwise you'll get memory leaks. Names are automatically freed whenever the containing MEM_ROOT is freed. If we switch to std::string, we'll have to maintain the list of all created names, to free them properly at the end. This will make the things more complicated.

...

Regards, Vicentiu

Vicențiu Ciorbaru

10:27 a.m.

Hi Alexander, Thanks for the explanation. I agree with your points regarding std::string. Although it is possible to control allocation using an allocator, it does indeed make sense that std::string is not easily applicable in this case. I am personally for the introduction of an extra string class if it serves our purpose better. However we should strive to make the interfaces for our classes more uniform, as it quickly becomes confusing which operations are supported and which are not. The classes at hand seem reasonable in this regard. I think I would override the equality operator also. Vicentiu On Fri, 8 Jan 2016 at 09:36, Alexander Barkov <bar@mariadb.org> wrote:

...

Hi Vicențiu,

On 01/08/2016 02:48 AM, Vicențiu Ciorbaru wrote:

...
Hi Alexander, Sergei,

On Fri, 8 Jan 2016 at 00:10 Alexander Barkov <bar@mariadb.org <mailto:bar@mariadb.org>> wrote:

Hi Sergei,

On 01/07/2016 08:53 PM, Sergei Golubchik wrote: > Hi, Alexander! > > On Nov 27, Alexander Barkov wrote: >> Hi Serg, >> >> Please review a patch for MDEV-8092. > > I didn't review a lot of it, sorry. Because my first two comments - if > we agree on them - might cause quite a few changes (search/replace kind > of changes, but still). So I'd better look at the whole patch after > that.

Sure. Thanks.

Please see below.

> >> diff --git a/sql/field.h b/sql/field.h >> index 0591914..a40e307 100644 >> --- a/sql/field.h >> +++ b/sql/field.h >> @@ -537,6 +537,115 @@ inline bool is_temporal_type_with_time(enum_field_types type) >> } >> >> >> +/** >> + @class Name >> + An arbitrary SQL name, consisting of a NULL-terminated string >> + and its length. >> + Later it can be reused for table and schema names. >> +*/ >> +class Name >> +{ >> + const char *m_str; >> + uint m_length; > > Sorry, we have too many "string" classes already. > Either use LEX_STRING or create a Lex_string > which is inherited from LEX_STRING, binary compatible with it > and only adds C++ methods (and can be casted back and forth automatically). > So that one could use Lex_string where a LEX_STRING is expected and vice > versa

Sorry for a long reply.

This is a very convenient moment of time when we can improve a few things. I disagree to go the easier way with LEX_STRING.

LEX_STRING is a very unfortunately designed thing for the server side code.

It was a kind of Ok on 32 bit platforms. But now on 64bit platforms it's just a waste of memory. Length can never be large enough to use the entire "size_t".

Btw, it can even never be long enough to use "uint". It should be enough to use uint16 for length and use the available remainder (6 bytes) to store extra useful things, like string metadata. It will help to avoid some operations, like character set conversion in the client-server protocol, and I planned to add some optimization in the future.

I propose to do the other way around: - gradually get rid of using LEX_STRING and LEX_CSTRING for names in the server side and use more suitable Name-alike classes instead. - keep LEX_XXX only for compatibility with the existing pure C API, where it already presents now.

Please note, in the current patch version the Name class has constructors that can accept LEX_STRING and LEX_CSTRING. This means that you can pass LEX_STRING and LEX_CSTRING to any function or method that is designed to accept Name, and the patch already uses this. So this switch will go very smoothly.

As for "too many string classes", I have heard this a few times already, both in MariaDB and earlier in Oracle :) I don't agree with this statement.

Hi Alexander, Sergei!

Since this discussion might be somewhat similar to a difficulty I'm facing (see recent email on dev list "Use of std::string"), I figured I'd join in.

In that email I suggested making use of std::string instead of regular raw pointers and length fields. From looking at the patch (mostly just the name classes, did not investigate everything), the std::string class would work just fine for the Name class and be very clear cut parent class for the Column_name. Implementing an equality operator perhaps, instead of eq_name function would improve clarity also.

The advantages of using the standard library would be that compilers provide specific optimizations for them. At the same time, we would not be adding an extra string API.

I think std::string would be a non-optimal storage for column, table, database names.

There are at least two serious reasons not to use std::string:

1. std::string is a true dynamic string. It supports reallocation. Internally it consists of a structure which is effectively similar to:

{ size_t m_length; size_t m_alloced_length; int m_reference_count; char *m_ptr; }

Note, sizeof(std::string) reports only 8 bytes, but this is just because of implementation tricks. The real minimum size is in fact 32 bytes.

Column names are not dynamic string. Once they're created, they never change. So they don't need reallocation. They only need 10 bytes for a structure like this:

{ char *m_ptr; // 8 bytes uint16 m_length; // 2 bytes // we can use the remaining unaligned 6 bytes for some other purposes // e.g. for string metadata, to avoid character set conversion in many cases }

10 vs 32 is better.

2. Another, and the most important thing, is that names are created on MEM_ROOT, not directly on heap. This is true both the structure itself, and the char array buffer.

std::string must be freed via destructor or delete, otherwise you'll get memory leaks.

Names are automatically freed whenever the containing MEM_ROOT is freed. If we switch to std::string, we'll have to maintain the list of all created names, to free them properly at the end. This will make the things more complicated.

...
Regards, Vicentiu

Alexander Barkov

9:47 p.m.

Hi Vicențiu, On 01/08/2016 02:27 PM, Vicențiu Ciorbaru wrote:

...

Hi Alexander,

Thanks for the explanation. I agree with your points regarding std::string. Although it is possible to control allocation using an allocator, it does indeed make sense that std::string is not easily applicable in this case.

I am personally for the introduction of an extra string class if it serves our purpose better. However we should strive to make the interfaces for our classes more uniform, as it quickly becomes confusing which operations are supported and which are not. The classes at hand seem reasonable in this regard. I think I would override the equality operator also.

It should be fine to override the equality operator. But on the other hand, it's much easier to search for a code using "grep eq_name". So I'm not sure what is better.

...

Vicentiu

On Fri, 8 Jan 2016 at 09:36, Alexander Barkov <bar@mariadb.org <mailto:bar@mariadb.org>> wrote:

Hi Vicențiu,

On 01/08/2016 02:48 AM, Vicențiu Ciorbaru wrote: > Hi Alexander, Sergei, > > On Fri, 8 Jan 2016 at 00:10 Alexander Barkov <bar@mariadb.org <mailto:bar@mariadb.org> > <mailto:bar@mariadb.org <mailto:bar@mariadb.org>>> wrote: > > Hi Sergei, > > > On 01/07/2016 08:53 PM, Sergei Golubchik wrote: > > Hi, Alexander! > > > > On Nov 27, Alexander Barkov wrote: > >> Hi Serg, > >> > >> Please review a patch for MDEV-8092. > > > > I didn't review a lot of it, sorry. Because my first two comments > - if > > we agree on them - might cause quite a few changes > (search/replace kind > > of changes, but still). So I'd better look at the whole patch after > > that. > > Sure. Thanks. > > Please see below. > > > > >> diff --git a/sql/field.h b/sql/field.h > >> index 0591914..a40e307 100644 > >> --- a/sql/field.h > >> +++ b/sql/field.h > >> @@ -537,6 +537,115 @@ inline bool > is_temporal_type_with_time(enum_field_types type) > >> } > >> > >> > >> +/** > >> + @class Name > >> + An arbitrary SQL name, consisting of a NULL-terminated string > >> + and its length. > >> + Later it can be reused for table and schema names. > >> +*/ > >> +class Name > >> +{ > >> + const char *m_str; > >> + uint m_length; > > > > Sorry, we have too many "string" classes already. > > Either use LEX_STRING or create a Lex_string > > which is inherited from LEX_STRING, binary compatible with it > > and only adds C++ methods (and can be casted back and forth > automatically). > > So that one could use Lex_string where a LEX_STRING is expected > and vice > > versa > > Sorry for a long reply. > > This is a very convenient moment of time when > we can improve a few things. I disagree to go the easier way > with LEX_STRING. > > > LEX_STRING is a very unfortunately designed thing for the server > side code. > > It was a kind of Ok on 32 bit platforms. > But now on 64bit platforms it's just a waste of memory. > Length can never be large enough to use the entire "size_t". > > Btw, it can even never be long enough to use "uint". It should be enough > to use uint16 for length and use the available remainder (6 bytes) to > store extra useful things, like string metadata. > It will help to avoid some operations, like character set conversion > in the client-server protocol, and I planned to add some optimization > in the future. > > > I propose to do the other way around: > - gradually get rid of using LEX_STRING and LEX_CSTRING for names > in the server side and use more suitable Name-alike classes instead. > - keep LEX_XXX only for compatibility with the existing pure C API, > where it already presents now. > > > Please note, in the current patch version the Name class has > constructors that can accept LEX_STRING and LEX_CSTRING. > This means that you can pass LEX_STRING and LEX_CSTRING to any > function or method that is designed to accept Name, > and the patch already uses this. So this switch will go very smoothly. > > > As for "too many string classes", I have heard this a few times already, > both in MariaDB and earlier in Oracle :) > I don't agree with this statement. > > > Hi Alexander, Sergei! > > Since this discussion might be somewhat similar to a difficulty I'm > facing (see recent email on dev list "Use of std::string"), I figured > I'd join in. > > In that email I suggested making use of std::string instead of regular > raw pointers and length fields. From looking at the patch (mostly just > the name classes, did not investigate everything), the std::string class > would work just fine for the Name class and be very clear cut parent > class for the Column_name. Implementing an equality operator perhaps, > instead of eq_name function would improve clarity also. > > The advantages of using the standard library would be that compilers > provide specific optimizations for them. At the same time, we would not > be adding an extra string API.

I think std::string would be a non-optimal storage for column, table, database names.

There are at least two serious reasons not to use std::string:

1. std::string is a true dynamic string. It supports reallocation. Internally it consists of a structure which is effectively similar to:

{ size_t m_length; size_t m_alloced_length; int m_reference_count; char *m_ptr; }

Note, sizeof(std::string) reports only 8 bytes, but this is just because of implementation tricks. The real minimum size is in fact 32 bytes.

Column names are not dynamic string. Once they're created, they never change. So they don't need reallocation. They only need 10 bytes for a structure like this:

{ char *m_ptr; // 8 bytes uint16 m_length; // 2 bytes // we can use the remaining unaligned 6 bytes for some other purposes // e.g. for string metadata, to avoid character set conversion in many cases }

10 vs 32 is better.

2. Another, and the most important thing, is that names are created on MEM_ROOT, not directly on heap. This is true both the structure itself, and the char array buffer.

std::string must be freed via destructor or delete, otherwise you'll get memory leaks.

Names are automatically freed whenever the containing MEM_ROOT is freed. If we switch to std::string, we'll have to maintain the list of all created names, to free them properly at the end. This will make the things more complicated.

> > Regards, > Vicentiu

Sergei Golubchik

11:48 a.m.

Hi, Alexander! On Jan 08, Alexander Barkov wrote:

...

...
...
diff --git a/sql/field.h b/sql/field.h index 0591914..a40e307 100644 --- a/sql/field.h +++ b/sql/field.h @@ -537,6 +537,115 @@ inline bool is_temporal_type_with_time(enum_field_types type) }

+/** + @class Name + An arbitrary SQL name, consisting of a NULL-terminated string + and its length. + Later it can be reused for table and schema names. +*/ +class Name +{ + const char *m_str; + uint m_length;

Sorry, we have too many "string" classes already. Either use LEX_STRING or create a Lex_string which is inherited from LEX_STRING, binary compatible with it and only adds C++ methods (and can be casted back and forth automatically). So that one could use Lex_string where a LEX_STRING is expected and vice versa

Sorry for a long reply.

This is a very convenient moment of time when we can improve a few things.

I agree with this, to a certain extent. Cleanups and refactoring is good. But I'm starting to feel that you do too many too big changes and that makes the task MDEV much larger than it was supposed to be. And much delayed too. Wouldn't it be a good time now to actually start implementing a feature?

...

I disagree to go the easier way with LEX_STRING. LEX_STRING is a very unfortunately designed thing for the server side code.

It was a kind of Ok on 32 bit platforms. But now on 64bit platforms it's just a waste of memory. Length can never be large enough to use the entire "size_t".

I wouldn't sweat over it. No matter what you do with LEX_STRING it won't decrease server memory requirements in any noticeable way.

...

Btw, it can even never be long enough to use "uint". It should be enough to use uint16 for length and use the available remainder (6 bytes) to store extra useful things, like string metadata. It will help to avoid some operations, like character set conversion in the client-server protocol, and I planned to add some optimization in the future.

I could say that one "should not use LEX_STRING where character sets are involved", but let's postpone that till the end of the email.

...

I propose to do the other way around: - gradually get rid of using LEX_STRING and LEX_CSTRING for names in the server side and use more suitable Name-alike classes instead.

More suitable for *what*? See below.

...

- keep LEX_XXX only for compatibility with the existing pure C API, where it already presents now.

I've checked, LEX_STRING is only used in the plugin API that is versioned and we can safely change LEX_STRING in 10.2 to use, say, uint16, if needed.

...

As for "too many string classes", I have heard this a few times already, both in MariaDB and earlier in Oracle :)

Perhaps, because it's true? :)

...

I don't agree with this statement.

I think that:

1. Trying to reuse classes that are not fully suitable for a task is more harmful than having more classes, because this is simply not efficient.

2. Nobody objects to have many Item or Field classes that suite a particular functionality better. But some people are very afraid of having more strings. Why? Various strings that we have in the server code *are* different enough. Different string do deserve different classes.

Precisely. So, to know what different strings are there in the server, I want to list them all. Then we'll see whether how string classes fit in. 1. char* - zero terminated string. I don't see why one should ever use it (in the server). 2. uchar* - byte array, not zero terminated, no length. Even worse. Never ever. 3. LEX_STRING - a string and a length, no charset. Used in C with a preallocated buffer. 4. LEX_CSTRING - constant string, no charset. 5. LEX_CUSTRING - constant byte array? weird, but possible. 6. DYNAMIC_STRING - C string allocated on the heap. Can be appended to, reallocates automatically. No charset. 7. CSET_STRING - C++ LEX_STRING with a charset. 8. String - swiss army knife C++ string. Uses a preallocated buffer or heap. Grows, shrinks, convert charsets, can copy, move, initialize from anything, print hex, make coffee, escape, append anything, and so on. 9. StringBuffer - helper template to initialize a String from a local fixed buffer (very common usage pattern). And I'm pretty sure I've missed a few. Two questions: - what did I miss? - how do your new classes fits into it? please add 10, 11, etc to the list as you see fit.

...

...
this is logically wrong. A field *has* a name, that's right. But a field *is not* a name. The Field class should include the name, not be inherited from it.

Can I spend a few more seconds of you time, to have some fun?

This reminds me the phrase:

A rabbit IS not only costly fur, but IS also 3-4 kilograms of dietetic meat.

Looks like this phrase is also logically wrong. Or should I rather say it *has* a logically wrong phrasing? :)

Yeah, human languages are often (most of the time?) use words loosely and rely on implicit meaning. Like "can you pass the salt, please?" - the expected answer is to pass the salt, not to say "yes, I can". (and, strictly speaking, human languages don't use words and don't rely on the meaning. Humans do :) Regards, Sergei Chief Architect MariaDB and security@mariadb.org -- Vote for my Percona Live 2016 talks: https://www.percona.com/live/data-performance-conference-2016/sessions/maria... https://www.percona.com/live/data-performance-conference-2016/sessions/maria...

Alexander Barkov

9 Jan 9 Jan

6:07 a.m.

Hi Sergei, On 01/08/2016 03:48 PM, Sergei Golubchik wrote:

...

Hi, Alexander!

On Jan 08, Alexander Barkov wrote:

...
...
...
diff --git a/sql/field.h b/sql/field.h index 0591914..a40e307 100644 --- a/sql/field.h +++ b/sql/field.h @@ -537,6 +537,115 @@ inline bool is_temporal_type_with_time(enum_field_types type) }

+/** + @class Name + An arbitrary SQL name, consisting of a NULL-terminated string + and its length. + Later it can be reused for table and schema names. +*/ +class Name +{ + const char *m_str; + uint m_length;

Sorry, we have too many "string" classes already. Either use LEX_STRING or create a Lex_string which is inherited from LEX_STRING, binary compatible with it and only adds C++ methods (and can be casted back and forth automatically). So that one could use Lex_string where a LEX_STRING is expected and vice versa

Sorry for a long reply.

This is a very convenient moment of time when we can improve a few things.

I agree with this, to a certain extent. Cleanups and refactoring is good. But I'm starting to feel that you do too many too big changes and that makes the task MDEV much larger than it was supposed to be. And much delayed too.

Wouldn't it be a good time now to actually start implementing a feature?

...
I disagree to go the easier way with LEX_STRING. LEX_STRING is a very unfortunately designed thing for the server side code.

It was a kind of Ok on 32 bit platforms. But now on 64bit platforms it's just a waste of memory. Length can never be large enough to use the entire "size_t".

I wouldn't sweat over it. No matter what you do with LEX_STRING it won't decrease server memory requirements in any noticeable way.

Monty proposes to collect some of the bool flags in Items into bit mask, to save a few bytes, because he thinks it's important. You propose to lose 4 or 6 bytes, because it does not affect server memory in any noticeable ways and thus is not important. I agree with Monty here. Saving a few bytes is at least not harmful and should affect performance positively. A little bit, but still positively. I am not afraid of having more string classes.

...

...
Btw, it can even never be long enough to use "uint". It should be enough to use uint16 for length and use the available remainder (6 bytes) to store extra useful things, like string metadata. It will help to avoid some operations, like character set conversion in the client-server protocol, and I planned to add some optimization in the future.

I could say that one "should not use LEX_STRING where character sets are involved", but let's postpone that till the end of the email.

When column metadata is sent to the client side, there is a character conversion from utf8 to character_set_client. If we remember the column name metadata inside the Name class, we can save on conversion and use faster non-converting methods instead.

...

...
I propose to do the other way around: - gradually get rid of using LEX_STRING and LEX_CSTRING for names in the server side and use more suitable Name-alike classes instead.

More suitable for *what*? See below.

More suitable to store names: - store lengths, which never need more than 2 bytes, in something smaller than size_t - use the remaining unaligned memory for metadata in the future (this can be done after pluggable data types)

...

...
- keep LEX_XXX only for compatibility with the existing pure C API, where it already presents now.

I've checked, LEX_STRING is only used in the plugin API that is versioned and we can safely change LEX_STRING in 10.2 to use, say, uint16, if needed.

LEX_STRING is not always used for names. It is also used to store the entire query, which is limited by max_allowed_packed and can be up to 1G. So there should be two separate things, with - uint16 length, for names. - uint32 length, for longer entities, like the entire query. We currently do not seem to need size_t for lengths. But in the 64-bit world it's very likely that we'll introduce >4Gb blobs sooner or later, and will support >4Gb queries. I propose not to change LEX_STRING.

...

...
As for "too many string classes", I have heard this a few times already, both in MariaDB and earlier in Oracle :)

Perhaps, because it's true? :)

Sorry, I'll stay on my point of view here. Let's measure the code not by amount of classes, but in how these classes fit to store the data.

...

...
I don't agree with this statement.

I think that:

1. Trying to reuse classes that are not fully suitable for a task is more harmful than having more classes, because this is simply not efficient.

2. Nobody objects to have many Item or Field classes that suite a particular functionality better. But some people are very afraid of having more strings. Why? Various strings that we have in the server code *are* different enough. Different string do deserve different classes.

Precisely. So, to know what different strings are there in the server, I want to list them all. Then we'll see whether how string classes fit in.

1. char* - zero terminated string. I don't see why one should ever use it (in the server). 2. uchar* - byte array, not zero terminated, no length. Even worse. Never ever. 3. LEX_STRING - a string and a length, no charset. Used in C with a preallocated buffer. 4. LEX_CSTRING - constant string, no charset. 5. LEX_CUSTRING - constant byte array? weird, but possible. 6. DYNAMIC_STRING - C string allocated on the heap. Can be appended to, reallocates automatically. No charset. 7. CSET_STRING - C++ LEX_STRING with a charset. 8. String - swiss army knife C++ string. Uses a preallocated buffer or heap. Grows, shrinks, convert charsets, can copy, move, initialize from anything, print hex, make coffee, escape, append anything, and so on. 9. StringBuffer - helper template to initialize a String from a local fixed buffer (very common usage pattern).

And I'm pretty sure I've missed a few. Two questions:

- what did I miss?

10. ConnectSE strings, which re-implement String with a different allocator, but otherwise repeat String functionality. 11. There are also places where we have just pairs const char *name; uint/size_t name_length; inside bigger structures.

...

- how do your new classes fits into it?

They do not fit. Various strings have these properties: 1. maximum possible length 2. pointer signness 3. "const" qualifier in the pointer 4. memory allocation/realloction possibility. 5. Needed in the pure C API A column name is a string with these properties: 1. Not longer than 64K (perhaps even not longer than 255) 2. Signed pointer 3. Constant pointer 4. No [re]allocation 5. Not needed in C API Non of the currently existing classes or structures perfectly fit.

...

please add 10, 11, etc to the list as you see fit.

12. Name, with "uint16 length" :) Note, some of these strings could share some code if: 1. We use a template, with abstract pointer and length types. So we can pass say "const char *" as PTR and uint16 as LENGTH, to get Name. String, Name, CSET_STRING can have a common template root. 2. We accept: MDEV-7063 Split String into logical components Unfortunately, nobody was willing to review this patch when I tried to help Olivier not to re-implement String. This is very very very pity and discouraging. I'm crying in the nights and eat anti-depressants :) Well, not really yet, but getting very close to this :)

...

...
...
this is logically wrong. A field *has* a name, that's right. But a field *is not* a name. The Field class should include the name, not be inherited from it.

Can I spend a few more seconds of you time, to have some fun?

This reminds me the phrase:

A rabbit IS not only costly fur, but IS also 3-4 kilograms of dietetic meat.

Looks like this phrase is also logically wrong. Or should I rather say it *has* a logically wrong phrasing? :)

Yeah, human languages are often (most of the time?) use words loosely and rely on implicit meaning. Like "can you pass the salt, please?" - the expected answer is to pass the salt, not to say "yes, I can". (and, strictly speaking, human languages don't use words and don't rely on the meaning. Humans do :)

Regards, Sergei Chief Architect MariaDB and security@mariadb.org

Sergei Golubchik

7:54 p.m.

Hi, Alexander! On Jan 09, Alexander Barkov wrote:

...

...
...
I disagree to go the easier way with LEX_STRING. LEX_STRING is a very unfortunately designed thing for the server side code.

It was a kind of Ok on 32 bit platforms. But now on 64bit platforms it's just a waste of memory. Length can never be large enough to use the entire "size_t".

I wouldn't sweat over it. No matter what you do with LEX_STRING it won't decrease server memory requirements in any noticeable way.

Monty proposes to collect some of the bool flags in Items into bit mask, to save a few bytes, because he thinks it's important. You propose to lose 4 or 6 bytes, because it does not affect server memory in any noticeable ways and thus is not important.

I agree with Monty here. Saving a few bytes is at least not harmful and should affect performance positively. A little bit, but still positively.

I am not afraid of having more string classes.

It is harmful in a sense that it adds more concepts to the source code and a developer needs to keep them in mind when writing new code. These new string classes make code base more complex. Sometimes the added complexity is unavoidable. Sometimes it's unnecessary. Anyway, you say that this change improves the performance? I don't think it does. May even make the performance worse - access to unaligned memory can be slower (presuming you've packed your data, otherwise the compiler can align them, which would destroy all your "space savings"). So, the final proof is to benchmark it and see.

...

When column metadata is sent to the client side, there is a character conversion from utf8 to character_set_client. If we remember the column name metadata inside the Name class, we can save on conversion and use faster non-converting methods instead.

What kind of metadata? The repertoire?

...

...
I've checked, LEX_STRING is only used in the plugin API that is versioned and we can safely change LEX_STRING in 10.2 to use, say, uint16, if needed. I propose not to change LEX_STRING.

Fine :)

...

...
...
I think that:

1. Trying to reuse classes that are not fully suitable for a task is more harmful than having more classes, because this is simply not efficient.

2. Nobody objects to have many Item or Field classes that suite a particular functionality better. But some people are very afraid of having more strings. Why? Various strings that we have in the server code *are* different enough. Different string do deserve different classes.

Precisely. So, to know what different strings are there in the server, I want to list them all. Then we'll see whether how string classes fit in.

1. char* - zero terminated string. I don't see why one should ever use it (in the server). 2. uchar* - byte array, not zero terminated, no length. Even worse. Never ever. 3. LEX_STRING - a string and a length, no charset. Used in C with a preallocated buffer. 4. LEX_CSTRING - constant string, no charset. 5. LEX_CUSTRING - constant byte array? weird, but possible. 6. DYNAMIC_STRING - C string allocated on the heap. Can be appended to, reallocates automatically. No charset. 7. CSET_STRING - C++ LEX_STRING with a charset. 8. String - swiss army knife C++ string. Uses a preallocated buffer or heap. Grows, shrinks, convert charsets, can copy, move, initialize from anything, print hex, make coffee, escape, append anything, and so on. 9. StringBuffer - helper template to initialize a String from a local fixed buffer (very common usage pattern).

And I'm pretty sure I've missed a few. Two questions:

- what did I miss?

10. ConnectSE strings, which re-implement String with a different allocator, but otherwise repeat String functionality.

Ok, so you're basically saying that String is not generic enough and sometimes one needs a String that uses a different allocator. I agree with that.

...

11. There are also places where we have just pairs

const char *name; uint/size_t name_length;

inside bigger structures.

That's basically my #1, we should never do it. They should be replaced with LEX_STRINGs asap.

...

...
- how do your new classes fits into it?

They do not fit.

Various strings have these properties: 1. maximum possible length 2. pointer signness 3. "const" qualifier in the pointer 4. memory allocation/realloction possibility. 5. Needed in the pure C API

Right. So LEX_STRING: signed/noconst/no/C LEX_CSTRING: signed/const/no/C LEX_CUSTRING: unsigned/const/no/C DYNAMIC_STRING: signed/noconst/yes/C CSET_STRING: signed/noconst/no/C++ String: signed/noconst/yes/C++ hmm, apparently, you've forgot 6. charset and other metadata last two classes have that, other do not. And, as you can see, I've omitted "maximum possible length" part, as I don't think it matters much.

...

A column name is a string with these properties: 1. Not longer than 64K (perhaps even not longer than 255) 2. Signed pointer 3. Constant pointer 4. No [re]allocation 5. Not needed in C API

Non of the currently existing classes or structures perfectly fit.

it's signed/const/no/C++ which is pretty much LEX_CSTRING, just as you've originally had in MDEV. If you add 6. knows its charset (and repertoire) then it's CSET_STRING or a (new) CSET_CSTRING, if you want the 'const' qualifier (although, I suspect, that const CSET_STRING might work too).

...

...
please add 10, 11, etc to the list as you see fit.

12. Name, with "uint16 length" :)

I do not think we need it.

...

Note, some of these strings could share some code if:

1. We use a template, with abstract pointer and length types. So we can pass say "const char *" as PTR and uint16 as LENGTH, to get Name. String, Name, CSET_STRING can have a common template root.

Yes.

...

2. We accept:

MDEV-7063 Split String into logical components

Unfortunately, nobody was willing to review this patch when I tried to help Olivier not to re-implement String.

This is assigned to Monty now.

...

This is very very very pity and discouraging. I'm crying in the nights and eat anti-depressants :) Well, not really yet, but getting very close to this :)

Sorry :( Let's speed up the discussion and take this string-class question on the next maria call? Regards, Sergei Chief Architect MariaDB and security@mariadb.org -- Vote for my Percona Live 2016 talks: https://www.percona.com/live/data-performance-conference-2016/sessions/maria... https://www.percona.com/live/data-performance-conference-2016/sessions/maria...

Alexander Barkov

11 Jan 11 Jan

12:58 p.m.

Hi Sergei, On 01/09/2016 11:54 PM, Sergei Golubchik wrote:

...

Hi, Alexander!

On Jan 09, Alexander Barkov wrote:

...
...
...
I disagree to go the easier way with LEX_STRING. LEX_STRING is a very unfortunately designed thing for the server side code.

It was a kind of Ok on 32 bit platforms. But now on 64bit platforms it's just a waste of memory. Length can never be large enough to use the entire "size_t".

I wouldn't sweat over it. No matter what you do with LEX_STRING it won't decrease server memory requirements in any noticeable way.

Monty proposes to collect some of the bool flags in Items into bit mask, to save a few bytes, because he thinks it's important. You propose to lose 4 or 6 bytes, because it does not affect server memory in any noticeable ways and thus is not important.

I agree with Monty here. Saving a few bytes is at least not harmful and should affect performance positively. A little bit, but still positively.

I am not afraid of having more string classes.

It is harmful in a sense that it adds more concepts to the source code and a developer needs to keep them in mind when writing new code. These new string classes make code base more complex. Sometimes the added complexity is unavoidable. Sometimes it's unnecessary.

Anyway, you say that this change improves the performance? I don't think it does. May even make the performance worse - access to unaligned memory can be slower (presuming you've packed your data, otherwise the compiler can align them, which would destroy all your "space savings"). So, the final proof is to benchmark it and see.

Attached. I run the test as follows: g++ -O3 s.cc && ./a.out Ignore the first iteration (I separated it from the other iterations with "------"), as it warms up the processor, so the very first result is always the worst. As you can see, performance is the same. But: - The size of the one with "unsigned int length" is 16. - The size of the one with "size_t length" is 24. Note, I did not use any packing. The compiler is smart enough to use the unused 4 bytes to store the extra members of the child class when inheriting, like m_repertoire in this example.

...

...
When column metadata is sent to the client side, there is a character conversion from utf8 to character_set_client. If we remember the column name metadata inside the Name class, we can save on conversion and use faster non-converting methods instead.

What kind of metadata? The repertoire?

Yes, at least repertoire. But maybe something else in the future. For example, store short identifiers directly in the memory used by "const char *ptr", so it does not even need any buffer on memroot (or elsewhere). If we do this, there will be a need for a flag telling that there is no an external buffer when the data is inside. Btw, the same thing could be useful in String. IIRC, some OSes do this trick with std::string.

...

...
...
I've checked, LEX_STRING is only used in the plugin API that is versioned and we can safely change LEX_STRING in 10.2 to use, say, uint16, if needed. I propose not to change LEX_STRING.

Fine :)

Moreover, it's not needed for server. Only needed for the plugin API.

...

...
...
...
I think that:

1. Trying to reuse classes that are not fully suitable for a task is more harmful than having more classes, because this is simply not efficient.

2. Nobody objects to have many Item or Field classes that suite a particular functionality better. But some people are very afraid of having more strings. Why? Various strings that we have in the server code *are* different enough. Different string do deserve different classes.

Precisely. So, to know what different strings are there in the server, I want to list them all. Then we'll see whether how string classes fit in.

1. char* - zero terminated string. I don't see why one should ever use it (in the server). 2. uchar* - byte array, not zero terminated, no length. Even worse. Never ever. 3. LEX_STRING - a string and a length, no charset. Used in C with a preallocated buffer. 4. LEX_CSTRING - constant string, no charset. 5. LEX_CUSTRING - constant byte array? weird, but possible. 6. DYNAMIC_STRING - C string allocated on the heap. Can be appended to, reallocates automatically. No charset. 7. CSET_STRING - C++ LEX_STRING with a charset. 8. String - swiss army knife C++ string. Uses a preallocated buffer or heap. Grows, shrinks, convert charsets, can copy, move, initialize from anything, print hex, make coffee, escape, append anything, and so on. 9. StringBuffer - helper template to initialize a String from a local fixed buffer (very common usage pattern).

And I'm pretty sure I've missed a few. Two questions:

- what did I miss?

10. ConnectSE strings, which re-implement String with a different allocator, but otherwise repeat String functionality.

Ok, so you're basically saying that String is not generic enough and sometimes one needs a String that uses a different allocator. I agree with that.

Excellent. Now guess how happier Olivier and I could be if you said this in October 2014, when I sent the patch implementing this :) https://mariadb.atlassian.net/browse/MDEV-7063

...

...
11. There are also places where we have just pairs

const char *name; uint/size_t name_length;

inside bigger structures.

That's basically my #1, we should never do it. They should be replaced with LEX_STRINGs asap.

Right, or to some class LEX_STRING-alike class. (but let's wait for what we will end up with Name).

...

...
...
- how do your new classes fits into it?

They do not fit.

Various strings have these properties: 1. maximum possible length 2. pointer signness 3. "const" qualifier in the pointer 4. memory allocation/realloction possibility. 5. Needed in the pure C API

Right. So

LEX_STRING: signed/noconst/no/C LEX_CSTRING: signed/const/no/C LEX_CUSTRING: unsigned/const/no/C DYNAMIC_STRING: signed/noconst/yes/C CSET_STRING: signed/noconst/no/C++ String: signed/noconst/yes/C++

hmm, apparently, you've forgot

6. charset and other metadata

last two classes have that, other do not. And, as you can see, I've omitted "maximum possible length" part, as I don't think it matters much.

It matters only for the proper type for the "length" member.

...

...
A column name is a string with these properties: 1. Not longer than 64K (perhaps even not longer than 255) 2. Signed pointer 3. Constant pointer 4. No [re]allocation 5. Not needed in C API

Non of the currently existing classes or structures perfectly fit.

it's signed/const/no/C++ which is pretty much LEX_CSTRING, just as you've originally had in MDEV. If you add

6. knows its charset (and repertoire)

then it's CSET_STRING or a (new) CSET_CSTRING, if you want the 'const' qualifier (although, I suspect, that const CSET_STRING might work too).

Name knows its charset. But it's always utf8. So no needs for a CHARSET_INFO pointer. But having repertoire could help to avoid conversion for pure ASCII identifiers (which is the majority).

...

...
...
please add 10, 11, etc to the list as you see fit.

12. Name, with "uint16 length" :)

I do not think we need it.

What about "uint length"? I.e. 4 bytes for length, thus save 4 bytes for something extra, like repertoire and other flags.

...

...
Note, some of these strings could share some code if:

1. We use a template, with abstract pointer and length types. So we can pass say "const char *" as PTR and uint16 as LENGTH, to get Name. String, Name, CSET_STRING can have a common template root.

Yes.

...
2. We accept:

MDEV-7063 Split String into logical components

Unfortunately, nobody was willing to review this patch when I tried to help Olivier not to re-implement String.

This is assigned to Monty now.

...
This is very very very pity and discouraging. I'm crying in the nights and eat anti-depressants :) Well, not really yet, but getting very close to this :)

Sorry :(

Let's speed up the discussion and take this string-class question on the next maria call?

Well, at this point I'd prefer to postpone "MDEV-7063 Split String into logical components" because Olivier has already copied everything he needed from String to his own class. So this is not urgent now. But I'd like to return to it at some point. Now we need to decide about Name, as it is really urgent. Thanks.

...

Regards, Sergei Chief Architect MariaDB and security@mariadb.org

3442

Age (days ago)

3487

Last active (days ago)

List overview

10 comments

3 participants

participants (3)

Alexander Barkov
Sergei Golubchik
Vicențiu Ciorbaru