Re: [Maria-developers] 2f9bbf392072: MDEV-18734 optimization by taking ownership via String::swap()
Hi, Aleksey! Don't be confused, the commit header and the comment come from your last commit, but the diff below includes all three commits that mention MDEV-18734. On Nov 02, Aleksey Midenkov wrote:
revision-id: 2f9bbf392072 (mariadb-10.2.31-543-g2f9bbf392072) parent(s): 1f4960e3f2ac author: Aleksey Midenkov <midenok@gmail.com> committer: Aleksey Midenkov <midenok@gmail.com> timestamp: 2020-11-02 14:26:04 +0300 message:
MDEV-18734 optimization by taking ownership via String::swap()
ha_partition stores records in array of m_ordered_rec_buffer and uses it for prio queue in ordered index scan. When the records are restored from the array the blob buffers may be already freed or rewritten. This can happen when blob buffers are cached in Field_blob::value or Field_blob::read_value.
Previous solution worked via copying blob buffer into mem_root-allocated buffer. That was not optimal as it requires more memory and processing time.
This patch avoids memory copy by taking temporary ownership of cached blob buffers via String::swap(). It also fixes problem when value.str_length is wrong after Field_blob::store().
diff --git a/mysql-test/suite/federated/federated_partition.test b/mysql-test/suite/federated/federated_partition.test index ef1e27ec5054..33ee025442f4 100644 --- a/mysql-test/suite/federated/federated_partition.test +++ b/mysql-test/suite/federated/federated_partition.test @@ -50,4 +50,30 @@ drop table federated.t1_2;
--echo End of 5.1 tests
+--echo # +--echo # MDEV-18734 ASAN heap-use-after-free upon sorting by blob column from partitioned table +--echo # +connection slave; +use federated; +create table t1_1 (x int, b text, key(x)); +create table t1_2 (x int, b text, key(x)); +connection master; +--replace_result $SLAVE_MYPORT SLAVE_PORT +eval create table t1 (x int, b text, key(x)) engine=federated + partition by range columns (x) ( + partition p1 values less than (40) connection='mysql://root@127.0.0.1:$SLAVE_MYPORT/federated/t1_1', + partition pn values less than (maxvalue) connection='mysql://root@127.0.0.1:$SLAVE_MYPORT/federated/t1_2' +); +insert t1 values (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8); +insert t1 select x + 8, x + 8 from t1; +insert t1 select x + 16, x + 16 from t1; +insert t1 select x + 49, repeat(x + 49, 100) from t1; +--echo # This produces wrong result before MDEV-17573 +select * from t1;
What do you mean, simple `select * from t1` produces incorrect result?
+flush tables; +select x, b from t1 where x > 30 and x < 60 order by b;
Is this ASAN-only test?
+drop table t1; +connection slave; +drop table t1_1, t1_2; + source include/federated_cleanup.inc; diff --git a/mysql-test/suite/vcol/t/partition.test b/mysql-test/suite/vcol/t/partition.test index 889724fb1c5c..f10ee0491ccc 100644 --- a/mysql-test/suite/vcol/t/partition.test +++ b/mysql-test/suite/vcol/t/partition.test @@ -30,3 +30,28 @@ subpartition by hash(v) subpartitions 3 ( insert t1 set i= 0; set statement sql_mode= '' for update t1 set i= 1, v= 2; drop table t1; + +--echo # +--echo # MDEV-18734 ASAN heap-use-after-free in my_strnxfrm_simple_internal upon update on versioned partitioned table +--echo # +--echo # Cover queue_fix() in ha_partition::handle_ordered_index_scan() +create table t1 ( + x int auto_increment primary key, + b text, v mediumtext as (b) virtual, + index (v(10)) +) partition by range columns (x) ( + partition p1 values less than (3), + partition p2 values less than (6), + partition pn values less than (maxvalue)); +insert into t1 (b) values ('q'), ('a'), ('b'); +update t1 set b= 'bar' where v > 'a'; +drop table t1;
This test didn't fail for me in an ASAN build on the vanilla 10.5 without your fix. That is, I'm not sure it actually tests the fix.
+ +--echo # Cover return_top_record() in ha_partition::handle_ordered_index_scan() +create table t1 (x int primary key, b tinytext, v text as (b) virtual) +partition by range columns (x) ( + partition p1 values less than (4), + partition pn values less than (maxvalue)); +insert into t1 (x, b) values (1, ''), (2, ''), (3, 'a'), (4, 'b'); +update t1 set b= 'bar' where x > 0 order by v limit 2;
This fails without a fix all right. Still, I wonder whether it's possible to create a test that'll fail in a normal optimized build w/o ASAN
+drop table t1; diff --git a/sql/field.cc b/sql/field.cc index bdaaecc20269..0fd40c979d2c 100644 --- a/sql/field.cc +++ b/sql/field.cc @@ -8310,6 +8310,7 @@ int Field_blob::store(const char *from,uint length,CHARSET_INFO *cs) copy_length= copier.well_formed_copy(field_charset, (char*) value.ptr(), new_length, cs, from, length); + value.length(copy_length);
good! Could this be a distinct bug with its own test case?
Field_blob::store_length(copy_length); bmove(ptr+packlength,(uchar*) &tmp,sizeof(char*));
diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc index 9d82df0235c7..7c658082d397 100644 --- a/sql/ha_partition.cc +++ b/sql/ha_partition.cc @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue() uint alloc_len; uint used_parts= bitmap_bits_set(&m_part_info->read_partitions); /* Allocate record buffer for each used partition. */ - m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS; + m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET; if (!m_using_extended_keys) m_priority_queue_rec_len += m_file[0]->ref_length; alloc_len= used_parts * m_priority_queue_rec_len; /* Allocate a key for temporary use when setting up the scan. */ alloc_len+= table_share->max_key_length;
- if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len, MYF(MY_WME)))) + if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root, alloc_len)))
I don't see why you need a memroot here. memroot is needed when you allocate later at difeerent points in time some initially unpredictable amount of memory that has the same lifetime and needs to be freed all at once. But here it seems that you only allocate once. You could just make m_priority_queue_rec_len = m_rec_length + PARTITION_BYTES_IN_POS + table->s->blob_fields * sizeof(Ordered_blob_storage) (plus necessary alignment).
DBUG_RETURN(true);
/* @@ -6178,7 +6203,11 @@ int ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order) */ error= file->read_range_first(m_start_key.key? &m_start_key: NULL, end_range, eq_range, TRUE); - memcpy(rec_buf_ptr, table->record[0], m_rec_length); + if (!error) + { + memcpy(rec_buf_ptr, table->record[0], m_rec_length); + }
did you have a bug because of that? something didn't work?
+ reverse_order= FALSE; break; } @@ -6310,6 +6349,43 @@ int ha_partition::handle_ordered_index_scan_key_not_found() }
+void ha_partition::swap_blobs(uchar * rec_buf, Ordered_blob_storage ** storage, bool restore) +{ + uint *ptr, *end; + uint blob_n= 0; + table->move_fields(table->field, rec_buf, table->record[0]); + for (ptr= table->s->blob_field, end= ptr + table->s->blob_fields; + ptr != end; ++ptr, ++blob_n) + { + DBUG_ASSERT(*ptr < table->s->fields); + Field_blob *blob= (Field_blob*) table->field[*ptr]; + DBUG_ASSERT(blob->flags & BLOB_FLAG); + DBUG_ASSERT(blob->field_index == *ptr); + if (!bitmap_is_set(table->read_set, *ptr) || blob->is_null()) + continue; + + Ordered_blob_storage &s= *storage[blob_n]; + + if (restore) + { + if (!s.blob.is_empty()) + blob->swap(s.blob, s.set_read_value);
looks somewhat risky, why blob.length == 0 means that you don't need to restore the value?
+ } + else + { + bool set_read_value; + String *cached= blob->cached(set_read_value); + if (cached) + { + cached->swap(s.blob); + s.set_read_value= set_read_value;
is it indeed possible here for a value to be either in `value` or in `read_value` ? When happens what?
+ } + } + } + table->move_fields(table->field, table->record[0], rec_buf); +} + + /* Common routine to handle index_next with ordered results
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Sergei, the updated patches are here: https://github.com/MariaDB/server/commits/5f6bcd6cd0145513974b0dfc5b2cefba28... On Mon, Nov 2, 2020 at 11:33 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Aleksey!
Don't be confused, the commit header and the comment come from your last commit, but the diff below includes all three commits that mention
MDEV-18734.
On Nov 02, Aleksey Midenkov wrote:
revision-id: 2f9bbf392072 (mariadb-10.2.31-543-g2f9bbf392072) parent(s): 1f4960e3f2ac author: Aleksey Midenkov <midenok@gmail.com> committer: Aleksey Midenkov <midenok@gmail.com> timestamp: 2020-11-02 14:26:04 +0300 message:
MDEV-18734 optimization by taking ownership via String::swap()
ha_partition stores records in array of m_ordered_rec_buffer and uses it for prio queue in ordered index scan. When the records are restored from the array the blob buffers may be already freed or rewritten. This can happen when blob buffers are cached in Field_blob::value or Field_blob::read_value.
Previous solution worked via copying blob buffer into mem_root-allocated buffer. That was not optimal as it requires more memory and processing time.
This patch avoids memory copy by taking temporary ownership of cached blob buffers via String::swap(). It also fixes problem when value.str_length is wrong after Field_blob::store().
diff --git a/mysql-test/suite/federated/federated_partition.test
b/mysql-test/suite/federated/federated_partition.test
index ef1e27ec5054..33ee025442f4 100644 --- a/mysql-test/suite/federated/federated_partition.test +++ b/mysql-test/suite/federated/federated_partition.test @@ -50,4 +50,30 @@ drop table federated.t1_2;
--echo End of 5.1 tests
+--echo # +--echo # MDEV-18734 ASAN heap-use-after-free upon sorting by blob
column from partitioned table > > +--echo # > > +connection slave; > > +use federated; > > +create table t1_1 (x int, b text, key(x)); > > +create table t1_2 (x int, b text, key(x)); > > +connection master; > > +--replace_result $SLAVE_MYPORT SLAVE_PORT > > +eval create table t1 (x int, b text, key(x)) engine=federated > > + partition by range columns (x) ( > > + partition p1 values less than (40) connection='mysql://root@127.0.0.1:$SLAVE_MYPORT/federated/t1_1', > > + partition pn values less than (maxvalue) connection='mysql://root@127.0.0.1:$SLAVE_MYPORT/federated/t1_2' > > +); > > +insert t1 values (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8); > > +insert t1 select x + 8, x + 8 from t1; > > +insert t1 select x + 16, x + 16 from t1; > > +insert t1 select x + 49, repeat(x + 49, 100) from t1; > > +--echo # This produces wrong result before MDEV-17573 > > +select * from t1; > > What do you mean, simple `select * from t1` produces incorrect result?
Removed that garbage line.
+flush tables; +select x, b from t1 where x > 30 and x < 60 order by b;
Is this ASAN-only test?
No, it produces wrong result in 10.2 for me.
+drop table t1; +connection slave; +drop table t1_1, t1_2; + source include/federated_cleanup.inc; diff --git a/mysql-test/suite/vcol/t/partition.test
b/mysql-test/suite/vcol/t/partition.test
index 889724fb1c5c..f10ee0491ccc 100644 --- a/mysql-test/suite/vcol/t/partition.test +++ b/mysql-test/suite/vcol/t/partition.test @@ -30,3 +30,28 @@ subpartition by hash(v) subpartitions 3 ( insert t1 set i= 0; set statement sql_mode= '' for update t1 set i= 1, v= 2; drop table t1; + +--echo # +--echo # MDEV-18734 ASAN heap-use-after-free in my_strnxfrm_simple_internal upon update on versioned partitioned table +--echo # +--echo # Cover queue_fix() in ha_partition::handle_ordered_index_scan() +create table t1 ( + x int auto_increment primary key, + b text, v mediumtext as (b) virtual, + index (v(10)) +) partition by range columns (x) ( + partition p1 values less than (3), + partition p2 values less than (6), + partition pn values less than (maxvalue)); +insert into t1 (b) values ('q'), ('a'), ('b'); +update t1 set b= 'bar' where v > 'a'; +drop table t1;
This test didn't fail for me in an ASAN build on the vanilla 10.5 without your fix. That is, I'm not sure it actually tests the fix.
It failed on 10.3 for me some time ago. Ok, replaced that with the bigger version which fails on non-ASAN optimized all right.
+ +--echo # Cover return_top_record() in
ha_partition::handle_ordered_index_scan()
+create table t1 (x int primary key, b tinytext, v text as (b) virtual) +partition by range columns (x) ( + partition p1 values less than (4), + partition pn values less than (maxvalue)); +insert into t1 (x, b) values (1, ''), (2, ''), (3, 'a'), (4, 'b'); +update t1 set b= 'bar' where x > 0 order by v limit 2;
This fails without a fix all right.
Still, I wonder whether it's possible to create a test that'll fail in a normal optimized build w/o ASAN
The new case fails on non-ASAN optimized as well.
+drop table t1; diff --git a/sql/field.cc b/sql/field.cc index bdaaecc20269..0fd40c979d2c 100644 --- a/sql/field.cc +++ b/sql/field.cc @@ -8310,6 +8310,7 @@ int Field_blob::store(const char *from,uint
length,CHARSET_INFO *cs)
copy_length= copier.well_formed_copy(field_charset, (char*) value.ptr(), new_length, cs, from, length); + value.length(copy_length);
good! Could this be a distinct bug with its own test case?
Do we need to spend time on that bureaucracy? There are real problems to solve... Tickets drain time from the management people and generally from anyone who sees them.
Field_blob::store_length(copy_length); bmove(ptr+packlength,(uchar*) &tmp,sizeof(char*));
diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc index 9d82df0235c7..7c658082d397 100644 --- a/sql/ha_partition.cc +++ b/sql/ha_partition.cc @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue() uint alloc_len; uint used_parts= bitmap_bits_set(&m_part_info->read_partitions); /* Allocate record buffer for each used partition. */ - m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS; + m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET; if (!m_using_extended_keys) m_priority_queue_rec_len += m_file[0]->ref_length; alloc_len= used_parts * m_priority_queue_rec_len; /* Allocate a key for temporary use when setting up the scan. */ alloc_len+= table_share->max_key_length;
- if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len,
MYF(MY_WME))))
+ if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root, alloc_len)))
I don't see why you need a memroot here. memroot is needed when you allocate later at difeerent points in time some initially unpredictable amount of memory that has the same lifetime and needs to be freed all at once.
I know... This is just another convenient way of allocation with presumably minimized number of mallocs (probably pre_alloc_size should be tweaked to a better value).
But here it seems that you only allocate once. You could just make
m_priority_queue_rec_len = m_rec_length + PARTITION_BYTES_IN_POS + table->s->blob_fields * sizeof(Ordered_blob_storage)
I don't wnat this data serialized. It is inconvenient to read the dump of the record. Why we should make it harder to code, read the code and debug while we can make it easier? If it is only for guaranteed 1 malloc, I don't think it's worth it.
(plus necessary alignment).
DBUG_RETURN(true);
/* @@ -6178,7 +6203,11 @@ int
ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order)
*/ error= file->read_range_first(m_start_key.key? &m_start_key:
NULL,
end_range, eq_range, TRUE); - memcpy(rec_buf_ptr, table->record[0], m_rec_length); + if (!error) + { + memcpy(rec_buf_ptr, table->record[0], m_rec_length); + }
did you have a bug because of that? something didn't work?
No, I just saw worthless memcpy and avoided it.
+ reverse_order= FALSE; break; } @@ -6310,6 +6349,43 @@ int
ha_partition::handle_ordered_index_scan_key_not_found()
}
+void ha_partition::swap_blobs(uchar * rec_buf, Ordered_blob_storage ** storage, bool restore) +{ + uint *ptr, *end; + uint blob_n= 0; + table->move_fields(table->field, rec_buf, table->record[0]); + for (ptr= table->s->blob_field, end= ptr + table->s->blob_fields; + ptr != end; ++ptr, ++blob_n) + { + DBUG_ASSERT(*ptr < table->s->fields); + Field_blob *blob= (Field_blob*) table->field[*ptr]; + DBUG_ASSERT(blob->flags & BLOB_FLAG); + DBUG_ASSERT(blob->field_index == *ptr); + if (!bitmap_is_set(table->read_set, *ptr) || blob->is_null()) + continue; + + Ordered_blob_storage &s= *storage[blob_n]; + + if (restore) + { + if (!s.blob.is_empty()) + blob->swap(s.blob, s.set_read_value);
looks somewhat risky, why blob.length == 0 means that you don't need to restore the value?
We store only non-empty blobs therefore we restore non-empty blobs. We can swap empty blobs too but why?
+ } + else + { + bool set_read_value; + String *cached= blob->cached(set_read_value); + if (cached) + { + cached->swap(s.blob); + s.set_read_value= set_read_value;
is it indeed possible here for a value to be either in `value` or in `read_value` ?
When happens what?
Yes, it uses `value` for plain fields and `read_value` for virtual fields.
+ } + } + } + table->move_fields(table->field, table->record[0], rec_buf); +} + + /* Common routine to handle index_next with ordered results
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
-- All the best, Aleksey Midenkov @midenok
Hi Sergei! The updated patches are here: https://github.com/MariaDB/server/commits/2c2e79487a9189e02fce8f884a9ab2b42b... On Tue, Nov 3, 2020 at 8:28 PM Aleksey Midenkov <midenok@gmail.com> wrote:
Sergei,
the updated patches are here:
https://github.com/MariaDB/server/commits/5f6bcd6cd0145513974b0dfc5b2cefba28...
On Mon, Nov 2, 2020 at 11:33 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Aleksey!
Don't be confused, the commit header and the comment come from your last commit, but the diff below includes all three commits that mention
...
diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc index 9d82df0235c7..7c658082d397 100644 --- a/sql/ha_partition.cc +++ b/sql/ha_partition.cc @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue() uint alloc_len; uint used_parts= bitmap_bits_set(&m_part_info->read_partitions); /* Allocate record buffer for each used partition. */ - m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS; + m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET; if (!m_using_extended_keys) m_priority_queue_rec_len += m_file[0]->ref_length; alloc_len= used_parts * m_priority_queue_rec_len; /* Allocate a key for temporary use when setting up the scan. */ alloc_len+= table_share->max_key_length;
- if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len, MYF(MY_WME)))) + if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root, alloc_len)))
I don't see why you need a memroot here. memroot is needed when you allocate later at difeerent points in time some initially unpredictable amount of memory that has the same lifetime and needs to be freed all at once.
I know... This is just another convenient way of allocation with
MDEV-18734. presumably minimized number of mallocs (probably pre_alloc_size should be tweaked to a better value).
But here it seems that you only allocate once. You could just make
m_priority_queue_rec_len = m_rec_length + PARTITION_BYTES_IN_POS + table->s->blob_fields * sizeof(Ordered_blob_storage)
I don't want this data serialized. It is inconvenient to read the dump of
the record. Why we should make it harder to code, read the code and debug while we can make it easier? If it is only for guaranteed 1 malloc, I don't think it's worth it.
Added another patch that does 1 malloc in m_ordered_root. -- All the best, Aleksey Midenkov @midenok
Hi, Aleksey! On Nov 03, Aleksey Midenkov wrote:
the updated patches are here:
https://github.com/MariaDB/server/commits/5f6bcd6cd0145513974b0dfc5b2cefba28...
I started writing my comments there, but realized that it'd be clearer if I reply here, more context. That is, see below. One comment about the new test case, you have 16K lines in the result file, are all 16K necessary? Can you do something like select x, left(b,10), left(v,10) ? And please, pretty please, don't use non-constant references as function arguments.
On Mon, Nov 2, 2020 at 11:33 PM Sergei Golubchik <serg@mariadb.org> wrote:
diff --git a/sql/field.cc b/sql/field.cc index bdaaecc20269..0fd40c979d2c 100644 --- a/sql/field.cc +++ b/sql/field.cc @@ -8310,6 +8310,7 @@ int Field_blob::store(const char *from,uint length,CHARSET_INFO *cs) copy_length= copier.well_formed_copy(field_charset, (char*) value.ptr(), new_length, cs, from, length); + value.length(copy_length);
good! Could this be a distinct bug with its own test case?
Do we need to spend time on that bureaucracy? There are real problems to solve... Tickets drain time from the management people and generally from anyone who sees them.
I didn't mean opening an MDEV. Only writing a test case and extracting it into a separate commit. If this is a genuine bug fix and you don't create a test case - how can you be sure the fix will stay? It may disappear in a merge or a rebase, manually or automatically. Or in a refactoring that "looks fine, all tests pass". If you can create a test case - please, do.
Field_blob::store_length(copy_length); bmove(ptr+packlength,(uchar*) &tmp,sizeof(char*));
diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc index 9d82df0235c7..7c658082d397 100644 --- a/sql/ha_partition.cc +++ b/sql/ha_partition.cc @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue() uint alloc_len; uint used_parts= bitmap_bits_set(&m_part_info->read_partitions); /* Allocate record buffer for each used partition. */ - m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS; + m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET; if (!m_using_extended_keys) m_priority_queue_rec_len += m_file[0]->ref_length; alloc_len= used_parts * m_priority_queue_rec_len; /* Allocate a key for temporary use when setting up the scan. */ alloc_len+= table_share->max_key_length;
- if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len, MYF(MY_WME)))) + if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root, alloc_len)))
I don't see why you need a memroot here. memroot is needed when you allocate later at difeerent points in time some initially unpredictable amount of memory that has the same lifetime and needs to be freed all at once.
I know... This is just another convenient way of allocation with presumably minimized number of mallocs (probably pre_alloc_size should be tweaked to a better value).
memroot wasn't created for that. There are approaches with much less overhead
But here it seems that you only allocate once. You could just make
m_priority_queue_rec_len = m_rec_length + PARTITION_BYTES_IN_POS + table->s->blob_fields * sizeof(Ordered_blob_storage)
I don't wnat this data serialized. It is inconvenient to read the dump of the record. Why we should make it harder to code, read the code and debug while we can make it easier? If it is only for guaranteed 1 malloc, I don't think it's worth it.
What do you mean "serialized"? What do you mean "read the dump"? Note that you can also use my_multi_malloc, that allocates many buffers together and hides address arithmetcs from the caller. Like my_multi_malloc(MYF(MY_WME), &m_ordered_rec_buffer, m_rec_length + PARTITION_BYTES_IN_POS, &blob_storage, table->s->blob_fields * sizeof(Ordered_blob_storage), NULL) (but with error checking). And then, perhaps, with placement new[] you could construct all Ordered_blob_storage objects in one line.
@@ -6178,7 +6203,11 @@ int ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order) */ error= file->read_range_first(m_start_key.key? &m_start_key: NULL, end_range, eq_range, TRUE); - memcpy(rec_buf_ptr, table->record[0], m_rec_length); + if (!error) + { + memcpy(rec_buf_ptr, table->record[0], m_rec_length); + }
did you have a bug because of that? something didn't work?
No, I just saw worthless memcpy and avoided it.
You're right that memcopy is unnecessary in case of an error. But most of the time there is no error. You made an exceptional code path faster by making a common code path slower. I think it's generally better to do the opposite, making the common, no-error code path faster, fewer conditionals, in particular. At the cost of making the error path slower, if needed.
@@ -6310,6 +6349,43 @@ int ha_partition::handle_ordered_index_scan_key_not_found() + if (cached) + { + cached->swap(s.blob); + s.set_read_value= set_read_value;
is it indeed possible here for a value to be either in `value` or in `read_value` ?
When happens what?
Yes, it uses `value` for plain fields and `read_value` for virtual fields.
Oh, I see. This read_value was introduced in ea1b25046c81db8fdf to solve the case of UPDATE, where a row is moved with store_record(table, record[1]) and blob pointers aren't updated. This is very similar to what you're fixing, perhaps read_value should go away and both bugs should have a one unified solution that allows to "park" a record somewhere out of record[0] and restore later, all with proper blob handling?
+ } + } + } + table->move_fields(table->field, table->record[0], rec_buf); +}
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Sergei, On Tue, Dec 22, 2020 at 7:58 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Aleksey!
On Nov 03, Aleksey Midenkov wrote:
the updated patches are here:
https://github.com/MariaDB/server/commits/5f6bcd6cd0145513974b0dfc5b2cefba28...
I started writing my comments there, but realized that it'd be clearer if I reply here, more context. That is, see below.
One comment about the new test case, you have 16K lines in the result file, are all 16K necessary? Can you do something like
select x, left(b,10), left(v,10)
?
Yes, these shorter result cases reproduce as well. Updated.
And please, pretty please, don't use non-constant references as function arguments.
Why? With pointers we can pass NULL and we don't need NULL there, do we?
On Mon, Nov 2, 2020 at 11:33 PM Sergei Golubchik <serg@mariadb.org> wrote:
diff --git a/sql/field.cc b/sql/field.cc index bdaaecc20269..0fd40c979d2c 100644 --- a/sql/field.cc +++ b/sql/field.cc @@ -8310,6 +8310,7 @@ int Field_blob::store(const char *from,uint length,CHARSET_INFO *cs) copy_length= copier.well_formed_copy(field_charset, (char*) value.ptr(), new_length, cs, from, length); + value.length(copy_length);
good! Could this be a distinct bug with its own test case?
Do we need to spend time on that bureaucracy? There are real problems to solve... Tickets drain time from the management people and generally from anyone who sees them.
I didn't mean opening an MDEV. Only writing a test case and extracting it into a separate commit.
If this is a genuine bug fix and you don't create a test case - how can you be sure the fix will stay? It may disappear in a merge or a rebase, manually or automatically. Or in a refactoring that "looks fine, all tests pass". If you can create a test case - please, do.
It is tested by federated.federated_partition test case. Why do we have to extract this side-fix to a separate commit? Looks anyway as a bureaucracy to me. The main bug is critical, the test case tests both bugs nicely. If the partitioning code is refactored, the test case still tests that side-fix.
Field_blob::store_length(copy_length); bmove(ptr+packlength,(uchar*) &tmp,sizeof(char*));
diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc index 9d82df0235c7..7c658082d397 100644 --- a/sql/ha_partition.cc +++ b/sql/ha_partition.cc @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue() uint alloc_len; uint used_parts= bitmap_bits_set(&m_part_info->read_partitions); /* Allocate record buffer for each used partition. */ - m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS; + m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET; if (!m_using_extended_keys) m_priority_queue_rec_len += m_file[0]->ref_length; alloc_len= used_parts * m_priority_queue_rec_len; /* Allocate a key for temporary use when setting up the scan. */ alloc_len+= table_share->max_key_length;
- if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len, MYF(MY_WME)))) + if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root, alloc_len)))
I don't see why you need a memroot here. memroot is needed when you allocate later at difeerent points in time some initially unpredictable amount of memory that has the same lifetime and needs to be freed all at once.
I know... This is just another convenient way of allocation with presumably minimized number of mallocs (probably pre_alloc_size should be tweaked to a better value).
memroot wasn't created for that. There are approaches with much less overhead
Using memory pools like that is a nice design pattern for non-critical sections. At least it is nicer than manual pointer management. I believe I've seen the similar pattern in InnoDB with its memory pool mem_heap_t even without preallocation! It would be nice to have a special class that preallocates and manages pointers, but I do not see the reason to develop that for this specific use case. I don't believe in *much* overhead: mem_root is a simplest algorithm, a couple of ifs and one loop. Initialization call init_record_priority_queue() is not a critical section, the major execution is mostly busy by a data copy in handle_ordered_index_scan(), handle_ordered_next(). Let's see gprof stats collected for 100k SELECTs attached in stats.txt init_record_priority_queue() takes 0.3%; handle_ordered_index_scan() takes 2.3%; handle_ordered_next() takes 12.3%. Let's see unpatched stats for the same 100k SELECTs in stats_vanilla.txt init_record_priority_queue() takes 0.1%; handle_ordered_index_scan() takes 2.2%; handle_ordered_next() takes 11.4%. My version of init_record_priority_queue() lost 0.2%. From details we see that mem_root took 6 out of 13 which is 0.15%. One should keep in mind this is an artificial usage of the prio queue. In diversified real-world usage the share of such kind SELECT is say 0.1%. But even if it is 1% we divide our stats by 100. So, do you really care about 0.0015%?
But here it seems that you only allocate once. You could just make
m_priority_queue_rec_len = m_rec_length + PARTITION_BYTES_IN_POS + table->s->blob_fields * sizeof(Ordered_blob_storage)
I don't wnat this data serialized. It is inconvenient to read the dump of the record. Why we should make it harder to code, read the code and debug while we can make it easier? If it is only for guaranteed 1 malloc, I don't think it's worth it.
What do you mean "serialized"?
A sort of memory organization when data is placed into a single contiguous buffer.
What do you mean "read the dump"?
x command in GDB.
Note that you can also use my_multi_malloc, that allocates many buffers together and hides address arithmetcs from the caller. Like
my_multi_malloc(MYF(MY_WME), &m_ordered_rec_buffer, m_rec_length + PARTITION_BYTES_IN_POS, &blob_storage, table->s->blob_fields * sizeof(Ordered_blob_storage), NULL)
(but with error checking). And then, perhaps, with placement new[] you could construct all Ordered_blob_storage objects in one line.
Sorry, I don't understand how my version is much worse.
@@ -6178,7 +6203,11 @@ int ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order) */ error= file->read_range_first(m_start_key.key? &m_start_key: NULL, end_range, eq_range, TRUE); - memcpy(rec_buf_ptr, table->record[0], m_rec_length); + if (!error) + { + memcpy(rec_buf_ptr, table->record[0], m_rec_length); + }
did you have a bug because of that? something didn't work?
No, I just saw worthless memcpy and avoided it.
You're right that memcopy is unnecessary in case of an error. But most of the time there is no error. You made an exceptional code path faster by making a common code path slower. I think it's generally better to do the opposite, making the common, no-error code path faster, fewer conditionals, in particular. At the cost of making the error path slower, if needed.
I'm usually aware of such things in my decisions. The slowdown is close to 0, it is not even 0.0001%, it is much lower. OTOH memcpy() slowdown is much higher and this is not a fully exceptional code path: HA_ERR_KEY_NOT_FOUND continues the loop.
@@ -6310,6 +6349,43 @@ int ha_partition::handle_ordered_index_scan_key_not_found() + if (cached) + { + cached->swap(s.blob); + s.set_read_value= set_read_value;
is it indeed possible here for a value to be either in `value` or in `read_value` ?
When happens what?
Yes, it uses `value` for plain fields and `read_value` for virtual fields.
Oh, I see.
This read_value was introduced in ea1b25046c81db8fdf to solve the case of UPDATE, where a row is moved with store_record(table, record[1]) and blob pointers aren't updated. This is very similar to what you're fixing, perhaps read_value should go away and both bugs should have a one unified solution that allows to "park" a record somewhere out of record[0] and restore later, all with proper blob handling?
Do you think this is necessary now? I believe there are a lot of *much* more important tasks.
+ } + } + } + table->move_fields(table->field, table->record[0], rec_buf); +}
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
-- All the best, Aleksey Midenkov @midenok
Hi Sergei! Users could benefit from this fix of MDEV-18734 since the last changes. Do you really think the open questions left are so important for this bug to be still not pushed? On Tue, Jan 26, 2021 at 1:51 AM Aleksey Midenkov <midenok@gmail.com> wrote:
Hi Sergei,
On Tue, Dec 22, 2020 at 7:58 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Aleksey!
On Nov 03, Aleksey Midenkov wrote:
the updated patches are here:
https://github.com/MariaDB/server/commits/5f6bcd6cd0145513974b0dfc5b2cefba28...
I started writing my comments there, but realized that it'd be clearer if I reply here, more context. That is, see below.
One comment about the new test case, you have 16K lines in the result file, are all 16K necessary? Can you do something like
select x, left(b,10), left(v,10)
?
Yes, these shorter result cases reproduce as well. Updated.
And please, pretty please, don't use non-constant references as function arguments.
Why? With pointers we can pass NULL and we don't need NULL there, do we?
On Mon, Nov 2, 2020 at 11:33 PM Sergei Golubchik <serg@mariadb.org> wrote:
diff --git a/sql/field.cc b/sql/field.cc index bdaaecc20269..0fd40c979d2c 100644 --- a/sql/field.cc +++ b/sql/field.cc @@ -8310,6 +8310,7 @@ int Field_blob::store(const char *from,uint length,CHARSET_INFO *cs) copy_length= copier.well_formed_copy(field_charset, (char*) value.ptr(), new_length, cs, from, length); + value.length(copy_length);
good! Could this be a distinct bug with its own test case?
Do we need to spend time on that bureaucracy? There are real problems to solve... Tickets drain time from the management people and generally from anyone who sees them.
I didn't mean opening an MDEV. Only writing a test case and extracting it into a separate commit.
If this is a genuine bug fix and you don't create a test case - how can you be sure the fix will stay? It may disappear in a merge or a rebase, manually or automatically. Or in a refactoring that "looks fine, all tests pass". If you can create a test case - please, do.
It is tested by federated.federated_partition test case. Why do we have to extract this side-fix to a separate commit? Looks anyway as a bureaucracy to me. The main bug is critical, the test case tests both bugs nicely. If the partitioning code is refactored, the test case still tests that side-fix.
Field_blob::store_length(copy_length); bmove(ptr+packlength,(uchar*) &tmp,sizeof(char*));
diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc index 9d82df0235c7..7c658082d397 100644 --- a/sql/ha_partition.cc +++ b/sql/ha_partition.cc @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue() uint alloc_len; uint used_parts= bitmap_bits_set(&m_part_info->read_partitions); /* Allocate record buffer for each used partition. */ - m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS; + m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET; if (!m_using_extended_keys) m_priority_queue_rec_len += m_file[0]->ref_length; alloc_len= used_parts * m_priority_queue_rec_len; /* Allocate a key for temporary use when setting up the scan. */ alloc_len+= table_share->max_key_length;
- if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len, MYF(MY_WME)))) + if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root, alloc_len)))
I don't see why you need a memroot here. memroot is needed when you allocate later at difeerent points in time some initially unpredictable amount of memory that has the same lifetime and needs to be freed all at once.
I know... This is just another convenient way of allocation with presumably minimized number of mallocs (probably pre_alloc_size should be tweaked to a better value).
memroot wasn't created for that. There are approaches with much less overhead
Using memory pools like that is a nice design pattern for non-critical sections. At least it is nicer than manual pointer management. I believe I've seen the similar pattern in InnoDB with its memory pool mem_heap_t even without preallocation! It would be nice to have a special class that preallocates and manages pointers, but I do not see the reason to develop that for this specific use case.
I don't believe in *much* overhead: mem_root is a simplest algorithm, a couple of ifs and one loop. Initialization call init_record_priority_queue() is not a critical section, the major execution is mostly busy by a data copy in handle_ordered_index_scan(), handle_ordered_next().
Let's see gprof stats collected for 100k SELECTs attached in stats.txt
init_record_priority_queue() takes 0.3%; handle_ordered_index_scan() takes 2.3%; handle_ordered_next() takes 12.3%.
Let's see unpatched stats for the same 100k SELECTs in stats_vanilla.txt
init_record_priority_queue() takes 0.1%; handle_ordered_index_scan() takes 2.2%; handle_ordered_next() takes 11.4%.
My version of init_record_priority_queue() lost 0.2%. From details we see that mem_root took 6 out of 13 which is 0.15%.
One should keep in mind this is an artificial usage of the prio queue. In diversified real-world usage the share of such kind SELECT is say 0.1%. But even if it is 1% we divide our stats by 100. So, do you really care about 0.0015%?
But here it seems that you only allocate once. You could just make
m_priority_queue_rec_len = m_rec_length + PARTITION_BYTES_IN_POS + table->s->blob_fields * sizeof(Ordered_blob_storage)
I don't wnat this data serialized. It is inconvenient to read the dump of the record. Why we should make it harder to code, read the code and debug while we can make it easier? If it is only for guaranteed 1 malloc, I don't think it's worth it.
What do you mean "serialized"?
A sort of memory organization when data is placed into a single contiguous buffer.
What do you mean "read the dump"?
x command in GDB.
Note that you can also use my_multi_malloc, that allocates many buffers together and hides address arithmetcs from the caller. Like
my_multi_malloc(MYF(MY_WME), &m_ordered_rec_buffer, m_rec_length + PARTITION_BYTES_IN_POS, &blob_storage, table->s->blob_fields * sizeof(Ordered_blob_storage), NULL)
(but with error checking). And then, perhaps, with placement new[] you could construct all Ordered_blob_storage objects in one line.
Sorry, I don't understand how my version is much worse.
@@ -6178,7 +6203,11 @@ int ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order) */ error= file->read_range_first(m_start_key.key? &m_start_key: NULL, end_range, eq_range, TRUE); - memcpy(rec_buf_ptr, table->record[0], m_rec_length); + if (!error) + { + memcpy(rec_buf_ptr, table->record[0], m_rec_length); + }
did you have a bug because of that? something didn't work?
No, I just saw worthless memcpy and avoided it.
You're right that memcopy is unnecessary in case of an error. But most of the time there is no error. You made an exceptional code path faster by making a common code path slower. I think it's generally better to do the opposite, making the common, no-error code path faster, fewer conditionals, in particular. At the cost of making the error path slower, if needed.
I'm usually aware of such things in my decisions. The slowdown is close to 0, it is not even 0.0001%, it is much lower. OTOH memcpy() slowdown is much higher and this is not a fully exceptional code path: HA_ERR_KEY_NOT_FOUND continues the loop.
@@ -6310,6 +6349,43 @@ int ha_partition::handle_ordered_index_scan_key_not_found() + if (cached) + { + cached->swap(s.blob); + s.set_read_value= set_read_value;
is it indeed possible here for a value to be either in `value` or in `read_value` ?
When happens what?
Yes, it uses `value` for plain fields and `read_value` for virtual fields.
Oh, I see.
This read_value was introduced in ea1b25046c81db8fdf to solve the case of UPDATE, where a row is moved with store_record(table, record[1]) and blob pointers aren't updated. This is very similar to what you're fixing, perhaps read_value should go away and both bugs should have a one unified solution that allows to "park" a record somewhere out of record[0] and restore later, all with proper blob handling?
Do you think this is necessary now? I believe there are a lot of *much* more important tasks.
+ } + } + } + table->move_fields(table->field, table->record[0], rec_buf); +}
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
-- All the best,
Aleksey Midenkov @midenok
-- All the best, Aleksey Midenkov @midenok
Hi, Aleksey! I think it's better to wrap it up quickly, agree on changes and push the fix before the next release. On Jun 01, Aleksey Midenkov wrote:
Hi Sergei!
Users could benefit from this fix of MDEV-18734 since the last changes. Do you really think the open questions left are so important for this bug to be still not pushed?
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Sergei! Do you plan to fix everything you just said and push yourself or let me push it as is? Btw, there were many allocations over one loop: that's why I used mem_root. multi_malloc() won't do the job there. That could be done either with explicit moving pointer or some utility class holding such pointer. I decided to upgrade mem_root for such pattern so it could do the double job and we had a cleaner code. On Thu, Jun 3, 2021 at 2:00 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Aleksey!
I think it's better to wrap it up quickly, agree on changes and push the fix before the next release.
On Jun 01, Aleksey Midenkov wrote:
Hi Sergei!
Users could benefit from this fix of MDEV-18734 since the last changes. Do you really think the open questions left are so important for this bug to be still not pushed?
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
-- All the best, Aleksey Midenkov @midenok
Hi, Aleksey! On Jun 03, Aleksey Midenkov wrote:
Hi Sergei!
Do you plan to fix everything you just said and push yourself or let me push it as is?
Neither. As a reviewer I plan to reply "ok to push" when the code is changed according to the review.
Btw, there were many allocations over one loop: that's why I used mem_root. multi_malloc() won't do the job there. That could be done either with explicit moving pointer or some utility class holding such pointer. I decided to upgrade mem_root for such pattern so it could do the double job and we had a cleaner code.
Yes, I mean my_multi_malloc(MYF(MY_WME), &m_ordered_rec_buffer, alloc_len, &blob_storage, table->s->blob_fields * sizeof(Ordered_blob_storage *), &ptr, table->s->blob_fields * sizeof(Ordered_blob_storage), NULL); for (uint j= 0; j < table->s->blob_fields; ++j, ++ptr) blob_storage[j]= new (ptr) Ordered_blob_storage; Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi, Aleksey! On Jan 26, Aleksey Midenkov wrote:
On Tue, Dec 22, 2020 at 7:58 PM Sergei Golubchik <serg@mariadb.org> wrote:
On Nov 03, Aleksey Midenkov wrote:
And please, pretty please, don't use non-constant references as function arguments.
Why? With pointers we can pass NULL and we don't need NULL there, do we?
Because this is the policy. If you convince Monty that non-constant references are fine - feel free to use them.
On Mon, Nov 2, 2020 at 11:33 PM Sergei Golubchik <serg@mariadb.org> wrote:
diff --git a/sql/field.cc b/sql/field.cc index bdaaecc20269..0fd40c979d2c 100644 --- a/sql/field.cc +++ b/sql/field.cc @@ -8310,6 +8310,7 @@ int Field_blob::store(const char *from,uint length,CHARSET_INFO *cs) copy_length= copier.well_formed_copy(field_charset, (char*) value.ptr(), new_length, cs, from, length); + value.length(copy_length);
good! Could this be a distinct bug with its own test case?
Do we need to spend time on that bureaucracy? There are real problems to solve... Tickets drain time from the management people and generally from anyone who sees them.
I didn't mean opening an MDEV. Only writing a test case and extracting it into a separate commit.
If this is a genuine bug fix and you don't create a test case - how can you be sure the fix will stay? It may disappear in a merge or a rebase, manually or automatically. Or in a refactoring that "looks fine, all tests pass". If you can create a test case - please, do.
It is tested by federated.federated_partition test case.
Do you mean, that if I remove this line, federated.federated_partition test will fail? If yes - good, this is the test I meant, no need to create another one.
Why do we have to extract this side-fix to a separate commit?
Because later when someone looks at the code it helps to be able to understand what a commit is actually fixing.
diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc --- a/sql/ha_partition.cc +++ b/sql/ha_partition.cc @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue() uint alloc_len; uint used_parts= bitmap_bits_set(&m_part_info->read_partitions); /* Allocate record buffer for each used partition. */ - m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS; + m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET; if (!m_using_extended_keys) m_priority_queue_rec_len += m_file[0]->ref_length; alloc_len= used_parts * m_priority_queue_rec_len; /* Allocate a key for temporary use when setting up the scan. */ alloc_len+= table_share->max_key_length;
- if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len, MYF(MY_WME)))) + if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root, alloc_len)))
I don't see why you need a memroot here. memroot is needed when you allocate later at difeerent points in time some initially unpredictable amount of memory that has the same lifetime and needs to be freed all at once.
One should keep in mind this is an artificial usage of the prio queue. In diversified real-world usage the share of such kind SELECT is say 0.1%. But even if it is 1% we divide our stats by 100. So, do you really care about 0.0015%?
A right tool for a job. If you need to allocate few buffers at the same time - use multi_alloc, if you do many allocations over time and want to fee them at once - use memroot.
@@ -6178,7 +6203,11 @@ int ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order) */ error= file->read_range_first(m_start_key.key? &m_start_key: NULL, end_range, eq_range, TRUE); - memcpy(rec_buf_ptr, table->record[0], m_rec_length); + if (!error) + { + memcpy(rec_buf_ptr, table->record[0], m_rec_length); + }
did you have a bug because of that? something didn't work?
No, I just saw worthless memcpy and avoided it.
You're right that memcopy is unnecessary in case of an error. But most of the time there is no error. You made an exceptional code path faster by making a common code path slower. I think it's generally better to do the opposite, making the common, no-error code path faster, fewer conditionals, in particular. At the cost of making the error path slower, if needed.
I'm usually aware of such things in my decisions. The slowdown is close to 0, it is not even 0.0001%, it is much lower. OTOH memcpy() slowdown is much higher and this is not a fully exceptional code path: HA_ERR_KEY_NOT_FOUND continues the loop.
This is a wrong code pattern, you should optimize for the normal code path, not for errors. Errors are exceptions, they're much more rare than non-errors.
@@ -6310,6 +6349,43 @@ int ha_partition::handle_ordered_index_scan_key_not_found() + if (cached) + { + cached->swap(s.blob); + s.set_read_value= set_read_value;
is it indeed possible here for a value to be either in `value` or in `read_value` ?
When happens what?
Yes, it uses `value` for plain fields and `read_value` for virtual fields.
Oh, I see.
This read_value was introduced in ea1b25046c81db8fdf to solve the case of UPDATE, where a row is moved with store_record(table, record[1]) and blob pointers aren't updated. This is very similar to what you're fixing, perhaps read_value should go away and both bugs should have a one unified solution that allows to "park" a record somewhere out of record[0] and restore later, all with proper blob handling?
Do you think this is necessary now? I believe there are a lot of *much* more important tasks.
Uhm, yes? This would be a notable code simplification, and it won't take long, a low-hanging fruit. Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Sergei! On Thu, Jun 3, 2021 at 1:58 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Aleksey!
On Jan 26, Aleksey Midenkov wrote:
On Tue, Dec 22, 2020 at 7:58 PM Sergei Golubchik <serg@mariadb.org> wrote:
On Nov 03, Aleksey Midenkov wrote:
And please, pretty please, don't use non-constant references as function arguments.
Why? With pointers we can pass NULL and we don't need NULL there, do we?
Because this is the policy. If you convince Monty that non-constant references are fine - feel free to use them.
If this is the policy, why there is void String::swap(String &s) since 2004? Are you asking me to pass the pointer and then dereference it and pass it as a reference into String class? That would look weird to me. I'm going to agree if you insist, but really?
On Mon, Nov 2, 2020 at 11:33 PM Sergei Golubchik <serg@mariadb.org> wrote:
diff --git a/sql/field.cc b/sql/field.cc index bdaaecc20269..0fd40c979d2c 100644 --- a/sql/field.cc +++ b/sql/field.cc @@ -8310,6 +8310,7 @@ int Field_blob::store(const char *from,uint length,CHARSET_INFO *cs) copy_length= copier.well_formed_copy(field_charset, (char*) value.ptr(), new_length, cs, from, length); + value.length(copy_length);
good! Could this be a distinct bug with its own test case?
Do we need to spend time on that bureaucracy? There are real problems to solve... Tickets drain time from the management people and generally from anyone who sees them.
I didn't mean opening an MDEV. Only writing a test case and extracting it into a separate commit.
If this is a genuine bug fix and you don't create a test case - how can you be sure the fix will stay? It may disappear in a merge or a rebase, manually or automatically. Or in a refactoring that "looks fine, all tests pass". If you can create a test case - please, do.
It is tested by federated.federated_partition test case.
Do you mean, that if I remove this line, federated.federated_partition test will fail? If yes - good, this is the test I meant, no need to create another one.
Yes, the test will fail.
Why do we have to extract this side-fix to a separate commit?
Because later when someone looks at the code it helps to be able to understand what a commit is actually fixing.
As long as both fixes are important and there is no sense to apply one without the other I see separate commits as a measure to satisfy someone's curiosity which I do not share (and there is nothing to be curious about that one-liner). Though I'm going to agree if you ask me to do that.
diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc --- a/sql/ha_partition.cc +++ b/sql/ha_partition.cc @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue() uint alloc_len; uint used_parts= bitmap_bits_set(&m_part_info->read_partitions); /* Allocate record buffer for each used partition. */ - m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS; + m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET; if (!m_using_extended_keys) m_priority_queue_rec_len += m_file[0]->ref_length; alloc_len= used_parts * m_priority_queue_rec_len; /* Allocate a key for temporary use when setting up the scan. */ alloc_len+= table_share->max_key_length;
- if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len, MYF(MY_WME)))) + if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root, alloc_len)))
I don't see why you need a memroot here. memroot is needed when you allocate later at difeerent points in time some initially unpredictable amount of memory that has the same lifetime and needs to be freed all at once.
One should keep in mind this is an artificial usage of the prio queue. In diversified real-world usage the share of such kind SELECT is say 0.1%. But even if it is 1% we divide our stats by 100. So, do you really care about 0.0015%?
A right tool for a job. If you need to allocate few buffers at the same time - use multi_alloc, if you do many allocations over time and want to fee them at once - use memroot.
Changed to my_multi_alloc().
@@ -6178,7 +6203,11 @@ int ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order) */ error= file->read_range_first(m_start_key.key? &m_start_key: NULL, end_range, eq_range, TRUE); - memcpy(rec_buf_ptr, table->record[0], m_rec_length); + if (!error) + { + memcpy(rec_buf_ptr, table->record[0], m_rec_length); + }
did you have a bug because of that? something didn't work?
No, I just saw worthless memcpy and avoided it.
You're right that memcopy is unnecessary in case of an error. But most of the time there is no error. You made an exceptional code path faster by making a common code path slower. I think it's generally better to do the opposite, making the common, no-error code path faster, fewer conditionals, in particular. At the cost of making the error path slower, if needed.
I'm usually aware of such things in my decisions. The slowdown is close to 0, it is not even 0.0001%, it is much lower. OTOH memcpy() slowdown is much higher and this is not a fully exceptional code path: HA_ERR_KEY_NOT_FOUND continues the loop.
This is a wrong code pattern, you should optimize for the normal code path, not for errors. Errors are exceptions, they're much more rare than non-errors.
Okay. Looks like no sense to argue that anymore. Reverted.
@@ -6310,6 +6349,43 @@ int ha_partition::handle_ordered_index_scan_key_not_found() + if (cached) + { + cached->swap(s.blob); + s.set_read_value= set_read_value;
is it indeed possible here for a value to be either in `value` or in `read_value` ?
When happens what?
Yes, it uses `value` for plain fields and `read_value` for virtual fields.
Oh, I see.
This read_value was introduced in ea1b25046c81db8fdf to solve the case of UPDATE, where a row is moved with store_record(table, record[1]) and blob pointers aren't updated. This is very similar to what you're fixing, perhaps read_value should go away and both bugs should have a one unified solution that allows to "park" a record somewhere out of record[0] and restore later, all with proper blob handling?
Do you think this is necessary now? I believe there are a lot of *much* more important tasks.
Uhm, yes? This would be a notable code simplification, and it won't take long, a low-hanging fruit.
I'm not sure if this is a "fruit". Can you please elaborate the change? Use so-called Ordered_blob_storage instead of read_value and do swap_blobs() on store_record()? And have 2 arrays of Ordered_blob_storage for each TABLE::record like that: diff --git a/sql/table.h b/sql/table.h index 69bd14b2834..40042dc1df0 100644 --- a/sql/table.h +++ b/sql/table.h @@ -1151,6 +1151,8 @@ struct TABLE THD *in_use; /* Which thread uses this */ uchar *record[2]; /* Pointer to records */ + Ordered_blob_storage **blob_storage[2]; /* NULL-terminated arrays of pointers + to blob-storages */ uchar *write_row_record; /* Used as optimisation in THD::write_row */ uchar *insert_values; /* used by INSERT ... UPDATE */
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
-- All the best, Aleksey Midenkov @midenok
Hi, Aleksey! On Jun 10, Aleksey Midenkov wrote:
On Thu, Jun 3, 2021 at 1:58 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Aleksey!
On Jan 26, Aleksey Midenkov wrote:
On Tue, Dec 22, 2020 at 7:58 PM Sergei Golubchik <serg@mariadb.org> wrote:
On Nov 03, Aleksey Midenkov wrote:
And please, pretty please, don't use non-constant references as function arguments.
Why? With pointers we can pass NULL and we don't need NULL there, do we?
Because this is the policy. If you convince Monty that non-constant references are fine - feel free to use them.
If this is the policy, why there is
void String::swap(String &s)
since 2004?
because, of course, we've never had a consistent written policy that everyone would've followed. And because Monty tends to revert new commits, not 17yr old ones.
Are you asking me to pass the pointer and then dereference it and pass it as a reference into String class? That would look weird to me. I'm going to agree if you insist, but really?
Oh, right. Your swap() is just calling String::swap(), so it copies its prototype. Makes sense. Okay, let's keep it like String::swap() Still, your String * cached(bool &set_read_value) isn't "inheriting" the prototype from anything, better use a pointer there.
It is tested by federated.federated_partition test case.
Do you mean, that if I remove this line, federated.federated_partition test will fail? If yes - good, this is the test I meant, no need to create another one.
Yes, the test will fail.
Great, so no new test is needed.
Why do we have to extract this side-fix to a separate commit?
Because later when someone looks at the code it helps to be able to understand what a commit is actually fixing.
As long as both fixes are important and there is no sense to apply one without the other I see separate commits as a measure to satisfy someone's curiosity which I do not share (and there is nothing to be curious about that one-liner). Though I'm going to agree if you ask me to do that.
Generally, yes, it's *hugely* helpful to look at the line and see why it was added. For example, this commit: https://github.com/MariaDB/server/commit/aba91154264 Here it's very clear what's happening, one doesn't need even to read the commit comment or the test. As a counter example, MySQL commit (they have a policy for doing that, as far as I understand, our tree doesn't have such monsters): https://github.com/mysql/mysql-server/commit/67c3c70e489 447 files changed, 28239 insertions(+), 21952 deletions(-) many are tests, but in sql alone: 126 files changed, 8975 insertions(+), 6318 deletions(-) But in this particular case of your commit. If both changes, indeed, make no sense without each other, then they could be in the same commit.
> @@ -6310,6 +6349,43 @@ int ha_partition::handle_ordered_index_scan_key_not_found() > + if (cached) > + { > + cached->swap(s.blob); > + s.set_read_value= set_read_value;
is it indeed possible here for a value to be either in `value` or in `read_value` ?
When happens what?
Yes, it uses `value` for plain fields and `read_value` for virtual fields.
Oh, I see.
This read_value was introduced in ea1b25046c81db8fdf to solve the case of UPDATE, where a row is moved with store_record(table, record[1]) and blob pointers aren't updated. This is very similar to what you're fixing, perhaps read_value should go away and both bugs should have a one unified solution that allows to "park" a record somewhere out of record[0] and restore later, all with proper blob handling?
Do you think this is necessary now? I believe there are a lot of *much* more important tasks.
Uhm, yes? This would be a notable code simplification, and it won't take long, a low-hanging fruit.
I'm not sure if this is a "fruit". Can you please elaborate the change? Use so-called Ordered_blob_storage instead of read_value and do swap_blobs() on store_record()? And have 2 arrays of Ordered_blob_storage for each TABLE::record like that:
May be it isn't a as low-hanging as it looked, indeed. Let's get your fix first, then I'll check again. Just don't like to different approaches to essentially the same problem Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
participants (2)
-
Aleksey Midenkov
-
Sergei Golubchik