Hi Sergei, On Tue, Dec 22, 2020 at 7:58 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Aleksey!
On Nov 03, Aleksey Midenkov wrote:
the updated patches are here:
https://github.com/MariaDB/server/commits/5f6bcd6cd0145513974b0dfc5b2cefba28...
I started writing my comments there, but realized that it'd be clearer if I reply here, more context. That is, see below.
One comment about the new test case, you have 16K lines in the result file, are all 16K necessary? Can you do something like
select x, left(b,10), left(v,10)
?
Yes, these shorter result cases reproduce as well. Updated.
And please, pretty please, don't use non-constant references as function arguments.
Why? With pointers we can pass NULL and we don't need NULL there, do we?
On Mon, Nov 2, 2020 at 11:33 PM Sergei Golubchik <serg@mariadb.org> wrote:
diff --git a/sql/field.cc b/sql/field.cc index bdaaecc20269..0fd40c979d2c 100644 --- a/sql/field.cc +++ b/sql/field.cc @@ -8310,6 +8310,7 @@ int Field_blob::store(const char *from,uint length,CHARSET_INFO *cs) copy_length= copier.well_formed_copy(field_charset, (char*) value.ptr(), new_length, cs, from, length); + value.length(copy_length);
good! Could this be a distinct bug with its own test case?
Do we need to spend time on that bureaucracy? There are real problems to solve... Tickets drain time from the management people and generally from anyone who sees them.
I didn't mean opening an MDEV. Only writing a test case and extracting it into a separate commit.
If this is a genuine bug fix and you don't create a test case - how can you be sure the fix will stay? It may disappear in a merge or a rebase, manually or automatically. Or in a refactoring that "looks fine, all tests pass". If you can create a test case - please, do.
It is tested by federated.federated_partition test case. Why do we have to extract this side-fix to a separate commit? Looks anyway as a bureaucracy to me. The main bug is critical, the test case tests both bugs nicely. If the partitioning code is refactored, the test case still tests that side-fix.
Field_blob::store_length(copy_length); bmove(ptr+packlength,(uchar*) &tmp,sizeof(char*));
diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc index 9d82df0235c7..7c658082d397 100644 --- a/sql/ha_partition.cc +++ b/sql/ha_partition.cc @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue() uint alloc_len; uint used_parts= bitmap_bits_set(&m_part_info->read_partitions); /* Allocate record buffer for each used partition. */ - m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS; + m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET; if (!m_using_extended_keys) m_priority_queue_rec_len += m_file[0]->ref_length; alloc_len= used_parts * m_priority_queue_rec_len; /* Allocate a key for temporary use when setting up the scan. */ alloc_len+= table_share->max_key_length;
- if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len, MYF(MY_WME)))) + if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root, alloc_len)))
I don't see why you need a memroot here. memroot is needed when you allocate later at difeerent points in time some initially unpredictable amount of memory that has the same lifetime and needs to be freed all at once.
I know... This is just another convenient way of allocation with presumably minimized number of mallocs (probably pre_alloc_size should be tweaked to a better value).
memroot wasn't created for that. There are approaches with much less overhead
Using memory pools like that is a nice design pattern for non-critical sections. At least it is nicer than manual pointer management. I believe I've seen the similar pattern in InnoDB with its memory pool mem_heap_t even without preallocation! It would be nice to have a special class that preallocates and manages pointers, but I do not see the reason to develop that for this specific use case. I don't believe in *much* overhead: mem_root is a simplest algorithm, a couple of ifs and one loop. Initialization call init_record_priority_queue() is not a critical section, the major execution is mostly busy by a data copy in handle_ordered_index_scan(), handle_ordered_next(). Let's see gprof stats collected for 100k SELECTs attached in stats.txt init_record_priority_queue() takes 0.3%; handle_ordered_index_scan() takes 2.3%; handle_ordered_next() takes 12.3%. Let's see unpatched stats for the same 100k SELECTs in stats_vanilla.txt init_record_priority_queue() takes 0.1%; handle_ordered_index_scan() takes 2.2%; handle_ordered_next() takes 11.4%. My version of init_record_priority_queue() lost 0.2%. From details we see that mem_root took 6 out of 13 which is 0.15%. One should keep in mind this is an artificial usage of the prio queue. In diversified real-world usage the share of such kind SELECT is say 0.1%. But even if it is 1% we divide our stats by 100. So, do you really care about 0.0015%?
But here it seems that you only allocate once. You could just make
m_priority_queue_rec_len = m_rec_length + PARTITION_BYTES_IN_POS + table->s->blob_fields * sizeof(Ordered_blob_storage)
I don't wnat this data serialized. It is inconvenient to read the dump of the record. Why we should make it harder to code, read the code and debug while we can make it easier? If it is only for guaranteed 1 malloc, I don't think it's worth it.
What do you mean "serialized"?
A sort of memory organization when data is placed into a single contiguous buffer.
What do you mean "read the dump"?
x command in GDB.
Note that you can also use my_multi_malloc, that allocates many buffers together and hides address arithmetcs from the caller. Like
my_multi_malloc(MYF(MY_WME), &m_ordered_rec_buffer, m_rec_length + PARTITION_BYTES_IN_POS, &blob_storage, table->s->blob_fields * sizeof(Ordered_blob_storage), NULL)
(but with error checking). And then, perhaps, with placement new[] you could construct all Ordered_blob_storage objects in one line.
Sorry, I don't understand how my version is much worse.
@@ -6178,7 +6203,11 @@ int ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order) */ error= file->read_range_first(m_start_key.key? &m_start_key: NULL, end_range, eq_range, TRUE); - memcpy(rec_buf_ptr, table->record[0], m_rec_length); + if (!error) + { + memcpy(rec_buf_ptr, table->record[0], m_rec_length); + }
did you have a bug because of that? something didn't work?
No, I just saw worthless memcpy and avoided it.
You're right that memcopy is unnecessary in case of an error. But most of the time there is no error. You made an exceptional code path faster by making a common code path slower. I think it's generally better to do the opposite, making the common, no-error code path faster, fewer conditionals, in particular. At the cost of making the error path slower, if needed.
I'm usually aware of such things in my decisions. The slowdown is close to 0, it is not even 0.0001%, it is much lower. OTOH memcpy() slowdown is much higher and this is not a fully exceptional code path: HA_ERR_KEY_NOT_FOUND continues the loop.
@@ -6310,6 +6349,43 @@ int ha_partition::handle_ordered_index_scan_key_not_found() + if (cached) + { + cached->swap(s.blob); + s.set_read_value= set_read_value;
is it indeed possible here for a value to be either in `value` or in `read_value` ?
When happens what?
Yes, it uses `value` for plain fields and `read_value` for virtual fields.
Oh, I see.
This read_value was introduced in ea1b25046c81db8fdf to solve the case of UPDATE, where a row is moved with store_record(table, record[1]) and blob pointers aren't updated. This is very similar to what you're fixing, perhaps read_value should go away and both bugs should have a one unified solution that allows to "park" a record somewhere out of record[0] and restore later, all with proper blob handling?
Do you think this is necessary now? I believe there are a lot of *much* more important tasks.
+ } + } + } + table->move_fields(table->field, table->record[0], rec_buf); +}
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
-- All the best, Aleksey Midenkov @midenok