[Maria-developers] InnoDB: Instant ADD COLUMN that works on old data files
I would like to seek technical review and feedback for an idea. In the long term, I would like InnoDB to support instant ALTER TABLE for a larger set of operations: https://jira.mariadb.org/browse/MDEV-11424 Instant ALTER TABLE of failure-free record format changes Probably one of the most common operations is ADD COLUMN. For that, a ticket had already been filed and some patches exist: https://jira.mariadb.org/browse/MDEV-11369 Instant add column for InnoDB Tencent and Alibaba have ADD COLUMN patches that basically introduce new InnoDB record formats so that each record will carry the number of columns, similar to how ROW_FORMAT=REDUNDANT (the original InnoDB row format) works. A problem with that approach is that we would only be able to support instant ADD COLUMN on data files that already are in a suitable format. The whole purpose of instant ALTER TABLE is to avoid conversions of huge data files. Requiring a conversion before the feature can be used would in my opinion defeat the purpose. I came up with a solution that only requires some changes to the InnoDB page format, not to the record formats: Require each record in a clustered index leaf page to contain the same number of fields (on write, convert the whole page). We can repurpose the previously unused (always 0) field PAGE_MAX_TRX_ID on clustered index leaf pages so that the number of fields in each leaf page can be determined. This is the idea that I posted in MDEV-11369 last week and edited today. As far as I can tell, it should work on any existing data file, and it would be a step towards the full-blown MDEV-11424 in a distant future version of MariaDB. One challenge with the page-at-a-time conversion is that inserting or updating records into a leaf page may require the page to be split into several pages, especially in the case where multiple CHAR(xxx) NOT NULL columns would need to be materialized as a result of the insert or update. Does this seem to make sense to you? I am looking forward to your feedback. Best regards, Marko -- DON’T MISS M|17 April 11 - 12, 2017 The Conrad Hotel New York City https://m17.mariadb.com/ Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
Hi, Number of the fields indeed would solve the ADD COLUMN default case but has some limitations ADD COLUMN m default xxx; would be still slow, xxx = NULL maybe possible, in other cases could we store default on SYS_COLUMNS? DROP COLUMN would be still slow Problem naturally is that we do not have suitable metadata on every page where to identify what columns are stored. I do not see way out of this problem without introducing yet a another file forma (we have too many of them already). R: Jan On Tue, Jan 3, 2017 at 8:13 PM, Marko Mäkelä <marko.makela@mariadb.com> wrote:
I would like to seek technical review and feedback for an idea.
In the long term, I would like InnoDB to support instant ALTER TABLE for a larger set of operations:
https://jira.mariadb.org/browse/MDEV-11424 Instant ALTER TABLE of failure-free record format changes
Probably one of the most common operations is ADD COLUMN. For that, a ticket had already been filed and some patches exist:
https://jira.mariadb.org/browse/MDEV-11369 Instant add column for InnoDB
Tencent and Alibaba have ADD COLUMN patches that basically introduce new InnoDB record formats so that each record will carry the number of columns, similar to how ROW_FORMAT=REDUNDANT (the original InnoDB row format) works. A problem with that approach is that we would only be able to support instant ADD COLUMN on data files that already are in a suitable format.
The whole purpose of instant ALTER TABLE is to avoid conversions of huge data files. Requiring a conversion before the feature can be used would in my opinion defeat the purpose.
I came up with a solution that only requires some changes to the InnoDB page format, not to the record formats: Require each record in a clustered index leaf page to contain the same number of fields (on write, convert the whole page). We can repurpose the previously unused (always 0) field PAGE_MAX_TRX_ID on clustered index leaf pages so that the number of fields in each leaf page can be determined. This is the idea that I posted in MDEV-11369 last week and edited today. As far as I can tell, it should work on any existing data file, and it would be a step towards the full-blown MDEV-11424 in a distant future version of MariaDB.
One challenge with the page-at-a-time conversion is that inserting or updating records into a leaf page may require the page to be split into several pages, especially in the case where multiple CHAR(xxx) NOT NULL columns would need to be materialized as a result of the insert or update.
Does this seem to make sense to you? I am looking forward to your feedback.
Best regards,
Marko -- DON’T MISS M|17 April 11 - 12, 2017 The Conrad Hotel New York City https://m17.mariadb.com/
Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Hi Jan, On Wed, Jan 11, 2017 at 9:07 AM, Jan Lindström <jan.lindstrom@mariadb.com> wrote:
Number of the fields indeed would solve the ADD COLUMN default case but has some limitations
ADD COLUMN m default xxx; would be still slow, xxx = NULL maybe possible, in other cases could we store default on SYS_COLUMNS?
The instant ADD COLUMN described in MDEV-11369 is just a start. Changing the DEFAULT of an instantly added column would require a rebuild, but specifying the DEFAULT at the time of adding the column can be done instantly. Changing a previously NOT NULL column to allow NULLs would be in the realm of https://jira.mariadb.org/browse/MDEV-11424 Instant ALTER TABLE of failure-free record format changes. Currently the DEFAULT values of columns are stored into the *.frm files by the SQL layer, not at all in InnoDB. If we allow instant changes of the DEFAULT values of instantly added columns, we must make sure that any rows that were inserted before the change to the DEFAULT would use the old default value. I believe that for this, we would effectively have to store all default values that ever existed, and we would have to identify when each record or page was created with respect to the instant ALTER operations. MDEV-11424 does cover all that. Monty has expressed a long-term goal of allowing the server to discover InnoDB tables after *.frm and *.ibd files were copied to the data directory. I think that we should avoid extending the InnoDB SYS_* tables any further, because that would seem to move us further away from the goal. I think that any metadata related to the instant ALTER has to be stored in the data file itself, and the high-level data dictionary should only reflect the latest committed data definition. I hope that we can some day deliver a transactional data dictionary in MDEV-11655 and remove all SYS_* tables.
DROP COLUMN would be still slow
Yes, DROP COLUMN (as well as changing the order of columns, or adding a column somewhere in the middle) would require a table rebuild until MDEV-11424 is implemented.
Problem naturally is that we do not have suitable metadata on every page where to identify what columns are stored. I do not see way out of this problem without introducing yet a another file forma (we have too many of them already).
Yes, that is the case. I agree with you that we must be careful when introducing new file formats. Luckily we do have 8 previously unused bytes (always written as 0 until now) in each clustered index page, the PAGE_MAX_TRX_ID field. (MDEV-6076 in MariaDB 10.2.4 will repurpose the field in clustered index root pages for a persistent AUTO_INCREMENT field.) In MDEV-11369 we repurpose a subset 1..1023 of the values, and in MDEV-11424 we can repurpose the remaining values to indicate the format of the records in the page. MDEV-11424 would also require a metadata history (or record conversion recipes) to be stored somewhere in the data file, separately from the clustered index tree. Marko -- DON’T MISS M|17 April 11 - 12, 2017 The Conrad Hotel New York City https://m17.mariadb.com/ Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
participants (2)
-
Jan Lindström
-
Marko Mäkelä