Odp: Re: Re: Rocksdb as a replacement for toku?
Hi Marko, thanks for the response ... so for the DOCS part > h mariadb.com ttps://mariadb.com/kb/en/innodb-page-compression/ > When InnoDB page compression is used, InnoDB may still write the compressed page to the tablespace file with the original size of the uncompressed page, which would be equivalent to the value of the mariadb.com innodb_page_size system variable. This is done by design, because when InnoDB's I/O code needs to read the page from disk, it can only read the full page size. However, this is obviously not optimal. > This punch hole technique allows InnoDB to read the compressed page from disk as the full page size, even though the compressed page really takes up less space on the file system. So as far as i understand if compression is enabled the server will write eg. 40% utilized page which will later be decompressed when read, but this compression/decompression won't have any advangage if there's no punch hole support. The only difference will be that the block will be filled with zeros for the empty space at the end. Or is it possible for PAGE_COMPRESSION to still be used without punch hole (just not that effective for example)? Also tested this row compression, and yes page compression seems much better. Is it possible to get rid of punch hole by just re-mapping the pages inside these innodb files? I was thinking about this... if it's possible to change the data structure to divide the file into 512b blocks while the beginning of each block will mark page number (0xffffffff for blocks which are part of previous page). To increase page size we'd need to add another page with same number at the end of the file, later continuous recycled blocks can be added to a list and used to store smaller pages. To make this atomic, first whole block would be allocated and written with 0xfffffffe as a dummy id and then this placeholder id could be replaced with real block number). The problem with this could be that the file would get 100% fragmented over time (but probably will work ok with NVME). The mapping could be read back by reading whole datafile during server start and it'd need 125MB of memory for 1TB of compressed data to store the mapping in memory for fast access. What you thinik? Would it be easy to add it on top of current code (if there's any API maybe, so for example i can do this) and would it even make sense? Best, Slawomir. Dnia 21 sierpnia 2023 07:08 Marko Mäkelä < marko.makela@mariadb.com > napisał(a): On Sun, Aug 20, 2023 at 6:12 PM pslawek83 via discuss <discuss@lists.mariadb.org> wrote: Thanks everyone for the answers. Tested this PAGE_COMPRESSION a bit and make some reading. Not sure if I fully understand how that works... is that correct? 1. Impossible without punch hole as if we don't have it the server will just write partially used pages which will take full page size anyway. But in docs it seems it described its use with punch hole disabled. What's the reason to still have page compression without punch hole? Can you please point to the misleading part of the documentation so that it can be corrected? 2. After enabling compression each page needs to be trimmed using punch hole, so if we have 1 TB database and 64k blocks, the database files will have over 15 million fragments (1000000000000/64000) That is correct. So if we get like 50% compression and then we'll be able to put another 1TB of data into these holes we'll be having close to 31 million file fragments. I'm not sure if under such conditions the FS will be able to even allocate space efficiently anymore, copy these files, etc. Or im not getting something? For what it is worth, around the same time when the page_compressed was being developed for MariaDB Server 10.1, something similar was being worked on at Oracle, I think for MySQL 5.7. To my understanding, the same hardware partner worked with both companies. The obvious answer to smalldatum.blogspot.com https://smalldatum.blogspot.com/2015/10/wanted-file-system-on-which-innodb.h... would have been en.wikipedia.org https://en.wikipedia.org/wiki/Fusion-io_NVMFS if it had worked without kernel panics or existed in a mainline kernel. I don’t even know if it ever was open sourced. 3. I was testing compression of ~1.5GB table with nice results on default 16kb page size. One thing i noticed is that l was always getting exactly the same compressed size using `du`. However the table read/copy times were different from 40s for uncompressed to 80-400 seconds depending on algo. Is that possible that is because page size is too low? As far as I remember, most compression algorithms that are derivatives of the LZ77 family have an input buffer size of 32 KiB. From that point of view, it would seem to make sense to use innodb_page_size=32k or innodb_page_size=64k instead of the default. Furthermore, I think that it was a mistake to implement bzip2 support at all. That algorithm would work with much larger buffer sizes. 4. Myrocks seems to be very slow compared to compressed inno. Was also reading docs related to it. For its usecase is it more suitable for storing large rows rather than many small rows? To my understanding, the LSM trees that MyRocks is built on work best in an insert-only workload. On top of LSM trees, MyRocks implements some caches and Bloom filters. If you have few updates or deletes and most reads are accessing recently inserted data, it could work. 5. I remember reading that some innodb compression is going to be obsolete. Is that row compression? Or there are no plans to obsolete any compression model? I wanted to deprecate and remove the ROW_FORMAT=COMPRESSED that I designed and implemented in the InnoDB Plugin for MySQL 5.1 based on some high level ideas of Heikki Tuuri between 2005 and 2007. jira.mariadb.org https://jira.mariadb.org/browse/MDEV-22367 was the first step towards that. Based on community feedback, this won’t be happening. Even after some data structure cleanup that took place in MariaDB Server 10.6, supporting the format incurs an overhead of some 32 to 40 bytes per buffer page descriptor, which is not insignificant. Removing both ROW_FORMAT=COMPRESSED and the adaptive hash index would shrink the block descriptor to 80 bytes from its current 10.6 size of 160 bytes. The write performance with ROW_FORMAT=COMPRESSED should have been degraded when I implemented the easier-to-parse log record format in MDEV-12353 (MariaDB Server 10.5). There are open bugs about ROW_FORMAT=COMPRESSED, such as jira.mariadb.org https://jira.mariadb.org/browse/MDEV-31574. I would say that the InnoDB ROW_FORMAT=COMPRESSED seemed like a good idea back when it was implemented, when HDD storage was the only practical option. Just like the InnoDB change buffer, which I removed in MariaDB Server 11.0 after having to deal with too many cases of corruption: fosdem.org https://fosdem.org/2023/schedule/event/innodb_change_buffer/ With best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Or you could just put it on ZFS and let it do all of the difficult stuff for you. :-) On Thu, Aug 24, 2023 at 6:40 PM pslawek83 <pslawek83@o2.pl> wrote:
Hi Marko, thanks for the response ... so for the DOCS part
https://mariadb.com/kb/en/innodb-page-compression/ When InnoDB page compression is used, InnoDB may still write the compressed page to the tablespace file with the original size of the uncompressed page, which would be equivalent to the value of the innodb_page_size system variable. This is done by design, because when InnoDB's I/O code needs to read the page from disk, it can only read the full page size. However, this is obviously not optimal. This punch hole technique allows InnoDB to read the compressed page from disk as the full page size, even though the compressed page really takes up less space on the file system.
So as far as i understand if compression is enabled the server will write eg. 40% utilized page which will later be decompressed when read, but this compression/decompression won't have any advangage if there's no punch hole support. The only difference will be that the block will be filled with zeros for the empty space at the end. Or is it possible for PAGE_COMPRESSION to still be used without punch hole (just not that effective for example)?
Also tested this row compression, and yes page compression seems much better. Is it possible to get rid of punch hole by just re-mapping the pages inside these innodb files? I was thinking about this... if it's possible to change the data structure to divide the file into 512b blocks while the beginning of each block will mark page number (0xffffffff for blocks which are part of previous page). To increase page size we'd need to add another page with same number at the end of the file, later continuous recycled blocks can be added to a list and used to store smaller pages. To make this atomic, first whole block would be allocated and written with 0xfffffffe as a dummy id and then this placeholder id could be replaced with real block number).
The problem with this could be that the file would get 100% fragmented over time (but probably will work ok with NVME). The mapping could be read back by reading whole datafile during server start and it'd need 125MB of memory for 1TB of compressed data to store the mapping in memory for fast access.
What you thinik? Would it be easy to add it on top of current code (if there's any API maybe, so for example i can do this) and would it even make sense?
Best, Slawomir.
Dnia 21 sierpnia 2023 07:08 Marko Mäkelä <marko.makela@mariadb.com> napisał(a):
On Sun, Aug 20, 2023 at 6:12 PM pslawek83 via discuss <discuss@lists.mariadb.org> wrote:
Thanks everyone for the answers. Tested this PAGE_COMPRESSION a bit and make some reading. Not sure if I fully understand how that works... is that correct? 1. Impossible without punch hole as if we don't have it the server will just write partially used pages which will take full page size anyway. But in docs it seems it described its use with punch hole disabled. What's the reason to still have page compression without punch hole?
Can you please point to the misleading part of the documentation so that it can be corrected?
2. After enabling compression each page needs to be trimmed using punch hole, so if we have 1 TB database and 64k blocks, the database files will have over 15 million fragments (1000000000000/64000)
That is correct.
So if we get like 50% compression and then we'll be able to put another 1TB of data into these holes we'll be having close to 31 million file fragments. I'm not sure if under such conditions the FS will be able to even allocate space efficiently anymore, copy these files, etc. Or im not getting something?
For what it is worth, around the same time when the page_compressed was being developed for MariaDB Server 10.1, something similar was being worked on at Oracle, I think for MySQL 5.7. To my understanding, the same hardware partner worked with both companies. The obvious answer to https://smalldatum.blogspot.com/2015/10/wanted-file-system-on-which-innodb.h... would have been https://en.wikipedia.org/wiki/Fusion-io_NVMFS if it had worked without kernel panics or existed in a mainline kernel. I don’t even know if it ever was open sourced.
3. I was testing compression of ~1.5GB table with nice results on default 16kb page size. One thing i noticed is that l was always getting exactly the same compressed size using `du`. However the table read/copy times were different from 40s for uncompressed to 80-400 seconds depending on algo. Is that possible that is because page size is too low?
As far as I remember, most compression algorithms that are derivatives of the LZ77 family have an input buffer size of 32 KiB. From that point of view, it would seem to make sense to use innodb_page_size=32k or innodb_page_size=64k instead of the default. Furthermore, I think that it was a mistake to implement bzip2 support at all. That algorithm would work with much larger buffer sizes.
4. Myrocks seems to be very slow compared to compressed inno. Was also reading docs related to it. For its usecase is it more suitable for storing large rows rather than many small rows?
To my understanding, the LSM trees that MyRocks is built on work best in an insert-only workload. On top of LSM trees, MyRocks implements some caches and Bloom filters. If you have few updates or deletes and most reads are accessing recently inserted data, it could work.
5. I remember reading that some innodb compression is going to be obsolete. Is that row compression? Or there are no plans to obsolete any compression model?
I wanted to deprecate and remove the ROW_FORMAT=COMPRESSED that I designed and implemented in the InnoDB Plugin for MySQL 5.1 based on some high level ideas of Heikki Tuuri between 2005 and 2007. https://jira.mariadb.org/browse/MDEV-22367 was the first step towards that. Based on community feedback, this won’t be happening. Even after some data structure cleanup that took place in MariaDB Server 10.6, supporting the format incurs an overhead of some 32 to 40 bytes per buffer page descriptor, which is not insignificant. Removing both ROW_FORMAT=COMPRESSED and the adaptive hash index would shrink the block descriptor to 80 bytes from its current 10.6 size of 160 bytes. The write performance with ROW_FORMAT=COMPRESSED should have been degraded when I implemented the easier-to-parse log record format in MDEV-12353 (MariaDB Server 10.5). There are open bugs about ROW_FORMAT=COMPRESSED, such as https://jira.mariadb.org/browse/MDEV-31574. I would say that the InnoDB ROW_FORMAT=COMPRESSED seemed like a good idea back when it was implemented, when HDD storage was the only practical option. Just like the InnoDB change buffer, which I removed in MariaDB Server 11.0 after having to deal with too many cases of corruption: https://fosdem.org/2023/schedule/event/innodb_change_buffer/ With best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
On Thu, Aug 24, 2023 at 6:40 PM pslawek83 <pslawek83@o2.pl> wrote:
Hi Marko, thanks for the response ... so for the DOCS part
https://mariadb.com/kb/en/innodb-page-compression/ When InnoDB page compression is used, InnoDB may still write the compressed page to the tablespace file with the original size of the uncompressed page, which would be equivalent to the value of the innodb_page_size system variable. This is done by design, because when InnoDB's I/O code needs to read the page from disk, it can only read the full page size. However, this is obviously not optimal. This punch hole technique allows InnoDB to read the compressed page from disk as the full page size, even though the compressed page really takes up less space on the file system.
So as far as i understand if compression is enabled the server will write eg. 40% utilized page which will later be decompressed when read, but this compression/decompression won't have any advangage if there's no punch hole support. The only difference will be that the block will be filled with zeros for the empty space at the end. Or is it possible for PAGE_COMPRESSION to still be used without punch hole (just not that effective for example)?
The actual savings will depend on the implementation details and configuration parameters of the underlying file system. The file system that I have most everyday experience with is Linux ext4fs. If you had suitable storage that can support atomic 16KiB writes (or whatever your innodb_page_size is), you might want to do mke2fs -b 16384 and use innodb_flush_method=O_DIRECT, to safely set innodb_doublewrite=OFF. But, with this setting, I am pretty sure that you would not be able to get any savings from using PAGE_COMPRESSED, because any FALLOC_FL_PUNCH_HOLE request that is smaller than the file system block size would simply be ignored. The ext4 file system block size must be a power of 2 between 1 and 64 kilobytes. 40% of a 16384-byte page would be 6554 bytes. If the file system was created with the minimum 1024-byte block size, then the page would occupy 7 KiB or 43.75% of the logical size. With any other ext4fs block size, the allocation should be 50%. I do not know how much space the fragmentation of files occupies in the file system metadata, but it does slow down file access, as you can see based on reports like https://jira.mariadb.org/browse/MDEV-19554 (for ext4fs). I think that there has been a similar report for NTFS as well. I do not know if there are any Linux file systems that would categorically reject FALLOC_FL_PUNCH_HOLE. My common sense would suggest that file systems that support snapshots or are implemented as copy-on-write will have some limitations regarding sparse files. In fact, there is the open bug https://jira.mariadb.org/browse/MDEV-25734 where a difference between ext4fs and xfs was demonstrated.
Also tested this row compression, and yes page compression seems much better.
Actually, the InnoDB ROW_FORMAT=COMPRESSED that I implemented between 2005 and 2007 does not compress individual rows or records, but pages of them. The name is somewhat misleading. Because at that time the SQL parser was outside our control, we had to reuse some existing syntax, which I think had been used for read-only MyISAM files that had been prepared with some tool. A design constraint of that format was that the fill factor of the uncompressed B-tree pages (which are in a format equivalent to ROW_FORMAT=DYNAMIC) will be limited in order to avoid overflows of the corresponding compressed pages. I wanted to remove the code to deal with ROW_FORMAT=COMPRESSED, because it complicates the buffer pool data structures. In fact, if both the adaptive hash index and ROW_FORMAT=COMPRESSED could be removed, the size of buffer pool page descriptors could be shrunk to 100 or 104 bytes, from the current size of 160 bytes on 64-bit ISAs. Also, recently I learned about cases where some original design decisions turned out to be wrong, causing overflow of the compressed pages. The way I resolved https://jira.mariadb.org/browse/MDEV-30882 is only a partial fix that is not guaranteed to work in all cases. There are users who seem to be happy with this format. Maybe because there are so many issues with PAGE_COMPRESSED at so many levels, and because the compressed BLOBs (https://jira.mariadb.org/browse/MDEV-11371) which the InnoDB ROW_FORMAT=COMPRESSED would handle in a better way has not been stable enough for production use.
Is it possible to get rid of punch hole by just re-mapping the pages inside these innodb files?
Back in 2004 or 2005, I was searching for information on how or whether Stacker or equivalent tools had solved this problem. In https://en.wikipedia.org/wiki/DriveSpace some corruption related to fragmented files is mentioned. Maybe it was not designed for any update-in-place usage. I concluded back then that also compression algorithms would have a hard time to support any update-in-place operations. Even if such algorithms existed, they would likely have to maintain some ephemeral data structures, which can be much larger than the uncompressed input or the compressed output. If I remember correctly, zlib could require more than 256 kilobytes of memory for a single compression context, even though its input buffer size is only 32 kilobytes. If there are many actively modified pages, you would need much more RAM for the compression algorithm than the actual uncompressed or compressed data. One thing has not changed over the years: RAM is much more expensive than persistent storage. Some people are using SSDs as a less expensive alternative to RAM. To avoid the need to recompress pages on every small modification (which would be extremely slow), I came up with two ideas. Some fields in the uncompressed page (page header, parts of record headers, transaction IDs, BLOB pointers, child page numbers) are stored uncompressed also in the compressed page, so that they can simply be updated in place. Modifications to the rest of the data may be covered by a log that will be stored in an otherwise unused area of the compressed page. Only if the modification log runs out of space, we will recompress the entire page.
I was thinking about this... if it's possible to change the data structure to divide the file into 512b blocks
While 512-byte blocks are still rather common, the physical block size may often be 4096 bytes. Users of MariaDB Server 10.11 should benefit from MDEV-14425, which finally allows the redo log to be written efficiently on such storage. I think that it would be better to design a format that wastes less space to begin with. At the same time, operations on the keys and access to records should remain reasonably fast, because we would not want to waste expensive RAM by introducing an additional layer of a record cache. Like Gordan has said in this thread, you might just let the file system handle compression if you need it. But, there is no free lunch. I suppose that ZFS would not support O_DIRECT. In any case, with file system compression, page writes would become more than a simple matter of sending the data to a DMA controller. You could also let the storage layer handle compression. I was really impressed by the performance of ScaleFlux when we tested it some years ago. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
On Fri, Aug 25, 2023 at 9:40 AM Marko Mäkelä <marko.makela@mariadb.com> wrote:
Like Gordan has said in this thread, you might just let the file system handle compression if you need it. But, there is no free lunch. I suppose that ZFS would not support O_DIRECT.
Doesn't support it YET. But if the concern is double-caching things, then ZFS has a solution for that - setting primarycache=metadata. Obviously buffered write require an extra memcopy, so things will get faster when O_DIRECT implementation lands, but this is generally not where significant bottlenecks are at the moment, especially when you can set sync=disabled and still preserve write ordering (which makes it safer than disabling innodb_flush_log_at_trx_commit, sync_binlog, sync_master_info, and similar).
In any case, with file system compression, page writes would become more than a simple matter of sending the data to a DMA controller. You could also let the storage layer handle compression. I was really impressed by the performance of ScaleFlux when we tested it some years ago.
ZFS compression performance is fast enough that it isn't really a problem. In many cases (spinning rust, slower SSDs), it often makes things faster because disk throughput is a bigger bottleneck than the compression cost.
participants (3)
-
Gordan Bobic
-
Marko Mäkelä
-
pslawek83