On Thu, Aug 24, 2023 at 6:40 PM pslawek83 <pslawek83@o2.pl> wrote:
Hi Marko, thanks for the response ... so for the DOCS part
https://mariadb.com/kb/en/innodb-page-compression/ When InnoDB page compression is used, InnoDB may still write the compressed page to the tablespace file with the original size of the uncompressed page, which would be equivalent to the value of the innodb_page_size system variable. This is done by design, because when InnoDB's I/O code needs to read the page from disk, it can only read the full page size. However, this is obviously not optimal. This punch hole technique allows InnoDB to read the compressed page from disk as the full page size, even though the compressed page really takes up less space on the file system.
So as far as i understand if compression is enabled the server will write eg. 40% utilized page which will later be decompressed when read, but this compression/decompression won't have any advangage if there's no punch hole support. The only difference will be that the block will be filled with zeros for the empty space at the end. Or is it possible for PAGE_COMPRESSION to still be used without punch hole (just not that effective for example)?
The actual savings will depend on the implementation details and configuration parameters of the underlying file system. The file system that I have most everyday experience with is Linux ext4fs. If you had suitable storage that can support atomic 16KiB writes (or whatever your innodb_page_size is), you might want to do mke2fs -b 16384 and use innodb_flush_method=O_DIRECT, to safely set innodb_doublewrite=OFF. But, with this setting, I am pretty sure that you would not be able to get any savings from using PAGE_COMPRESSED, because any FALLOC_FL_PUNCH_HOLE request that is smaller than the file system block size would simply be ignored. The ext4 file system block size must be a power of 2 between 1 and 64 kilobytes. 40% of a 16384-byte page would be 6554 bytes. If the file system was created with the minimum 1024-byte block size, then the page would occupy 7 KiB or 43.75% of the logical size. With any other ext4fs block size, the allocation should be 50%. I do not know how much space the fragmentation of files occupies in the file system metadata, but it does slow down file access, as you can see based on reports like https://jira.mariadb.org/browse/MDEV-19554 (for ext4fs). I think that there has been a similar report for NTFS as well. I do not know if there are any Linux file systems that would categorically reject FALLOC_FL_PUNCH_HOLE. My common sense would suggest that file systems that support snapshots or are implemented as copy-on-write will have some limitations regarding sparse files. In fact, there is the open bug https://jira.mariadb.org/browse/MDEV-25734 where a difference between ext4fs and xfs was demonstrated.
Also tested this row compression, and yes page compression seems much better.
Actually, the InnoDB ROW_FORMAT=COMPRESSED that I implemented between 2005 and 2007 does not compress individual rows or records, but pages of them. The name is somewhat misleading. Because at that time the SQL parser was outside our control, we had to reuse some existing syntax, which I think had been used for read-only MyISAM files that had been prepared with some tool. A design constraint of that format was that the fill factor of the uncompressed B-tree pages (which are in a format equivalent to ROW_FORMAT=DYNAMIC) will be limited in order to avoid overflows of the corresponding compressed pages. I wanted to remove the code to deal with ROW_FORMAT=COMPRESSED, because it complicates the buffer pool data structures. In fact, if both the adaptive hash index and ROW_FORMAT=COMPRESSED could be removed, the size of buffer pool page descriptors could be shrunk to 100 or 104 bytes, from the current size of 160 bytes on 64-bit ISAs. Also, recently I learned about cases where some original design decisions turned out to be wrong, causing overflow of the compressed pages. The way I resolved https://jira.mariadb.org/browse/MDEV-30882 is only a partial fix that is not guaranteed to work in all cases. There are users who seem to be happy with this format. Maybe because there are so many issues with PAGE_COMPRESSED at so many levels, and because the compressed BLOBs (https://jira.mariadb.org/browse/MDEV-11371) which the InnoDB ROW_FORMAT=COMPRESSED would handle in a better way has not been stable enough for production use.
Is it possible to get rid of punch hole by just re-mapping the pages inside these innodb files?
Back in 2004 or 2005, I was searching for information on how or whether Stacker or equivalent tools had solved this problem. In https://en.wikipedia.org/wiki/DriveSpace some corruption related to fragmented files is mentioned. Maybe it was not designed for any update-in-place usage. I concluded back then that also compression algorithms would have a hard time to support any update-in-place operations. Even if such algorithms existed, they would likely have to maintain some ephemeral data structures, which can be much larger than the uncompressed input or the compressed output. If I remember correctly, zlib could require more than 256 kilobytes of memory for a single compression context, even though its input buffer size is only 32 kilobytes. If there are many actively modified pages, you would need much more RAM for the compression algorithm than the actual uncompressed or compressed data. One thing has not changed over the years: RAM is much more expensive than persistent storage. Some people are using SSDs as a less expensive alternative to RAM. To avoid the need to recompress pages on every small modification (which would be extremely slow), I came up with two ideas. Some fields in the uncompressed page (page header, parts of record headers, transaction IDs, BLOB pointers, child page numbers) are stored uncompressed also in the compressed page, so that they can simply be updated in place. Modifications to the rest of the data may be covered by a log that will be stored in an otherwise unused area of the compressed page. Only if the modification log runs out of space, we will recompress the entire page.
I was thinking about this... if it's possible to change the data structure to divide the file into 512b blocks
While 512-byte blocks are still rather common, the physical block size may often be 4096 bytes. Users of MariaDB Server 10.11 should benefit from MDEV-14425, which finally allows the redo log to be written efficiently on such storage. I think that it would be better to design a format that wastes less space to begin with. At the same time, operations on the keys and access to records should remain reasonably fast, because we would not want to waste expensive RAM by introducing an additional layer of a record cache. Like Gordan has said in this thread, you might just let the file system handle compression if you need it. But, there is no free lunch. I suppose that ZFS would not support O_DIRECT. In any case, with file system compression, page writes would become more than a simple matter of sending the data to a DMA controller. You could also let the storage layer handle compression. I was really impressed by the performance of ScaleFlux when we tested it some years ago. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc