Hi Marko, thanks for the response ... so for the DOCS part

When InnoDB page compression is used, InnoDB may still write the compressed page to the tablespace file with the original size of the uncompressed page, which would be equivalent to the value of the innodb_page_size system variable. This is done by design, because when InnoDB's I/O code needs to read the page from disk, it can only read the full page size. However, this is obviously not optimal.
> This punch hole technique allows InnoDB to read the compressed page from disk as the full page size, even though the compressed page really takes up less space on the file system.

So as far as i understand if compression is enabled the server will write eg. 40% utilized page which will later be decompressed when read, but this compression/decompression won't have any advangage if there's no punch hole support. The only difference will be that the block will be filled with zeros for the empty space at the end. Or is it possible for PAGE_COMPRESSION to still be used without punch hole (just not that effective for example)?

Also tested this row compression, and yes page compression seems much better. Is it possible to get rid of punch hole by just re-mapping the pages inside these innodb files? I was thinking about this... if it's possible to change the data structure to divide the file into 512b blocks while the beginning of each block will mark page number (0xffffffff for blocks which are part of previous page). To increase page size we'd need to add another page with same number at the end of the file, later continuous recycled blocks can be added to a list and used to store smaller pages. To make this atomic, first whole block would be allocated and written with 0xfffffffe as a dummy id and then this placeholder id could be replaced with real block number).

The problem with this could be that the file would get 100% fragmented over time (but probably will work ok with NVME). The mapping could be read back by reading whole datafile during server start and it'd need 125MB of memory for 1TB of compressed data to store the mapping in memory for fast access.

What you thinik? Would it be easy to add it on top of current code (if there's any API maybe, so for example i can do this) and would it even make sense?

Best,
Slawomir.

Dnia 21 sierpnia 2023 07:08 Marko Mäkelä <marko.makela@mariadb.com> napisał(a):
On Sun, Aug 20, 2023 at 6:12 PM pslawek83 via discuss
<discuss@lists.mariadb.org> wrote:
Thanks everyone for the answers. Tested this PAGE_COMPRESSION a bit and make some reading. Not sure if I fully understand how that works... is that correct?
1. Impossible without punch hole as if we don't have it the server will just write partially used pages which will take full page size anyway. But in docs it seems it described its use with punch hole disabled. What's the reason to still have page compression without punch hole?
Can you please point to the misleading part of the documentation so
that it can be corrected?
2. After enabling compression each page needs to be trimmed using punch hole, so if we have 1 TB database and 64k blocks, the database files will have over 15 million fragments (1000000000000/64000)
That is correct.
So if we get like 50% compression and then we'll be able to put another 1TB of data into these holes we'll be having close to 31 million file fragments. I'm not sure if under such conditions the FS will be able to even allocate space efficiently anymore, copy these files, etc. Or im not getting something?
For what it is worth, around the same time when the page_compressed
was being developed for MariaDB Server 10.1, something similar was
being worked on at Oracle, I think for MySQL 5.7. To my understanding,
the same hardware partner worked with both companies. The obvious
had worked without kernel panics or existed in a mainline kernel. I
don’t even know if it ever was open sourced.
3. I was testing compression of ~1.5GB table with nice results on default 16kb page size. One thing i noticed is that l was always getting exactly the same compressed size using `du`. However the table read/copy times were different from 40s for uncompressed to 80-400 seconds depending on algo. Is that possible that is because page size is too low?
As far as I remember, most compression algorithms that are derivatives
of the LZ77 family have an input buffer size of 32 KiB. From that
point of view, it would seem to make sense to use innodb_page_size=32k
or innodb_page_size=64k instead of the default. Furthermore, I think
that it was a mistake to implement bzip2 support at all. That
algorithm would work with much larger buffer sizes.
4. Myrocks seems to be very slow compared to compressed inno. Was also reading docs related to it. For its usecase is it more suitable for storing large rows rather than many small rows?
To my understanding, the LSM trees that MyRocks is built on work best
in an insert-only workload. On top of LSM trees, MyRocks implements
some caches and Bloom filters. If you have few updates or deletes and
most reads are accessing recently inserted data, it could work.
5. I remember reading that some innodb compression is going to be obsolete. Is that row compression? Or there are no plans to obsolete any compression model?
I wanted to deprecate and remove the ROW_FORMAT=COMPRESSED that I
designed and implemented in the InnoDB Plugin for MySQL 5.1 based on
some high level ideas of Heikki Tuuri between 2005 and 2007.
that. Based on community feedback, this won’t be happening. Even after
some data structure cleanup that took place in MariaDB Server 10.6,
supporting the format incurs an overhead of some 32 to 40 bytes per
buffer page descriptor, which is not insignificant. Removing both
ROW_FORMAT=COMPRESSED and the adaptive hash index would shrink the
block descriptor to 80 bytes from its current 10.6 size of 160 bytes.
The write performance with ROW_FORMAT=COMPRESSED should have been
degraded when I implemented the easier-to-parse log record format in
MDEV-12353 (MariaDB Server 10.5). There are open bugs about
ROW_FORMAT=COMPRESSED, such as
InnoDB ROW_FORMAT=COMPRESSED seemed like a good idea back when it was
implemented, when HDD storage was the only practical option. Just like
the InnoDB change buffer, which I removed in MariaDB Server 11.0 after
having to deal with too many cases of corruption:
With best regards,
Marko
--
Marko Mäkelä, Lead Developer InnoDB
MariaDB plc