Hi Lukas,

On Thu, Jun 29, 2017 at 10:18 AM, Lukas Lehner <weblehner@gmail.com> wrote:
Hey


This approach differs significantly from legacy InnoDB compressed tables using row_format=compressed, where both uncompressed and compressed pages can be in the buffer pool.

Any details why this is legacy? Should not be used? With a search engine I found a very old blog posting "InnoDB compression woes" from Percona.

I am the principal author of the ROW_FORMAT=COMPRESSED in InnoDB. Actually, when I was hired for Innobase Oy in September 2003, my first task was to reduce the disk space usage of InnoDB tables. The first step of that was ROW_FORMAT=COMPACT, which was introduced in MySQL 5.0.3.

I started the development of ROW_FORMAT=COMPRESSED shortly after Oracle Corporation acquired my employer Innobase Oy in October 2005. The code was originally published as the InnoDB Plugin for MySQL 5.1, in 2007 or 2008.

I authored all the revised ROW_FORMATs of InnoDB. ROW_FORMAT=DYNAMIC is a minor variant of ROW_FORMAT=COMPACT: for columns that are stored off-page, no prefix is stored in the clustered index leaf page records. ROW_FORMAT=REDUNDANT (the original format developed by Heikki Tuuri) and ROW_FORMAT=COMPACT store a prefix of 768 bytes locally, for easier implementation of column prefix indexes (which before MySQL 5.5 were limited to 767 bytes). With ROW_FORMAT=DYNAMIC, some column prefix information is written to the undo log, so that the purge of history can access it from there.

row_format=compressed is attractive when you pay your DBaaS per GB (storage based usage). Any insides?

The high-level idea of keeping both uncompressed and compressed pages in the buffer pool came from the founder of Innobase Oy, Heikki Tuuri. The low-level design was mine. There are 3 kinds of pages in ROW_FORMAT=COMPRESSED:

  • Fully uncompressed pages (page allocation bitmap, change buffer bitmap, other allocation data structures). These use user-specified the physical page size.
  • Compressed B-tree pages. Each page uncompresses in memory to a page that is equivalent to ROW_FORMAT=DYNAMIC. The compressed page consists of a compressed zlib stream, some uncompressed fields (for facilitating in-place updates without recompressing the page), and an uncompressed "page modification log" (to avoid updates without recompressing the page). This format was my invention.
  • Compressed BLOB pages. The contents of off-page columns is compressed as a single zlib stream, and there are a little fewer header fields than on uncompressed BLOB pages. Either way, BLOBs are stored as singly-linked lists of pages in InnoDB.
I implemented a rudimentary buffer pool interface and also the binary buddy allocator (buf0buddy.cc) that uses the InnoDB buffer pool for allocating smaller pages for the compressed page frames.
The buffer pool eviction mechanism (unzip_LRU) was jointly designed by Heikki Tuuri and Inaam Rana. Oracle wanted them to file a patent for it, and it was granted: United States Patent 8,375,178 (Memory page eviction based on present system operation).

Back in 2005, there were no SSDs, and Heikki was worried about fragmentation. So, a scheme like Stacker or Doublespace or NTFS compression was out of the question. My feeling is that all these systems essentially work like the PAGE_COMPRESSED=YES in MariaDB 10.1: allocating variable-size compressed blocks for fixed-size uncompressed blocks. The property of ROW_FORMAT=COMPRESSED that the compressed block size is fixed and specified by the user was kind of a design constraint.

I think that is fair to say that ROW_FORMAT=COMPRESSED is complicating the buffer pool management quite a bit. But on the other hand, PAGE_COMPRESSED=YES depends on file system specifics and is causing fragmentation there. Even on SSD, this could slow down some operations, such as DROP TABLE.

BLOB storage is more efficient with ROW_FORMAT=COMPRESSED, because each off-page column is compressed as a single zlib stream. With PAGE_COMPRESSED=YES, each BLOB page snippet is compressed individually, which could be much more wasteful in the worst case. Artificial example: REPEAT(almost innodb_page_size uncompressible bytes',N) should compress to 1 or 2 pages with ROW_FORMAT=COMPRESSED (storing the string and then compressing each repetition), but N pages in PAGE_COMPRESSED=YES, because it would divide the BLOB into parts smaller than innodb_page_size and then attempt to compress these pages individually.

Conclusion: ROW_FORMAT=COMPRESSED was a monolithic piece of development that was not properly performance-tested before it was completed. PAGE_COMPRESSED=YES was much simpler to implement and is easier to understand. It is also easier to understand the buffer pool metrics when not using ROW_FORMAT=COMPRESSED.

Maybe some day we can improve the ROW_FORMAT=COMPRESSED. But I think that LSM-trees such as MyRocks are difficult to beat when it comes to disk space usage. This of course depends on the workload.

Best regards,

Marko
--
Marko Mäkelä, Lead Developer InnoDB
MariaDB Corporation