Data Organization in InnoDB

From: https://blogs.oracle.com/mysqlinnodb/entry/data_organization_in_innodb

Introduction

This article will explain how the data is organized in InnoDB storage engine. First we will look at the various files that are created by InnoDB, then we look at the logical data organization like tablespaces, pages, segments and extents. We will explore each of them in some detail and discuss about their relationship with each other. At the end of this article, the reader will have a high level view of the data layout within the InnoDB storage engine.

The Files

MySQL will store all data within the data directory. The data directory can be specified using the command line option –data-dir or in the configuration file as datadir. Refer to the Server Command Options for complete details.

By default, when InnoDB is initialized, it creates 3 important files in the data directory – ibdata1, ib_logfile0 and ib_logfile1. The ibdata1 is the data file in which system and user data will be stored. The ib_logfile0 and ib_logfile1 are the redo log files. The location and size of these files are configurable. Refer to Configuring InnoDB for more details.

The data file ibdata1 belongs to the system tablespace with tablespace id (space_id) of 0. The system tablespace can contain more than 1 data file. As of MySQL 5.6, only the system tablespace can contain more than 1 data file. All other tablespaces can contain only one data file. Also, only the system tablespace can contain more than one table, while all other tablespaces can contain only one table.

The data files and the redo log files are represented in the memory by the C structure fil_node_t.

Tablespaces

By default, InnoDB contains only one tablespace called the system tablespace whose identifier is 0. More tablespaces can be created indirectly using the innodb_file_per_table configuration parameter. In MySQL 5.6, this configuration parameter is ON by default. When it is ON, each table will be created in its own tablespace in a separate data file.

The relationship between the tablespace and data files is explained in the InnoDB source code comment (storage/innobase/fil/fil0fil.cc) which is quoted here for reference:

“A tablespace consists of a chain of files. The size of the files does not have to be divisible by the database block size, because we may just leave the last incomplete block unused. When a new file is appended to the tablespace, the maximum size of the file is also specified. At the moment, we think that it is best to extend the file to its maximum size already at the creation of the file, because then we can avoid dynamically extending the file when more space is needed for the tablespace.”

The last statement about avoiding dynamic extension is applicable only to the redo log files and not the data files. Data files are dynamically extended, but redo log files are pre-allocated. Also, as already mentioned earlier, only the system tablespace can have more than one data file.

It is also clearly mentioned that even though the tablespace can have multiple files, they are thought of as one single large file concatenated together. So the order of files within the tablespace is important.

Pages

A data file is logically divided into equal sized pages. The first page of the first data file is identified with page number of 0, and the next page would be 1 and so on. A page within a tablespace is uniquely identified by the page identifier or page number (page_no). And each tablespace is uniquely identified by the tablespace identifier (space_id). So a page is uniquely identified throughout InnoDB by using the (space_id, page_no) combination. And any location within InnoDB can be uniquely identified by the (space_id, page_no, page_offset) combination, where page_offset is the number of bytes within the given page.

How the pages from different data files relate to one another is explained in another source code comment: “A block’s position in the tablespace is specified with a 32-bit unsigned integer. The files in the chain are thought to be catenated, and the block corresponding to an address n is the nth block in the catenated file (where the first block is named the 0th block, and the incomplete block fragments at the end of files are not taken into account). A tablespace can be extended by appending a new file at the end of the chain.” This means that the first page of all the data files will not have page_no of 0 (zero). Only the first page of the first data file in a tablespace will have the page_no as 0 (zero).

Also in the above comment it is mentioned that the page_no is a 32-bit unsigned integer. This is the size of the page_no when stored on the disk.

Every page has a page header (page_header_t). For more details on this please refer to the Jeremy Cole’s blog “The basics of InnoDB space file layout.”

Extents

An extent is 1MB of consecutive pages. The size of one extent is defined as follows (1048576 bytes = 1MB):

#define FSP_EXTENT_SIZE (1048576U / UNIV_PAGE_SIZE)

The macro UNIV_PAGE_SIZE used to be a compile time constant. From mysql-5.6 onwards it is a global variable. The number of pages in an extent depends on the page size used. If the page size is 16K (the default), then an extent would contain 64 pages.

Page Types

One page can be used for many purposes. The page type will identify the purpose for which the page is being used. The page type of each page will be stored in the page header. The page types are available in the header file storage/innobase/include/fil0fil.h. The following table provides a brief description of the page types.

Page Type	Description
FIL_PAGE_INDEX	The page is a B-tree node
FIL_PAGE_UNDO_LOG	The page stores undo logs
FIL_PAGE_INODE	contains an array of fseg_inode_t objects.
FIL_PAGE_IBUF_FREE_LIST	The page is in the free list of insert buffer or change buffer.
FIL_PAGE_TYPE_ALLOCATED	Freshly allocated page.
FIL_PAGE_IBUF_BITMAP	Insert buffer or change buffer bitmap
FIL_PAGE_TYPE_SYS	System page
FIL_PAGE_TYPE_TRX_SYS	Transaction system data
FIL_PAGE_TYPE_FSP_HDR	File space header
FIL_PAGE_TYPE_XDES	Extent Descriptor Page
FIL_PAGE_TYPE_BLOB	Uncompressed BLOB page
FIL_PAGE_TYPE_ZBLOB	First compressed BLOB page
FIL_PAGE_TYPE_ZBLOB2	Subsequent compressed BLOB page

Each page type is used for different purposes. It is beyond the scope of this article, to explore each page type. For now, it is sufficient to know that all pages have a page header (page_header_t) and they store the page type in it, and based on the page type the contents and the layout of the page would be decided.

Tablespace Header

Each tablespace will have a header of type fsp_header_t. This data structure is stored in the first page of a tablespace.

The table space identifier (space_id)
Current size of the table space in pages.
List of free extents
List of full extents not belonging to any segment.
List of partially full/free extents not belonging to any segment.
List of pages containing segment headers, where all the segment inode slots are reserved. (pages of type FIL_PAGE_INODE)
List of pages containing segment headers, where not all the segment inode slots are reserved. (pages of type FIL_PAGE_INODE).

InnoDB Tablespace Header Structure

From the tablespace header, we can access the list of segments available in the tablespace. The total space occupied by the tablespace header is given by the macro FSP_HEADER_SIZE, which is equal to 16*7 = 112 bytes.

Reserved Pages of Tablespace

As mentioned earlier, InnoDB will always contain one tablespace called the system tablespace with identifier 0. This is a special tablespace and is always kept open as long as the MySQL server is running. The first few pages of the system tablespace is reserved for internal usage. This information can be obtained from the header storage/innobase/include/fsp0types.h. They are listed below with a short description.

Page Number	The Page Name	Description
0	FSP_XDES_OFFSET	The extent descriptor page.
1	FSP_IBUF_BITMAP_OFFSET	The insert buffer bitmap page.
2	FSP_FIRST_INODE_PAGE_NO	The first inode page number.
3	FSP_IBUF_HEADER_PAGE_NO	Insert buffer header page in system tablespace.
4	FSP_IBUF_TREE_ROOT_PAGE_NO	Insert buffer B-tree root page in system tablespace.
5	FSP_TRX_SYS_PAGE_NO	Transaction system header in system tablespace.
6	FSP_FIRST_RSEG_PAGE_NO	First rollback segment page, in system tablespace.
7	FSP_DICT_HDR_PAGE_NO	Data dictionary header page in system tablespace.

As can be noted from above, the first 3 pages will be there in any tablespace. But the last 5 pages are reserved only in the case of system tablespace. In the case of other tablespaces only 3 pages are reserved.

When the option innodb_file_per_table is enabled, then for each table a separate tablespace with one data file would be created. The source code comment in the function dict_build_table_def_step() states the following:

                /* We create a new single-table tablespace for the table. 
                We initially let it be 4 pages: 
                - page 0 is the fsp header and an extent descriptor page, 
                - page 1 is an ibuf bitmap page, 
                - page 2 is the first inode page, 
                - page 3 will contain the root of the clustered index of the 
                table we create here. */

File Segments

A tablespace can contain many file segments. File segments (or just segments) is a logical entity. Each segment has a segment header (fseg_header_t), which points to the inode (fseg_inode_t) describing the file segment. The file segment header contains the following information:

The space to which the inode belongs
The page_no of the inode
The byte offset of the inode
The length of the file segment header (in bytes).

Note: It would have been really more readable (at source code level) if fseg_header_t and fseg_inode_t had proper C-style structures defined for them.

The fseg_inode_t object contains the following information:

The segment id to which it belongs.
List of full extents.
List of free extents of this segment.
List of partially full/free extents
Array of individual pages belonging to this segment. The size of this array is half an extent.

When a segment wants to grow, it will get free extents or pages from the tablespace to which it belongs.

Table

In InnoDB, when a table is created, a clustered index (B-tree) is created internally. This B-tree contains two file segments, one for the non-leaf pages and the other for the leaf pages. From the source code documentation:

“In the root node of a B-tree there are two file segment headers. The leaf pages of a tree are allocated from one file segment, to make them consecutive on disk if possible. From the other file segment we allocate pages for the non-leaf levels of the tree.”

For a given table, the root page of a B-tree will be obtained from the data dictionary. So in InnoDB, each table exists within a tablespace, and contains one B-tree (the clustered index), which contains 2 file segments. Each file segment can contain many extents, and each extent contains 1MB of consecutive pages.

Conclusion

This article discussed the details about the data organization within InnoDB. We first looked at the files created by InnoDB, and then discussed about the various logical entities like tablespaces, pages, page types, extents, segments and tables. We also looked at the relationship between each one of them.