(This page was last modified on 2003/11/25 23:48:57 UTC)
SQLite is implemented in layers. (See the architecture description.) The format of database files is determined by three different layers in the architecture.
We wil describe each layer beginning with the bottom (pager) layer and working upwards.
An SQLite database consists of "pages" of data. Each page is 1024 bytes in size. Pages are numbered beginning with 1. A page number of 0 is used to indicate "no such page" in the B-Tree and Schema layers.
The pager layer is responsible for implementing transactions with atomic commit and rollback. It does this using a separate journal file. Whenever a new transaction is started, a journal file is created that records the original state of the database. If the program terminates before completing the transaction, the next process to open the database can use the journal file to restore the database to its original state.
The journal file is located in the same directory as the database file and has the same name as the database file but with the characters "-journal" appended.
The pager layer does not impose any content restrictions on the main database file. As far as the pager is concerned, each page contains 1024 bytes of arbitrary data. But there is structure to the journal file.
A journal file begins with 8 bytes as follows: 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, and 0xd6. Processes that are attempting to rollback a journal use these 8 bytes as a sanity check to make sure the file they think is a journal really is a valid journal. Prior version of SQLite used different journal file formats. The magic numbers for these prior formats are different so that if a new version of the library attempts to rollback a journal created by an earlier version, it can detect that the journal uses an obsolete format and make the necessary adjustments. This article describes only the newest journal format - supported as of version 2.8.0.
Following the 8 byte prefix is a three 4-byte integers that tell us the number of pages that have been committed to the journal, a magic number used for sanity checking each page, and the original size of the main database file before the transaction was started. The number of committed pages is used to limit how far into the journal to read. The use of the checksum magic number is described below. The original size of the database is used to restore the database file back to its original size. The size is expressed in pages (1024 bytes per page).
All three integers in the journal header and all other multi-byte numbers used in the journal file are big-endian. That means that the most significant byte occurs first. That way, a journal file that is originally created on one machine can be rolled back by another machine that uses a different byte order. So, for example, a transaction that failed to complete on your big-endian SparcStation can still be rolled back on your little-endian Linux box.
After the 8-byte prefix and the three 4-byte integers, the journal file consists of zero or more page records. Each page record is a 4-byte (big-endian) page number followed by 1024 bytes of data and a 4-byte checksum. The data is the original content of the database page before the transaction was started. So to roll back the transaction, the data is simply written into the corresponding page of the main database file. Pages can appear in the journal in any order, but they are guaranteed to appear only once. All page numbers will be between 1 and the maximum specified by the page size integer that appeared at the beginning of the journal.
The so-called checksum at the end of each record is not really a checksum - it is the sum of the page number and the magic number which was the second integer in the journal header. The purpose of this value is to try to detect journal corruption that might have occurred because of a power loss or OS crash that occurred which the journal file was being written to disk. It could have been the case that the meta-data for the journal file, specifically the size of the file, had been written to the disk so that when the machine reboots it appears that file is large enough to hold the current record. But even though the file size has changed, the data for the file might not have made it to the disk surface at the time of the OS crash or power loss. This means that after reboot, the end of the journal file will contain quasi-random garbage data. The checksum is an attempt to detect such corruption. If the checksum does not match, that page of the journal is not rolled back.
Here is a summary of the journal file format:
The B-Tree layer builds on top of the pager layer to implement one or more separate b-trees all in the same disk file. The algorithms used are taken from Knuth's The Art Of Computer Programming.
Page 1 of a database contains a header string used for sanity checking, a few 32-bit words of configuration data, and a pointer to the beginning of a list of unused pages in the database. All other pages in the database are either pages of a b-tree, overflow pages, or unused pages on the freelist.
Each b-tree page contains zero or more database entries. Each entry has an unique key of one or more bytes and data of zero or more bytes. Both the key and data are arbitrary byte sequences. The combination of key and data are collectively known as "payload". The current implementation limits the amount of payload in a single entry to 1048576 bytes. This limit can be raised to 16777216 by adjusting a single #define in the source code and recompiling. But most entries contain less than a hundred bytes of payload so a megabyte limit seems more than enough.
Up to 238 bytes of payload for an entry can be held directly on a b-tree page. Any additional payload is contained on a linked list of overflow pages. This limit on the amount of payload held directly on b-tree pages guarantees that each b-tree page can hold at least 4 entries. In practice, most entries are smaller than 238 bytes and thus most pages can hold more than 4 entries.
A single database file can hold any number of separate, independent b-trees. Each b-tree is identified by its root page, which never changes. Child pages of the b-tree may change as entries are added and removed and pages split and combine. But the root page always stays the same. The b-tree itself does not record which pages are root pages and which are not. That information is handled entirely at the schema layer.
Page 1 begins with the following 48-byte string:
** This file contains an SQLite 2.1 database **
If you count the number of characters in the string above, you will see that there are only 47. A '\000' terminator byte is added to bring the total to 48.
A frequent question is why the string says version 2.1 when (as of this writing) we are up to version 2.7.0 of SQLite and any change to the second digit of the version is suppose to represent a database format change. The answer to this is that the B-tree layer has not changed any since version 2.1. There have been database format changes since version 2.1 but those changes have all been in the schema layer. Because the format of the b-tree layer is unchanged since version 2.1.0, the header string still says version 2.1.
After the format string is a 4-byte integer used to determine the byte-order of the database. The integer has a value of 0xdae37528. If this number is expressed as 0xda, 0xe3, 0x75, 0x28, then the database is in a big-endian format and all 16 and 32-bit integers elsewhere in the b-tree layer are also big-endian. If the number is expressed as 0x28, 0x75, 0xe3, and 0xda, then the database is in a little-endian format and all other multi-byte numbers in the b-tree layer are also little-endian. Prior to version 2.6.3, the SQLite engine was only able to read databases that used the same byte order as the processor they were running on. But beginning with 2.6.3, SQLite can read or write databases in any byte order.
After the byte-order code are six 4-byte integers. Each integer is in the byte order determined by the byte-order code. The first integer is the page number for the first page of the freelist. If there are no unused pages in the database, then this integer is 0. The second integer is the number of unused pages in the database. The last 4 integers are not used by the b-tree layer. These are the so-called "meta" values that are passed up to the schema layer and used there for configuration and format version information. All bytes of page 1 past beyond the meta-value integers are unused and are initialized to zero.
Here is a summary of the information contained on page 1 in the b-tree layer:
Conceptually, a b-tree page contains N database entries and N+1 pointers to other b-tree pages.
Ptr
0Entry
0Ptr
1Entry
1... Ptr
N-1Entry
N-1Ptr
N
The entries are arranged in increasing order. That is, the key to Entry 0 is less than the key to Entry 1, and the key to Entry 1 is less than the key of Entry 2, and so forth. The pointers point to pages containing additional entries that have keys in between the entries on either side. So Ptr 0 points to another b-tree page that contains entries that all have keys less than Key 0, and Ptr 1 points to a b-tree pages where all entries have keys greater than Key 0 but less than Key 1, and so forth.
Each b-tree page in SQLite consists of a header, zero or more "cells" each holding a single entry and pointer, and zero or more "free blocks" that represent unused space on the page.
The header on a b-tree page is the first 8 bytes of the page. The header contains the value of the right-most pointer (Ptr N) and the byte offset into the page of the first cell and the first free block. The pointer is a 32-bit value and the offsets are each 16-bit values. We have:
0 1 2 3 4 5 6 7 Ptr N Cell 0 Freeblock 0
The 1016 bytes of a b-tree page that come after the header contain cells and freeblocks. All 1016 bytes are covered by either a cell or a freeblock.
The cells are connected in a linked list. Cell 0 contains Ptr 0 and Entry 0. Bytes 4 and 5 of the header point to Cell 0. Cell 0 then points to Cell 1 which contains Ptr 1 and Entry 1. And so forth. Cells vary in size. Every cell has a 12-byte header and at least 4 bytes of payload space. Space is allocated to payload in increments of 4 bytes. Thus the minimum size of a cell is 16 bytes and up to 63 cells can fit on a single page. The size of a cell is always a multiple of 4 bytes. A cell can have up to 238 bytes of payload space. If the payload is more than 238 bytes, then an additional 4 byte page number is appended to the cell which is the page number of the first overflow page containing the additional payload. The maximum size of a cell is thus 254 bytes, meaning that a least 4 cells can fit into the 1016 bytes of space available on a b-tree page. An average cell is usually around 52 to 100 bytes in size with about 10 or 20 cells to a page.
The data layout of a cell looks like this:
0 1 2 3 4 5 6 7 8 9 10 11 12 ... 249 250 251 252 253 Ptr Keysize
(low)Next Ksz
(hi)Dsz
(hi)Datasize
(low)Payload Overflow
Pointer
The first four bytes are the pointer. The size of the key is a 24-bit where the upper 8 bits are taken from byte 8 and the lower 16 bits are taken from bytes 4 and 5 (or bytes 5 and 4 on little-endian machines.) The size of the data is another 24-bit value where the upper 8 bits are taken from byte 9 and the lower 16 bits are taken from bytes 10 and 11 or 11 and 10, depending on the byte order. Bytes 6 and 7 are the offset to the next cell in the linked list of all cells on the current page. This offset is 0 for the last cell on the page.
The payload itself can be any number of bytes between 1 and 1048576. But space to hold the payload is allocated in 4-byte chunks up to 238 bytes. If the entry contains more than 238 bytes of payload, then additional payload data is stored on a linked list of overflow pages. A 4 byte page number is appended to the cell that contains the first page of this linked list.
Each overflow page begins with a 4-byte value which is the page number of the next overflow page in the list. This value is 0 for the last page in the list. The remaining 1020 bytes of the overflow page are available for storing payload. Note that a full page is allocated regardless of the number of overflow bytes stored. Thus, if the total payload for an entry is 239 bytes, the first 238 are stored in the cell and the overflow page stores just one byte.
The structure of an overflow page looks like this:
0 1 2 3 4 ... 1023 Next Page Overflow Data
All space on a b-tree page which is not used by the header or by cells is filled by freeblocks. Freeblocks, like cells, are variable in size. The size of a freeblock is at least 4 bytes and is always a multiple of 4 bytes. The first 4 bytes contain a header and the remaining bytes are unused. The structure of the freeblock is as follows:
0 1 2 3 4 ... 1015 Size Next Unused
Freeblocks are stored in a linked list in increasing order. That is to say, the first freeblock occurs at a lower index into the page than the second free block, and so forth. The first 2 bytes of the header are an integer which is the total number of bytes in the freeblock. The second 2 bytes are the index into the page of the next freeblock in the list. The last freeblock has a Next value of 0.
When a new b-tree is created in a database, the root page of the b-tree consist of a header and a single 1016 byte freeblock. As entries are added, space is carved off of that freeblock and used to make cells. When b-tree entries are deleted, the space used by their cells is converted into freeblocks. Adjacent freeblocks are merged, but the page can still become fragmented. The b-tree code will occasionally try to defragment the page by moving all cells to the beginning and constructing a single freeblock at the end to take up all remaining space.
When information is removed from an SQLite database such that one or more pages are no longer needed, those pages are added to a list of free pages so that they can be reused later when new information is added. This subsection describes the structure of this freelist.
The 32-bit integer beginning at byte-offset 52 in page 1 of the database contains the address of the first page in a linked list of free pages. If there are no free pages available, this integer has a value of 0. The 32-bit integer at byte-offset 56 in page 1 contains the number of free pages on the freelist.
The freelist contains a trunk and many branches. The trunk of the freelist is composed of overflow pages. That is to say, each page contains a single 32-bit integer at byte offset 0 which is the page number of the next page on the freelist trunk. The payload area of each trunk page is used to record pointers to branch pages. The first 32-bit integer in the payload area of a trunk page is the number of branch pages to follow (between 0 and 254) and each subsequent 32-bit integer is a page number for a branch page. The following diagram shows the structure of a trunk freelist page:
0 1 2 3 4 5 6 7 8 ... 1023 Next trunk page # of branch pages Page numbers for branch pages
It is important to note that only the pages on the trunk of the freelist contain pointers to other pages. The branch pages contain no data whatsoever. The fact that the branch pages are completely blank allows for an important optimization in the paging layer. When a branch page is removed from the freelist to be reused, it is not necessary to write the original content of that page into the rollback journal. The branch page contained no data to begin with, so there is no need to restore the page in the event of a rollback. Similarly, when a page is not longer needed and is added to the freelist as a branch page, it is not necessary to write the content of that page into the database file. Again, the page contains no real data so it is not necessary to record the content of that page. By reducing the amount of disk I/O required, these two optimizations allow some database operations to go four to six times faster than they would otherwise.
The schema layer implements an SQL database on top of one or more b-trees and keeps track of the root page numbers for all b-trees. Where the b-tree layer provides only unformatted data storage with a unique key, the schema layer allows each entry to contain multiple columns. The schema layer also allows indices and non-unique key values.
The schema layer implements two separate data storage abstractions: tables and indices. Each table and each index uses its own b-tree but they use the b-tree capabilities in different ways. For a table, the b-tree key is a unique 4-byte integer and the b-tree data is the content of the table row, encoded so that columns can be separately extracted. For indices, the b-tree key varies in size depending on the size of the fields being indexed and the b-tree data is empty.
Each row of an SQL table is stored in a single b-tree entry. The b-tree key is a 4-byte big-endian integer that is the ROWID or INTEGER PRIMARY KEY for that table row. The key is stored in a big-endian format so that keys will sort in numerical order using memcmp() function.
The content of a table row is stored in the data portion of the corresponding b-tree table. The content is encoded to allow individual columns of the row to be extracted as necessary. Assuming that the table has N columns, the content is encoded as N+1 offsets followed by N column values, as follows:
offset 0 offset 1 ... offset N-1 offset N value 0 value 1 ... value N-1
The offsets can be either 8-bit, 16-bit, or 24-bit integers depending on how much data is to be stored. If the total size of the content is less than 256 bytes then 8-bit offsets are used. If the total size of the b-tree data is less than 65536 then 16-bit offsets are used. 24-bit offsets are used otherwise. Offsets are always little-endian, which means that the least significant byte occurs first.
Data is stored as a nul-terminated string. Any empty string consists of just the nul terminator. A NULL value is an empty string with no nul-terminator. Thus a NULL value occupies zero bytes and an empty string occupies 1 byte.
Column values are stored in the order that they appear in the CREATE TABLE statement. The offsets at the beginning of the record contain the byte index of the corresponding column value. Thus, Offset 0 contains the byte index for Value 0, Offset 1 contains the byte offset of Value 1, and so forth. The number of bytes in a column value can always be found by subtracting offsets. This allows NULLs to be recovered from the record unabiguously.
Most columns are stored in the b-tree data as described above. The one exception is column that has type INTEGER PRIMARY KEY. INTEGER PRIMARY KEY columns correspond to the 4-byte b-tree key. When an SQL statement attempts to read the INTEGER PRIMARY KEY, the 4-byte b-tree key is read rather than information out of the b-tree data. But there is still an Offset associated with the INTEGER PRIMARY KEY, just like any other column. But the Value associated with that offset is always NULL.
SQL indices are implement using a b-tree in which the key is used but the data is always empty. The purpose of an index is to map one or more column values into the ROWID for the table entry that contains those column values.
Each b-tree in an index consists of one or more column values followed by a 4-byte ROWID. Each column value is nul-terminated (even NULL values) and begins with a single character that indicates the datatype for that column value. Only three datatypes are supported: NULL, Number, and Text. NULL values are encoded as the character 'a' followed by the nul terminator. Numbers are encoded as the character 'b' followed by a string that has been crafted so that sorting the string using memcmp() will sort the corresponding numbers in numerical order. (See the sqliteRealToSortable() function in util.c of the SQLite sources for additional information on this encoding.) Numbers are also nul-terminated. Text values consists of the character 'c' followed by a copy of the text string and a nul-terminator. These encoding rules result in NULLs being sorted first, followed by numerical values in numerical order, followed by text values in lexigraphical order.
The database schema is stored in the database in a special tabled named "sqlite_master" and which always has a root b-tree page number of 2. This table contains the original CREATE TABLE, CREATE INDEX, CREATE VIEW, and CREATE TRIGGER statements used to define the database to begin with. Whenever an SQLite database is opened, the sqlite_master table is scanned from beginning to end and all the original CREATE statements are played back through the parser in order to reconstruct an in-memory representation of the database schema for use in subsequent command parsing. For each CREATE TABLE and CREATE INDEX statement, the root page number for the corresponding b-tree is also recorded in the sqlite_master table so that SQLite will know where to look for the appropriate b-tree.
SQLite users can query the sqlite_master table just like any other table in the database. But the sqlite_master table cannot be directly written. The sqlite_master table is automatically updated in response to CREATE and DROP statements but it cannot be changed using INSERT, UPDATE, or DELETE statements as that would risk corrupting the database.
SQLite stores temporary tables and indices in a separate file from the main database file. The temporary table database file is the same structure as the main database file. The schema table for the temporary tables is stored on page 2 just as in the main database. But the schema table for the temporary database named "sqlite_temp_master" instead of "sqlite_master". Other than the name change, it works exactly the same.
The nine 32-bit integers that are stored beginning at byte offset 60 of Page 1 in the b-tree layer are passed up into the schema layer and used for versioning and configuration information. The meaning of the first four integers is shown below. The other five are currently unused.
The first meta-value, the schema version number, is used to detect when the schema of the database is changed by a CREATE or DROP statement. Recall that when a database is first opened the sqlite_master table is scanned and an internal representation of the tables, indices, views, and triggers for the database is built in memory. This internal representation is used for all subsequent SQL command parsing and execution. But what if another process were to change the schema by adding or removing a table, index, view, or trigger? If the original process were to continue using the old schema, it could potentially corrupt the database by writing to a table that no longer exists. To avoid this problem, the schema version number is changed whenever a CREATE or DROP statement is executed. Before each command is executed, the current schema version number for the database file is compared against the schema version number from when the sqlite_master table was last read. If those numbers are different, the internal schema representation is erased and the sqlite_master table is reread to reconstruct the internal schema representation. (Calls to sqlite_exec() generally return SQLITE_SCHEMA when this happens.)
The second meta-value is the schema format version number. This number tells what version of the schema layer should be used to interpret the file. There have been changes to the schema layer over time and this number is used to detect when an older database file is being processed by a newer version of the library. As of this writing (SQLite version 2.7.0) the current format version is "4".
The third meta-value is the recommended pager cache size as set by the DEFAULT_CACHE_SIZE pragma. If the value is positive it means that synchronous behavior is enable (via the DEFAULT_SYNCHRONOUS pragma) and if negative it means that synchronous behavior is disabled.
The fourth meta-value is safety level added in version 2.8.0. A value of 1 corresponds to a SYNCHRONOUS setting of OFF. In other words, SQLite does not pause to wait for journal data to reach the disk surface before overwriting pages of the database. A value of 2 corresponds to a SYNCHRONOUS setting of NORMAL. A value of 3 corresponds to a SYNCHRONOUS setting of FULL. If the value is 0, that means it has not been initialized so the default synchronous setting of NORMAL is used.