Relational database: MySQL is a relational database. In a relational database, all data is organized in the shape of a table, with rows and columns. The first row can be thought of as the header. Similar to a table in its common sense, the header describes what information will be present for each row in the table; the header forms the schema of the table. Each row is a data point; each column is a property of that data point.
SQL language: With the SQL language, you can extract the rows/columns you want. It's like applying filters to the data, but it's more powerful than that. More advanced SQL usage includes GROUP. You can group the rows by a certain property. Rows with the same property value will be put into the same group. Then some arithmetics (MAX, MIN, AVG) can be performed within each group. In MySQL, you can combine multiple SELECT statements into a single query with a UNION.
Data modeling: You may need multiple tables that are related to describe your data. One motivation is to reduce duplication (think of TNF, Third Norm Form). A row in one table can reference to one or more rows of another table by a foreign key. Different tables can be joined to form a larger table that contains all the properties of the original tables.
InnoDB maintains a B+ tree on disks where each tree node is a page on the disk. At the filesystem level, these are represented as InnoDB data files (.ibd files).
What's a B+ tree?
Redo log: Redo log is for tracking dirty pages for the data files that have not been flushed to disk. It's critical for data recovery.
One thing to remember is that the state of the database is fully represented by what's on the disk. Things in memory do not count; they can be lost at any time. Memory is for speed optimization and disk is for persistence. There's no way around a WAL (redo log).
MVCC: InnoDB achieves MVCC. MVCC allows data to be accessed concurrently without locking the entire table. Each row has a hidden column about the version. Old versions that are no longer referenced are cleaned up in a background thread.
(Note: While InnoDB doesn't take this approach, some optimization/implementation of MVCC for the B+ tree involves using copy-on-write for data pages. When a page is modified, it's not modified in place; a new page is created, updating all parenting pages up until the root. This forms a new B+ tree where most of the pages are reused. Each write creates a new root, representing a consistent snapshot of the database.)
Buffer pool: InnoDB engine uses a buffer pool to cache pages in memory. To avoid double caching (caching the same page at both the user space and kernel space), O_DIRECT
can be used.
B+ tree size
It's a globally unique identifier for a transaction. It's in the format of source_server_id: transaction_id
where the transaction id is a sequence number in commit order.
Each server records the set of GTIDs that have been executed. If a GTID transaction has been applied, any further transaction with the same GTID is ignored. GTID is useful for checking the replication progress of a follower.
Traditionally (Before MySQL 5.6), MySQL replication was file-based, which tracks the binlog file and positions. GTID makes the process easier. With auto-positioning, a follower doesn't have to know the binlog file and position, it just needs to know the GTID of the transaction last executed.
Binlog records the data changes to the database (e.g. write transactions). That sounds like the redo log? The difference is, that binlog keeps track of changes on the transaction level while the redo log keeps track of changes at the page level.
Binlog is useful for point-in-time recovery. You don't have to enable binlog if there's no follower attached.
The leader runs a dump thread that reads from the binlog and sends it to the follower. The follower has an IO thread for receiving the binlog and stores it as the relay log. The SQL thread on the follower reads the relay log and applies the transactions.
The bottleneck is usually not the CPU but I/O (how fast the transactions can be applied). MySQL is multi-threaded. A MySQL leader can handle requests from concurrent connections. But the follower may not be able to keep up.
Steps | Description | redo log | bin log | relay log |
1 | The client sends a transaction to a MySQL leader. | |||
2 | The MySQL leader modifies the in-memory state. | |||
3 |
The MySQL leader writes the transaction to the InnoDB redo log buffer. The redo log buffer is periodically synced to disk. If the server/OS crashes now, the transaction may or may not be persisted to disk. If persisted, it will just be an uncommitted transaction that will be rolled back. |
txn?
|
||
4 |
The MySQL leader writes the transaction to the binlog buffer. If the server/OS crashes now, the transaction may or may not be persisted to disk. |
txn? | txn? | |
5 | The MySQL leader's dump thread sends it to the MySQL follower. The IO thread on the follower puts it onto the relay log. | txn? | txn? | txn |
6 |
Once the follower acknowledges that the transaction has been persisted on the follower, the leader commits the transaction by writing a COMMIT record to the redo log and binlog. With |
txn, commit | txn, commit | txn |
7 | The leader acknowledges to the client that the transaction has been committed. |
A leader may be able to group multiple commits and flush them to disk in one operation. In this case, they become one operation in the binlog and the follower can apply those transactions in parallel.
replica_preserve_commit_order
ensures that transactions are committed on the follower in the same order as the leader. Note that the scheduling (i.e. the start time) of the transaction can be out-of-order.
binlog_group_commit_sync_delay
introduces delays for commits on the leader so that more transactions can be applied in parallel on the followers. This is meant to improve the replication lags.
Backup is not only useful for backing up data but also for duplicating the data to create a new follower.
How Xtrabackup
works is that it copies the InnoDB data files (.ibd files) along with the redo log and then performs crash recovery. The data files are large and they could be copied at different timestamps and the redo log keeps track of all the data file changes since those timestamps.
We need to track the redo log since the first data file is copied until the last data file. This can take hours. After the crash recovery, the state of the database reaches the end of the copy, not the start.
PTTC checks that the data on the leader is the same as on the followers. It's useful for detecting data corruption.
It works on one table at a time. It divides the table into chunks of rows and calculates a checksum for each chunk. The checksum is calculated and inserted into a checksum table by a REPLACE..SELECT
query. The key here is to perform a statement-based replication so the same operation is done on the follower. If the data is inconsistent, the checksum will be different.