Measure Replication Delay with Percona Toolkit Heartbeat (pt-heartbeat)
Replication lag occurs when the slave cannot keep up with the changes being made on the master instance. The reasons for a lag can be network congestion, poor performance on the slave or service interruptions of the slave mechanism, leading to a large pile of work to be catched up by the slave. When you useSHOW SLAVE STATUS you can monitor the current delay, which is indicated in the Seconds_Behind_Master: 0column. Ideally, this value is zero, but the smallest granularity you get for this delay counter is one second.
Percona offers a tool that allows measuring the delay more accurately, by writing timestamps in the master database and calculate the difference when they arrive at the slave: [pt-heartbeat][1]
On the Master
Create a dedicated database schema called percona and add the following table.
CREATE TABLE IF NOT EXISTS `percona`.`heartbeat` (
ts varchar(26) NOT NULL,
server_id int unsigned NOT NULL PRIMARY KEY,
file varchar(255) DEFAULT NULL, -- SHOW MASTER STATUS
position bigint unsigned DEFAULT NULL, -- SHOW MASTER STATUS
relay_master_log_file varchar(255) DEFAULT NULL, -- SHOW SLAVE STATUS
exec_master_log_pos bigint unsigned DEFAULT NULL -- SHOW SLAVE STATUS
);```
Then you can launch the script on the master, where it will write new timestamps periodically as long as the script is running.
pt-heartbeat -D percona –update –utc h=127.0.0.1,u=root,p=SECRET,P=3306```
The -D flag specifies the database schema where the table for the heartbeat data is stored. The –update command is needed on the master for updating the table and the last part is the DSN, which specifies host address, user name and password. Very important is the UTC flag –utc, that ensures that the timestamps will be interpreted as UTC, regardless of the timezone setting.
On the Slave
Create a user for reading the replicated heartbeat table like this:
GRANT REPLICATION SLAVE,PROCESS,SELECT ON percona.* TO `percona_checksum`@'10.20.30.40' IDENTIFIED BY 'SECRET';```
Then you can run the script and point it to the slave. It will output precise delay counts in fractions of seconds
pt-heartbeat h=127.0.0.1,u=percona_checksum,p=SECRET,P=3306 -D percona –monitor –utc –master-server-id 1```
Notice the different DSN, the –monitor flag and the master-server id, which needs to be the one of your master of course. You need this because the tool supports hierarchies of masters and therefore you would need to know which one is to be considered.
Results
The results will look similar to this
0.09s [ 0.00s, 0.00s, 0.00s ]
0.02s [ 0.20s, 0.00s, 0.00s ]
0.09s [ 0.00s, 0.00s, 0.00s ]
0.03s [ 0.02s, 0.00s, 0.00s ]
0.09s [ 0.01s, 0.00s, 0.00s ]
0.09s [ 0.01s, 0.00s, 0.00s ]
0.09s [ 0.01s, 0.00s, 0.00s ]
0.08s [ 0.01s, 0.00s, 0.00s ]
0.08s [ 0.01s, 0.00s, 0.00s ]
0.10s [ 0.01s, 0.00s, 0.00s ]
0.12s [ 0.02s, 0.00s, 0.00s ]
0.08s [ 0.01s, 0.00s, 0.00s ]
0.11s [ 0.02s, 0.00s, 0.00s ]
0.08s [ 0.02s, 0.00s, 0.00s ]
0.09s [ 0.02s, 0.00s, 0.00s ]
0.08s [ 0.02s, 0.00s, 0.00s ]
0.08s [ 0.03s, 0.01s, 0.00s ]```
The output is the current delay followed by moving averages over 1m, 5m and 15m, as you might know from your favorite OS already.
Have a look at the [official documentation][1], there is a lot of options available.
<div class="twttr_buttons">
<div class="twttr_twitter">
<a href="http://twitter.com/share?text=Measure+Replication+Delay+with+Percona+Toolkit+Heartbeat+%28pt-heartbeat%29" class="twitter-share-button" data-via="" data-hashtags="" data-size="default" data-url="https://blog.stefanproell.at/2018/12/06/measure-replication-delay-with-percona-toolkit-heartbeat-pt-heartbeat/" data-related="" target="_blank">Tweet</a>
</div>
<div class="twttr_followme">
<a href="https://twitter.com/@stefanproell" class="twitter-follow-button" data-show-count="true" data-size="default" data-show-screen-name="false" target="_blank">Follow me</a>
</div>
</div>
[1]: https://www.percona.com/doc/percona-toolkit/LATEST/pt-heartbeat.html