- MySQL知识库 :: cluster
- Why does the Slave report LOST_EVENTS and what to do?
-
Discussion
During MySQL Cluster geographical replication it might happen that the Slave SQL Node is reporting the LOST_EVENTS incident. This happens because the Master SQL Node had problems and is trying to make sure that the Slave stops replicating. Otherwise, it could cause the slave to be out-of-sync.
As an example, suppose a MySQL server responsible for logging all updates executed in MySQL Cluster is restarted. Suppose further that there is a moment in which it will be unable to log any changes done in the cluster. This means that changes made using another SQL or API node will not be in the binary log and consequently not replicated. Therefore, a LOST_EVENTS incident is reported in the binary logs and handled as an error by the Slave. Replication stops since data might be missing in the logs.
Below is an example of how the error is reported by the Slave (edited for clarity):
SHOW SLAVE STATUS \G .. Last_Errno: 1590 Last_Error: The incident LOST_EVENTS occured on the master. Message: mysqld startup ..
There are two cases when a LOST_EVENTS incident is written by the Master in its binary logs. Either will be in the Last_Error reported by the Slave as follows:
- mysqld startup: Master was started.
- cluster disconnect: Master lost connection to its data nodes.
Solution
If the Slave is reporting the LOST_EVENTS incident, it means that the administrator probably will have to do one of two things: switch to a backup replication channel, using the other SQL Node on the Master side which was not affected by a restart; or the administrator will need to re-initilize the Slave MySQL Cluster because one can't guarantee they are in-sync.
When the Slave is reporting with the message saying cluster disconnect, then you'll have to check what happened on the Master. It might be that the Master Cluster's data nodes were shut down and could not take any updates, which means you can safely skip the error on the Slave (see below). Another reason could be that the SQL Node of the Master had network problems and got disconnected from the data nodes. Then you'll have to probably do one of the above steps to solve this.
To skip the error on the Slave, when you are sure it is OK to ignore it (e.g. during tests), you can use the following commands on the Slave:
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE IO_THREAD;