Restoring MySQL Master-Master Replication on Production Dashboard VMs

Introduction

MySQL is setup in Master-Master replication technique on prod-newdboard01.geant.net, and prod-newdboard02.geant.net. Sometimes, though very rarely, the replication might break. This note provides solutions to apply when such a situation occurs.

Procedure

A Nagios check setup on gnmonitoring.geant.net checks for the replication status periodically. If it fails, then it sends an email to Richard Sanigar and Trupti Kulkarni. Log in to Nagios on gnmonitoring and find out the server on which replication has broken down; it will be one of prod-newdboard01.geant.net, or prod-newdboard02.geant.net

Method 1 - Use this method in the first instance; if it doesn't work, then it means that replication has broken down completely and needs to be setup from scratch, in which case move on to Method 2 below.

Login to the offending Dashboard server on which replication has broken down using your GEANT credentials
Login to MySQL as root user.

sudo mysql-as-root

Please act with extra caution as this is a production server!
Run the following command to check slave status:

SHOW SLAVE STATUS \G

Either or both of Slave_IO_Running and Slave_SQL_Running will be 'No', and Seconds_Behind_Master will be NULL. Time to act.

Run the following commands in order in the MySQL shell:

STOP SLAVE;

FLUSH LOGS;

RESET SLAVE;

START SLAVE;

After this, run the command to see slave status again to see if it has helped:

SHOW SLAVE STATUS \G

Slave_IO_Running or Slave_SQL_Running should display 'Yes', and Seconds_Behind_Master should come down to 0 (after a short while, if not immediately).

If yes, then congratulations, you have restored the replication!

If no, then proceed to Method 2.

Method 2 - Use this method when Method 1 doesn't help, and replication is still broken.

Run steps 1-4 as in Method 1 above.
Prepare to restore replication from scratch on the offending server.
Login to the non-offending Dashboard VM using your GEANT credentials and take a fresh database dump as follows:

mysqldump -u newdashboardBiz -p --opt alarms > /tmp/alarms.sql

When prompted, enter password, which can be found in Systems KeePass on \\chfile01\Systems

Transfer this file (using scp) to the offending Dashboard VM (on which replication has broken) on /tmp directory.

Login to MySQL as root user on the non-offending Dashboard VM:

sudo mysql-as-root

And run the following command:

SHOW MASTER STATUS;

Make a note of the File and Position values. Quit the MySQL shell.

Login to the offending Dashboard VM using your GEANT credentials.
Login to MySQL as root:

sudo mysql-as-root

Stop the slave in order to restore replication:

STOP SLAVE;

Quit the MySQL shell.

Go the /tmp directory where you have placed the database dump received/copied from non-offending Dashboard VM, as in step 3 in this section above.
Copy the sql file to MySQL on this (offending) VM, as follows:

mysql -u newdashboardBiz -p alarms < /tmp/alarms.sql

Then, once again login to MySQL as root user as follows:

sudo mysql-as-root

And restore replication using the following commands:

CHANGE MASTER TO MASTER_HOST='host of partner (non-offending) server', MASTER_USER='replicator', MASTER_PASSWORD='<please check Systems KeePass>', MASTER_LOG_FILE='<partner's log file name>', MASTER_LOG_POS=<partner's log file position>;

where 'partner's log file name' and 'partner's log file position' are the values recorded in step 4 in this section above.

Now, start the slave:

START SLAVE;

Check if replication has been setup successfully:

SHOW SLAVE STATUS;

Both Slave_IO_Running or Slave_SQL_Running should display 'Yes', and Seconds_Behind_Master should come down to 0 (after a short while, if not immediately).

That’s all! You have successfully restored replication on production Dashboard VMs! If you still see an error, please check mysqld log file and investigate.

Page tree

Restoring MySQL Master-Master Replication on Production Dashboard VMs

Introduction

Procedure

Method 1 - Use this method in the first instance; if it doesn't work, then it means that replication has broken down completely and needs to be setup from scratch, in which case move on to Method 2 below.

Method 2 - Use this method when Method 1 doesn't help, and replication is still broken.