Read Only Friday - The MySQL Crash

Fridays in the summer are nice for us. The students are gone. We can usually keep changes to a minimum. Best of all, everyone goes home for the day at noon. Not this guy. Not today.... When I finally forced myself to sit up and check my phone, dread set in.

Website: Critical - Nagios (30)

Apparently Nagios had been trying to get my attention for a couple hours. I recently bought a new phone that has a "Do Not Disturb" option. We don't pay overtime, or on-call time, or recall time at work, so I now leave the do not disturb mode on from 12a - 8a on weekdays. It is going to be a fun day. "No messages from the boss or co-workers... maybe Nagios is just confused." I check the homepage on my phone. Nope. "Error establishing connection to database."

My girlfriend gets out of bed to use the bathroom. "I was just about to do that" I think to myself. I sigh and angrily bring up the ssh client on my phone. The db server I need to get to is not in the dmz. I have to tunnel through some public facing servers in order to finally land at the command prompt for the db server. By the time my girlfriend returns from her quest to the bathroom, I have just barely managed to wearily log in to my database server. She wants attention and grabs at me as she climbs back into bed. I brush her off in my frustration and head to the bathroom. My phone let's me know that the first helpdesk ticket about this has been assigned to me.

On the toilet, phone still in hand I take a guess at what is going on. Years ago we built a bunch of servers with 10GB hard drives. We figured we could just add disk and mount/expand as needed. So our servers capping out is somewhat common.

root@dbserv-001> df -h
/                   /dev/sda1   100% use
/var/lib/mysqld.    /dev/sda2.  20% use
root@dbserv-001> yum clean all
root@dbserv-001> /etc/init.d/mysqld reatart

Cleaning the yum cache should give me time to finish my business here in the bathroom, get dressed, and drive over there. I respond to the ticket, "database server hdd full." With things working again... and my morning bathroom routine dealt with, I return to the bedroom to get dressed and collect my things. The girlfriend is up and wants attention. Having collected all of my things, I feign leaving before turning around to give her the goodbye kisses she wants. After driving to campus, I check the website to ensure everything is up as I walk from my car to my office. I stop by the other system administrator's office and give the short version of the solution and the issue.

With the formalities out of the way, I sit at my pc, glaring at the terminal with the ssh session to my database server. "I am going to have to take it offline to expand the root volume anyway... may as well do those upgrades that we have been putting off" Those upgrades include moving from Oracle Linux 6.7 to 7.1 and from MySQL 5.1 to 5.6. The OS upgrade can't happen in place.

After cloning a new Oracle Linux 7.1 VM from template, I decide to give this server a 40GB hdd and let it all sit on the root volume. This means I will have to boot the system from disk and go through the xfs expansion process. I can never remember the lvm commands. Once I'd found the vgcreate and lvcreate commands and got the lvm volumes all setup, I had to figure out the xfs_grow command. Xfs is new for us. Historically, we ran normal ext4/3 partitions. With the volume expanded, I bring up the vm.

After networking it and installing OS patches I set to work trying to install MariaDB from yum. It will be a big move, going from MySQL to MariaDB... They are supposed to be compatible and oracle is leaving MariaDB in its own repositories despite owning MySQL. Yum install mariadb mariadb-server fails because of some mariadb-commons-libs dependency.  After trying a couple things, I enable the MySQL-56 repo in /etc/yum.repos.d/ MySQL-5.6 installs flawlessly. I get it started and tell the web services team that they will have to stop making changes while I do a migration. I ask them to let me know when they have saved their changes.

Once they sign off,  a simple "mysqldump -u root -p --all-databases" on the old server and a "mysql -u root < mysqldump.sql" on the new server and we are just about done. I shut down the old server and reconfigure the new one with the ip address of the old one. One more restart of mysqld on the new server and everything should be working.. .. things are going pretty well. The site loads fine. I tell the web team they can get back to work.

The Problems Start

It only takes a few minutes for the Web team to report back, "a page has disappeared." Maybe it is a one off error. I restore the dev version of the page the coworker was editing from the live production version.... It happens again. This is old CMS software that was custom built. No one has been maintaining it since I got moved off the web team years ago. There is a slow push to migrate from this old CMS to Wordpress. I mention to the web team, "maybe you will have to migrate that page to Wordpress." I turn on query logging on the database and start debugging. I try to edit the disappearing page after restoring it. It disappears. I try manually running the "insert" command that the CMS generated through the database. I get an error about a field having no default value. Apparently a "Text" field got imported with no default value but also as "NOT NULL". After I modify the field in the 4 tables it exists in to allow it to be null, everything starts to work. An office politics email comes in, "We need to fix legacy authentication on the old active directory Domain and the website." None of this is really my responsibility any more... I am a programmer for our ERP system..

I am having trouble authenticating to the new MySQL database. We still have MySQL 4.x style passwords laying around. After figuring out how to add new passwords in the correct style, and deleting the old ones, I feel good. At least we can close that ticket from 8 years ago in the old ticketing system that we decommissioned 5 years ago... After fixing some unrelated template bugs, I set to work fixing the online directory listing update tool. It needs aN sql query rewritten in php. It was using the legacy mysqli libraries and fields we're not properly escaped. It takes about 20 minutes to rewrite the thing as a parametrized PDO query.... I am a little rusty in my PHP -> MySQL these days... Having cleared out about 10 long standing issues in the morning I glare at the clock 2:00pm.

I decide I may as well deal with the authentication issue. After tracing the relevant code bases for about an hour, I make changes in two files and test. I make one more change to fix a regression in username handling and I am done. That was code that I wrote as a systems administrator for the web developer at the time. The missing parts were just the few entry points I had to modify today.... I have a coworker test an issue that has been plaguing him for years. It was an authentication issue unrelated to the changes I made. The changes just sidestepped the issue. Everything works. I send an email out stating as much and leave.... a month worth of work done in a day...  

Submitted by daniel on Sun, 08/09/2015 - 22:26