The investigation of a site crash

When I woke up this morning I noticed by site had crashed with the following error:

Error establishing a database connection.

Awesome. What a helpful error message. WordPress is basically saying:

shrug

Thanks WordPress. I ssh’d into the server to see if I could figure out what was wrong myself. Since this was complaining about a database connection, my first step was to check the status of mysql:

 $ sudo netstat -tap | grep mysql

Nothing. Ruh Roh! Time to attempt a restart…

$ sudo /etc/init.d/mysql restart

… and the site came back up. Phew! That was easy.

But why did it crash in the first place?

Since MySql was the cause it made sense to me to first check the sql logs. I opened those up and found nothing. Based on the recommendation of Google I then searched the syslogs, specifically for memory:

$ sudo grep memory /var/logs/syslog
  Aug 19 10:56:12 localhost kernel: [10664646.817182]  [<ffffffff811429d4>] out_of_memory+0x414/0x450
  Aug 19 10:56:12 localhost kernel: [10664646.819979] Out of memory: Kill process 4803 (mysqld) score 104 or sacrifice child
  Aug 19 10:56:12 localhost kernel: [10664646.831686]  [<ffffffff811429d4>] out_of_memory+0x414/0x450
  Aug 19 10:56:12 localhost kernel: [10664646.833365] Out of memory: Kill process 4826 (mysqld) score 104 or sacrifice child</ffffffff811429d4></ffffffff811429d4>

Aha! It looks like MySql ran out of memory and the server killed it. Okay, now on to the next question… Why?

Well… it ran out of memory, that’s why. (Duh.) One way to alleviate this is to create a swap file, which it turns out I forgot to do when I originally configured this server. Without that swapfile MySql had nowhere to overflow excess data to and subsequently crashed. I created and enabled a swap file:

$ sudo dd if=/dev/zero of=/swapfile bs=1024 count=1024k
$ sudo mkswap /swapfile
    Setting up swapspace version 1, size = 262140 KiB
    no label, UUID=XXXXX
$ sudo swapon /swapfile

After creating the file everything has been running great (so far).

Next Steps

Had I not been a narcissist and checked my own website I probably wouldn’t have noticed it was down for hours, perhaps encroaching on days. My next steps are to look into monitoring software – something that alerts me when there’s a problem, or even a potential problem before it’s even there. One I have found that does just that is Nagois, or it’s stepchild Icinga.