In case some of you weren’t aware, I was /.ed last night. For almost two hours after the posting, my site was unavailable, intermittently available, or very slow. After 24 hours this post constitutes a brief analysis of what went wrong and remedial steps I took — while the oncoming hordes were banging at the gate — to get my server back up and running. I also note some lessons learned for those of you wanting to drive lots of traffic to your sites, get popular fast, or just experience the Slashdot effect for yourself. I also note some implications this experience has for Bad Behavior 2 development.
First a few raw numbers: In the first 24 hours, Apache recorded 30,787 visits from slashdot.org, 4,882 more visits from people with their referrer blocked, and 5,582 visits from other places. Of those hits, only 45 came prior to the post going live. (Subscribers to /. can see articles shortly prior to their publication time.) I had 27 minutes from the first referral from /. at 1:24 am to the time the post went live there at 1:51 am. (All times are UTC.)
Of those pageviews, 544 came between 1:51 and 2:01. There were 634 pageviews between 2:01 and 2:11, 658 between 2:11 and 2:21, and so on. Clearly the server should have been able to handle much more than one pageview per second. And this only counts pageviews served successfully; it doesn’t count images, CSS, MP3 files, etc. It also doesn’t count the innumerable requests dropped on the floor, or people who just gave up waiting.
The server started showing signs of trouble fast. By 1:55 am the load average had passed 35. By 2:00 it had passed 50. At one point I saw the load average as high as 112, and I was over 500MB into swap on a box with 1GB of RAM. I noted that accesses were going VERY slowly and realized that neither Apache nor MySQL had had much performance tuning, and could do a lot better than this.
Unfortunately, I am an idiot, and neglected to take any measures BEFORE the barbarians showed up at the gate! I’m doubly an idiot, because I called them by submitting the story to /. in the first place! So, here’s what went wrong, how I made it right, and actually managed to get the box serving requests again.
First off, the server platform is Fedora Core 5. It includes Apache 2.2.0, MySQL 5.0.21 and PHP 5.1.4. What can I say, I like the bleeding edge. Anyway, take the distro wars elsewhere. The point is, Apache and MySQL on this platform aren’t particularly well tuned for high performance, high traffic sites. So off to Google I went to try to get things under control.
I quickly determined that MySQL was spending way too much time creating threads to serve incoming requests. It also wasn’t dropping old connections quickly enough. So I set the following two variables:
set global thread_cache_size = 150;
set global wait_timeout = 10;
That got MySQL behaving mostly okay. Then I turned to Apache. It turned out to be a bit more of a problem, as at first glance, it already looked like it was fairly well tuned:
So I twiddled the values for a while, with little result, all the while getting hammered with load averages hovering between 50 and 60, but at least I wasn’t swapping anymore. Then after over an hour and a half, I finally realized I had a big problem:
Wait, OFF? That means Apache’s getting hit about 10 times as hard as it needs to be, as it will have to spawn a new process for every inline image, the CSS file, etc. So I changed it fast:
And when I restarted Apache, now around 3:30 am, the load average quickly dropped from 45 to about 10, and requests started coming through at something approaching tolerable speed again.
Then I installed PHP eAccelerator, which has a nice Fedora package (php-eaccelerator) and works out of the box. When I restarted Apache after installing it, the load average dropped by half again, and the site was as fast as it usually is with nobody on it.
I’ve since installed WP-Cache 2 with the required WordPress 2.0/PHP 5 fix and Mark Jaquith’s gzip compression patch. I tested it to make sure it works, but I’m leaving it off until the next time the barbarians show up at the gate, as it screws with some dynamic code I have and I haven’t figured out how to get it to execute the code every time. Yet.
For those of you on shared hosting providers, you won’t be able to make most of these changes yourself. Only WP-Cache 2 is user-installable. If you use a VPS or dedicated server, though, you can do all of this.
There’s plenty of further performance tuning I can do, especially with MySQL, and I plan to do it in the very near future, just in case one of my posts actually gets greenlit on Fark.com or something.
One of the things I did while the server was being hammered was to disable Bad Behavior, to determine if it was putting too much load on the system while it was being hit with 50-100 requests a second. I’ve determined that at those levels it does hit the database pretty hard, and I plan to redesign all of Bad Behavior’s database usage to try to accommodate this sort of situation.
P.S. I can tell you that many /. users really do click on ads. Yesterday’s take on Google AdSense was $40.91, and so far today I’m above $66. Not bad. I get paid to learn fast about tuning my server.
There’s much more work to be done, though, so I’ll most likely have a follow-up to this.