One Step Backwards, Two Steps Forward
Regular readers of Ordinary Times have noticed problems with site performance over the last few weeks. In particular, lots of people were receiving those annoying error messages about the server exceeding assorted mysterious resource limits. Those appear to be, at least indirectly, my fault. My bad. Mea culpa. My intentions were good. On a more positive note, I believe that the problem has been fixed. Writing about how and why things got broken will make me feel better about it. Other people may find it useful as well, although the probability of that does seem low.
State of the Discussion is one of the features of Ordinary Times, available through the Community menu item in the list just below the logo at the top of the pages. In my experience, SotD is unique. CK McLeod’s take on a comment-centric view of a blog is a brilliantly concise way of watching what’s happening overall. I made a fuller argument here. When CK moved on, entropy began to take its usual toll. Updates to the platform on which the site runs slowly but surely broke some of the SotD features. When Will offered me a chance to restore it as part of moving the site from its second incarnation to its third, I jumped at it.
I had to learn some PHP, the programming language used for the majority of the WordPress product for building web sites. I had to learn about the way WordPress actually uses that PHP to generate the web pages that get handed out. And I had to learn at least some of CK’s coding idiosyncrasies. That’s not a knock on CK. Every programmer, like every kind of writer when left to their own devices, will develop style habits. If I had been writing SotD from scratch the style would have been different. Not necessarily better, just more suited to the way I like things to be structured.
Here’s an unpleasant truth about the web today: once a site has become modestly popular, the large majority of the requests the site’s server has to field are not from people interested in the content. Most requests are either people trying to break in or web crawlers. A web crawler downloads a page, extracts any links that are embedded in the page, adds those to a list of links to be searched, and goes on to the next link in the list. The crawler scans the page content for indexing – eg, Google and Bing – or other purposes. The crawlers’ goal is to visit all the interesting parts of the web. Some crawlers are quite sophisticated and can deal with all of the cross-linked pages in a complex web site. Some… aren’t, and don’t recognize when they’re wasting their time.
As I was working on SotD I was concerned with getting tidy pages properly generated to provide a specific user experience. All of the pages require multiple dips into the site’s data base, which takes time. Some of the pages, if the links are followed, will produce chains tens of thousands of links long. A simple-minded web crawler, once it gets started on that collection, will cheerfully hammer away at the site for days. Maybe weeks. I didn’t think about that. And those crawlers generating all those (relatively complicated) data base dips appear to be the reason the server was running out of resources.
Why doesn’t every complex web site grind to a halt under that kind of load? One of the reasons is that all well-behaved web crawlers look for a special file at the top level of the site’s file hierarchy named robots.txt. The file contains a list of simple rules telling the crawler where it is welcome to go and where it should stay out. Even crawlers being run by bad guys – searching for stashes of personal information, perhaps – have an interest in being well-behaved. It doesn’t do them any good if it becomes impossible for people to operate web sites. WordPress provides, as the default, a very small robots.txt file that basically points crawlers away from an old way of doing site administration.
Ordinary Times now has a robots.txt file that includes “stay out of SotD” rules. Over the course of a few days the crawlers all seem to have found the rule changes and are honoring them. (As I write this, zero of the last 1,000 visits to the site were crawlers accessing SotD.) The crawlers still seem to be visiting the rest of the site, so people’s posts and comments are all being indexed, which is a good thing. The bad guys are still trying to steal user ids and passwords, which is a largely unavoidable bad thing. Resource usage seems to have returned to reasonable levels. The site’s response time seems to be more consistent. State of the Discussion is available again. If you want to use it, you may need to flush your browser’s cache in order to get the formatting right. The commenter archive pages that are part of SotD are not available yet. Those were the bigger problem and I want to make sure other things are stable before trying them again.
Enjoy the improved performance.
Thanks for the writeup, and the work. SOTD is one of my favorite site features, one I use a lot.
Your explanation makes perfect sense, but I’m curious how you diagnosed the problem.Report
Will had largely pinned the problem down to too many simultaneous PHP threads trying to run. In my mind there were basically a couple of situations that could cause that. One was a small number of very long-running threads that eventually accumulate. The other is a much larger number of medium long-running threads with variability in load. Add some positive feedback when errors start occurring and either could produce serious problems that only gradually cleared. Will suggested that the first could be caused by code errors, eg, infinite loops. I didn’t want to look for that in SotD* so investigated the other possibility. The crawlers hammering at SotD, particularly the commenter archive parts, was an obvious candidate once I recognized it. Given the improvement since blocking that, it was probably the underlying problem (this time).
* Turning SotD into honest-to-Knuth production code would, IMO, require a real rewrite. The current code is largely free of error handling and data checking. Consider the situation if you want to handle errors in code that runs on WordPress. Now that PHP has try-catch exception handling, some WordPress core functions throw exceptions. Some core functions that predate try-catch return WP_error objects. Some core functions that return WP_error objects do so only if you ask properly. And some core functions simply don’t indicate errors at all.Report
Thank you so much Michael. I don’t know how our lil community can honor you folks who make the bits boink and keep the lights on but you (and CK McLeod before you) deserve every praise.Report
And Will! Will does a lot of heavy lifting to keep things running.Report
Oh my gosh, this is awesome.Report
Cannot thank you enough for all the hard work you do.Report
Most excellent!Report
On Feb 9, the site was “attacked”, with page requests exceeding 20 per second at times. That’s about ten times the normal peak load. The hosting arrangement is simply not adequate to deal with that.Report