News:

The anti-spam plugins have stopped being effective. Registration is back to requiring approval. After registering, you must ALSO email me with your username, so that I can manually approve your account.

Main Menu

Internal server errors

Started by reinder, July 16, 2007, 08:34:08 PM

Previous topic - Next topic

0 Members and 3 Guests are viewing this topic.

reinder

Is anyone else getting lots of 500 (Internal Server Error) responses lately? I am, and it's affecting both my Perl-based Movable Type installation as my PHP/MySQL-based CMS for the comic.

At first, we (Mithandir who wrote the CMS and I) blamed a persistent spam attack on the comic, but that seems to have died down.
Reinder Dijkhuis
Rogues of Clwyd-Rhan | Waffle

Xepher

#1
I'm looking in the error logs for the past 24 hours... it looks like it's still being attacked, or at least was a few hours ago. I'm seeing 20 requests/second from the same IP for your front page at one point a few hours ago, with no referrer or additional info (meaning a bot, not any real browser.) I'm seeing the same sort of pattern for requests to mt-tb.cgi The server has resource limits in place to prevent such flood attacks from bringing down the whole place. It only lets so many processes run at the same time... once that number is reached, you'll get those 500 "Internal Error" messages. The limits are per-account, so it doesn't matter if they're flooding the MT install, or the PHP stuff, it will affect the whole site.

I'm not seeing anything in the past couple hours, and I'm not getting any errors myself as I browse your site. Are you still having problems? If it starts happening again, try running "top" in SSH/Shell. It'll show all running processes. Hit "u" and then type your username "rocr" and it will filter to only show your username... Most web-serving processes should only pop up for a second or two as they serve a page and then go away... if it's flooded though, you'll see a bunch of processes there. If that's the case, get the exact time (on the server) and let me know. Top shows the time in the upper left corner. (The server runs on UTC.) I can look at the log files for that precise time and see what was happening and where it was coming from. Maybe there's a particular IP or two that we can block or something. I've also got an option where I can (temporarily) let it log stuff separately and JUST for your server, so you can look through stuff yourself easier. Let me know.

EDIT: Also, I just noticed a lot of hits to mt-blahblah.cgi... Didn't you rename that in order to disable it from being run? It's still getting requested/run.

Databits

Actually Xepher, if the file was renamed I would strongly suggest looking it over or setting it as a non-executable for now. It's possible that the script has been compromised, and would be unsafe to allow it to run until it's investigated.
(\_/)    ~Relakuyae D'Selemae
(o.O)    
(")_(")  [Libre Office] [Chrome]

Xepher

No, it's not that bad... it was the mt-comments script, IIRC, he renamed it so that spam-bots wouldn't find it directly. I don't remember if it was still meant to be used or not though. The script itself is "safe" in that it's an official part of the moveable type package. It's just that it's prone to being hit by comment spam. If it's not supposed to be used at all, then yeah, we'll disable it entirely. But I think he may have changed the links so it's still accessible/usable, but spambots just don't see it directly.

reinder

The status on my mt-tb and mt-blahblah scripts is that they're disabled from within Movable Type, so no comments can get posted right now. The spam protection within the latest Movable Type seemed adequate at first but didn't hold up for long, so I went moderated after three months and switched comments off entirely again after four. Presumably the script still runs when a spammer tries to hit it, only it rejects everything. Since this situation doesn't cause Movable Type to do the file rebuilds of doom, server overhead should be slight, except in case of a severe bot attack. Or so I thought.

Unfortunately, if I remember correctly, these scripts are also needed to allow users to read existing comments and trackbacks, so I can't get rid of mt-blahblah entirely... well, not without a whole lot of other work. Mt-tb can be safely lost as my weblog never got many trackbacks.

I'm a bit puzzled by the bot hits to my front page. I don't see them in my CMS's logs, at least not in those numbers. I got a large number of bot hits on Sunday and early on Monday (early being before 10 AM, CET, when I left home after Mithandir had upgraded the anti-spam system). After that, everything seemed normal.
Reinder Dijkhuis
Rogues of Clwyd-Rhan | Waffle

Xepher

The string of hits I was seeing in the logs were all resulting in 500 (internal error) meaning the CMS itself probably never registered them. The overhead for mt-blahblah isn't bad in terms of server load, but when they're called, they run for a second or two and take up a process slot. When things are being bombarded, they count against the total number of processes allowed to run, even though they're using barely any CPU time. I wish I could've seen what was running at the time the errors were cropping up. I can look at the environment of a process and see exactly what command was run, what IP it was from, and any query/post strings. I've upped the limits a bit, since the processes being run all seem to be low memory usage, but generally speaking, something has to be going wrong for it to be bumping into those limits in the first place.

If you look at the stats http://xepher.net/stats/rocr/ and go down to "processing time" you'll notice there's more than 10,000 requests (since the start of the month) that took over 10 full seconds to process. Most requests don't take more than a fraction of a second. A lot of those that take longer are what causes it to start bumping up against the process limit. I wish that stat program gave a better analysis of things... specifically what scripts/queries/etc were the worst offenders for processing time.

Anyway, I don't see any big problem right now, unless you're still running into errors. Like I said before, if/when you do have problems, check "top" in shell, and see what's running, and how long processes are sticking around. If there's a lot of them coming and going, then it's just really busy and there's not really an "error." In that case, we can just look if it's legit or if it's some sort of bot attack, and maybe up the limits or block some IPs. If individual processes are hanging around more than a few seconds though, then something is wrong, and maybe there's an exploit we're not aware of.

Xepher

Okay, I think I may have found the root of the problem... You're maxing out the database! The table "Willow_Logs" is 1.1 gigabytes in size. I noticed a few minutes ago that everything was going slow, and looking at the process table, you had a ton of index.php running. I looked at the environs of several, and it was all legit looking. Looking back the processes, I saw that several had been running for 5 or even 7 minutes. The database server was running at a full CPU load, and that was causing all those processes to wait on it... then they all hit their timeouts and died off, and things went back to normal. I can't imagine what SQL query could cause that kind of lag, until I saw that you had a single table that had 4.7 million records in it. I can't even use phpMyAdmin to manage a look at it. I'm gonna go poke around in command line, see if I can tune or optimize this.

Databits

4.7 million records really isn't nothing for a mysql (5+) server.

We've got stores with more records than that, as well as database image storage, and no big performance hits like that...

...but that also depends on if the table was structured intelligently.
(\_/)    ~Relakuyae D'Selemae
(o.O)    
(")_(")  [Libre Office] [Chrome]

reinder

I've already sent this in email because the forum wasn't working, but:

I know what the big database table is, probably. It's most likely to be the WillowCMS logfiles. We've had to take a large chunk of them out of the active database into an archived version before, and since then, they've only been growing faster. I've asked Mithandir to do this again, and will specify that he should remove more than he did the last time  If you want, I can also request that he keeps fewer logs. While we're both big nerds about being able to drill down into our logs, I'm more than willing to give that up if it means better service to the readers.

However, to be able to do those things, he needs to be able to get into the CMS, which is affected by the maintenance status, and the shell, which doesn't seem to be working (I've tried with both my new and my old password). Does the maintenance status include the shell?
Reinder Dijkhuis
Rogues of Clwyd-Rhan | Waffle

Xepher

It does... I apologize for the lock, but it was neccessary while I brought the database down and back up. I defragged the physical files backing your database, and mysql is running it's own optimization (basically internal defrag and reindex) on the table right now. I had to lock it because any database access during the optimization results in that process hanging (since the table is locked by mysql during optimization) I tried it without it, and your site immdiately had 25 threads hanging... I figured a maintenance notice was better than 500 errors for your users. As data noted, MySQL shouldn't have a problem handling things this size... it was just a bad combination of a nearly full disk and a very busy table... fragmentation was through the roof.

It hopefully shouldn't be much longer. I'm not going anywhere until I get you back up and running.

Xepher

Alright, optimization is done and the lock is removed. Site looks to be working just fine again... I'll keep an eye on it until I go to bed, but it certainly can't be any WORSE than before.

Just using the database seems a LOT faster... I can actually browse it and such now, and it responds quickly. The physical data file was reduced in size by about 250MB too, so that's the amount of cruft we were talking about... probably 90% of that in small fragments across the disk too. There's only a couple gigs left free in /home, so what's going in there right now gets pretty fragged. Not a big deal for 500kb images, but gigabyte database files take a huge performance hit.

(Note to self... get the new server in place already!)

Xepher

#11
I spoke too soon it seems.

I visit http://rocr.net/index.php?home and it loads 90% of the page right away... but the "loading" thing continues still... for several minutes. I'm monitoring it from my browser and from the server side. It took 6 full minutes to give up, and let the thread on the server die. Gonna do more testing to see if it's SQL related or not.


EDIT: Okay, it's SQL related, but not really a server problem, but rather a coding one (I think.) I enabled the slow query log, and this is what it shows...

UPDATE WillowLogs SET crawler = 255 WHERE ipv4hex='184a3560' AND tstamp > 1184745591;

There's a dozen more like it just in the few minutes I was watching. It gives query and lock times, and both were creeping up into the minutes as I let it go. When I give things a fresh start, I can browse your site as fast as I want, with no lag... but as soon as something starts lagging, queries of the above sort start showing up in the slow-query log.

I notice the table structure has both "ipv4" (an int10 field) and "ipv4hex" (a varchar field) There's an index for the first one, but not for the hex one. Is it maybe possible to run that above query on the indexed field instead of the hex one? Or add an index to the hex field? (Maybe on the tstamp too.)

Mithandir

That's ... strange ... because I commented that line out when I noticed it slowed things down too much...

*checks*

Ah ... right ... I uploaded the updated file to the wrong place again (I keep uploading files to the root for some reason when I'm in a rush, instead of to public_html). And yeah, it should be using the indexed field of course ... serves me right for updating code without refreshing my memory.

Mea Maxima Culpa, changed it now.

Mithandir

oh, by the way ... that query is only executed when a spam is encountered. If you're seeing a lot of them it means the site is under a pretty hefty spam attack.

Xepher

Heh... oh well, at least we figured out what it was, and it was an easy fix (I assume, since you thought you'd already commented it out.) I know I've made much more foolish mistakes when coding.

And that would explain why it would work just fine when I was just speed-reading through pages to do my own tests... nothing slowed down... then suddenly a few other visitors would happen along, and it would all bog down. If it only triggers on spam, then it doesn't bog down until spam shows up, then EVERYONE bogs, and the ensuing snowball destroys northern europe. :-P

Oh, and I completely understand the desire to keep detailed logs. I'm a statistics-whore myself at times. If not for that one query, it doesn't seem to be causing any problems, so feel free to keep logging if you're confident you fixed the problem.