News:

The anti-spam plugins have stopped being effective. Registration is back to requiring approval. After registering, you must ALSO email me with your username, so that I can manually approve your account.

Main Menu

Explanation for the March 31st outage

Started by Xepher, April 01, 2010, 05:03:34 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Xepher

Well, I feel I owe everyone an explanation for the day-long outage yesterday. Let's start from the top.

For the past year, xepher.net was hosted on a virtual machine at a company I used to work for, because I got a discount there, paying $63 for what should've been $250 normally. They finally decided I've been gone long enough, that the discount has to go. So I look for new hosting. I find a place (call it option A), and it's $160/month for ostensibly the exact same level of service. I put in an order, and don't hear anything for a couple days, despite emails to their support and such. This isn't very promising, so I look at other options. I find another place (option B), really cheap, $55/month, with a slightly less powerful setup, but more bandwidth and diskspace, so I figure, why not? They get me setup pretty much right away, and I spend a day or so moving stuff over there.

I switch all the services on, and it runs great for a few hours, some hiccups, but not a big deal. After a couple days, I figure, eh, this'll work for now, and my contract is up in a day or two anyway on the original service, so I have to move, shutting down the old host and closing that account. I go ahead and let the order for A stay in place though, since I'm curious to compare the two. It comes through about 48 hours ago, 6 full days after I actually ordered it, and by then I'm noticing more and more performance issues with B. The problem is that the CPU is powerful enough, but they have too many users on one machine, sharing the same disk, so anything needing a file gets delayed really bad. Since we have about 150 users here, all with websites, this shows up as a system load of over 100... e.g. there's 100 people waiting in line for a file. At this point, I get a letter from B, saying other customers are complaining because I'm using all the CPU. This isn't the case, as it's the disk that's lagging, and it's not my fault. Other customers may not have 100 people in line, but... well, picture it this way. I have 100 people in line with bucks, to collect their files as they trickle out of the spigot, so my "load" is 100-in-line. The other customers may not have 100 waiting, but they're bringing dumptrucks instead... big file transfers, but since only one person is waiting, their load is only "5" or some such. I tested this by shutting down all services for a bit in the middle of the night, and the load was still sky high, even when my server was doing nothing. Anyway, besides the point... what happens next is the fun part.

The of B emails me, with the complaints... but instead of giving me a chance to respond or work on it... he just shuts down my server. And in the email he tells me he's installed a script that will automatically shut it down again anytime load goes over 10.  Keep in mind, there's already a line of 20 dumptrucks (each from a different customer) in line, but if I so much as get 10 people with bucks at the end of the line, I'm shutdown again. I can reboot the thing through the control panel, but it obviously mucks with stuff to just reboot every few minutes. I write out a reply, explaining all this, but he apparently went to bed, because I don't get any response for 8 hours. In the meantime, I do the only thing I can. I shutdown the web service and try to migrate stuff as quickly as possible from host B to host A.

And then it gets more fun... host A is even slower on the disk response than the first one. As in, I can literally download it from the internet faster than it can save it to disk, so it's taking like 18 hours to move all this stuff. I just kept at it for 6 hours or so, but realized this place will probably be even worse, and for $160 there's got to be a better option. Finally I got with ThePlanet, and get a dedicated server. No sharing it with anyone... no VMs, no VPS, pure hardware all for me! And it's $125.

I order it, but one catch, it doesn't come with gentoo, the OS I use. So either I rewrite all my management scripts and all the custom programming, or I find a way to install gentoo myself. The thing has two disks, so it should work, but... well... OS install via the internet. Yay. It's slow going, but I make some headway, but finally, I reboot and it won't come back up. I give in, and tack on another $30/month for a KVM (virtual keyboard/monitor access) to let me fix things. That takes another couple hours for them to setup.

So now I have KVM, and I work through stuff for a good solid 30 minutes, then it refuses to boot because I screwed up a boot menu option. No biggy, reset it and pick another thing from the menu. FAIL! I can't reset it with the ctrl+alt+del, since it's frozen. So I use their panel to power it off and back on... but my KVM goes dead and stops responding as soon as I do this. Later, I find out the KVM is powered by USB, and takes like 60 seconds to come up. So, by the time it powers up, the computer has already booted, and once again gone with the broken default!! So I put in a ticket for them to go reboot it manually, and wait some more.

They're actually pretty good on this, get it done in like 20 minutes, and I finally get a working, bootable system. Then I start copying all data off of B, because that's the last place a complete copy of xepher.net ever actually made it to. (Well, I have a full backup of everything on my desktop, but uploading from a home connection would take 3 days or more.) The attempted copy to A has been going for the entire day, and it's still not done either! I nix that, and just focus on moving everything into the physical machine. It finally finished about two hours ago. I've been up since 5pm on tuesday, and it's now the wee hours of the morning on thursday. 31 hours awake, and 25 of them trying to sort all this out.

There are a thousand more little things that went wrong of course, but I'm too tired to rant anymore. Bottom line is, while it wasn't my fault, I do apologize. I never like to get caught with my pants down, so to speak, and all this hit when my options/infrastructure was weakest. Almost any other time I have a backup plan that's better than this, but having to move only 2 days after you just finished one... and with a script rebooting your only functional system anytime you so much as look at it funny... Oi!

Well, with all that said, I'm off to bed. If you run into any problems or snags, it's entirely possible I missed 1 (or 100) things in this, so do let me know. Also, any of you hosting here... feel free to tell your readers/visitors what happened, and send 'em here, anyone is welcome to ask questions if you want clarification or whatnot... it's not just for members here.

Somehow, I keep thinking all this must've been 24 hours early, because I would swear this much fail would have to be an April Fool's Day prank!

tapewolf

Thanks, both for your efforts in restoring service and also for the explanation.  It made for some interesting reading.


"The main difficulty is getting [Qa'Dar] out of his cage.
Far and away the most reliable method I have found is mass-murder." -- The IT-HE guide to Morrowind

amuletts

Thankyou for working so hard to get out websites back up. It must have been incredibly frustrating working remotely with those servers, and having to be so patient only to have things just... not... work!!
Doubtless my readers will *think* it was an April fool and laugh heartily.

Turnsky

Everything works, not a smouldering crater in sight, good job :)

Xepher

Also, thanks to everyone for being so patient... I got quite a few IMs about it being down of course, but I don't think a single one was actually rude or angry, and lots of ya'll were actively encouraging. It's always nice to have a userbase that really appreciates things, which is quite rare, especially when it comes to free services on the internet.

Kaspalian

Thanks for your hard work sorting all this out!
So, yeah...

Aetre

*patpat*

I admire the dedication, and thank you for getting it all back up. :)
"Not even the Human can stop me now..."