As you’ve undoubtedly noticed, OSAlert has been down most of the day. We’re back up now, running like we always do. We rarely have this kind of extensive downtime, so in the interest of full disclosure, here’s what happened. If you’re expecting some sexy story – think again.
After I notified our sysadmin that OSAlert was dishing out 324s, he got back to me they were already working on it. We had the same issue for a really short while a few days ago, but it got sorted then. Our sysadmin, Jon Jensen, over at Endpoint checked remotely what the heck was going on, and confirmed it wasn’t actually a software issue – it was a hardware issue, more specifically, a RAID issue. Had the RAID controller failed?
Since this obviously couldn’t be addressed remotely, Jon contacted the datacentre, and when they got back to him, the true extent of the problem was revealed. The quote from the datacenter tech? “We have fixed the cabling issue with your server and your server is now back online.”
Believe it or not, the connections of the disk cables to the controller had come loose. After the first outage happened a few days ago, they were apparently still somewhat connected, only to fully come loose today. Quite a feat, for an immobile server in an immobile rack.
The funny coincidence is that this all happened while we were also working to set up a playground subdomain for the next version of OSAlert we’re currently working on. Maybe it’s a sign…
I dunno… it was kind of sexy…
Thom hid the thruth. IT WAS THE ANONYMOUS!!!!
So it wasn’t a cease and desist order for patent infringement!
Glad everything’s back <3
My router has been acting wacky lately (left in on a radiator). Resetting, rebooting and such usually get it going. However, this afternoon I couldn’t get it to work again.
The thing was, I kept testing the connection with “ping osnews.com” :$
EDIT: Wasn’t meant as a reply to bombuzal.
Edited 2012-02-24 19:30 UTC
Loose cables? I was hoping for this headline:
BREAKING NEWS: INFINITELY FAST QUANTUM COMPUTER BREAKS LAWS OF PHYSICS!
This is a very common problem of sata cables (as seen on the Internet). Personally this has happened 3 times. Every time I was about to pull my hair. When the computer restarts all of the sudden for no apparent reason (even with Linux, yes), or if your bios complains about your disk driver, or if no disk is found, suspect the sata cable!
I think the OP was referring to the recent news item that “a loose fiber-optic cable may be behind measurements that seemed to show neutrinos outpacing the speed of light” at CERN, where the original results, were they correct, would have actually broken widely-accepted laws of physics.
(Even though it appears that there may be more to it: http://news.sciencemag.org/scienceinsider/2012/02/official-word-on-… )
Whoever came up with the non-locking SATA connector needs to be taken out back and shot.
IDE connectors were nice and tight, and could not “wiggle loose” on their own.
P-SCSCI connectors were nice and tight, and could not “wiggle loose” on their own.
Multi-lane SAS connectors all have locking clips, so they can’t “wiggle loose”.
Some vendors provide locking SATA connectors, but they’re fairly rare and add a bit of cost.
Why, oh why, oh why did the standard SATA connector make it out of the lab? These things are flimsy, non-locking, and will “wiggle loose” with very little provocation (similar to HDMI cables). Especially when used with 10K/15K RPM drives, or multiple 7.2K RPM drives in a case. You get a lot of vibration in a multi-U rackmount case stuffed full of drives.
When I bought my Gigabyte mobo for my Quad core, it came with several all latching Sata cables while most disk drives come with the loose fitting ones. The latching cables sometimes don’t fit some female sockets but otherwise I far prefer them to the old IDE ribbons or SCSI cables which have plenty of their own mechanical issues.
I’ve had a Raid degrade for exactly the same reason but it not just the crap non-locking connectors. Cheap SATA cables themselves are a problem, buy a new drive and then loose data and be plagued with instability due to a damn rubbish cable,and that^aEURTMs before it falls out completely.
I’ve had the same problem… somehow when routing my cables in a build I’ve managed to “break” them and have had to go out and buy more. I now tend to automatically order “good” cables with new drives or mobos.
I work as a data center administrator for a major internet hosting company, cant say who, but we’re in the top 3 hosting companies…..
The cables may have worked loose due to other servers being deprovisioned and removed from the rack or being removed to be upgraded with memory or other hardware upgrades. It sometimes happens when we pull other servers from the racks, it causes other servers in the cabinet to slide forward too when the other server is pulled out. it would take a while for anything to work that loose though, but that’s a possible reason for it happening. Just wanted to pass on what I knew hope it helps.
Is it the same company that hosts your website?
nope I checked, we don’t host osnews
I think he meant yours.
ah, sorry I’m a bit on the slow side today, been fighting a cold. Yeah my site is hosted on my company supplied personal server
Wow, I have bookmarked your site. BIG darkwave fan and I’ve always preferred the Cruxshadows to The Last Dance!
A simpler explanation is just the vibration of all the harddrives running in a rack, or in a server. Especially if you have multi-U systems with 20+ drives in it.
It can also be the cause of non-optimal performance:
http://www.youtube.com/watch?v=tDacjrSCeq4
Fibration is a real problem for HDD.
“If you’re expecting some sexy story – ”
Yes, my first thought was.. Isn’t Thom dating Heidi Klum?! .. maybe he lured her back to the server room and condensation from the steamy geeky conversation was causing electrical problems..
hate when that happens.. and i wake up ..
I do not wish to envision any scenario that could possibly involve the words “Thom” and “sexy” together, thank you very much.
Only joking Thom. Anyway, it could have been worse: a co-worker and I once had a very important and old server fall out of the rack (from about U20 or so) when we were moving it. Needless to say, it didn’t really work once we tried powering it back on.
*come hither look*
That is only slightly worse than my experience. I had a server I was taking out of a rack for service get stuck on one rail. Then I lost my grip. Needless to say that the one rail didn’t hold it in place. Down it came with a spray of ball bearings everywhere. Boy that was a bad day.
There was a time in far off Soviet computing that the chips used to jump out of their sockets.
The sockets were made locally to metric dimensions, while the chips were imported from the US, so 25 to 25.4 ratio eventually made them pop by themselves.
Those occurrences were not limited to the USSR. The CEO of some fruity USA computer company (whose logo wasn’t a pear…) insisted the 3rd major design of their computer should not contain any fans. Even though the engineers thought it needed one. So over time the repeated heating and cooling worked chips loose from their sockets.
The official remedy was to drop it on the desk from the height of a few inches…
/troll mode ON
Microsoft, copying everything from Not-A-Pear down to the Xbox 360’s overheating problems.
*takes shelter*
I must have been extremely lucky, I have a second-hand XBox 360 Elite that has never overheated or had graphics issues. My best friend’s original white XBox 360 glitches like crazy, and I’m going to attempt to redo the heatsink for him sometime soon.
That said, mine won’t play any discs so it’s far from a perfect unit. Downloaded games and streaming content are excellent though!
IIRC, the original XBox 360 had RROD failure rates of about 30%, so the majority still worked – though with glitches
I still remember the noise when turning one on though. It felt like someone brought an airplane reactor to my friend’s room…
Why use physical hardware for hosting when virtualizaion and clustering would be cheaper and more reliable. Our servers got all n+1 PSU, RAID1 SAS drives with battery backed raid controller. And all servers work in cluster- when some hardware come down, there would be just VM machines moving to other cluster nodes or in case of total destruction just restoring latest backup to other node. Of course there is multiple internet connections available and network cards bonded together. High Availability options available if needed- you’ll get multiple virtual machines mirroring each other.
There are lots of ways that virtualization can go wrong, there are many failure modes.
With a dedicated server the customer knows exactly what they are getting, it is very predictable.
Especially if the custumer already needs more than one server of each service anyway (multiple webservers and multiple database servers for example).
With virtualization the customer depends on the provider getting all of it right. The customer can’t see what they are doing (wrong).
It depends on the requirements of the customer.
Edited 2012-02-25 12:46 UTC
Isn’t there some way to blame Apple for this? :-p
Seeing Apple’s lock-in strategies and there willingness to sue others over patent infrigment. most producers have decided on excluding locking mechanisms for there SATA cables.
Edited 2012-02-24 23:54 UTC
“Oh crap! the new guy fumbled and unplugged the cable by mistake. Quick, make something up”.
“It’s my first day!”