The outage on Microsoft’s Windows Azure cloud computing platform that caused the government’s G-Cloud service to go offline was the result of a calculation error caused by the extra day in February due to the leap year.
Writing on the Azure blog the firm’s corporate vice president for service and cloud, Bill Laing, said while the firm had still to fully determine the cause of the issue, the extra date in the month appeared the most likely cause.
Only every 4 years, right? (even less on average – not in 2100 for example)
Edited 2012-03-01 23:28 UTC
Reason #48341830923 that you shouldn’t keep your data in the cloud.
I found a typo in your post and fixed it for you:
Reason #48341830923 that you shouldn’t keep your data in the *Microsoft* cloud.
There is NO OTHER cloud-system in the whole galaxy which has such kinds of problems again and again. They have problems without doubt but NOT such problems.
Let’s face it. It may not be a problem to reboot your Windows desktop if there is a problem but that does NOT WORK for servers.
Anyone remember the London stock-exchange disaster last year (a whole day down thanks to Windows server technology)?
Yes, but knowing how many consulting companies work from the inside, I am sure that developers were to blame and not the technology.
If the world’s (arguably) premiere software company, with all the lessons learned and experience gained during decades of development could have had a disastrous outage caused by an extra day,
then all those who bitch that all the money and effort spent on the Y2K fixes were a waste and that we were hoodwinked by a bunch of grouchy, grimy COBOL programmers looking for a last big payout can just shut the FUCK up.
Depends on how you define Y2K.
If Y2K was about how a single flaw in a single system somewhere would kill all civilization. Then no, this is not that.
If Y2K was about how some developers fail to think a few years into the future with localized bad results, then yes. This is that.
It went way beyond “some developers” although most of the blame can be laid at the feet of the decision-makers.
Bob Bemer started petitioning everyone from programmers to politicians starting in the early 60s about the problems with 2-digit dates – they didn’t listen and he wasted several decades trying to convince them.
Everyone assumed that “Y2K” trouble was only about the year 2000, but most computers actually had different date boundaries.
The year 2000 was only a problem for those who stored dates in ascii/ebsdic form. Mainframes seem to be unusual in their use of BCD and nine’s complement within their vsam files, which is why they were especially susceptible to the two digit overflow.
Binary time representations such as those in *nix have different limits, but they’re also approaching.
http://en.wikipedia.org/wiki/Year_2038_problem
But weren’t ICBMs among the concerns (admittedly, one of the most silly ones) thrown around?
The reason is that they didn’t think about leap years? In 2012, this is the error they made? It’s not like it’s some unexpected even we didn’t see coming.
You know, I would have found this acceptable in someone’s pet OSS project but not in a global service from MS that you probably pay an arm and a leg for.
If I was the guy who was responsible for this in “the government” I would have been having a serious talk with my account rep already and it would not have been easy for them convince me to continue using their product.
agreed, but sadly British government like expensive and often vastly over-priced contracts with Microsoft, IBM and Oracle is simply because it takes liability away from the government.
If MS fsck up and take a government service offline, then IT managers within the government just say “not our fault, it’s one of our service providers“. For the government, contracts like this are just another form of outsourcing and thus it would take something monumental and hugely publicly embarrassing before any government body would even consider switching providers – let along bring the services back in house where they really belong.
This is just my experiences when I worked for the British government. Things might be different for the rest of the EU or western world (for their sake, I hope so).
Edited 2012-03-02 09:15 UTC
Not only Governments but also quite a lot of organisations (I worked in a large charity for 15 months and this was rampant). The higher up you get the more you gotta watch your own backside.
Everybody likes to outsource responsibility. Certainly in some Central European places one can see a strong “nobody got fired for using Microsoft or Oracle” of sorts…
…and even when the projects, waaaaay down the line, largely prove to be practical failures – those initially pushing and implementing them moved on, several times already, each time adding another “success” to their CV – and the more expensive, the more lucrative such “successes” are, the better they look on the CV, it seems.
Agreed, that’s just embarrassing. But…
…you would only complain and try to get some monetary recognition out of it, but you wouldn’t quit using the service. And you know why. This is not just picking up your ball and going, it’s picking up the goal posts, the fences, the benches, the lawn and the parking lot, too. I don’t claim to know how large the gov’s data is on Azure, but I’m sure it is somewhere in the region where you don’t move on a whim.
And on top of that 1 day in 366 is probably well within agreed outage levels (I’d guess they have 99.9%, so they would be covered.)
In the short run you’re probably right but the contract will be renegotiated at some point and I would make damn sure there’s was a viable alternative at that point. Of course, I would probably not have bought into Azure in the first place so it’s a bit moot.
Could be but on the other hand, isn’t the cloud all about NOT having these kind of problems? You know, scalability, redundancy and all that jazz that the sales rep probably fed the gov’t.
It just means someone else, which is dedicated to the task, is doing that kind of work. That doesn’t mean you get less problems.
It might mean you get more problems, because doing things at a large scale isn’t easier.
Right, but I’m sure the MS sales rep told them that if they used Azure they’ll never have downtime and it would all be redundant and scalable and blah blah blah. If I had been told that and then the whole thing (and I mean the whole thing, not just a few of my VM’s) goes down because they forgot about leap years I’d be mighty pissed.
I think, I wasn’t clear. The FU is reprimandable, no doubt.
My line of thinking was that at some point in deployment you pass a point of no return where you are effectively locked-in into the cloud of someone else, because moving becomes very expensive, even more expensive than putting up with a FU.
I guess, what I’m really trying to say is that cloud services lock-in your data and you will suffer the consequences and like it. Beware of the cloud, seriously.
“And on top of that 1 day in 366 is probably well within agreed outage levels (I’d guess they have 99.9%, so they would be covered.)”
Let me show you some magic:
100-1/366*100 => 99.73%
99.73>=99.9 => false
By Jove! Please, civilized man, teach me your mathmagics!
I guess, what I’m trying to say is: what were you thinking when you decided to snark instead of simply correcting my mistake?
let me show you some marketingmath:
100-(1/(365+365+365+366))*100 = 99,93
99.93 > 99.9, so no problem for the uptime guarantee
(and even if the SLA were for 99.99 and this month will only be 96.55 that probably only means you will get a refund of 99.99-96.55=3.44% of what you pay per month)
I trust microsoft cloud services the same now. because I already didnt trust microsoft cloud services
Even a top flight software design company can be caught out by one of these random insertions of extra days by those damn wizards, soothsayers or stargazers. Who would have thought that someone would “chuck in” a spare day out of the blue like that?
And at such short notice too!
Remember when a lot of zune players died the last leap year?
http://www.computerworld.com/s/article/9124638/Zune_chokes_on_leap_…
Microsoft says it will issue a bug fix for the device so that this problem won’t occur again in 2012, the next leap year.
I guess they should have shared that knowledge with the azure department.
Microsoft doesn’t work that way. Each department is a fiefdom unto itself, and must hoard its knowledge and bug fixes to give itself a leg up on its departmental enemies.
Now the Zune devs can sit back and laugh, gloat, and toast to the pain and suffering of their evil Azure dev-enemies.
We’ve been having leap years long before computers were invented. We have one every four years. None of my Android devices had problems on February 29 or March 1st. Even Windows didn’t have problems with the extra day. How can Microsoft’s Azure division drop the ball so miserably with something so simple, for which there’s plenty of source code sample on how to handle?!
I’m running windows 7 and I definitely had problems yesterday: the time on my PC was off by one hour, but the timezone was set correctly.
I checked the configuration, and it said that it had synced in the morning from time.microsoft.com. That service seemed not to be responding very well (maybe it’s hosted on azure now?), I’d put the blame on it rather than the OS itself, but stil…