Maintenance error caused Facebook to crash for 6 hours, company says
October 5 (Reuters) – An error in routine maintenance of Facebook’s data center network caused its global system to collapse for more than six hours on Monday, resulting in a torrent of problems that delayed repairs, a the company announced on Tuesday.
The outage was the largest ever seen by Downdetector, a web monitoring company. It blocked access to the apps of billions of users from Facebook (FB.O), Instagram and WhatsApp, further escalating weeks of scrutiny of the nearly $ 1,000 billion company.
In a hearing in the US Senate on Tuesday, a former employee turned whistleblower accused Facebook of putting profits above personal safety, which the company denies.
(Also read: Facebook asks judge to dismiss lawsuit to force sale of Instagram, WhatsApp)
In a blog post, Facebook vice president of engineering Santosh Janardhan explained that the company’s engineers had issued an order that unintentionally disconnected Facebook data centers from the rest of the world.
Facebook’s systems are designed to audit orders to avoid errors, but the audit tool had a bug and failed to stop the order that caused the outage, the company said.
The outage was not caused by malicious activity, he added.
While users have lost access to one of the world’s most popular messaging apps – WhatsApp has over 2 billion users – employees have also been blocked in internal tools.
(Also read: Senator Asks Facebook CEO to Answer Teen Safety Questions)
The outage destroyed the tools engineers would normally use to investigate and repair such failures, making the task even more difficult, Facebook said.
The company said it sent a team of engineers to its data center site to try to debug and restart the systems.
However, it took longer for the company to bring the engineers inside to work on the servers due to the high physical and system security in place.
Even after restoring network connectivity to data centers, Facebook said it was concerned that increased traffic could cause its websites and apps to crash.
But since the company had organized exercises to prepare for such situations, access to its services returned relatively quickly.
“Every failure like this is an opportunity to learn and improve,” Janardhan wrote. “From now on, our job is to… make events like this happen as infrequently as possible.”
Reporting by Sheila Dang in Dallas; Editing by Sonya Hepinstall, Grant McCool and Richard Pullin
Our Standards: The Thomson Reuters Trust Principles.