Major incident management server down

4/30/2023

Why are you doing this to us, Puppet? Is it a faulty commit that tells Puppet to make this change? Josi gives the all-clear - There are no suspicious commits in the Git history. Puppet is our automation software that manages every other aspect of a server - software installation, security and upgrades. Systems where Puppet is disabled are permanently fixed. Take a breath - there is only one reason why something like this can happen. But then the unexpected happens!Īlready "fixed" systems are broken again. Now that we are effectively back to our internal systems via DNS, we are starting to restore the customer systems with an SLA. The task force, which now consists of all Managed Services team members, starts to restore the remaining nine systems and coordinates via Slack. But now we potentially still have thousands of affected systems.ġ2:35 André reports via Slack that the first resolver is available again. The rules are quickly adjusted manually and the first system is working again.

We observe the same behavior on other servers. Similarly, access to our monitoring and services such as web servers, databases, DNS etc is not possible.

They are set to a default setting that does not even allow nine employees to log in from the management system and suddenly are asked for passwords. It quickly becomes clear that the firewall rules, which otherwise prevent unauthorized access to the system, have been changed. They bring their servers back online themselves by switching from nine resolvers to other working DNS resolvers. This trick is also used by our root customers. We are getting closer to the cause of the problem. On this server he uses Google DNS to find out the IP of our resolver:ĭig can connect to the IP via SSH. André has found a working management server through our emergency system. Now there are two options - physical access to the server or access via a management system, which we also use in our daily business to repair servers that no longer boot. Without access to our own servers and without DNS we are effectively stuck. We are effectively locked out from our own servers!ġ2:10 Lukas drives to the data center, Team Managed Services - especially André - switches to hacker mode. But that's just another symptom, because we can't connect ourselves over a direct IP on some of our servers: We no longer believe that our DNS resolvers are the root problem. This finding is not that easy for us, because the monitoring systems are not accessible either.īut we quickly notice that there are still open SSH connections to our servers and only new connections are rejected. At least we know by now that the cause is not the network. The question is mostly the same: "How long will it take until my website is back or until I can send important emails again?” We don't know the answers either, as we are just as much in the dark at this point. Of course, our customers have noticed that their websites and the mail system are no longer working and the CSD has its hands full answering questions from customers. Stefan and Patrick from the IT / Managed Services team take over the coordination in the teams and Kyon from the Customer Service Desk (CSD) takes over the external communication.

We have a crisis!ġ1:20 The officer of the day Tajno starts a dedicated Slack channel. First analysis is taking place within the team and it is obvious that something is wrong with the network, perhaps the DNS resolvers. But the exact extent of the problem is still unknown - is it a small outage or a crisis?įact is - we must react immediately! There are more and more critical warnings from internal and external monitoring systems and individual employees report via Slack, that they are unable to connect to servers and VPN connections are being rejected. We use Slack or Hangouts for our daily stand-ups or meeting video calls and discuss on Slack what the more stable system is.ġ1:00 We realize that there is a problem - one of our customers can't connect via VPN and our monitoring alerts us that some servers are down. The whole team at nine has been working from home since the beginning of the week and has meanwhile adjusted to the new situation to some extent.

Thursday, 19 March 2020 starts as a "normal" day in home office. or how we almost locked ourselves out of our managed servers!

0 Comments

Major incident management server down

Leave a Reply.

Author

Archives

Categories