We are currently facing a major network outage at one of our cloud providers.
08:25 Update from Cloudscale: "In order to ease the load on the CPU we have disabled a control protocol (BFD) completely hoping to stabilize the situation again. Investigating"
08:34 Update from Cloudscale: "Situation is stabilizing. Several services need to be taken care of due to split brain situations. We will keep you posted"
08:44 Update from Cloudscale: "All services are healthy again. We will keep monitoring the situation and will follow up with next steps."
09:11 Update: Currently the situation has stabilized and we're making good progress recovering services. We continue to monitor the situation and are in close communication with Cloudscale.
10:18 Update from Cloudscale: "We are facing another outage."
10:20 Update: Our ticketing system, email servers and customer control panel are also affected by the current networking outage
12:56 Update: Since the issue at Cloudscale is ongoing we are in the process of planning and preparing the migration of our core infrastructure service to another cloud provider. In parallel we're coordinating with customers to take recovery actions and migrate services to other cloud providers wherever possible. Currently we still observe high packet loss to all services running at Cloudscale RMA, intermittent connections are possible.
13:55 Update from Cloudscale: "Our networking infrastructure is currently working normally. However, it is still possible for performance or availability issues to recur. We continue investigating the issue and will take further measures as needed."
16:00 Update: At the time of this writing, 16:00 CET, most systems are stable. Some of them are still recovering, but these systems do so at a steady pace.
VSHN strongly recommends not to rush any overhasty decisions at this moment. We are in full emergency mode, and we will be monitoring and restoring systems as needed over the weekend, to ensure the best possible level of service to all of our customers.
In parallel, we are currently preparing and executing separate plans for affected customers, details of which we will be sharing on a case-by-case basis.
We are extremely sorry for the trouble this issue has created, and we want to assure you that all hands are on-deck to solve this issue even as these words are written. We will provide updates on our status pages at 17:30 CET. Please check the APPUiO Status Page and the VSHN Status Page. We take downtime very seriously and we strive to maintain high availability.
18:00 Right now all systems are operational and stable, but we are still keeping a close eye on them. There are a number of staff-members that will continue to work on this issue over the weekend, and we have made sure that there are several VSHNeers on standby should the situation require more immediate action.
Saturday Nov 23rd 11:00 We kept monitoring the systems over the night and all systems are operational and stable. Our monitoring did not record any packet loss or other network issues. We will keep monitoring the situation closely over the weekend, VSHNeers are still on standby in case the situation changes and requires immediate attention.
14:00 Update from Cloudscale: They are convinced they found the root-cause and have measures in place to prevent further incidents. See https://cloudscale-status.net/incident/110 for more details.