Sunday, June 11, 2017

During saturday June, 10th and sunday 11th our services suffered from two major incidents. The saturday incident was caused by a network problem in Helsinki and the one on sunday was related to our disk drives in France. This is the largest incident to happen in our company's history. 

Saturday 10th June

What happened?

Around 16.00 GMT+2 we noticed a severe degradation in Helsinki network. Our core router was unavailable with no appraent reason. This led to a serious downtime in our Helsinki network.

Solution

We immediately contacted the datacenter and had an engineer to restart the router. This was to no avail. It seemed that the router would've had some kind of hardware problem which could not be solved by the engineer contacted. We sent one of our own to the datacenter with a 2 hour drive to the datacenter.

The router was up and operational 19.00 GMT+2. This was achieved by removing an incorrect connection on the router.

Cause

Part of the problem was caused by an incorrect connection on the router, which happened to halt bootup of the operating system in this situation. However the incorrect connection would not have caused this by itself.

Something unexpected had happened with the router's operating system leading to an unscheduled reboot on the router. The incorrect connection on one of it's ports caused the router to thinnk that a person had manually intervened with the bootup. The router was waiting for manual input before loading it's operating system.

These two issues at the same time caused such a situation that a person had to manually go to the datacenter, remove the incorrect connection and restart the router manually.

Sunday 11th June

What happened?

A whole disk array of ours in France was lost with no apparent reason 9.00 GMT+2. This caused severe outage on a big part of our services located in France.

Solution

Customer data was recovered from backups made on 10th and 11th of June. Minecraft servers with IP addresses 151.80.78.207, 151.80.78.211 and 151.80.78.215 were affected. The Net9.fi site lost data from the last 24 hours, because the latest backup was from 10th June 9.00 GMT+2. All transactions, orders and other changes made during 10th 9.00 GMT+2 - 11th 9.00 GMT+2 of June were lost. We are currently recoving this lost information by hand.

Cause

The cause of this incident is unknown. We are currently collaborating with manufacturer Dell and provider Online.net to find the cause. It is expected to be a hardware problem of rare occurence.

Mitigation

In future, we will use replication with our MySQL databases to reduce downtime and data loss. A few times a year we will perform disaster recovery exercise in order to understand how to recover such situations safely and within as small timespan as possible.