Dear community, as some of you may have noticed, we've suffered from several infrastructure outages over the recent weeks. This caused our website, Minecraft servers and internal tooling to be unavailable. We wish to be transparent with everyone on what is going on, why these outages happened and how we plan to tackle them moving forward. The post below will involve some technical details. You can skip to the summary at the bottom for a short, simplified version.
Storage
Within the hosting, and especially the cloud hosting industry, storage comes in many different shapes, sizes, pros and cons. Some setups value simplicity, some value scalability, others value performance or integrity. When setting up our infrastructure, we had to choose between these options and decided to go with a storage solution that had integrity and was scalable. As such, we are running an internal block storage solution called Longhorn, with volume replication across all of our servers in geographically different data centers. This means that all machines are constantly replicating each other's volumes. This is great for keeping data integer and fault-tolerant, as well as being able to quickly move software deployments such as databases to new machines without having to wait for file transfers.
That sounds great, so why bring it up? Everything works great as long as everything remains connected. As soon as these nodes lose connection with each other, the machine entirely will be marked as unhealthy. This is okay because then we can deploy software on different machines and quickly recover. The problem, however, is that the node becoming unhealthy, results in the storage solution shutting itself down on this machine. This means the replica status is lost and no health can be communicated anymore. As a result, the data replica on the unhealthy node is immediately considered degraded. This doesn't mean data loss, nor does it necessarily cause any damage, but it does cause the storage solution to rebuild the entire replica on that particular node from a node that was healthy. We have got data replication across our American and European servers as some of you may know. Repairing a volume across the Atlantic ocean comes with a big drop in network capacity. To save budget and reserve extra capacity for alpha and beta development, excessive hardware has been removed and thus we cannot have replicas for these volumes be reliably placed across multiple machines in the same continent, and we're forced to replicate across continents. A full replication rebuild thus causes time. Doing this with dozens of volumes puts considerable strain on our connection between the two continents. In some cases, volumes cannot be attached to workloads before they are fully healthy. This is what was the cause of lengthy outages the past few weeks. Why machines lost connection in the first place will be disclosed later in this post.
The above wouldn't be a concern if a short connection drop wouldn't immediately kill the replica instance on the other machine that loses connection. The developers of said storage solution agree, and a bug/feature ticket has been created on their Github as a result. We were aware of this issue, and have been tracking it for some time. Unfortunately, the priority wasn't high enough initially and a resolution has been pushed back to the next major update. Since the outage last weekend, we have tweaked several settings in this storage solution and are monitoring the results. We have an alternative storage plan drawn out in case this solution remains problematic.
Nodes losing connection
As discussed above, issues occur when nodes lose connection. This by itself has several causes, and we'll be sharing three of those causes here.
First, a worldwide OVH outage on the 13th of October caused a connection loss between all of our servers. Exact details are missing, but the general understanding is that OVH failed a large scale BGP routine update, causing all routes to disappear. The same happened with the worldwide Facebook outage not long before that. In a case like this, we are powerless, and with outages like this, the Minecraft industry as a whole takes a beating. The outage was solved by OVH within a reasonable time, but it caused our storage solution to panic and all volume replications had to be rebuilt as a result.
Secondly, we only got an understanding of the magnitude of the bug (or at least, limitation) of our storage solution somewhat recently. We've performed several rolling updates on our orchestration platform in production to keep everything up to date. This can normally be done without any downtime, but because of the storage limitation, does cause downtime. We weren't aware of this until it happened a few times.
Thirdly, the most recent one was a problem with our CNI (Container Network Interface) plugin. We are still unsure of the exact cause, but a sudden crash occurred on the plugin on one of our machines. This caused a connectivity loss between processes on that particular machine. Rebooting & reinstalling part of the systems did not help. We could not trace the cause but eventually tackled the issue by updating the kernel and operating system on the machine. We suspect that an unknown, undiscovered bug occurred as a result of an edge cause of our specific kernel, operating system and CNI plugin versions.
Summary
In short, a combination of sudden connectivity drops and a current limitation of an internal storage solution caused several unexpected, lengthy outages. Moving forward, we are adjusting settings appropriately to get the desired behavior of our systems and are continuously monitoring the results. An alternative plan for storage is ready in case this setup remains problematic.
We sincerely apologize for the outages. We are still actively learning from all of the feedback we've gotten since launch; on a functional and technical level. Thank you for your continued understanding and patience.
Storage
Within the hosting, and especially the cloud hosting industry, storage comes in many different shapes, sizes, pros and cons. Some setups value simplicity, some value scalability, others value performance or integrity. When setting up our infrastructure, we had to choose between these options and decided to go with a storage solution that had integrity and was scalable. As such, we are running an internal block storage solution called Longhorn, with volume replication across all of our servers in geographically different data centers. This means that all machines are constantly replicating each other's volumes. This is great for keeping data integer and fault-tolerant, as well as being able to quickly move software deployments such as databases to new machines without having to wait for file transfers.
That sounds great, so why bring it up? Everything works great as long as everything remains connected. As soon as these nodes lose connection with each other, the machine entirely will be marked as unhealthy. This is okay because then we can deploy software on different machines and quickly recover. The problem, however, is that the node becoming unhealthy, results in the storage solution shutting itself down on this machine. This means the replica status is lost and no health can be communicated anymore. As a result, the data replica on the unhealthy node is immediately considered degraded. This doesn't mean data loss, nor does it necessarily cause any damage, but it does cause the storage solution to rebuild the entire replica on that particular node from a node that was healthy. We have got data replication across our American and European servers as some of you may know. Repairing a volume across the Atlantic ocean comes with a big drop in network capacity. To save budget and reserve extra capacity for alpha and beta development, excessive hardware has been removed and thus we cannot have replicas for these volumes be reliably placed across multiple machines in the same continent, and we're forced to replicate across continents. A full replication rebuild thus causes time. Doing this with dozens of volumes puts considerable strain on our connection between the two continents. In some cases, volumes cannot be attached to workloads before they are fully healthy. This is what was the cause of lengthy outages the past few weeks. Why machines lost connection in the first place will be disclosed later in this post.
The above wouldn't be a concern if a short connection drop wouldn't immediately kill the replica instance on the other machine that loses connection. The developers of said storage solution agree, and a bug/feature ticket has been created on their Github as a result. We were aware of this issue, and have been tracking it for some time. Unfortunately, the priority wasn't high enough initially and a resolution has been pushed back to the next major update. Since the outage last weekend, we have tweaked several settings in this storage solution and are monitoring the results. We have an alternative storage plan drawn out in case this solution remains problematic.
Nodes losing connection
As discussed above, issues occur when nodes lose connection. This by itself has several causes, and we'll be sharing three of those causes here.
First, a worldwide OVH outage on the 13th of October caused a connection loss between all of our servers. Exact details are missing, but the general understanding is that OVH failed a large scale BGP routine update, causing all routes to disappear. The same happened with the worldwide Facebook outage not long before that. In a case like this, we are powerless, and with outages like this, the Minecraft industry as a whole takes a beating. The outage was solved by OVH within a reasonable time, but it caused our storage solution to panic and all volume replications had to be rebuilt as a result.
Secondly, we only got an understanding of the magnitude of the bug (or at least, limitation) of our storage solution somewhat recently. We've performed several rolling updates on our orchestration platform in production to keep everything up to date. This can normally be done without any downtime, but because of the storage limitation, does cause downtime. We weren't aware of this until it happened a few times.
Thirdly, the most recent one was a problem with our CNI (Container Network Interface) plugin. We are still unsure of the exact cause, but a sudden crash occurred on the plugin on one of our machines. This caused a connectivity loss between processes on that particular machine. Rebooting & reinstalling part of the systems did not help. We could not trace the cause but eventually tackled the issue by updating the kernel and operating system on the machine. We suspect that an unknown, undiscovered bug occurred as a result of an edge cause of our specific kernel, operating system and CNI plugin versions.
Summary
In short, a combination of sudden connectivity drops and a current limitation of an internal storage solution caused several unexpected, lengthy outages. Moving forward, we are adjusting settings appropriately to get the desired behavior of our systems and are continuously monitoring the results. An alternative plan for storage is ready in case this setup remains problematic.
We sincerely apologize for the outages. We are still actively learning from all of the feedback we've gotten since launch; on a functional and technical level. Thank you for your continued understanding and patience.