Operating Systems, Servers + Applications
The UK Air Traffic Control Issues highlights the impact that data corruption can have and so, database experts urge businesses to consider their disaster recovery solutions to mitigate risk of such disasters in their own industries.
Upon hearing the news that UK air traffic was closed and I was now stranded in Fiumicino, Rome for 5 days, I immediately began to question if NATS had done everything possible to mitigate risks that could cause such chaos for the aviation infrastructure - a system operating at 98% capacity with very little margin for disruption.
Responsible for directing aircraft safely throughout airspace to prevent collisions and organise flow of air traffic, pilots were now flying via a manual flight path - or not at all. Amongst 200,000 passengers affected by delays - though many unconcerned by the physics behind Air Traffic Control, we all understood the significance of a system with such importance failing but did not account for the disruptions that would lead from an outage of roughly 4 hours.
It was now too late to find an alternative route back into London and return to the office, so I instead utilised 5 days of being stuck in a hotel room by asking my colleagues (who just so happen to be database experts) “What happened to Air Traffic Control? And, how would WellData have responded?”
Although statements released by authorities have been vague, we now understand that at around 8:30 am on Bank Holiday Monday, controllers noticed that squawks were no longer appearing on radar screens. Following that, supervisors identified that flight plan data was not being uploaded automatically, and so they switched to manual input.
The technology behind the flight planning system had failed. Typically, at this point a backup system would kick in to fuel the system with enough data to maintain service for four hours, and once this four hours of data had been consumed, every flight path would need to be entered manually, leading to a requirement for significant reductions to air traffic.
In this case, NATS did not fix the glitch until 3:30pm on Monday and by that time 799 outbound flights had been cancelled, in addition to 786 inbound flights - which accounted for 27% of UK air traffic. This disregards the further cancellations and delays experienced since and as a result of the issue.
Following the fix, NATS revealed that a single piece of data for a flight plan had been wrongly input by an airline, and the data caused the automatic processing system to suspend itself so that the incorrect safety-related information could not impact any other elements of the Air Traffic Control system (ATC).
With very little information regarding the ‘several layers of backup’ which exist - according to chief executive, Martin Rolfe - WellData is unable to comment on whether the techniques outlined below are in place. However, we share this article as a thought provoking exercise for others when considering disaster recovery solutions. WellData is launching this campaign in an attempt to help others prevent disasters of such significance in other industries.
I knew the specialists that I work alongside would have an answer to the issues faced by Air Traffic Control before I even asked the question.
Although we have little understanding of the practices in place at UK ATC, Phill Clayton remarked that, “There have been various hints within press releases that lead me to suspect their backup system runs as a high availability standby to the primary system. Meaning the standby system is maintained as an exact replica of the primary, constantly being updated with new and changing data in real time. In the case where physical technology fails, this solution is very appropriate, however on this occasion, the incoming data could not be processed by the primary system, which suspended itself. However, the data that caused the issue will have been replicated to the backup system, which then experienced the same error and suspended itself.”
James Newton-Brady supported this theory and added that “While disaster recovery scenarios are unique and planning should be for each individual case, there is an opportunity for businesses to consider their Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO), which should not only take account of physical infrastructure failures but also data error/corruption failures. In the case of data errors it might not be prudent to keep your backup systems exactly up to date. During an incident, operations have time to stop the backup system from ingesting the bad data, determining to what point the backup system should be recovered and performing a database ‘point in time recovery’. The backup system can then be brought back online and services resumed. When planning Recovery Point Objectives it is useful to assess the appropriateness of a ‘Delayed Disaster Recovery’”.
Phill Clayton added, “Many would assume that a delayed backup would necessarily require a long catchup period, of the order of the delay itself, however this is not the case. The recovery time is much faster than the original processing itself. Recovering data from the point of loss is far less detrimental to business operations than trying to restore a system which contains bad or corrupt data. Therefore, delayed disaster recovery is an appropriate first step for businesses to take before looking to recover their system from offline backups.”
Following our conversation, we knew we had to share our insights with companies who could find themselves in an extraordinary case of disaster, that may not exist in their recovery plans already. UK Air Traffic Control confirmed that cases such as the one experienced last week are extremely rare, so much so that in the last 10 years of aviation, there’s only been a handful of incidents to cause some disruption, but very little significance in consideration to the most recent of events. So, while we cannot predict all scenarios of disaster, we must do more to mitigate the risks of cases involving data corruption.
With this in mind, we debated the topic of ‘delayed disaster recovery’. A prominent statement remained true throughout our discussion. James Newton-Brady voiced that, “Often blame lies with the physical infrastructure that a system may run on but we fail to acknowledge that data can also be at fault.”
“Sometimes we don’t want to restore all the data from the latest backups because the data itself has caused the issues. We are aware of scenarios in which data has caused issues in the past and in such instances, we have recovered to a point before such data exists.”
It is also important for companies to “be aware that a RPO and RTO that works for one company may not do so for another”, and therefore, we recommend that planning processes involve various key stakeholders, as well as, a database consultant who can put the requirements in place to achieve a comprehensive disaster recovery solution.
We concluded that had UK Air Traffic Control’s secondary systems been run as a delayed disaster recovery, chances are the bad data could have been skipped and the significant disruption caused by a single line of corrupt data avoided.
So, now we ask you ‘is delayed DR a solution for your business?’.
If you are unsure as to whether you should reconsider high availability and swap the value of data for function, speak to the experts in our team and let us weigh up the pros and cons for you.
We can provide advice based on an individual case to give bespoke recommendations and action plans. We do so with a demonstrated knowledge across industries and technologies, whilst taking the time to engage in discovery workshops and get to know your business inside out.
But, you should never risk one line of corrupt data preventing the future functionality of your systems. It boils down to how valuable your historic data is vs the value of your business operating correctly. In the CAA’s case, their choice to use a real time backup of data is estimated to cost £100 million due to the impact it had on the functionality of their business.