11.14.2022 - GovFTP - Outage 09.28.2022 RCA

This is the RCA provided to FTP Today regarding the incident that occurred on 09.28.2022.

Root Cause Analysis (RCA)

Effective Date: Monday, November 14, 2022 5:40:12 PM (UTC)

Post incident Conclusion
The RCA provided by Databank is a final assessment of a critical/major event. This document can include information from 3rd party providers. Databank works to verify all information has been reviewed and verified. In the event new information is discovered, an update to this RCA will be provided.

Incident Information
Date / Time of Incident Start: Thursday, September 29, 2022 1:38:00 AM (UTC)
Duration of Incident: Thursday, September 29, 2022 1:38:00 AM (UTC) - Thursday, September 29, 2022 6:11:00 AM(UTC)
Scope of Incident: GovFTP customers


Root Cause of Incident

What Happened:
During a scheduled maintenance in the datacenter on 9/28/2022 2PM central time, an onsite engineer was replacing a failed disk and verifying the seating of a cable on the back on the storage array. The cable seating verification resulted in a fault on the card the cable was connected. Shortly thereafter, the remaining nodes in the array lost their connection to some of the disk enclosures supporting array. When this happened, the system was no longer able to destage written data coming into the system and the array shut itself down to preserve the integrity of the data. This caused servers dependent on this array to freeze up resulting in an outage for the connected hosts.


Incident Resolution
What Databank did to restore service:
Databank engineering was immediately aware of the situation as well as the storage vendor, since they were already onsite performing routine maintenance. The issue was quickly escalated to upper management at both Databank and the storage vendor. After careful evaluation from the vendor to maintain data integrity a reactivation of the system
was started. There were complications with the node involved in the initial cable seating check and the determination was made to start the system without the node experiencing the issue. After some troubleshooting with the management services, the array was back up and serving data.


Corrective Actions
Vendor Corrective Actions:
In Databanks partnership with their drive vendor, they are researching methods where this unique and rare condition can be remediated in the vendors code.