In a further clarification about the glitch on 24 February which led to a trading halt for more than three hours, National Stock Exchange (NSE) has issued a statement on Monday that
Storage Area Network (SAN) system failure led to the incident. It said that the SAN system at the primary data centre stopped functioning, which was completely unexpected. It also said that exchange is exploring alternate solutions to de-risk dependency of critical applications to a single storage device.
“On 24 February, post link failure, we saw unexpected behaviour of the SAN system, with the primary SAN becoming inaccessible to the host servers. This resulted in the risk management system of NSE Clearing and other systems such as clearing and settlement, index and surveillance systems becoming unavailable,” NSE said in an official statement.
The SAN is a fault tolerant system that was designed to function seamlessly even in the event of telecom link failures between primary and Near Disaster Recovery (NDR) copies. One of the features of SAN that was deployed in October 2020 was designed to provide not just zero data loss but also zero down time. Before deployment, the system was tested against various scenarios including link failures and functioned properly, NSE said.
Subsequent incident analysis showed that the problem was caused by failover logic implemented by the vendor which did not conform to NSE’s stated design requirements, coupled with issues in the configuration done by the SAN vendor that triggered the failover logic. “We note that the specific failure logic used by the vendor is not documented, was not communicated to NSE, and was not appropriate for NSE’s setup. The resultant SAN failure led to the incident on 24 February.
It added that while there was no impact on the trading system, given that the risk management system was unavailable, allowing trading to continue on NSE posed an unacceptable risk, and hence trading had to be halted.
NSE’s primary data centre is in BKC (Mumbai), a Near Disaster Recovery (NDR) site is maintained in Kurla, and the disaster recovery (DR) site is in Chennai. The statement said that there is synchronous data replication between primary site in BKC and NDR site to ensure no data loss in case of primary site failure, and asynchronous replication to our DR site in Chennai which is designed to take over with zero data loss in case of disaster at the primary site.
“Between our primary and NDR sites, NSE has multiple telecom links with two service providers to ensure redundancy. On 24 February 24, we had instability in links from both service providers primarily due to digging and construction activity along the path between the two sites. The replication to NDR is designed such that in the event of the links between primary and NDR getting cut, the primary continues operations without any direct effect. Post earlier link failures in February 2021, operations continued without any interruption,” NSE said.
NSE informed that there are various steps that have already been taken and others that are under implementation to address the SAN and telecom link issues. “We had already placed orders in January for two additional telecom provider links and have removed the SAN software that caused the incident. We are also exploring alternate solutions to de-risk dependency of critical applications to a single storage device,” NSE said.
Vendors like Cisco, HP, Dell, Hitachi, Checkpoint, Palo Alto, Oracle etc support fault tolerant technology infrastructure of NSE, aided by technology service providers like TCS, Cognizant, Wipro etc.
Leave a Comment