Wednesday, April 5, 2017

A Data Center Nightmare: Single Point of Failure (1)

Every facility executive responsible for data centers can tell at least one nightmare scenario. Some are from direct personal experience; others are data center legends. All these stories show how hard it is to prevent data centers from failing. Every data center is unique. Every design is a custom solution based on the experience of the engineer and the facility executive.

An example comes from the colocation business which is made up of real estate companies that offer tenants space, not in office buildings, but in data centers. The occupants are servers, not people. The data center real estate company brands its services based upon a promise to deliver non-stop climate control and power reliability. One moment without cooling or power harms not only the tenant, which stands to lose revenue as a result of down time and recovery time, but also the colocation company’s business model (with SLA, Service Level Agreement).

A construction error that exposed a design miscalculation and a commissioning flaw can result in losing a data center. One nightmare scenario is that cabling between the generators and the paralleling gear had been damaged during construction. While being pulled through the conduits, the cable insulation had been nicked and scraped. The damage was not enough to be detected by normal meggering — a test of the resistivity of insulation — but enough to create a weak link in the mission critical power chain.




If all things are correct, the loss of a cable should not be an issue. The design engineer had foreseen the potential for generator system failure and had designed paralleling gear with the programmable logic controller (PLC) programmed to handle this fault. When the fault occurred, the PLC began shutting down the entire generator bank. With the system experiencing a cascading failure, the PLC was unable to intervene.




When the shutdown event was complete and the paralleling switchgear was cold, the entire site transferred to the battery. Within the design time of 15 minutes, the batteries were depleted and all customers were left without the service of their computers. The data center had failed and the colocation company’s branding promise had been seriously compromised.

Why did this happen? Was it a construction error? A commissioning oversight? Could this be pinned to the owner’s design manager, the one who devised the paralleling scheme from the beginning? How about the engineering design team?

There were multiple causes for the failure. In this instance, a construction craftsmanship issue revealed a design shortfall.


Source of the Problem


It is clear that even more rigorous testing before commissioning was needed. Additionally, this failure indicated that the PLC had not been programmed correctly to clear this fault condition and thus had not been commissioned with this fault scenario. And this sequence should have been part of the preventive maintenance program — a change that was made following the disaster.

The design/commissioning team had not anticipated the exact failure sequence. This project would have benefited from more involvement during the design phase from a commissioning agent with specific experience in PLC programming. Additionally, a third-party reviewer with topical design and operating experience would have added value if brought into the design process.

Every data center is one of a kind. The better the commissioning team can simulate real-life scenarios, the more reliable the data center will be.



Continue - A Data Center Nightmare: Single Point of Failure (2)



About the Blog


Strategic Media Asia (SMA) is one of the approved CPD course providers of the Chartered Institution of Building Services Engineers (CIBSE) UK. The team exists to provide an interactive environment and opportunities for members of ICT industry and facilities' engineers to exchange professional views and experience.

SMA connects IT, Facilities and Design. For Data Center Design Consideration, please visit 


(1) Site Selection,
(2) Space Planning,
(3) Cooling,
(4) Redundancy,
(5) Fire Suppression,
(6) Meet Me Rooms,
(7) UPS Selection, and
(8) Raised Floor

All topics focus on key components and provide technical advice and recommendations for designing a data center and critical facilities.



No comments:

Post a Comment

Note: Only a member of this blog may post a comment.