3 steps to better data center risk management

Half a decade ago, in the wake of the rise of cloud computing, some IT evangelists, CIOs, and large tech research firms foretold the imminent death of the data center as we know it. My co-columnist at CIO.com Mark Settle at the time advised caution in writing off data centers and envisaged how they would continue to grow based on the evolution of – you guessed it – data.

Today, data centers continue to not just survive but thrive alongside hybrid and multicloud systems in new avatars such as on-prem as a service. Not just that, data centers are poised to meet the emerging demand for services related to emerging tech such as edge computing, IoT, and 5G.

As a result of these new applications and emerging needs in end-user computing (EUC) and mobility solutions, data centers are becoming increasingly complex, leading to more internal and external risks. Downtime is a persistent risk, with a single event topping losses of $11,000 per minute.

Here’s what enterprises can do to identify and mitigate risks in data center operations.

1. Have an integrated approach to risk management

The record-setting winter storm and subsequent power outage in Texas in February 2021 proved to be a reality check for data centers in the state. Although there were no large-scale failures, there were significant issues with electrical failover systems.

From a functional point of view, data centers are physical facilities that run business-critical applications, while from the business perspective, they are pieces of real estate or capital assets that need to be budgeted for and managed.

The point is, a single point of failure might (and frequently does) cause a huge disruption to operations and consequently leads to revenue loss. Which is why you need a pervasive risk management plan and policy that apply to the whole organization.

This is where Integrated Risk Management (IRM) comes in. Gartner defines IRM as “a set of practices and processes supported by a risk-aware culture and enabling technologies that improve decision making and performance through an integrated view of how well an organization manages its unique set of risks.”

In the post-pandemic world, businesses implementing remote work, BYOD, CYOD, and other changes to workplace practices are blending their digital transformation strategy with IT infrastructure upgrades to identify, tolerate, and mitigate risks arising from natural disasters, supply chain, data processing, as well as those inherent to their business model.

If you’re in the middle of a digital transformation, you need to monitor every process and factor – external or internal – that can affect your data center and be prepared to deal with multiple risks arising from a single or multiple events happening simultaneously.

Digital transformation is not just for the enterprise or organizations that bank heavily on data or technology – it applies just as much to SMBs in the post-pandemic workplace, including those that have started out with the public cloud as a substitute for the data center.

Even the federal government is taking digital transformation seriously – transforming data center infrastructure to take advantage of cloud technology is one of their two central objectives (improving the online user experience is another).

“Data center optimization is a key measurement for scorecarding in the Federal Information Technology Acquisition Reform Act. This measurement is in part a reflection of how well the agency infrastructure takes advantage of the cloud,” says Jeff Shupack, a digital transformation expert with 15 years’ practice in reducing risk for global capital initiatives with Lean-Agile implementations.

Organizations are realizing that agile methodologies, big data analytics, mobility solutions, and DevOps work in tandem with a reliable and upgraded data center for efficient risk prevention, adequate risk response, and quick disaster recovery. As a result, they’re turning to frameworks that enable these best practices to be implemented in hybrid IT infrastructures to ensure business continuity, reduce OPEX, and improve digital customer experience.

2. Know your risks

No matter how comprehensive your risk management plan, it can never evolve faster than technology. And new tech and new work practices are creating more complexities than ever. Let’s take a quick look at the different types of risks data centers face.

Inadequate IT security

Arguably the biggest risk that data centers face today, cybersecurity breaches can range from DoS attacks to social engineering to data theft. The average data breach cost $4.24 million in 2021 – the highest in 17 years.

Application and system failures also have an impact on the physical security front, resulting in situations where ID cards can’t be verified, CCTV connections are lost, or authorized personnel are denied entry to certain areas.

System failure

Without a resilient architecture and continuous, redundant, and high-bandwidth connectivity, a data center is doomed. Servers, network devices, and associated equipment all need features such as clustering, mirroring, and duplication to reduce the chances of downtime.

Sometimes applications or software (such as hypervisors) act up and take down entire servers or networks with them. You need to make sure all apps work seamlessly across a hybrid infrastructure and talk to cloud-native apps as well.

Power failure

Although extremely rare, power failure can and does happen – primarily as a consequence of natural disasters. You need to provide UPS- or generator-backed power routes to all racks and cooling systems in your data center. A direct connection to a multi-substation power grid helps hedge against an outage at the local substation.

Water leakage

Flooding or water seepage can spell doom for data center equipment. However, well-maintained water pathways and drainage are crucial for fire control and cooling systems.

High-decibel noise

One lesser-known but significant risk to data centers is prolonged exposure to loud and high-frequency sound vibrations, which can lower the efficacy of storage systems, reduce read/write performance, and ultimately affect data integrity. Data centers should be built far away from arenas, fire stations, airports, and the like, and housed within buildings that use acoustic suppression technology.

Fire

Electrical power spikes and short circuits are common causes of fire in data centers. If not contained quickly, fires can raze thousands of dollars’ worth of hardware in minutes. Ironically, air conditioning and cooling systems dissipate smoke and make it harder to detect a fire in the early stages. Use smoke detection systems with photoelectric sensors to continuously monitor the air in your data center for signs of smoke.

Poor disaster-recovery planning

While data backup is a pretty simple procedure these days, data centers are preferred over the public cloud for a combination of security and performance reasons – you’d expect the immediate recovery of transactional data in the event of a system failure.

Of course, this depends on factors such as the nature of the business and regulatory framework it falls under. All the more reason to have a clear-cut plan for recovery for each different failure event; ditto for compute, storage, or networking resources.

The most pre-emptive disaster-recovery plans have monitoring systems in place that track risk factors affecting data centers and send out alerts when critical thresholds are crossed.

3. Assess risk before you manage it

All risks – like businesses – are not created equal. While data centers face their own distinctive risks, especially for different verticals, the risk-mitigation techniques that you end up using need not necessarily be tailored to a data center environment.

Therefore, you need a risk management plan that lists out every imaginable risk your data center faces and specifies responses to every type of incident. Before it happens.

Start by carrying out a risk audit – a comprehensive assessment of all your owned and operated facilities. Evaluate factors that affect facility design, IT infrastructure, and operational processes.

If there have been major incidents or outages in the past, do a root-cause analysis (if still possible) to address any gaps you haven’t covered. What can you do to ensure downtime won’t occur in similar circumstances arise again?

Further, if you operate a hybrid architecture with multiple data centers and cloud systems, audit each one on its own as well as the data paths and connections between all of them.

If you operate in highly-regulated industries such as finance and healthcare, you need to make periodic data center risk assessments and disaster testing a part of your routine operations.

As with everything else, creating a framework, policy, or cheat-sheet (at the very least) provides a ready reference of the categories of risk that apply to you, the systems that each category affects, the estimated damage and recovery costs, and the protocol to be followed in case of an incident or disaster.

For example, IT Consulting company Capgemini employs an evolving approach to risk management that identifies and quantifies risks along with their mitigation costs. “We have put in place a monthly risk management system that logs all risks and issues with containment and action plans. An investment budget is made available if changes are required,” said Kevin Read, Senior Delivery Center Manager at Capgemini.

Killing downtime

A data center – or even the entire IT infrastructure of a company – never functions in isolation. There are umpteen components and factors that keep data centers running around the clock.

Risk mitigation with IT infrastructure is a shared responsibility, not just the CIO’s or CTO’s. You need to have an adequate number of IT staff trained and willing to do what it takes to stay on top of data center operations.

I’ll leave you with a piece of advice from Gavin Millard, VP of Product Marketing at Tenable: “Conflicting goals can be hard to address, but one of the most effective methods of doing so is to have a highly efficient process for continuously identifying where a risk resides. You also need a predictable, reliable method of updating systems without impacting the overarching business goals of the organization.”

Next read this:

Source