| OpsMill

16 Apr 2025

Automating AI Data Centers Highlights Data Management Challenges

Business lens

AI is playing a growing role in the future prospects of many industries and businesses. As a result, companies are making big bets on data centers stuffed with GPUs. These costly investments need to drive outsized ROI, and there’s no way to ensure that without also investing in automation. But as enterprises reinvest in on-prem environments, they’re discovering that infrastructure automation in traditional datacenters is not only critical, it’s also rather difficult.

The devil is in the details

A lot of infrastructure automation’s recent development has come about because of cloud adoption. Cloud services provide nice support for automation, making it easier to deploy and update configurations for individual cloud service offerings. This is because cloud providers abstract the underlying infrastructure and present APIs that are automation-ready. As a result, infrastructure engineers and IT teams don’t have to worry about things like hardware ports in a cloud API. It’s all expressed as a service. In addition, there are technologies like Terraform that aid in more advanced cloud infrastructure automation.

But on-prem infrastructure automation is quite a different story. Whereas the cloud abstracts all the complexity of the lower technical layer for customers, on-prem infrastructure is not natively abstracted. You need to deal with a lot more underlying infrastructure detail and you need to ultimately have a way for all that infrastructure to present itself as a catalog of consumable, rapidly deployable services.

When I refer above to lots of infrastructure details, remember that AI datacenters are often gigantic. We’re talking about facilities the size of football fields that can cost hundreds of millions of dollars to build, equip, and operate. There are countless endpoints: A typical environment might have thousands of networking devices, tens of thousands of servers, and hundreds of thousands of cables linking everything together. So the level of detail and the size of the challenge to automate it all is seriously non-trivial.

It takes multiple dedicated engineers simply to understand how to abstract and automate all these systems. It can take even more manpower to execute and oversee those automation efforts.

A data management challenge, times a million

When you’re dealing with on-premises infrastructure at this level, the core problem of automation is data management. Of course, data is essential to any automation. If you’re using a standalone python script and you feed it the wrong data, you won’t get the right result. Automating an AI data center takes that problem and multiplies it a million times.

Since all the infrastructure sitting in your AI data center isn’t abstracted, in order to turn it all into services, a huge proportion of the work of automation has to do with dealing with complex sets of data:

Business logic: You need to capture the business logic for the services you want to deliver. For example, in an AI data center, services might be defined as small, medium, and large infrastructure “pods” that consist of GPUs, CPUs, and storage interconnected with networking infrastructure. The design of these pods must include the logic of how all these components interconnect and how they present together as a consumable service.
Service translation: That high-level business logic or service design must be translated into the specific infrastructure in the data center. Keep in mind that it’s always the easiest when everything is brand new and in the most homogeneous state. But that doesn’t last long. Over time, translating design into the actual infrastructure will need to deal with changing components, new vendors, new OS, etc.
Transformation to drive action: To actually deploy a pod, that translated design must then be transformed into specific configuration data which can be read and pushed by a tool like Ansible or Nornir.

You have to automate computing, networking, security, and storage components. You need to manage logical resources like IP addresses and deal with virtualized infrastructure components like VXLANs and BGP routing. The combination of hard infrastructure and logical or virtual infrastructure details means that the size and complexity of on-prem automation data management is 4-5x bigger than what’s required for the cloud.

Traditional Data Management Approaches Aren’t Enough

There have been two primary approaches to infrastructure automation: GitOps and traditional infrastructure management. However, both fall short on data management for different reasons.

GitOps offers version control and continuous integration (CI) validation, which provides a healthy lifecycle and quality control for consistent automation. However GitOps doesn’t offer a structured data model that can handle the volumes of data needed for on-premises infrastructure.

Traditional infrastructure management tools are built around structured databases that manage infrastructure details, but they don’t include version control or CI. Ironically, they are also constrained by fixed database schemas that prevent organizations from managing automation data in a way that fits their business or any unique factors in their infrastructure.

A Third Way for Infrastructure Data Management

As AI adoption sends organizations back to building on-prem datacenters, they need a fresh approach to infrastructure automation—one that provides a structured yet flexible approach to complex data management needs, as well as embracing modern versioning and CI concepts from GitOps.

That’s why Infrahub is designed to serve as a infrastructure data management platform at the center of your automation stack. Infrahub combines the version control and branch management of Git with the flexible data model and UI of a graph database. Infrahub offer a user-defined schema, plus a number of sophisticated and powerful data management capabilities, so you can design services, express your infrastructure, translate high-level designs to that infrastructure, then transform intent data to rendered and deployable data. Version control, peer review, and continuous integration are native and integral rather than being add-ons.

We’re seeing interest from customers who are working on AI data center projects because Infrahub addresses the core data management challenge that will make or break how they build and sustain infrastructure automation for these expensive data center investments.

Learn more

Want to learn more? Check out our blog, documentation, labs, and sandbox.

Eager to try it out hands-on? Visit our Github and join our Discord community.

Ready to get your organization moving with Infrahub? Request a demo.

Automating AI Data Centers Highlights Data Management Challenges

The devil is in the details

A data management challenge, times a million

Traditional Data Management Approaches Aren’t Enough

A Third Way for Infrastructure Data Management

Learn more

Alex Henthorn-Iwane

REQUEST A DEMO

See what Infrahub can do for you

Fantastic! 🙌