Automating AI Data Centers Highlights Data Management Challenges

AI is playing a growing role in the future prospects of many industries and businesses. As a result, companies are making big bets on data centers stuffed with GPUs. These costly investments need to drive outsized ROI, and there’s no way to ensure that without also investing in automation. But as enterprises reinvest in on-prem environments, they’re discovering that infrastructure automation in traditional datacenters is not only critical, it’s also rather difficult.

The devil is in the details

A lot of infrastructure automation’s recent development has come about because of cloud adoption. Cloud services provide nice support for automation, making it easier to deploy and update configurations for individual cloud service offerings. This is because cloud providers abstract the underlying infrastructure and present APIs that are automation-ready. As a result, infrastructure engineers and IT teams don’t have to worry about things like hardware ports in a cloud API. It’s all expressed as a service. In addition, there are technologies like Terraform that aid in more advanced cloud infrastructure automation.

But on-prem infrastructure automation is quite a different story. Whereas the cloud abstracts all the complexity of the lower technical layer for customers, on-prem infrastructure is not natively abstracted. You need to deal with a lot more underlying infrastructure detail and you need to ultimately have a way for all that infrastructure to present itself as a catalog of consumable, rapidly deployable services.

When I refer above to lots of infrastructure details, remember that AI datacenters are often gigantic. We’re talking about facilities the size of football fields that can cost hundreds of millions of dollars to build, equip, and operate. There are countless endpoints: A typical environment might have thousands of networking devices, tens of thousands of servers, and hundreds of thousands of cables linking everything together. So the level of detail and the size of the challenge to automate it all is seriously non-trivial.

It takes multiple dedicated engineers simply to understand how to abstract and automate all these systems. It can take even more manpower to execute and oversee those automation efforts.

A data management challenge, times a million

When you’re dealing with on-premises infrastructure at this level, the core problem of automation is data management. Of course, data is essential to any automation. If you’re using a standalone python script and you feed it the wrong data, you won’t get the right result. Automating an AI data center takes that problem and multiplies it a million times.

Since all the infrastructure sitting in your AI data center isn’t abstracted, in order to turn it all into services, a huge proportion of the work of automation has to do with dealing with complex sets of data:

  • Business logic: You need to capture the business logic for the services you want to deliver. For example, in an AI data center, services might be defined as small, medium, and large infrastructure “pods” that consist of GPUs, CPUs, and storage interconnected with networking infrastructure. The design of these pods must include the logic of how all these components interconnect and how they present together as a consumable service.
  • Service translation: That high-level business logic or service design must be translated into the specific infrastructure in the data center. Keep in mind that it’s always the easiest when everything is brand new and in the most homogeneous state. But that doesn’t last long. Over time, translating design into the actual infrastructure will need to deal with changing components, new vendors, new OS, etc.
  • Transformation to drive action: To actually deploy a pod, that translated design must then be transformed into specific configuration data which can be read and pushed by a tool like Ansible or Nornir.

You have to automate computing, networking, security, and storage components. You need to manage logical resources like IP addresses and deal with virtualized infrastructure components like VXLANs and BGP routing. The combination of hard infrastructure and logical or virtual infrastructure details means that the size and complexity of on-prem automation data management is 4-5x bigger than what’s required for the cloud.

Traditional Data Management Approaches Aren’t Enough

There have been two primary approaches to infrastructure automation: GitOps and traditional infrastructure management. However, both fall short on data management for different reasons.

GitOps offers version control and continuous integration (CI) validation, which provides a healthy lifecycle and quality control for consistent automation. However GitOps doesn’t offer a structured data model that can handle the volumes of data needed for on-premises infrastructure.

Traditional infrastructure management tools are built around structured databases that manage infrastructure details, but they don’t include version control or CI. Ironically, they are also constrained by fixed database schemas that prevent organizations from managing automation data in a way that fits their business or any unique factors in their infrastructure.

A Third Way for Infrastructure Data Management

As AI adoption sends organizations back to building on-prem datacenters, they need a fresh approach to infrastructure automation—one that provides a structured yet flexible approach to complex data management needs, as well as embracing modern versioning and CI concepts from GitOps.

That’s why Infrahub is designed to serve as a infrastructure data management platform at the center of your automation stack. Infrahub combines the version control and branch management of Git with the flexible data model and UI of a graph database. Infrahub offer a user-defined schema, plus a number of sophisticated and powerful data management capabilities, so you can design services, express your infrastructure, translate high-level designs to that infrastructure, then transform intent data to rendered and deployable data. Version control, peer review, and continuous integration are native and integral rather than being add-ons.

We’re seeing interest from customers who are working on AI data center projects because Infrahub addresses the core data management challenge that will make or break how they build and sustain infrastructure automation for these expensive data center investments.

Learn more

Want to learn more? Check out our blogdocumentationlabs, and sandbox.

Eager to try it out hands-on? Visit our Github and join our Discord community.

Ready to get your organization moving with Infrahub? Request a demo.

Webinar Round Up: Using GraphQL, Transformations, and Jinja2 Templates to drive Arista Network Automation

In this recorded webinar, Alex Gittings, a member of the customer success team at Opsmill takes you through how to use the Infrahub platform to generate and deliver configurations to devices to create a functional network. The webinar follows a tutorial (that you can sign up for here), which helps you understand how to develop against Infrahub. Alex also looks at different tooling around Infrahub, and how it makes network engineers’ lives better.

Alex uses a network setup involving Arista switches connected via OSPF.

Topics covered include:

    • An overview of Infrahub
    • How to extract data from Infrahub using GraphQL
  • Creating network artifacts using transformations
  • Using Jinja2 templates to render configurations

The webinar recording also includes an extensive Q&A session.

From Scripting to Strategic Automation: A Candid Q&A with Wim Van Deun

We had a chance to catch up with Solution Architect Wim Van Deun about his journey in network automation. In this conversation, Wim opens up about his early “ad hoc” scripting days, how he stumbled into data challenges, and why a flexible, version-controlled source of truth—like our platform Infrahub—would have been a game-changer in his past roles. Plus check out a bonus video where Wim shares about his first experience leading a strategic network automation initiative at a major enterprise.

Question: You’ve mentioned before that you started with automation long before it had that label. Can you walk us through your early steps?

Wim: Early on, I was the go-to person whenever there was a networking puzzle no one else wanted to deal with. I’d write Perl scripts—really rough stuff—to log in to devices, check for weird metrics, and figure out if we’d hit specific issues. Honestly, I didn’t think of it as “automation;” I just knew scripting was faster than doing everything by hand. For example, we once had a Cisco firewall module that didn’t expose certain metrics. My script would hop in via SSH, run a command, and compare the output. If it saw a specific counter jump, we knew something was up. These mini-projects solved problems our monitoring tools couldn’t handle. That’s how I ended up learning that scripts could really streamline day-to-day tasks.

Question: So how did you go from those quick Perl scripts to more serious automation?

Wim: The real turning point was at a large organization in the healthcare space. They had a huge network, and my boss recognized we were burning too many hours and dollars doing everything manually. Since I was already known as the “script guy,” I got the nod to build a more formal automation strategy. That’s when I realized just how chaotic our data was. We had no centralized inventory, no clear ownership, and every device move or team change seemed to introduce new glitches. I learned pretty quickly that if you want to automate at scale, you have to tackle data management first.

Question: You’ve told me before that data management was your biggest hurdle. Can you explain why?

Wim: Picture an environment with thousands of devices, all at different sites, each with its own specs. If you don’t know exactly which devices exist or how they connect, you’re in for a world of frustration. We tried using spreadsheets, but that quickly became a nightmare—multiple versions, random naming conventions, fields missing. Eventually, we adopted tools like NetBox or later Nautobot, and that helped a bit. But we kept running into restrictions with how the data was modeled. We’d have to create plugins or store critical info in random “tags,” which felt like a hack. Plus, if we wanted a new relationship—like connecting a device to a custom design spec—that usually required more tinkering than it should.

Question: Let’s talk about Infrahub. In your view, what’s the big difference between that and other source-of-truth platforms?

Wim: Two things stand out: schema flexibility and built-in version control. Take the schema part. With some tools, you’re forced to fit your network data into a model that might not match your reality. Maybe you’ve got specific relationships—like “these branches share a unique WAN design” or “this device needs these custom attributes”—but the tool doesn’t support it, so you do a ton of workarounds. Infrahub flips that and says, “Model your data however you want.” That means I can adapt the schema to my actual environment, instead of bending my environment to fit a cookie-cutter approach.

And version control? That’s huge. We often needed to plan a future state for a network refresh while keeping production data intact. Doing that with older platforms meant awkward duplication or separate staging instances that never stayed in sync. Infrahub’s branching system lets me spin up a separate version, test it, and merge it back when everything’s good. No guesswork. No accidental overwrites.

Question: Sounds like version control is especially helpful for big projects, like hardware rollouts. Did you face that sort of challenge often?

Wim: Absolutely. Enterprise networks can’t just stop for a day while you swap out old hardware for new. You might have 200 sites that need a refreshed design, but you still have to maintain daily operations. With a branching approach, you can create a “future” environment of your site, plan out all your changes (like new VLANs, IP addresses, and device details), then merge it in one go. It’s essentially the same workflow developers use for code, only now you’re doing it with network configurations and inventory data. Before, we’d basically keep a second spreadsheet and pray no one messed with the original. Infrahub’s approach is drastically more reliable.

Question: Any final words of wisdom for folks trying to decide if they should invest in better automation and a centralized data platform?

Wim: The best advice I can give is: don’t overlook the data problem. Automation is only as good as the data that drives it. If you’re dealing with incomplete, out-of-date, or scattered info, you’ll spend more time fixing scripts than enjoying the benefits. Before you build fancy workflows, nail down a solid source of truth. And look for a platform with flexibility in how you store information and the ability to version-control big changes. That’s a massive boost to productivity and reliability.

That’s a wrap on this Q&A. Wim’s experiences paint a clear picture: network automation works best when you have a solid handle on your data and the right platform to accommodate changes. Infrahub provides that flexibility and branching capability,

Please Welcome Infrahub Enterprise!

Today, we’re announcing the enterprise version of our Infrahub platform. Infrahub Enterprise delivers SLA-backed support, advanced integrations, and enhanced performance and high-availability to enable customers to gain the benefits of mature infrastructure automation with greater velocity and quality assurance. With Infrahub Enterprise, IT and networking teams can easily turn any infrastructure into consumable, automated services that deliver greater business agility, efficiency, and security.

Infrahub Enterprise comes hot on the heels of the 1.0 release of Infrahub Open Source. An open-source approach has always been at the bedrock of OpsMill’s technology. But Infrahub was designed from the beginning to solve enterprise-class infrastructure automation challenges.

Furthermore, most organizations require comprehensive support and other advanced features to enable them to deploy innovative software at scale and effectively change the way they’re managing hybrid infrastructure. Infrahub Enterprise is here to meet the needs of those customers, while maintaining our commitment to transparent open source software.

Our ambition with Infrahub is to go far beyond the limited scope, constrained data models, and inflexibility of infrastructure automation products that have been the mainstay of the landscape up until now. Now, these solutions have a role to play, and in particular the independent and open source products have served many well to get started with infrastructure automation. However, due to their constraints they hit a ceiling when businesses want to journey past lower-level automation and build a full lifecycle to turn all their infrastructure into a consumable service catalog.

We had the chance to share our latest updates including Infrahub Enterprise with some industry experts. Jim Frey, principal analyst of networking at Enterprise Strategy Group, had this to say: “Enterprises have faced long-standing challenges when trying to automate their infrastructure due to the long tail of legacy technology and the spread of systems across multiple clouds, private cloud data centers, and on-premises office, manufacturing, and retail infrastructure. OpsMill has created a DevOps approach to infrastructure that seeks to solve root causes of automation blockers. Addressing a full cycle of automation from design through deployment is ambitious, but ambitious solutions are needed to bring the state of the art forward.”

Enterprises Want to Turn All Their Infrastructure into Services–Not Just the Cloud

Cloud is where most automation has succeeded so far. Yet the fact is that enterprises are still investing significantly in on-premises and hybrid infrastructure and networking to support critical applications. And don’t forget that there is a lot of IT infrastructure that lives in offices, retail, warehouse, and manufacturing locations. What enterprise IT leaders want from automation is to make their own infrastructure function as consumable services, like the cloud. That may seem like an overplayed note, but it doesn’t mean it’s not true.

While the movement towards cloud has enabled repeatable and scalable infrastructure automation via Terraform and GitOps, hybrid infrastructure – especially on-premises infrastructure that includes networking assets  – has been left behind. The fact of the matter is that most infrastructure teams don’t use Git, and Git isn’t a great fit for many types of on-premises infrastructure, especially when networking is involved because it has no structured data management capabilities.

So far, the state of art of non-cloudy infrastructure automation for most organizations has been achieving some structured way of automating configuration deploys, graduating from ad-hoc script usage. But that’s pretty low-level. Getting to service-driven automation takes a lot more, and that’s where so much of the problem starts.

Fragmented Data and Processes

We’ve spoken with enterprises that can identify more than two dozen sources of truth for infrastructure data in their IT domain. That’s a great illustration of data fragmentation. What are all these sources of truth? Well, they include CMDBs, Network SoT, multiple vendor-specific configuration management databases, home-grown databases, etc. But that’s just the infrastructure data.

Before you ever deploy a service, you have to define it. Where does service definition data live? Mostly in static documents–PDFs and Visio diagrams. Last I checked, these are not terribly automation-friendly formats. Joking aside, these design documents generally get written once, then archived, never to see anything close to a “lifecycle” again. That’s a big problem.

None of the above data sets live inside of a continuous integration lifecycle. But that’s not all that belongs there. Typically when configuration-level data gets rendered, it is used then thrown away, but that’s not right. Deployable config data needs to be persistent, versioned, validated, and reusable.

The lack of design integration into automation, the diverse sources of truth, and the absence of any of that plus config deploy data within a CI process. These gaps are massive barriers to the ambition to turn infrastructure into services. That is exactly the problem that Infrahub is built to solve.

Infrahub Enterprise unifies fragmented data with complete flexibility in its user-defined data model. You can synchronize Infrahub’s dataset with any external data source via our APIs. There’s no predefined constraint on what services you can envision and design, or what type of infrastructure data you can manage. The data model can accommodate service designs, plus all infrastructure data from all the sources of truth–without needing to be canonical. As a platform Infrahub brings all of that together as well as translation logic between service models and infrastructure and intent data, to generate reliable configurations that also live in the platform. Everything is part of a CI lifecycle, which means that everything is versioned from design through to deploy.

With Infrahub, you can turn any infrastructure into services.

A Little More About Infrahub Enterprise

What sets Infrahub Enterprise apart? As of its first iteration, it includes:

  • Advanced role-based access controls (RBAC) integrations
  • High-performance enhancements
  • Production-grade Neo4j Enterprise database with high availability
  • SLA-backed support
  • Hardened production-ready releases

But remember that Infrahub Enterprise is always built on the foundation of transparent open-source software.

What Can You Do with Infrahub and Infrahub Enterprise?

By delivering hybrid infrastructure automation as a consumable service, Infrahub Enterprise supports diverse use cases across a variety of industries. Here are a few examples:

  • Hybrid and Multi-Cloud Platform Engineering: More enterprises are building or repatriating infrastructure to on-premises data centers due to cost, data sensitivity, and regulatory requirements. Furthermore, many applications function on a hybrid cloud basis, with heavy dependencies on both on-premises and cloud-based infrastructure. With Infrahub Enterprise, platform engineering teams can now create infrastructure automation self-service catalogs that allow developers and operational teams to spin up end-to-end infrastructure that includes cloud resources, cloud interconnect, and on-premises networking, security policies, and servers.
  • AI GPU Data Centers: Financial institutions and other enterprises are building large-scale GPU data centers to develop their own AI and ML models. Traditional network inventory software isn’t flexible enough to model Infiniband and AI-specific infrastructure. Infrahub Enterprise uniquely possesses the service creation, data management, versioning, and CI capabilities to automate such large-scale infrastructures as a set of service offerings, allowing enterprise IT leaders to maximize return on investment in AI data center infrastructure.
  • Retail Location Infrastructure-as-a-Service: Retail stores’ IT infrastructure needs have grown steadily, increasing the complexity of deployment. When businesses expand and build new stores, or revise infrastructure to meet the demands of digital transformation initiatives, the only way to keep pace with the business is to apply repeatable automation processes. Using Infrahub Enterprise, retailers can create retail infrastructure service automation catalogs to roll out or upgrade stores, even accommodating different store sizes, regions, and other variables.

Learn More

Infrahub Enterprise and Infrahub 1.0 are available now. Book a demo to learn more about how Infrahub can help you up-level your infrastructure automation and make a big positive impact on your business.

Infrahub and Netpicker Integration

When we built Infrahub, we envisioned that it would provide a core set of integrated infrastructure automation capabilities including a source of truth (SoT) that is built around a user-defined and extensible schema with versioning and integrated continuous integration amongst other things.

Yet we were equally committed to the idea that much of the way that Infrahub would take action would be through properly abstracted integrations. For example, Infrahub has Ansible and Nornir integrations for configuration pushes.

And that’s why we’re very pleased to share that we have completed an integration between Infrahub and Netpicker to enable production validation in network CI/CD.

The Holy Grail of Network Automation Testing

What would an ideal network continuous integration (CI) validation process look like? First, you would be able to easily version the intended state of the network with proposed changes, and perform internal consistency checks and production pre-checks. Infrahub has you covered there.

But then, once you push changes to the network, you would be able to validate that the production state matches the intended state in a Continuous Deployment (CD) process. Infrahub doesn’t do this on its own. In fact this picture of network CI/CD has long been out of reach for most network teams.

But no longer. With the integration of Infrahub and Netpicker, network teams can now run an end-to-end branching and CI/CD process for their network services and device configurations.

Learn more in our joint webinar

If you’d like to learn more about how this works, check out our joint webinar with Netpicker. During the webinar, you’ll:

  • Learn about historical challenges that prevented networking CI/CD
  • Gain an understanding of Infrahub and Netpicker and how they enable CI/CD including pre-push and post-push production validation
  • See a live demo of Infrahub and Netpicker in action

You can register/watch here.

REQUEST A DEMO

See what Infrahub can do for you

Get a personal tour of Infrahub Enterprise

Learn how we can support your infrastructure automation goals

Ask questions and get advice from our automation experts

By submitting this form, I confirm that I have read and agree to OpsMill’s privacy policy.

Fantastic! 🙌

Check your email for a message from our team.

From there, you can pick a demo time that’s convenient for you and invite any colleagues who you want to attend.

We’re looking forward to hearing about your automation goals and exploring how Infrahub can help you meet them.