Every new customer conversation follows a familiar arc. It starts with the day-to-day reality of running infrastructure at scale, and ends with a very reasonable question: Can you actually do this in production?
This post walks through that arc, from the operational pain that brings teams to Infrahub, to the architectural and production-readiness questions that follow.
The operational reality
The teams we talk to aren’t browsing for new tools out of curiosity. They’re dealing with real, compounding pressure.
Network inventory is scattered across spreadsheets, legacy CMDBs, Slack threads, and tribal knowledge. Nobody fully trusts the data. Automation exists, but it’s brittle. Scripts break when the underlying data drifts, and engineers spend more time fixing automation than using it. Every change carries production risk, and the review processes meant to catch problems are informal, inconsistent, or missing.
Meanwhile, the business expects network resources on demand: VLANs, IPs, firewall rules, new services. Ticket queues are no longer acceptable. Device counts and service complexity keep growing, but team sizes stay flat. The gap between what teams manage and how they manage it widens every quarter.
This isn’t an automation problem. It’s a data problem. CMDBs weren’t built for automation. Git repos with YAML files can’t model real infrastructure relationships. Custom in-house builds fit perfectly…until the author leaves and the whole thing becomes unmaintainable.
Teams need a platform that combines a trusted data foundation, DevOps-grade change workflows, and the flexibility to model infrastructure the way it actually works. That’s what Infrahub is.
What Infrahub changes
Infrahub is a unified data management platform with built-in version control, purpose-built for infrastructure automation. It gives teams a trusted, schema-enforced source of truth with graph-native relationships; branch-based proposed changes with peer review, CI validation, and full audit trails; Generators that turn trusted data into configs, cable plans, and deployment artefacts; and a GraphQL API and integration layer that connects to any orchestration or deployment tool in your stack.
The result is a shift in operating model: from fragmented data and manual process to trusted, validated, repeatable delivery. One European service provider reduced service deployment time from 120 hours to under 15 minutes after moving to schema-driven automation with Infrahub. Another enterprise expanded from one to five data centres, managing multi-vendor deployments through a single trusted data platform.
But once teams see the vision and start building in the lab, the conversation naturally shifts. The next questions are about production.
“Can it support our infrastructure?”
This is usually the first production question, and rightly so. If you’re managing thousands of devices across multiple regions, you need to know the platform can handle the dataset and query patterns your automation depends on
Infrahub is a single platform with a common core. Community edition is a good fit for teams getting started, running labs, or operating at moderate scale. As deployments grow—particularly when organisations need clustering for high availability, horizontal scaling for concurrent pipeline reads, or workflow-based approval processes for change governance—Enterprise extends that same core with the features and support needed at larger scale. The Community vs. Enterprise comparisonprovides a detailed breakdown.
Our sizing tiers define resource allocations across deployment profiles, from moderate environments up to tens of thousands of device-class objects. Production deployments today manage datasets well into the hundreds of thousands of managed objects on a single instance, so most organisations approaching Infrahub are well within proven capacity.
“How do you architect for scale?”
Our default recommendation is a single Infrahub instance with full high availability — database clustering with automatic leader election, cache failover, and mirrored message queues for bus continuity. A single instance keeps operations simple: one place to manage schema, one resource allocation authority for IPs, prefixes, and identifiers, and one target for monitoring and upgrades. As I previously mentioned, production deployments today manage datasets well into the hundreds of thousands of managed objects on a single instance, so for most organisations this is more than sufficient. Recommended resource baselines for each sizing tier are detailed in the deployment sizing documentation.
That said, some organisations have operational or regulatory reasons to run regionally independent infrastructure. For those cases, Infrahub fully supports a multi-instance deployment model — one instance per region, each with its own HA topology, sharing no state between them. We have customers running this pattern in production today, including one with a formal multi-instance HA deployment managed via Helm.
The multi-instance model adds a second layer of resilience: regional failure isolation. A problem in one region’s Infrahub — whether a maintenance operation, a failed merge, or a service restart — has zero impact on the others. The trade-off is that each region manages its own resource allocation independently, which is usually a natural fit for organisations that already have regional operational boundaries.
“What about performance?”
Performance is not a milestone we hit and move on from — it’s a continuous investment. Our customers operate at scales where every second of pipeline execution and every merge operation matters, and those demands drive our engineering roadmap.
A recent example: we moved core branch-merge logic into the database layer, resulting in roughly a 4x improvement in merge times across datasets ranging from tens of thousands to hundreds of thousands of objects. That kind of deep architectural optimisation reflects how we approach performance — not surface-level tuning, but rethinking where work happens to unlock step-change improvements.
On query and Generator performance, we’ve found across deployments that Generator execution — not GraphQL query latency — is typically the build pipeline bottleneck. By working closely with customers to tune how Generators query Infrahub, using targeted field selection instead of eager relationship loading and increasing pagination size, one customer achieved a significant reduction in overall build time, with some individual Generators running up to 10x faster. These are the same query patterns any downstream build system would use.
This is ongoing work. As our customers grow and their datasets and concurrency requirements increase, we continue to invest in profiling, benchmarking, and optimising Infrahub at every layer of the stack.
“What About HA, DR, and upgrades?”
These are the questions that separate a proof of concept from a production deployment, and we take them seriously.
- Resilience. A single HA instance provides component-level resilience out of the box: database clustering, cache failover, and message queue mirroring. For organisations running the multi-instance model, regional failure isolation adds another layer — a problem in one region has zero impact on the others. In either case, when data-plane agents consume a compiled desired state rather than querying Infrahub directly, an Infrahub outage means only that the build pipeline cannot retrieve updated intent. Hosts continue enforcing their last-known desired state.
- Backup and restore. Database backup is a routine, automated operation that integrates with S3-compatible cloud storage (see the database backup documentation). In large production deployments, compressed backups typically range from single-digit to tens of gigabytes, with substantial compression ratios from raw database size.
- DR targets. Based on production experience, with hourly backups RPO is one hour. RTO for restoring a large instance is around one hour, dominated by restoring the task manager database. In architectures where host agents enforce their last-known desired state, the effective data-plane RTO is zero regardless of Infrahub availability.
- Upgrades.The
infrahub upgradecommand handles database migrations (see the Upgrade Guide). For production HA environments, we recommend Kubernetes and Helm as the deployment platform (see installation documentation). It provides the cleanest upgrade path, with the upgrade command running as a Helm job during the release update while Helm manages rolling pod updates. For VM-based HA deployments, we’ve developed formal upgrade procedures that codify the stop/upgrade/restart sequence with safety checks. Organisations running the multi-instance model gain an additional benefit: upgrades naturally follow a canary pattern — upgrade one region, validate, then the next. In all cases, we recommend a full database backup before every upgrade and validation in a staging environment that mirrors production topology.
Running Infrahub in production: Monitoring, security, and operations
Once the architecture and resilience questions are covered, teams want to know what day-to-day operations look like.
- Monitoring.Infrahub ships with built-in observability. OpenTelemetry distributed tracing is supported natively, with configurable push-based metrics export to any OTLP-compatible backend — Jaeger, Zipkin, Datadog, and others. Prometheus integration is also available via a pull-based
/metricsendpoint (see the observability configuration guide). From our operational experience across deployments, the critical metrics to watch are API request latency, task manager queue depth and database connection pool usage, and disk utilisation. - Security.Infrahub supports OAuth2/OIDC SSO (see the SSO guide) and token-based API authentication (see managing API tokens). The RBAC model provides pre-configured roles (Admin, Standard User, Anonymous User), custom role creation, global permission types including super admin, account management, and schema management, and namespace-scoped object-level permissions with configurable decision types (see the permissions reference). Enterprise adds workflow-based approval processes with configurable approval thresholds and enforced approval gates (see the change approval workflow guide). All changes are tracked immutably with actor attribution. A production hardening guide is also available.
Growing with Enterprise
Infrahub Community and Enterprise share the same core engine, the same data model, and the same APIs. Many teams start with Community — building their schema, integrating data sources, developing Generators, and validating workflows in a lab or early production environment. When the time comes to scale up, moving to Enterprise is seamless. It’s the same platform. Your schema, your data, your integrations, your Generators — everything carries over with no data loss, no migration tooling, and no re-architecture. You pick up exactly where you left off, with Enterprise capabilities enabled on top.
Enterprise extends Infrahub with clustering for high availability, horizontal scaling for concurrent reads from build pipelines, advanced database tuning for larger datasets, workflow-based approval processes for change governance, and SLA-backed support. For organisations operating at scale with production uptime requirements, these are the capabilities that make the architecture viable. A full breakdown of what each edition includes is available in the Community vs. Enterprise comparison.
OpsMill also offers an optional dedicated Solutions Architecture engagement—covering deployment planning, performance tuning, and operational procedures—that can be added to any Enterprise licence. It’s the fastest path from architecture planning to production confidence.
What comes next
We don’t ask teams to take our word for it. Before production, we’re happy to work with you to run benchmarks using a dataset modeled on your actual schema, so you have concrete numbers for the metrics that matter most to your deployment.
And if it would be helpful, we can facilitate conversations with production customers running at similar scale. There’s no better way to pressure-test production-readiness claims than hearing directly from teams who have already done it.
If any of this resonates, reach out. We’re always happy to engage.