An Automation Engineer’s Guide to Understanding Data Schemas

Damien Garros

Feb 6, 2026

What is a data schema?

A schema defines the structure, format, and constraints of data.

A data schema specifies:

What types of data exist
What attributes each type has
How different pieces of data relate to each other
What rules the data must follow

Schema plays an incredibly important role in network automation because it ensures data consistency and integrity, and facilitates communication between different systems by providing a shared understanding of the dataʼs structure.

There's no perfect data schema

Everything in technology brings trade-offs. It's no different with schema types.

You will never find the perfect schema. But if you understand the different schema types, and the strengths and weaknesses of each one, you can find the best schema for your particular use case.

With schemas, the biggest trade-off is always flexibility vs data integrity.

When schemas are loosely defined or unenforced, you get flexibility. It's incredibly easy to add new fields, change data structures, and move fast.

This works beautifully for quick scripts and proof-of-concepts, small projects with one or two contributors, or rapidly evolving data models where you're still figuring out what you actually need.

But as your system scales, flexibility becomes a liability. Without schema enforcement, you end up with inconsistent data.

Some devices have a "site" field, others have "location," and nobody's quite sure which one is correct. You get silent failures when your automation script expects an integer but gets a string. Integration becomes a nightmare because each consumer of your data has to handle all the edge cases differently.

When schemas are formally defined and enforced, you trade some flexibility for integrity and consistency. But that integrity is essential if you want data that will support network automation in production.

Where a data schema can be defined

There are three places, or levels, where a schema can be defined and implemented. It's important to understand the differences between the levels because where the schema is defined has a big impact on those trade-offs we just discussed.

Storage level: Schema is enforced at the database level. You get solid data structure and integrity.
Application level: Schema is enforced through code at the application layer. You get more flexibility but your data integrity depends entirely on the quality of your code.
User level: Schema is not formally enforced at all. Users are simply expected to follow conventions documented somewhere (hopefully). You get maximum flexibility but minimum guarantees.

There's always a data schema

Here's something important to understand: Everything has a schema, whether you realize it or not.

You might have heard terms like "schema-less" databases or "schema-free" configuration management. These are technically a myth.

Even the worst possible schema in the world—no consistency, completely abstract, any type of data in any field—is still a schema. Because if there's no schema, there's no data.

Systems marketed as "schema-less" simply mean the schema isn't enforced at the database level. Instead, the schema lives implicitly in the code (at the application level) or in the minds of engineers (at the user level).

The 3 components of every data schema

No matter the type, every data schema has three fundamental components: structure, relationships, and constraints.

1. Structure

Structure defines what types of things exist in your data model and what attributes they have. Think about modeling a network device. You might have a name (string), a model (string), a serial number (string), a management IP address, an install date, and a flag indicating whether it's active (boolean).

Each attribute has a type, and types matter because they determine what operations are valid. You can do math on integers but not on strings. You can validate IP address formats but not arbitrary text.

2. Relationships

Relationships define how different objects relate to each other, and this is where schemas really start to reflect how we think about infrastructure.

The most common relationship pattern is a one-to-many relationship. One device has many interfaces. One site contains many devices. These hierarchical relationships are everywhere in network modeling.

You'll also encounter many-to-many relationships. A device can have multiple tags, and each tag can apply to multiple devices. A prefix can be allocated to multiple sites in a multi-site deployment. These are trickier to model but crucial for real-world scenarios.

One-to-one relationships are less common but do exist. A device might have one primary management interface that you model separately from the rest.

The type of relationship matters for network automation. When you query "Show me all interfaces on devices in the NYC datacenter," you're traversing relationships. You find the site object for NYC, then find all device objects related to that site, then find all interface objects related to those devices. Your schema defines these relationships, and your database uses them to answer queries efficiently.

3. Constraints and validation

Constraints are the rules that keep your data clean and consistent. They prevent impossible or invalid states.

Think about the constraints that make sense for network data. Examples:

Required fields: Every device must have a name.
Uniqueness constraints: No two devices can have the same name, and no two interfaces on the same device can share a name.
Format validation: IP addresses must be valid IPv4 or IPv6 addresses, and MAC addresses must match the standard format.
Range constraints: Interface speed must be a positive integer, and VLAN IDs must be between 1 and 4094.

Then there's referential integrity, which is all about maintaining consistency across relationships. An interface can't be assigned to a device that doesn't exist. When you delete a device, what happens to its interfaces? Do they get deleted too, or does the system prevent you from deleting a device that still has interfaces attached?

Without constraints, your data becomes a mess, and it can't be relied on to accurately drive automation.

Data schema languages

Now that you understand what schemas are and what they do, let's look at the different languages and tools used to define them. Each has its strengths and weaknesses.

SQL: The original and still dominant

SQL has been around since 1970, and is the language of relational databases. It's a language for both defining data schemas and querying data.

Here's what a simple SQL schema looks like:

CREATE TABLE devices (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL UNIQUE,
    model VARCHAR(100),
    serial_number VARCHAR(100),
    site_id INTEGER REFERENCES sites(id)
);

The strengths of SQL are hard to beat. It's mature, well-understood, and universally supported. The schema is mandatory and enforced at the database level, so you can't accidentally violate it. The query capabilities are powerful, and you get strong consistency with ACID transaction guarantees.

SQL shines in traditional applications that require strict data integrity, transactional workloads where consistency is critical, and systems where the data model is relatively stable over time.

JSON Schema: Modern and flexible

JSON Schema came along in 2010 as a way to define the structure of JSON documents. It's widely used for API validation, configuration files, and NoSQL databases.

A JSON Schema looks like this:

{
  "title": "New Blog Post",
  "content": "content of the blog...",
  "publishedDate": "2023-08-25T15:00:00Z",
  "author": {
    "username": "authoruser",
    "email": "[email protected]"
  },
  "tags": ["Technology", "Programming"]
}
{
  "$id": "https://example.com/blog-post.schema.json",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "description": "A representation of a blog post",
  "type": "object",
  "required": ["title", "content", "author"],
  "properties": {
    "title": {
      "type": "string"
    },
    "content": {
      "type": "string"
    },
    "publishedDate": {
      "type": "string",
      "format": "date-time"
    },
    "author": {
      "$ref": "https://example.com/user-profile.schema.json"
    },
    "tags": {
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  }
}

What makes JSON Schema appealing is that it's human-readable and easy to understand. The tooling for validation and code generation is excellent. It supports external references and schema composition, which means you can build complex schemas from reusable pieces. And since it's designed for JSON, it's a natural fit for JSON-based APIs and documents.

JSON Schema is useful when you're building APIs and need to validate requests and responses, when you're working with configuration files that need validation, or when you want to document data structures in your API documentation.

YANG: Built for networks

YANG was introduced in 2010 and designed specifically for modeling network configuration and operational data. If you're working with network devices, this is the industry standard.

A YANG model looks like this:

module example-network {
  namespace "http://example.com/network";
  prefix "ex";
  container device {
    leaf hostname {
      type string;
    }
    leaf ip-address {
      type inet:ipv4-address;
    }
    leaf model {
      type string;
    }
    list interfaces {
      key "name";
      leaf name {
        type string;
      }
      leaf enabled {
        type boolean;
      }
    }
  }
}

YANG's strength is that it's purpose-built for network management. It integrates seamlessly with NETCONF, RESTCONF, and gNMI protocols. It has strong typing and validation for network-specific data types, and it's an industry standard with extensive vendor support across major equipment manufacturers.

YANG is useful when managing network device configurations, modeling operational state, or integrating with standard network management protocols.

GraphQL: Schema + query language

GraphQL, developed at Facebook in 2012 and open-sourced in 2015, takes a different approach. It combines a schema definition language with a query language, giving you a complete system for building modern APIs.

Here's a GraphQL schema:

type Mutation {
  createPost(input: CreatePostInput!): Post
  createUser(input: CreateUserInput!): User
  addComment(input: AddCommentInput!): Comment
}
input CreatePostInput {
  title: String!
  content: String!
  authorId: ID!
}
input CreateUserInput {
  name: String!
  email: String!
}
input AddCommentInput {
  postId: ID!
  content: String!
  authorId: ID!
}

The power of GraphQL is that a single query can fetch exactly the data you need, with no over-fetching or under-fetching. It's strongly typed with excellent developer tooling. The schema serves as both documentation and a contract between your API and its consumers. And it has built-in support for mutations, which are how you change data.

GraphQL is ideal for APIs where clients need flexible querying capabilities, applications with complex nested data requirements, and teams that want type safety across both frontend and backend code.

Infrahub schema: Infrastructure-specific

Infrahub takes a modern approach to data modeling with a schema optimized specifically for networking and infrastructure management.

An Infrahub schema looks like this:

---
nodes:
  - name: Device
    namespace: Dcim
    label: Network Device
    icon: clarity:network-switch-solid
    inherit_from:
      - DcimGenericDevice
      - DcimPhysicalDevice
    attributes:
      - name: name
        kind: Text
        unique: true
        order_weight: 1000
      - name: height
        label: Height (U)
        optional: false
        default_value: 1
        kind: Number
        order_weight: 1400
    relationships:
      - name: platform
        peer: DcimPlatform
        cardinality: one
        kind: Attribute
        order_weight: 1300

What sets Infrahub apart is its native support for infrastructure concepts like hierarchies, inheritance, and relationships. And the system automatically generates a GraphQL API from your schema, so you don't have to write and maintain that layer yourself.

The Infrahub schema excels as the foundation for a network source of truth.

Choosing the right data schema to work with

So which schema approach should you use? Here's a practical framework based on what you're actually trying to accomplish.

If you're working with network device configuration, YANG is the industry standard. Start here if you're working with NETCONF/RESTCONF or building device configuration templates.
For API development, look at GraphQL or JSON Schema. Choose GraphQL if you want flexible querying capabilities. Go with JSON Schema if you need straightforward validation and documentation.
For a network source of truth, consider a graph-oriented schema like the one in Infrahub.
If your team is already comfortable with relational databases and your data model fits naturally into tables and rows, you might stick with traditional SQL.

For production systems at scale, you'll need to invest in formal schemas with enforcement. Your future self, and your team, will thank you when you're not debugging mysterious data inconsistencies at 2 am.

Practical data schema takeaways

Here's what you need to remember as you start thinking about schemas in your own work building and evolving network automation.

Everything has a schema. The question is whether it's explicit and enforced, or implicit and living in someone's head.
There are always trade-offs between flexibility and data integrity, speed and consistency. Choose based on your actual requirements, not industry hype or what seems trendy.
All schemas share three core components: structure (objects and their attributes), relationships (how things connect), and constraints (validation rules). Understanding these fundamentals helps you evaluate any schema language or approach.
Different tools exist for different jobs. SQL excels at transactional systems. JSON Schema shines for APIs. YANG is purpose-built for network devices. GraphQL enables flexible querying. Infrahub specializes in network data. Match the tool to the problem.
Finally, it's perfectly okay to start simple and plan for growth. Beginning with loose schemas for prototypes is fine, but build in enforcement as your system matures and the stakes get higher.

Understanding schemas is the foundation for making smart decisions about how you model and store your network data. But once you've defined your schema, you need somewhere to actually store that data… which brings us to databases.

Next, we'll explore different database types, their performance characteristics and trade-offs, and why graph databases are the perfect fit for network automation.