Subnet planning looks easy when your AI cluster is still a whiteboard sketch. Then the racks arrive, the first training window is booked, Kubernetes shows up, and your addressing plan decides whether you launch smoothly or spend a weekend renumbering nodes.
This guide is for AI infrastructure teams and cybersecurity-aware tech leaders who also care about proxy-driven realities (controlled egress, logging, IP reputation isolation, public IPv4 capacity). The big idea is simple:
Your IP plan should start with host bits, not with CIDR notation.
- • Why subnet planning for AI workloads matters right now
- • What most subnet guides miss
- • Start with host bits, then write the CIDR
- • A host-bits-first method for subnet planning for AI workloads
- • How subnet size impacts GPU training clusters and performance
- • A worked example of subnet planning for AI workloads
- • Common mistakes
- • Security-first checklist
- • FAQ
Why subnet planning for AI workloads matters right now
Subnet planning for AI workloads is the process of sizing and segmenting IP ranges so GPU nodes, Kubernetes pods and services, storage endpoints, and security tooling can scale without disruptive readdressing.
AI compute is getting dense. NVIDIA positions the GB200 NVL72 as an “exascale computer in a single rack,” with 72 Blackwell GPUs in one NVLink domain providing 130 TB/s of low-latency GPU communications. When compute stacks this tight, networking has less forgiveness, and your network addressing strategy has to be clean from day one.
Connectivity demand is rising right alongside that compute density. Data Center Knowledge reported that bandwidth purchased for data center connectivity surged by nearly 330% from 2020 to 2024, driven by hyperscale expansion and AI. If your fabric is growing at that pace, subnet planning for AI workloads cannot be “we’ll fix it later.”
And if you operate proxy-driven workflows (scraping/data acquisition, verification, fraud detection, model serving with partner allowlists), your egress architecture depends on clean segmentation and predictable IP allocation.
Here is the workflow that makes an IP addressing plan predictable:
- • Count real IP consumers
- • Choose the host bits you need, plus headroom
- • Translate host bits into CIDRs and security zones
- • Map IPs so on-call humans can move fast
What most subnet guides miss
A lot of subnet articles are written for classic web tiers. They do not break, but they also do not warn you where AI clusters actually break.
Here are the usual blind spots:
- • They treat “hosts” as only servers. Subnet planning for AI workloads must count pods, services, agents, load balancers, and security appliances too.
- • They separate networking from security. A good addressing strategy bakes segmentation and auditability into the first draft.
- • They assume CIDRs are easy to change later. In real clusters, that is not always true. Oracle’s guidance for large Kubernetes clusters warns that subnet changes can be disruptive, and that you cannot change the pod CIDR after creation. That is a big reason subnet planning for AI workloads should start with host bits and headroom.
Proxy-driven teams hit an additional failure mode:
- • They treat egress as “just NAT.” In practice, egress design is governance: routing, DNS policy, logging, allowlists, and reputation isolation all depend on it.
Start with host bits, then write the CIDR
CIDR is how we write the answer. Host bits are how we choose the answer.
When you say “/24,” you are really saying “8 host bits,” which means 2^8 total addresses. Subnet planning for AI workloads gets simpler when you ask one question first:
How many addresses do we truly need, including growth and churn?
Basic host-bit math for AI cluster IP planning:
- • /24 = 8 host bits = 256 total addresses
- • /22 = 10 host bits = 1,024
- • /20 = 12 host bits = 4,096
- • /19 = 13 host bits = 8,192
One important note: cloud providers can reserve addresses in ways that reduce what you can actually use. So subnet planning for AI workloads should always validate usable counts in your environment.
A host-bits-first method for subnet planning for AI workloads
Step 1: Inventory every IP consumer
If you only count GPU nodes, your plan will look “perfect” right up until Kubernetes and security tooling show up. Subnet planning for AI workloads starts with a full IP budget across layers:
Compute layer
- • GPU nodes (bare metal or VMs)
- • login or jump nodes
- • schedulers and controllers (if not managed)
Kubernetes and platform layer
- • node IPs
- • pod IPs
- • service ClusterIP range
- • ingress or load balancer addresses
This is where address space design for AI workloads often surprises teams.
- • On Amazon EKS, AWS states the VPC CNI assigns each pod an IP address from the VPC’s CIDR(s), and notes this can consume a substantial number of IPs.
- • On Google Kubernetes Engine, pods are allocated from a dedicated secondary range in VPC-native setups (so that secondary range must be sized deliberately).
Data layer
- • storage endpoints, gateways, metadata services
- • object storage or caching services
- • data ingestion workers
Operations and security layer
- • BMC or out-of-band management addresses
- • monitoring and logging collectors
- • vulnerability scanners, sensors, inspection points
- • proxies and egress controls (gateways, NAT, egress LBs, DNS policy points)
In practice, subnet planning for AI workloads fails most often when the platform and security layers were never included in the initial count.
Step 2: Add growth and churn on purpose
AI teams do not sit still. They add nodes. They spin up experiments. They run bursty pipelines. They tear down and rebuild pieces of the platform. That is normal.
So in IP allocation planning for AI clusters, headroom is not “extra.” Headroom is stability.
A practical approach:
- • Add 30% headroom for predictable growth
- • Add another 20% if you autoscale frequently or run many short-lived pods
If that sounds large, compare it to renumbering a live training fleet. Subnet planning for AI workloads is cheap on paper and expensive under pressure.
Step 3: Segment by planes and trust zones
Security teams do not want one flat “AI subnet.” They want clear boundaries and enforceable policy. Subnet planning for AI workloads should match trust zones:
- • Training plane: GPU nodes, distributed training traffic, job runtime services
- • Storage plane: file systems, object storage, data movers
- • Control plane: schedulers, APIs, identity, CI runners
- • Management plane: BMCs, bastions, config management, monitoring
- • Egress plane: controlled outbound via proxy or NAT with logging and DNS policy
This makes policy reviews faster and reduces lateral movement paths. It also gives you a smaller blast radius when something goes wrong. This network segmentation plan is not just capacity planning; it is security architecture.
Step 4: Map addresses to the physical cluster
In a busy incident, nobody wants to reverse engineer spreadsheets. Subnet planning for AI workloads becomes easier to operate when the IP itself gives you a hint.
A simple convention:
- • one subnet per rack for training nodes
- • a dedicated subnet for storage services
- • a protected subnet for control-plane services
- • a stable subnet for management and BMC
- • a dedicated subnet for egress gateways/proxies
This turns your cluster addressing plan into a troubleshooting tool. When training starts stalling, you can quickly ask: “Is this rack-specific or plane-specific?” When a partner blocks your service, you can quickly ask: “Which egress pool did this exit from?”
Step 5: Lock pod and service ranges early
This is where many teams get burned.
Oracle warns that you cannot change the pod CIDR you initially specify after cluster creation, and that subnet changes in a running cluster can be disruptive. Kubernetes documents how service IP ranges (ServiceCIDR) are assigned and reconfigured, which is another reminder that service ranges are a first-class configuration, not an afterthought.
So subnet planning for AI workloads should not be “pick VPC subnets now, figure out pods later.” Do the pod math first, then decide the rest.
Step 6: Avoid private IP collisions and plan public IP needs
Most AI network addressing uses RFC1918 space: 10/8, 172.16/12, 192.168/16. The collision risk is real when you have VPN users, M&A, or peered networks. 192.168.0.0/16 is especially conflict-prone because it is common in home networks.
Also, even private training clusters often need controlled outbound and sometimes public endpoints:
- • dataset pulls
- • software updates
- • license servers
- • model serving endpoints
- • partner integrations
- • proxy-driven data workflows that require stable or segmented public egress
Subnet planning for AI workloads that ignores external IP strategy often ends up with NAT sprawl, inconsistent logging, and reputation spillover across tenants or workloads.
How subnet size impacts GPU training clusters and performance
Subnet planning for AI workloads is not bandwidth, but it can force complexity that hurts performance and security.
- • Too small: teams bolt on extra NAT, overlays, odd routes, and debugging becomes slower and riskier.
- • Too flat: you create a large failure domain and widen the lateral movement surface.
Meanwhile, network speeds are racing forward to keep up with AI traffic. Keysight notes the industry is transitioning to 800G and 1.6T Ethernet interconnects to support modern AI workloads at scale. That is another reason your IP range design should favor simple, low-hop, low-surprise designs.
A worked example of subnet planning for AI workloads
Let’s size a mid-size environment using host bits. We will keep the math realistic and security-friendly.
Scenario
- • 512 GPUs
- • 64 nodes with 8 GPUs each
- • Kubernetes scheduling
- • Dedicated storage layer
- • Segmentation by plane and by rack
- • Controlled egress via proxy/NAT with logging
1) Training plane
Start by counting:
- • 64 training nodes
- • 4 utility nodes (login, build, admin)
- • 8 fabric services or appliances (varies by design)
That is 76 addresses before headroom. If you want per-rack segmentation with 4 racks of 16 nodes:
- • Each rack subnet must cover 16 nodes plus a few services.
- • A /27 (32 total) is tight.
- • A /26 (64 total) is comfortable.
So choose four /26 subnets, one per rack. This is subnet planning for AI workloads that stays calm when “one more service” shows up.
2) Management plane
Count:
- • 64 BMC addresses
- • monitoring and logging nodes
- • security tools and scanners
Call it 80 addresses, then add headroom. Use a /24 so management stays stable as the cluster grows. Stable management is a quiet win in address space planning for AI workloads.
3) Storage plane
Count:
- • 40 endpoints and service addresses
- • 20 headroom
A /25 (128 total) gives space while keeping a smaller blast radius.
4) Kubernetes pod IPs
This is the part many subnet articles skip, but it often drives the whole plan. In subnet planning for AI workloads, pods can be the biggest consumer.
Assume peak pods per node:
- • 80 pods per node
- • 64 nodes × 80 = 5,120 pod IPs
- • +30% headroom = 6,656 pod IPs
Now choose host bits:
- • 2^12 = 4,096 is not enough
- • 2^13 = 8,192 covers 6,656 comfortably
So your cluster IP plan suggests a pod range equivalent to a /19 (8,192 total), depending on your CNI and how you allocate ranges.
This is not hypothetical. AWS warns pod IP consumption can be substantial in VPC CNI models, and GKE uses dedicated secondary ranges for pods in VPC-native clusters—both accelerate IP exhaustion if you under-size.
5) Kubernetes service IPs
Services are often smaller but easy to underestimate.
- • 500 services today
- • headroom to 1,000
A /22 (1,024 total) is a clean fit. Kubernetes treats service IP ranges as a core configuration, so size it deliberately during subnet planning for AI workloads.
Quick sizing formula
Pod IPs needed = nodes × peak pods per node × (1 + headroom)
Common mistakes
- • Sizing for nodes and forgetting pods and services
- • Mixing training and management into one flat range
- • Forgetting BMC, scanners, and logging in the IP budget
- • Overlapping RFC1918 space with VPN or peering networks
- • Deferring CNI and CIDR decisions until late, then discovering changes are disruptive
- • Treating egress as “just NAT” instead of a controlled, auditable proxy/egress plane
Security-first checklist
Before you freeze the plan, run this checklist:
- • Each plane has its own range, and the default stance is deny-first.
- • Management access goes through bastions, not directly from user subnets.
- • Egress is centralized so logs, DNS policy, and threat controls stay consistent.
- • Pod and service ranges are documented, monitored, and alert on utilization.
- • IP naming and allocation rules are written down so your addressing plan stays consistent across teams.
- • You can answer, fast: “what rack is this IP in” and “what plane is it in.”
- • Proxy-driven environments: you can also answer, fast: “which egress pool did this traffic exit from?” and “was this IP shared?”
That last point sounds small, but it is the difference between calm troubleshooting and chaos. Good subnet planning for AI workloads gives your team fewer mysteries when training is running hot.
Most AI cluster address planning is private IP strategy. But AI teams still need public IP capacity for controlled egress, model serving endpoints, partner integrations, and security monitoring.
PubConcierge helps infrastructure teams secure procurement-ready IPv4 allocations with operational support, so your external addressing is as clean, segmented, and auditable as your internal planes.
FAQ
Q1: What is the fastest way to size subnets for a new training cluster?
• Subnet planning for AI workloads goes fastest when you start with host bits, count every IP consumer (including pods and services), add headroom, then map it into CIDRs.
Q2: Does subnet planning for AI workloads change if I use InfiniBand or RoCE?
• You still need subnet planning for AI workloads for management, storage, orchestration, and controlled egress. The high-speed fabric does not remove the need for clean IP segmentation.
Q3: How do I avoid running out of pod IPs later?
• Use peak pod counts, not averages, and match your CNI behavior. AWS and Google both document models where pods consume VPC ranges, which can accelerate IP exhaustion if you under-size.
Q4: Can I change pod and service CIDRs later?
• Sometimes, but it can be disruptive and is often avoided in production. Oracle explicitly warns pod CIDR cannot be changed after creation, and Kubernetes provides guided processes for service range changes. Treat these as early decisions in subnet planning for AI workloads.
References and further reading
- • NVIDIA GB200 NVL72 product page: https://www.nvidia.com/en-gb/data-center/gb200-nvl72
- • Data Center Knowledge coverage of Zayo Bandwidth Report: https://www.datacenterknowledge.com/networking/data-center-bandwidth-soars-330-driven-by-ai-demand
- • AWS EKS best practice (IP optimization / VPC CNI pod IP usage): https://docs.aws.amazon.com/eks/latest/best-practices/ip-opt.html
- • Google GKE VPC-native clusters / alias IPs (secondary range for pods): https://docs.cloud.google.com/kubernetes-engine/docs/how-to/alias-ips
- • Oracle large cluster best practices (CIDR/subnet changes, pod CIDR constraints): https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengbestpractices_topic-Large-Scale-Clusters-best-practices.htm
- • Kubernetes docs (reconfigure default ServiceCIDR): https://kubernetes.io/docs/tasks/network/reconfigure-default-service-ip-ranges
- • Keysight (800G / 1.6T AI data center networking challenges): https://www.keysight.com/blogs/en/inds/ai/top-3-ai-data-center-challenges-at-1-6t-and-how-to-solve-them
Legal and compliance disclaimer
Subnet planning for AI workloads can touch regulated data and critical systems. Follow your organization’s security policies and all applicable laws and regulations, including privacy, security, and export controls that may apply to AI infrastructure and datasets. This article provides general technical information and is not legal advice. Consult qualified counsel for jurisdiction-specific requirements.
Stay up to date on growth infrastructure, email best practices, and startup scaling strategies by following PubConcierge on LinkedIn.