InfiniBand Networking for NVIDIA H100/H200 GPU Clusters
- Apr 30
- 6 min read
Updated: May 3
Designing clusters with NVIDIA H100 Tensor Core GPUs and NVIDIA H200 Tensor Core GPUs quickly stops being about GPUs. At 16, 32, or 64 nodes, performance is defined by how efficiently nodes exchange data during synchronized operations.
That makes InfiniBand design—ports, rails, switches, and especially cabling—the deciding factor.
NVIDIA InfiniBand Cards, Switches, DAC & AOC Cables
Limited stock at special pricing
In this article, when we refer to “clusters,” we are specifically talking about 8× GPU baseboard servers, such as:
Dell PowerEdge XE9680
HPE Cray XD670
Supermicro AS-8125GS-TNHR
InfiniBand for NVIDIA H100/H200 GPU Clusters: Traffic Behavior & Scaling
Distributed training relies on NCCL collectives such as AllReduce and AllGather. These generate synchronized, many-to-many traffic where all GPUs communicate at the same time, in cycles.
NCCL uses all available network paths automatically. With 4 or 8 ports per server, traffic is striped across them, so performance depends on balanced paths, not just peak bandwidth.
A single weaker path can slow down the entire cluster, for example:
longer cable
additional network hop
misaligned rails
GPUs communicate internally over NVLink/NVSwitch at much higher bandwidth than InfiniBand, so if inter-node network bandwidth is insufficient, GPUs will idle waiting for communication instead of computing.
InfiniBand for NVIDIA H100/H200 GPU Clusters: Port & NIC Layout
The starting point is the number of ports per server.
Typical 8× GPU systems support:
2× 400G ports
4× 400G ports
8× 400G ports
These are implemented using NVIDIA ConnectX-7.
Most high-end HGX-based systems (including DGX-class designs) use 8× single-port NICs, often consolidated through 4 OSFP cages, where each OSFP port internally carries two 400G connections (e.g., via DensiLink-style internal cabling).
This allows clean 1:1 GPU-to-rail mapping, which simplifies topology and improves balance.
Alternative implementations using dual-port NICs exist and are electrically equivalent, but consolidated OSFP designs are more common in reference architectures.
Scaling ports improves performance, but with trade-offs:
2 → 4 ports: clear and immediate gain
4 → 8 ports: better scaling for 32+ node clusters
higher port counts are limited by PCIe bandwidth, as each 400G NIC requires a full PCIe x16 link and shared PCIe paths inside the server can reduce effective throughput even when port count is sufficient.
NICs must be evenly distributed across CPU sockets (NUMA nodes). If traffic crosses sockets via UPI/Infinity Fabric, latency increases and bandwidth drops.
Not all “400G-capable” configurations are equivalent in practice - common NIC variants include:
Single-port 400G (NDR) – 1× 400G per card
Dual-port 400G – 2× 400G per card (higher PCIe pressure)
Dual-port 200G (HDR) – 2× 200G, often mistaken for 400G capability
A dual-port 200G NIC is not the same as a single-port 400G NIC. It results in half the rail bandwidth. For 8× GPU servers, the target is 8× 400G rails, ideally with single-port NICs for clean mapping.
GPU-to-NIC mapping (each of the 8 GPUs connects to its own network rail):
8 NIC ports → 1:1 GPU-to-rail mapping
fewer NICs → multiple GPUs share the same rail
When rails are shared, communication contention increases, reducing efficiency during NCCL collectives such as AllReduce.
InfiniBand for NVIDIA H100/H200 GPU Clusters: Rail-Optimized Topology
Each port represents a separate communication path, known as a rail. At scale, these rails must be explicitly designed.
A rail-optimized topology assigns each rail consistently across the fabric. For example, “port 1” from every server connects to the same logical leaf group, forming a low-diameter path for that rail. This aligns with NVIDIA’s recommended design for AI fabrics.
The benefits:
fewer hops per communication
predictable latency
better load distribution
Without this, traffic spreads across mixed paths and additional hops, increasing congestion and reducing efficiency during synchronized NCCL operations.
InfiniBand for NVIDIA H100/H200 GPU Clusters: Switch Architecture & Port Usage
Switch design must match the port strategy.
Using NVIDIA Quantum-2 QM9700:
64 × 400G NDR ports
implemented via 32 OSFP cages (800G each)
each cage supports breakout (2× 400G) or native mode
each OSFP port can run as 1×800G or 2×400G, so breakout and uplink usage must be balanced to avoid oversubscription.
total throughput: 51.2 Tb/s non-blocking
This allows flexible usage:
breakout for server connectivity
full-bandwidth links for inter-switch connections
For larger clusters, newer platforms such as NVIDIA Quantum-X800 provide:
native 800G ports
significantly higher radix (e.g., 100+ ports depending on chassis)
better scaling for very large AI environments
In practice, switch ports are divided into:
server-facing connections (downlinks)
inter-switch connections (uplinks)
For AI workloads, the goal is near 1:1 aggregate bandwidth between these layers to maintain non-blocking behavior under burst traffic.
InfiniBand for NVIDIA H100/H200 GPU Clusters: Cable Types, Function & Selection
Cables define how the network is physically realized.
Breakout AOC cables (800G → 2×400G) - Used for switch-to-server connections.
Interface compatibility (OSFP vs QSFP112) must be verified, as mismatches can prevent links from coming up.
Function:
splits one switch port into two server connections
maximizes port utilization
Key considerations:
maintain rail consistency
avoid mixing rails within a breakout
keep cable lengths consistent per rail
800G AOC cables (direct) - Used for switch-to-switch connections.
Function:
carries full bandwidth between switches
forms the fabric backbone
Key considerations:
must match server-side bandwidth
should be evenly distributed across rails
required for predictable Clos behavior
DAC cables - Used only within racks.
Function:
low-latency short connections
Limitations:
distance
cable thickness at scale
Cable behavior and impact on performance
At 400G/800G speeds, cables influence:
latency
signal integrity
path symmetry
At these speeds, AOC cables are preferred over DAC beyond short distances due to better signal integrity and easier cable management.
Even small inconsistencies—like uneven cable lengths—can create measurable imbalance between rails. NCCL assumes uniform paths and cannot compensate for this.
InfiniBand for NVIDIA H100/H200 GPU Clusters: Cable Quantities & Planning
Define:
N = number of servers
P = ports per server
Breakout cables = (N × P) / 2
Inter-switch cables depend on design, but for near non-blocking fabrics:
typically 0.8–1.2× breakout count
Example: 32-node cluster, 8 ports
256 server ports
128 breakout cables
~120–180 inter-switch cables
Example: 64-node cluster, 8 ports
512 server ports
256 breakout cables
~250–350 inter-switch cables
Exact numbers depend on:
oversubscription ratio
redundancy design
required bisection bandwidth
Cable planning considerations
At scale:
symmetry is critical
cables must be labeled and traceable
routing must preserve rail separation
airflow must not be blocked
serviceability must be maintained
Many large clusters fail operationally due to cable complexity, not bandwidth limits.
InfiniBand for NVIDIA H100/H200 GPU Clusters: Clos Topology & Scaling
Clos (spine–leaf) topology is standard beyond a single switch.
It provides:
multiple equal-cost paths
support for adaptive routing
scalability across racks
Quantum-2 includes:
adaptive routing
SHARP (in-network reduction)
These improve efficiency but do not replace correct topology or rail symmetry. Also, oversubscription must be controlled, as AI workloads like NCCL assume near 1:1 bandwidth, and even moderate oversubscription can create bottlenecks.
InfiniBand for NVIDIA H100/H200 GPU Clusters: Monitoring & Validation
Monitoring is essential.
Fabric-level:
NVIDIA UFM → congestion, routing, utilization
Application-level:
NCCL_DEBUG=INFOValidation should include:
verifying all rails are used
checking topology mapping
running NCCL benchmarks (e.g., all_reduce_perf)
A well-designed system should achieve 85–95% of theoretical bandwidth, depending on message size and cluster scale.
Click To Check Configuration
InfiniBand for NVIDIA H100/H200 GPU Clusters: Practical Limitations
Key constraints:
PCIe bandwidth vs GPU traffic
NUMA placement
cable density and routing
cost scaling with ports
diminishing returns at higher port counts
In practice, clusters often underperform due to issues like wrong NIC types, mixed cable lengths, uneven NUMA placement, or oversubscribed topology.
InfiniBand for NVIDIA H100/H200 GPU Clusters: Design Patterns
Requirements change as clusters grow. Below are practical guidelines for 8-GPU servers (8 GPUs per node):
16-Node GPU Cluster (128 GPUs):
single switch or small leaf–spine setup
rail optimization usually not needed
simple cabling
32-Node GPU Cluster (256 GPUs):
rail-optimized topology recommended
consistent rail mapping across servers
leaf–spine layout begins
small path differences impact performance
64-Node GPU Cluster (512 GPUs):
full leaf–spine (Clos) topology required
strict rail symmetry
NUMA-aware NIC placement
uneven paths reduce NCCL efficiency
128-Node GPU Cluster (1024 GPUs):
multi-tier Clos topology
balanced links between layers
rails consistent across racks
multiple spine switches required
256-Node GPU Cluster (2048 GPUs):
large-scale fabric design
strict rail consistency across the cluster
larger spine layer or extra switch tier
inter-switch bandwidth must match server traffic
512-Node GPU Cluster (4096+ GPUs):
multi-cluster scaling
additional core layer between clusters
careful inter-cluster traffic planning
higher complexity in routing, monitoring, and cabling
NVIDIA InfiniBand Cards, Switches, DAC & AOC Cables
Limited stock at special pricing
GPU cluster performance at scale depends on aligning ports, rails, switches, and cables to ensure efficient communication, otherwise hardware potential is wasted.
FAQ – InfiniBand for NVIDIA H100/H200 GPU Clusters
How many InfiniBand ports are needed for NVIDIA H100/H200 GPU clusters?
Most NVIDIA H100/H200 GPU clusters use 4–8× 400G InfiniBand ports per server to ensure sufficient bandwidth for NCCL communication.
What is the best network topology for H100/H200 GPU clusters?
A rail-optimized Clos (spine–leaf) topology is the standard design for scalable NVIDIA H100/H200 GPU clusters.
How many cables are required for an InfiniBand GPU cluster?
Cable count scales with ports, typically (servers × ports ÷ 2) breakout cables plus a similar number of inter-switch cables.
What cable types are used in InfiniBand H100/H200 clusters?
Most deployments use 800G to 2×400G breakout AOC cables for servers and 800G AOC cables for switch-to-switch connections.
Why use InfiniBand instead of Ethernet for H100/H200 GPU clusters?
InfiniBand provides lower latency, native RDMA, and more predictable NCCL performance compared to Ethernet-based solutions.
Sources - InfiniBand for NVIDIA H100/H200 GPU Clusters
NVIDIA – InfiniBand & Quantum-2 overview:
https://www.nvidia.com/en-us/networking/infiniband-switching/
NVIDIA – Quantum-2 platform details:
https://www.nvidia.com/en-us/networking/quantum2/
NVIDIA – NCCL documentation:
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html
NVIDIA – DGX / AI infrastructure design:
WEKA – InfiniBand in AI workloads:
https://www.weka.io/learn/enterprise-technology/nvidia-infiniband/






Comments