InfiniBand Networking for NVIDIA H100/H200 GPU Clusters

Apr 30
6 min read

Updated: May 3

Designing clusters with NVIDIA H100 Tensor Core GPUs and NVIDIA H200 Tensor Core GPUs quickly stops being about GPUs. At 16, 32, or 64 nodes, performance is defined by how efficiently nodes exchange data during synchronized operations.

That makes InfiniBand design—ports, rails, switches, and especially cabling—the deciding factor.

NVIDIA InfiniBand Cards, Switches, DAC & AOC Cables

Limited stock at special pricing

Request a Quote

In this article, when we refer to “clusters,” we are specifically talking about 8× GPU baseboard servers, such as:

Dell PowerEdge XE9680
HPE Cray XD670
Supermicro AS-8125GS-TNHR

InfiniBand design for NVIDIA H100H200 GPU clusters, 400G800G NDR networking with Quantum-2 switches, ConnectX-7 NICs, Clos topology, breakout AOC cabling, and NCCL performance optimization for AI and HPC. server-parts.eu.

InfiniBand for NVIDIA H100/H200 GPU Clusters: Traffic Behavior & Scaling

Distributed training relies on NCCL collectives such as AllReduce and AllGather. These generate synchronized, many-to-many traffic where all GPUs communicate at the same time, in cycles.

NCCL uses all available network paths automatically. With 4 or 8 ports per server, traffic is striped across them, so performance depends on balanced paths, not just peak bandwidth.

A single weaker path can slow down the entire cluster, for example:

longer cable
additional network hop
misaligned rails

GPUs communicate internally over NVLink/NVSwitch at much higher bandwidth than InfiniBand, so if inter-node network bandwidth is insufficient, GPUs will idle waiting for communication instead of computing.

InfiniBand for NVIDIA H100/H200 GPU Clusters: Port & NIC Layout

The starting point is the number of ports per server.

Typical 8× GPU systems support:

2× 400G ports
4× 400G ports
8× 400G ports

These are implemented using NVIDIA ConnectX-7.

Most high-end HGX-based systems (including DGX-class designs) use 8× single-port NICs, often consolidated through 4 OSFP cages, where each OSFP port internally carries two 400G connections (e.g., via DensiLink-style internal cabling).

This allows clean 1:1 GPU-to-rail mapping, which simplifies topology and improves balance.

Alternative implementations using dual-port NICs exist and are electrically equivalent, but consolidated OSFP designs are more common in reference architectures.

Scaling ports improves performance, but with trade-offs:

2 → 4 ports: clear and immediate gain
4 → 8 ports: better scaling for 32+ node clusters
higher port counts are limited by PCIe bandwidth, as each 400G NIC requires a full PCIe x16 link and shared PCIe paths inside the server can reduce effective throughput even when port count is sufficient.

NICs must be evenly distributed across CPU sockets (NUMA nodes). If traffic crosses sockets via UPI/Infinity Fabric, latency increases and bandwidth drops.

Not all “400G-capable” configurations are equivalent in practice - common NIC variants include:

Single-port 400G (NDR) – 1× 400G per card
Dual-port 400G – 2× 400G per card (higher PCIe pressure)
Dual-port 200G (HDR) – 2× 200G, often mistaken for 400G capability

A dual-port 200G NIC is not the same as a single-port 400G NIC. It results in half the rail bandwidth. For 8× GPU servers, the target is 8× 400G rails, ideally with single-port NICs for clean mapping.

GPU-to-NIC mapping (each of the 8 GPUs connects to its own network rail):

8 NIC ports → 1:1 GPU-to-rail mapping
fewer NICs → multiple GPUs share the same rail

When rails are shared, communication contention increases, reducing efficiency during NCCL collectives such as AllReduce.

InfiniBand for NVIDIA H100/H200 GPU Clusters: Rail-Optimized Topology

Each port represents a separate communication path, known as a rail. At scale, these rails must be explicitly designed.

A rail-optimized topology assigns each rail consistently across the fabric. For example, “port 1” from every server connects to the same logical leaf group, forming a low-diameter path for that rail. This aligns with NVIDIA’s recommended design for AI fabrics.

The benefits:

fewer hops per communication
predictable latency
better load distribution

Without this, traffic spreads across mixed paths and additional hops, increasing congestion and reducing efficiency during synchronized NCCL operations.

InfiniBand for NVIDIA H100/H200 GPU Clusters: Switch Architecture & Port Usage

Switch design must match the port strategy.

Using NVIDIA Quantum-2 QM9700:

64 × 400G NDR ports
implemented via 32 OSFP cages (800G each)
each cage supports breakout (2× 400G) or native mode
each OSFP port can run as 1×800G or 2×400G, so breakout and uplink usage must be balanced to avoid oversubscription.
total throughput: 51.2 Tb/s non-blocking

This allows flexible usage:

breakout for server connectivity
full-bandwidth links for inter-switch connections

For larger clusters, newer platforms such as NVIDIA Quantum-X800 provide:

native 800G ports
significantly higher radix (e.g., 100+ ports depending on chassis)
better scaling for very large AI environments

In practice, switch ports are divided into:

server-facing connections (downlinks)
inter-switch connections (uplinks)

For AI workloads, the goal is near 1:1 aggregate bandwidth between these layers to maintain non-blocking behavior under burst traffic.

InfiniBand for NVIDIA H100/H200 GPU Clusters: Cable Types, Function & Selection

Cables define how the network is physically realized.

Breakout AOC cables (800G → 2×400G) - Used for switch-to-server connections.

Interface compatibility (OSFP vs QSFP112) must be verified, as mismatches can prevent links from coming up.

Function:

splits one switch port into two server connections
maximizes port utilization

Key considerations:

maintain rail consistency
avoid mixing rails within a breakout
keep cable lengths consistent per rail

800G AOC cables (direct) - Used for switch-to-switch connections.

Function:

carries full bandwidth between switches
forms the fabric backbone

Key considerations:

must match server-side bandwidth
should be evenly distributed across rails
required for predictable Clos behavior

DAC cables - Used only within racks.

Function:

low-latency short connections

Limitations:

distance
cable thickness at scale

Cable behavior and impact on performance

At 400G/800G speeds, cables influence:

latency
signal integrity
path symmetry

At these speeds, AOC cables are preferred over DAC beyond short distances due to better signal integrity and easier cable management.

Even small inconsistencies—like uneven cable lengths—can create measurable imbalance between rails. NCCL assumes uniform paths and cannot compensate for this.

InfiniBand for NVIDIA H100/H200 GPU Clusters: Cable Quantities & Planning

Define:

N = number of servers
P = ports per server

Breakout cables = (N × P) / 2

Inter-switch cables depend on design, but for near non-blocking fabrics:

typically 0.8–1.2× breakout count

Example: 32-node cluster, 8 ports

256 server ports
128 breakout cables
~120–180 inter-switch cables

Example: 64-node cluster, 8 ports

512 server ports
256 breakout cables
~250–350 inter-switch cables

Exact numbers depend on:

oversubscription ratio
redundancy design
required bisection bandwidth

Cable planning considerations

At scale:

symmetry is critical
cables must be labeled and traceable
routing must preserve rail separation
airflow must not be blocked
serviceability must be maintained

Many large clusters fail operationally due to cable complexity, not bandwidth limits.

InfiniBand for NVIDIA H100/H200 GPU Clusters: Clos Topology & Scaling

Clos (spine–leaf) topology is standard beyond a single switch.

It provides:

multiple equal-cost paths
support for adaptive routing
scalability across racks

Quantum-2 includes:

adaptive routing
SHARP (in-network reduction)

These improve efficiency but do not replace correct topology or rail symmetry. Also, oversubscription must be controlled, as AI workloads like NCCL assume near 1:1 bandwidth, and even moderate oversubscription can create bottlenecks.

InfiniBand for NVIDIA H100/H200 GPU Clusters: Monitoring & Validation

Monitoring is essential.

Fabric-level:

NVIDIA UFM → congestion, routing, utilization

Application-level:

NCCL_DEBUG=INFO

Validation should include:

verifying all rails are used
checking topology mapping
running NCCL benchmarks (e.g., all_reduce_perf)

A well-designed system should achieve 85–95% of theoretical bandwidth, depending on message size and cluster scale.

Dell PowerEdge XE9680 Servers with 8× NVIDIA H100 GPUs: Special Offer

Click To Check Configuration

InfiniBand for NVIDIA H100/H200 GPU Clusters: Practical Limitations

Key constraints:

PCIe bandwidth vs GPU traffic
NUMA placement
cable density and routing
cost scaling with ports
diminishing returns at higher port counts

In practice, clusters often underperform due to issues like wrong NIC types, mixed cable lengths, uneven NUMA placement, or oversubscribed topology.

InfiniBand for NVIDIA H100/H200 GPU Clusters: Design Patterns

Requirements change as clusters grow. Below are practical guidelines for 8-GPU servers (8 GPUs per node):

16-Node GPU Cluster (128 GPUs):

single switch or small leaf–spine setup
rail optimization usually not needed
simple cabling

32-Node GPU Cluster (256 GPUs):

rail-optimized topology recommended
consistent rail mapping across servers
leaf–spine layout begins
small path differences impact performance

64-Node GPU Cluster (512 GPUs):

full leaf–spine (Clos) topology required
strict rail symmetry
NUMA-aware NIC placement
uneven paths reduce NCCL efficiency

128-Node GPU Cluster (1024 GPUs):

multi-tier Clos topology
balanced links between layers
rails consistent across racks
multiple spine switches required

256-Node GPU Cluster (2048 GPUs):

large-scale fabric design
strict rail consistency across the cluster
larger spine layer or extra switch tier
inter-switch bandwidth must match server traffic

512-Node GPU Cluster (4096+ GPUs):

multi-cluster scaling
additional core layer between clusters
careful inter-cluster traffic planning
higher complexity in routing, monitoring, and cabling

NVIDIA InfiniBand Cards, Switches, DAC & AOC Cables

Limited stock at special pricing

Request a Quote

GPU cluster performance at scale depends on aligning ports, rails, switches, and cables to ensure efficient communication, otherwise hardware potential is wasted.

FAQ – InfiniBand for NVIDIA H100/H200 GPU Clusters

How many InfiniBand ports are needed for NVIDIA H100/H200 GPU clusters?

Most NVIDIA H100/H200 GPU clusters use 4–8× 400G InfiniBand ports per server to ensure sufficient bandwidth for NCCL communication.

What is the best network topology for H100/H200 GPU clusters?

A rail-optimized Clos (spine–leaf) topology is the standard design for scalable NVIDIA H100/H200 GPU clusters.

How many cables are required for an InfiniBand GPU cluster?

Cable count scales with ports, typically (servers × ports ÷ 2) breakout cables plus a similar number of inter-switch cables.

What cable types are used in InfiniBand H100/H200 clusters?

Most deployments use 800G to 2×400G breakout AOC cables for servers and 800G AOC cables for switch-to-switch connections.

Why use InfiniBand instead of Ethernet for H100/H200 GPU clusters?

InfiniBand provides lower latency, native RDMA, and more predictable NCCL performance compared to Ethernet-based solutions.

Sources - InfiniBand for NVIDIA H100/H200 GPU Clusters

NVIDIA – InfiniBand & Quantum-2 overview:
https://www.nvidia.com/en-us/networking/infiniband-switching/
NVIDIA – Quantum-2 platform details:
https://www.nvidia.com/en-us/networking/quantum2/
NVIDIA – NCCL documentation:
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html
NVIDIA – DGX / AI infrastructure design:
https://docs.nvidia.com/dgx/
WEKA – InfiniBand in AI workloads:
https://www.weka.io/learn/enterprise-technology/nvidia-infiniband/

server-parts.eu Blog

InfiniBand Networking for NVIDIA H100/H200 GPU Clusters

InfiniBand for NVIDIA H100/H200 GPU Clusters: Traffic Behavior & Scaling

InfiniBand for NVIDIA H100/H200 GPU Clusters: Port & NIC Layout

InfiniBand for NVIDIA H100/H200 GPU Clusters: Rail-Optimized Topology

InfiniBand for NVIDIA H100/H200 GPU Clusters: Switch Architecture & Port Usage

InfiniBand for NVIDIA H100/H200 GPU Clusters: Cable Types, Function & Selection

Breakout AOC cables (800G → 2×400G) - Used for switch-to-server connections.

800G AOC cables (direct) - Used for switch-to-switch connections.

DAC cables - Used only within racks.

Cable behavior and impact on performance

InfiniBand for NVIDIA H100/H200 GPU Clusters: Cable Quantities & Planning

Example: 32-node cluster, 8 ports

Example: 64-node cluster, 8 ports

Cable planning considerations

InfiniBand for NVIDIA H100/H200 GPU Clusters: Clos Topology & Scaling

InfiniBand for NVIDIA H100/H200 GPU Clusters: Monitoring & Validation

InfiniBand for NVIDIA H100/H200 GPU Clusters: Practical Limitations

InfiniBand for NVIDIA H100/H200 GPU Clusters: Design Patterns

FAQ – InfiniBand for NVIDIA H100/H200 GPU Clusters

How many InfiniBand ports are needed for NVIDIA H100/H200 GPU clusters?

What is the best network topology for H100/H200 GPU clusters?

How many cables are required for an InfiniBand GPU cluster?

What cable types are used in InfiniBand H100/H200 clusters?

Why use InfiniBand instead of Ethernet for H100/H200 GPU clusters?

Sources - InfiniBand for NVIDIA H100/H200 GPU Clusters

Related Posts

CONTACT

INFORMATION

SERVER-PARTS.EU