top of page
server-parts.eu

server-parts.eu Blog

NVIDIA Blackwell Ultra B300: Full Specs, 288GB HBM3e Memory, 15 PFLOPS FP4, Architecture & GB300 Platform

  • 2 days ago
  • 8 min read

The NVIDIA B300 GPU, also called Blackwell Ultra, is the newest version of NVIDIA’s Blackwell architecture built for hyperscale AI infrastructure, large language models (LLMs), and AI inference workloads.


NVIDIA GPU Servers in Stock

Competitive Pricing



It improves on the NVIDIA B200 with higher GPU memory capacity and better inference performance, and works with the Grace-Blackwell GB300 platform that combines NVIDIA Grace CPUs and Blackwell GPUs to run trillion-parameter AI models in rack-scale AI systems.


NVIDIA B300 Blackwell Ultra GPU architecture for hyperscale AI infrastructure, LLM training and inference workloads server-parts.eu refurbished

This article covers the NVIDIA B300 and GB300 architecture, including GPU microarchitecture, compute units, HBM memory, NVLink interconnect, AI server platforms, rack-scale AI infrastructure, and the software stack used in modern AI data centers.


NVIDIA B300 Blackwell Ultra Architecture Overview


The NVIDIA B300 GPU is based on the Blackwell architecture, designed for AI training and inference workloads. It builds on the original Blackwell platform with higher memory capacity and improved inference throughput.


The architecture includes several major components:

  • Streaming Multiprocessors (SMs)

  • CUDA cores for general computation

  • 5th-generation Tensor Cores

  • Transformer Engine for AI workloads

  • HBM3e high-bandwidth memory

  • NVLink 5 GPU interconnect

  • NVSwitch GPU fabric

  • SXM module packaging


Together, these components create a GPU architecture capable of delivering extremely high performance for transformer-based models and large-scale distributed AI workloads.



NVIDIA B300 GPU Microarchitecture


At the core of the NVIDIA B300 GPU are Streaming Multiprocessors (SMs), the main compute units of the Blackwell architecture. These units execute thousands of threads in parallel using NVIDIA’s SIMT (Single Instruction Multiple Thread) execution model, which is essential for AI workloads and scientific computing.


Key B300 Compute Architecture

Component

Details

Architecture

Blackwell

Streaming Multiprocessors

~160 SMs

GPU Design

Dual-die architecture

Execution Model

SIMT (Single Instruction Multiple Thread)

Warp Size

32 threads per warp

Transistor Count

~208 billion transistors


Each SM (Streaming Multiprocessors) in the B300 contains several key compute components:

  • CUDA cores for general GPU compute

  • Tensor Cores for AI and deep learning operations

  • Warp schedulers managing thread execution

  • Register files for fast thread data access

  • Shared memory for inter-thread communication

  • Special Function Units (SFUs) for complex math operations


Threads run in warps of 32, with warp schedulers distributing instructions across execution pipelines, enabling high AI training and inference throughput, while the B300’s ~208 billion transistors, similar to the B200, improve performance through architectural optimizations and higher power headroom.



CUDA Cores and Compute Pipelines - NVIDIA B300


CUDA cores handle general-purpose GPU computation and execute floating-point and integer instructions used in machine learning, HPC, and data processing workloads.

Component

Function

CUDA Cores

Execute general GPU compute operations

Floating-Point Pipelines

Handle FP calculations used in AI and HPC

Integer Pipelines

Process integer operations and data tasks

Tensor Compute Pipelines

Accelerate AI and deep learning workloads

Memory Pipelines

Manage data movement across the GPU

These execution pipelines run concurrently, allowing the GPU to perform massive numbers of operations in parallel.



Tensor Cores and AI Acceleration - NVIDIA B300


The NVIDIA B300 uses 5th-generation Tensor Cores, specialized compute units designed for accelerating deep learning workloads. Tensor cores perform matrix multiplication operations that are fundamental to neural network training and inference.


Supported precision formats include:

  • FP4

  • FP8

  • BF16

  • FP16

  • TF32

  • FP32 accumulation


These precision modes allow the GPU to balance performance and numerical accuracy depending on the AI workload being executed.



NVFP4 Precision and AI Performance - NVIDIA B300


One of the most important improvements in the B300 generation is inference performance using FP4 precision. NVIDIA refers to its optimized FP4 format as NVFP4, which enables extremely high throughput while maintaining model accuracy.


Approximate AI compute performance:

  • NVIDIA B200 GPU: ~9 PFLOPS dense FP4

  • NVIDIA B300 GPU: ~14–15 PFLOPS dense FP4


This represents roughly a 55–67% performance improvement in inference-heavy workloads.


System-level performance examples include:

  • NVIDIA DGX B300 (8 GPUs): approximately 108–144 PFLOPS FP4

  • NVIDIA GB300 NVL72 rack: roughly 1.1–1.44 exaFLOPS FP4 compute performance


These improvements are designed specifically for AI reasoning models, mixture-of-experts architectures, and large language model inference.



Transformer Engine - NVIDIA B300


Blackwell GPUs, including the NVIDIA B300, include a Transformer Engine that speeds up transformer-based AI models (such as LLMs). It automatically adjusts precision (for example FP4 or FP8) to increase performance while keeping model accuracy.

Feature

Purpose

Dynamic Precision Scaling

Automatically selects FP4 or FP8 for optimal performance

Optimized Transformer Operations

Accelerates transformer-based neural networks

Inference Efficiency

Improves AI inference throughput

Memory Optimization

Reduces memory consumption

These capabilities are critical for large AI models with billions or trillions of parameters.



Multi-Die GPU Design - NVIDIA B300


Blackwell GPUs, including the NVIDIA B300, use a multi-die architecture instead of a single monolithic silicon die. The B300 consists of two large compute dies connected by a high-bandwidth internal interconnect, enabling larger GPU designs.

Component

Description

GPU Design

Multi-die architecture

Compute Dies

Two large compute dies

Interconnect

High-bandwidth internal interconnect

Memory Behavior

Unified compute and memory system


Advantages of the dual-die architecture:

  • Enables GPUs larger than single silicon die limits

  • Supports higher transistor counts

  • Improves manufacturing yield

  • Maintains unified GPU compute and memory behavior



Memory Architecture - NVIDIA B300


The NVIDIA B300 GPU uses HBM3e (High Bandwidth Memory), a stacked memory technology designed for data-intensive AI workloads. Compared to the B200, the B300 significantly increases memory capacity to support larger AI models.


NVIDIA B200 vs B300 Memory Capacity

GPU

Memory Type

Capacity

NVIDIA B200

HBM3e

192 GB

NVIDIA B300

HBM3e

288 GB

The increase is achieved using 12-high HBM3e stacks, compared to 8-high stacks in B200 GPUs.


NVIDIA B300 Memory Specifications

Component

Specification

HBM Memory Stacks

8 stacks

Memory Interface

~8192-bit

Memory Bandwidth

~8 TB/s

The 50% memory increase allows much larger AI models to run on a single GPU and reduces the need for model sharding across multiple GPUs.



Cache Hierarchy - NVIDIA B300


The GPU includes multiple cache levels that reduce memory latency and improve performance.

Cache Level

Location

Purpose

L1 Cache

Inside each Streaming Multiprocessor (SM)

Stores frequently used data and supports shared memory between threads

L2 Cache

Shared across the entire GPU

Caches global memory accesses and reduces traffic to HBM memory

Note: Exact cache sizes for the B300 GPU have not been publicly disclosed.



NVLink 5 Interconnect - NVIDIA B300


Feature

Details

NVLink Generation

NVLink 5

Connections per GPU

18 NVLink links

Total Bandwidth

~1.8 TB/s bidirectional


What this means in practice

  • GPUs communicate directly with each other, without sending data through the CPU.

  • This enables faster scaling in multi-GPU systems.

  • It is critical for distributed AI training and large-scale inference workloads.



NVSwitch Fabric - NVIDIA B300


Component

Function

NVSwitch

Connects many GPUs into a high-speed switching fabric so every GPU can communicate with every other GPU with very low latency

Main benefits:

  • Full GPU mesh connectivity – every GPU can talk directly to the others

  • High bandwidth communication – fast data exchange between GPUs

  • Efficient model parallelism – large AI models can run across multiple GPUs

Example System

NVLink Fabric Bandwidth

NVIDIA GB300 NVL72

~130 TB/s



Grace-Blackwell GB300 Platform - NVIDIA B300


The NVIDIA GB300 platform integrates NVIDIA’s Grace CPU with Blackwell GPUs.


A GB300 superchip contains:

  • one Grace CPU

  • two B300 GPUs


The CPU and GPUs are connected through NVLink-C2C, a coherent interconnect that allows direct memory sharing between CPU and GPU. This architecture significantly reduces data transfer latency compared to traditional PCIe connections.



Grace CPU Architecture - NVIDIA B300


The Grace CPU is an ARM-based processor designed specifically for AI and high-performance computing environments.


Key characteristics include:

  • large number of ARM cores

  • high memory bandwidth

  • optimized data processing for AI pipelines


Grace CPUs coordinate data movement, system management, and workload scheduling across GPU clusters.



Server Platforms - NVIDIA B300


NVIDIA B300 GPUs are deployed in several server platforms.


NVIDIA HGX B300 systems typically contain eight GPUs connected through NVLink and NVSwitch. These platforms are used by OEM server manufacturers.


NVIDIA DGX B300 servers are NVIDIA’s integrated AI infrastructure systems that combine GPUs, networking, and optimized cooling into a single platform.



Rack-Scale AI Systems - NVIDIA B300


One of the largest deployments of the architecture is the GB300 NVL72 rack system.


A single NVL72 rack includes:

  • 72 Blackwell GPUs

  • 36 Grace CPUs

  • large NVLink interconnect fabric


The rack behaves like a unified AI accelerator capable of training or running extremely large models.


Networking Infrastructure - NVIDIA B300


Large GPU clusters rely on high-performance networking technologies such as:

  • InfiniBand

  • Spectrum-X Ethernet

  • ConnectX SuperNICs

  • BlueField DPUs


These networks allow clusters to scale across thousands of GPUs in hyperscale AI data centers.



GPU Virtualization and Multi-Instance GPU - NVIDIA B300


Blackwell GPUs support Multi-Instance GPU (MIG) technology, which allows a single GPU to be partitioned into multiple logical GPUs.


This enables multiple users or workloads to share GPU resources securely while maintaining performance isolation.



Power and Cooling - NVIDIA B300


The B300 GPU operates at significantly higher power levels than previous generations.


Typical power envelope:

  • NVIDIA B200: ~1000–1200 W

  • NVIDIA B300: ~1000–1400 W configurable TDP


The higher power envelope allows higher sustained Tensor Core performance and improved inference throughput. Because of this power level, most B300-based servers rely on liquid cooling systems and high-density rack power infrastructure.


In large rack-scale systems such as GB300 NVL72, NVIDIA also implements power smoothing technologies that can reduce peak power spikes by up to 30%.



AI Workloads - NVIDIA B300


The B300 architecture is designed primarily for AI workloads including:

  • large language model training

  • large-scale inference

  • mixture-of-experts models

  • generative AI

  • reasoning systems


These workloads benefit from large GPU memory, extremely high memory bandwidth, and high-speed GPU interconnects.



Software Ecosystem - NVIDIA B300


Blackwell GPUs operate within NVIDIA’s AI software ecosystem.


Key components include:

  • CUDA

  • cuDNN

  • TensorRT

  • NCCL

  • Triton Inference Server

  • NeMo AI framework


These tools provide the infrastructure needed for training and deploying AI models across large GPU clusters.


The NVIDIA B300 GPU and GB300 platform, built on the Blackwell architecture, combine larger GPU memory, higher inference performance, and high-speed interconnects like NVLink and NVSwitch to power rack-scale AI systems (such as GB300 NVL72) that run extremely large models across dozens or even hundreds of GPUs.


FAQ - NVIDIA B300


What is the NVIDIA B300?

The NVIDIA B300 is a next-generation AI accelerator based on the Blackwell architecture, designed for large-scale AI training and inference workloads in data centers and hyperscale environments.


What is the difference between NVIDIA B300 and B200?

The B300 is an upgraded Blackwell GPU that offers more memory (up to 288 GB HBM3e vs 192 GB on B200), higher inference throughput, and improved performance for large AI models.


What is the difference between NVIDIA B300 and GB300?

The B300 is the GPU itself, while GB300 refers to a full AI computing platform that combines B300 GPUs with NVIDIA Grace CPUs, high-speed NVLink connectivity, and rack-scale infrastructure.


How many GPUs are in NVIDIA B300 systems?

A B300 is a single GPU, but it is typically deployed in multi-GPU systems where dozens of GPUs are connected using NVLink and NVSwitch technologies.


How much memory does the NVIDIA B300 have?

The NVIDIA B300 GPU supports up to 288 GB of HBM3e memory, enabling large AI models and high-performance AI workloads to run efficiently on a single GPU.



NVIDIA GPU Servers in Stock

Competitive Pricing




Sources - NVIDIA B300




bottom of page