NVIDIA Blackwell Ultra B300: Full Specs, 288GB HBM3e Memory, 15 PFLOPS FP4, Architecture & GB300 Platform

Mar 7
8 min read

The NVIDIA B300 GPU, also called Blackwell Ultra, is the newest version of NVIDIA’s Blackwell architecture built for hyperscale AI infrastructure, large language models (LLMs), and AI inference workloads.

NVIDIA GPU Servers in Stock

Competitive Pricing

Request a Quote

It improves on the NVIDIA B200 with higher GPU memory capacity and better inference performance, and works with the Grace-Blackwell GB300 platform that combines NVIDIA Grace CPUs and Blackwell GPUs to run trillion-parameter AI models in rack-scale AI systems.

NVIDIA B300 Blackwell Ultra GPU architecture for hyperscale AI infrastructure, LLM training and inference workloads server-parts.eu refurbished

This article covers the NVIDIA B300 and GB300 architecture, including GPU microarchitecture, compute units, HBM memory, NVLink interconnect, AI server platforms, rack-scale AI infrastructure, and the software stack used in modern AI data centers.

NVIDIA B300 Blackwell Ultra Architecture Overview

The NVIDIA B300 GPU is based on the Blackwell architecture, designed for AI training and inference workloads. It builds on the original Blackwell platform with higher memory capacity and improved inference throughput.

The architecture includes several major components:

Streaming Multiprocessors (SMs)
CUDA cores for general computation
5th-generation Tensor Cores
Transformer Engine for AI workloads
HBM3e high-bandwidth memory
NVLink 5 GPU interconnect
NVSwitch GPU fabric
SXM module packaging

Together, these components create a GPU architecture capable of delivering extremely high performance for transformer-based models and large-scale distributed AI workloads.

NVIDIA B300 GPU Microarchitecture

At the core of the NVIDIA B300 GPU are Streaming Multiprocessors (SMs), the main compute units of the Blackwell architecture. These units execute thousands of threads in parallel using NVIDIA’s SIMT (Single Instruction Multiple Thread) execution model, which is essential for AI workloads and scientific computing.

Key B300 Compute Architecture

Component	Details
Architecture	Blackwell
Streaming Multiprocessors	~160 SMs
GPU Design	Dual-die architecture
Execution Model	SIMT (Single Instruction Multiple Thread)
Warp Size	32 threads per warp
Transistor Count	~208 billion transistors

Each SM (Streaming Multiprocessors) in the B300 contains several key compute components:

CUDA cores for general GPU compute
Tensor Cores for AI and deep learning operations
Warp schedulers managing thread execution
Register files for fast thread data access
Shared memory for inter-thread communication
Special Function Units (SFUs) for complex math operations

Threads run in warps of 32, with warp schedulers distributing instructions across execution pipelines, enabling high AI training and inference throughput, while the B300’s ~208 billion transistors, similar to the B200, improve performance through architectural optimizations and higher power headroom.

CUDA Cores and Compute Pipelines - NVIDIA B300

CUDA cores handle general-purpose GPU computation and execute floating-point and integer instructions used in machine learning, HPC, and data processing workloads.

Component	Function
CUDA Cores	Execute general GPU compute operations
Floating-Point Pipelines	Handle FP calculations used in AI and HPC
Integer Pipelines	Process integer operations and data tasks
Tensor Compute Pipelines	Accelerate AI and deep learning workloads
Memory Pipelines	Manage data movement across the GPU

These execution pipelines run concurrently, allowing the GPU to perform massive numbers of operations in parallel.

Tensor Cores and AI Acceleration - NVIDIA B300

The NVIDIA B300 uses 5th-generation Tensor Cores, specialized compute units designed for accelerating deep learning workloads. Tensor cores perform matrix multiplication operations that are fundamental to neural network training and inference.

Supported precision formats include:

FP4
FP8
BF16
FP16
TF32
FP32 accumulation

These precision modes allow the GPU to balance performance and numerical accuracy depending on the AI workload being executed.

NVFP4 Precision and AI Performance - NVIDIA B300

One of the most important improvements in the B300 generation is inference performance using FP4 precision. NVIDIA refers to its optimized FP4 format as NVFP4, which enables extremely high throughput while maintaining model accuracy.

Approximate AI compute performance:

NVIDIA B200 GPU: ~9 PFLOPS dense FP4
NVIDIA B300 GPU: ~14–15 PFLOPS dense FP4

This represents roughly a 55–67% performance improvement in inference-heavy workloads.

System-level performance examples include:

NVIDIA DGX B300 (8 GPUs): approximately 108–144 PFLOPS FP4
NVIDIA GB300 NVL72 rack: roughly 1.1–1.44 exaFLOPS FP4 compute performance

These improvements are designed specifically for AI reasoning models, mixture-of-experts architectures, and large language model inference.

Transformer Engine - NVIDIA B300

Blackwell GPUs, including the NVIDIA B300, include a Transformer Engine that speeds up transformer-based AI models (such as LLMs). It automatically adjusts precision (for example FP4 or FP8) to increase performance while keeping model accuracy.

Feature	Purpose
Dynamic Precision Scaling	Automatically selects FP4 or FP8 for optimal performance
Optimized Transformer Operations	Accelerates transformer-based neural networks
Inference Efficiency	Improves AI inference throughput
Memory Optimization	Reduces memory consumption

These capabilities are critical for large AI models with billions or trillions of parameters.

Multi-Die GPU Design - NVIDIA B300

Blackwell GPUs, including the NVIDIA B300, use a multi-die architecture instead of a single monolithic silicon die. The B300 consists of two large compute dies connected by a high-bandwidth internal interconnect, enabling larger GPU designs.

Component	Description
GPU Design	Multi-die architecture
Compute Dies	Two large compute dies
Interconnect	High-bandwidth internal interconnect
Memory Behavior	Unified compute and memory system

Advantages of the dual-die architecture:

Enables GPUs larger than single silicon die limits
Supports higher transistor counts
Improves manufacturing yield
Maintains unified GPU compute and memory behavior

Memory Architecture - NVIDIA B300

The NVIDIA B300 GPU uses HBM3e (High Bandwidth Memory), a stacked memory technology designed for data-intensive AI workloads. Compared to the B200, the B300 significantly increases memory capacity to support larger AI models.

NVIDIA B200 vs B300 Memory Capacity

GPU	Memory Type	Capacity
NVIDIA B200	HBM3e	192 GB
NVIDIA B300	HBM3e	288 GB

The increase is achieved using 12-high HBM3e stacks, compared to 8-high stacks in B200 GPUs.

NVIDIA B300 Memory Specifications

Component	Specification
HBM Memory Stacks	8 stacks
Memory Interface	~8192-bit
Memory Bandwidth	~8 TB/s

The 50% memory increase allows much larger AI models to run on a single GPU and reduces the need for model sharding across multiple GPUs.

Cache Hierarchy - NVIDIA B300

The GPU includes multiple cache levels that reduce memory latency and improve performance.

Cache Level	Location	Purpose
L1 Cache	Inside each Streaming Multiprocessor (SM)	Stores frequently used data and supports shared memory between threads
L2 Cache	Shared across the entire GPU	Caches global memory accesses and reduces traffic to HBM memory

Note: Exact cache sizes for the B300 GPU have not been publicly disclosed.

NVLink 5 Interconnect - NVIDIA B300

Feature	Details
NVLink Generation	NVLink 5
Connections per GPU	18 NVLink links
Total Bandwidth	~1.8 TB/s bidirectional

What this means in practice

GPUs communicate directly with each other, without sending data through the CPU.
This enables faster scaling in multi-GPU systems.
It is critical for distributed AI training and large-scale inference workloads.

NVSwitch Fabric - NVIDIA B300

Component	Function
NVSwitch	Connects many GPUs into a high-speed switching fabric so every GPU can communicate with every other GPU with very low latency

Main benefits:

Full GPU mesh connectivity – every GPU can talk directly to the others
High bandwidth communication – fast data exchange between GPUs
Efficient model parallelism – large AI models can run across multiple GPUs

Example System	NVLink Fabric Bandwidth
NVIDIA GB300 NVL72	~130 TB/s

Grace-Blackwell GB300 Platform - NVIDIA B300

The NVIDIA GB300 platform integrates NVIDIA’s Grace CPU with Blackwell GPUs.

A GB300 superchip contains:

one Grace CPU
two B300 GPUs

The CPU and GPUs are connected through NVLink-C2C, a coherent interconnect that allows direct memory sharing between CPU and GPU. This architecture significantly reduces data transfer latency compared to traditional PCIe connections.

Grace CPU Architecture - NVIDIA B300

The Grace CPU is an ARM-based processor designed specifically for AI and high-performance computing environments.

Key characteristics include:

large number of ARM cores
high memory bandwidth
optimized data processing for AI pipelines

Grace CPUs coordinate data movement, system management, and workload scheduling across GPU clusters.

Server Platforms - NVIDIA B300

NVIDIA B300 GPUs are deployed in several server platforms.

NVIDIA HGX B300 systems typically contain eight GPUs connected through NVLink and NVSwitch. These platforms are used by OEM server manufacturers.

NVIDIA DGX B300 servers are NVIDIA’s integrated AI infrastructure systems that combine GPUs, networking, and optimized cooling into a single platform.

Rack-Scale AI Systems - NVIDIA B300

One of the largest deployments of the architecture is the GB300 NVL72 rack system.

A single NVL72 rack includes:

72 Blackwell GPUs
36 Grace CPUs
large NVLink interconnect fabric

The rack behaves like a unified AI accelerator capable of training or running extremely large models.

Networking Infrastructure - NVIDIA B300

Large GPU clusters rely on high-performance networking technologies such as:

InfiniBand
Spectrum-X Ethernet
ConnectX SuperNICs
BlueField DPUs

These networks allow clusters to scale across thousands of GPUs in hyperscale AI data centers.

GPU Virtualization and Multi-Instance GPU - NVIDIA B300

Blackwell GPUs support Multi-Instance GPU (MIG) technology, which allows a single GPU to be partitioned into multiple logical GPUs.

This enables multiple users or workloads to share GPU resources securely while maintaining performance isolation.

Power and Cooling - NVIDIA B300

The B300 GPU operates at significantly higher power levels than previous generations.

Typical power envelope:

NVIDIA B200: ~1000–1200 W
NVIDIA B300: ~1000–1400 W configurable TDP

The higher power envelope allows higher sustained Tensor Core performance and improved inference throughput. Because of this power level, most B300-based servers rely on liquid cooling systems and high-density rack power infrastructure.

In large rack-scale systems such as GB300 NVL72, NVIDIA also implements power smoothing technologies that can reduce peak power spikes by up to 30%.

AI Workloads - NVIDIA B300

The B300 architecture is designed primarily for AI workloads including:

large language model training
large-scale inference
mixture-of-experts models
generative AI
reasoning systems

These workloads benefit from large GPU memory, extremely high memory bandwidth, and high-speed GPU interconnects.

Software Ecosystem - NVIDIA B300

Blackwell GPUs operate within NVIDIA’s AI software ecosystem.

Key components include:

CUDA
cuDNN
TensorRT
NCCL
Triton Inference Server
NeMo AI framework

These tools provide the infrastructure needed for training and deploying AI models across large GPU clusters.

The NVIDIA B300 GPU and GB300 platform, built on the Blackwell architecture, combine larger GPU memory, higher inference performance, and high-speed interconnects like NVLink and NVSwitch to power rack-scale AI systems (such as GB300 NVL72) that run extremely large models across dozens or even hundreds of GPUs.

FAQ - NVIDIA B300

What is the NVIDIA B300?

The NVIDIA B300 is a next-generation AI accelerator based on the Blackwell architecture, designed for large-scale AI training and inference workloads in data centers and hyperscale environments.

What is the difference between NVIDIA B300 and B200?

The B300 is an upgraded Blackwell GPU that offers more memory (up to 288 GB HBM3e vs 192 GB on B200), higher inference throughput, and improved performance for large AI models.

What is the difference between NVIDIA B300 and GB300?

The B300 is the GPU itself, while GB300 refers to a full AI computing platform that combines B300 GPUs with NVIDIA Grace CPUs, high-speed NVLink connectivity, and rack-scale infrastructure.

How many GPUs are in NVIDIA B300 systems?

A B300 is a single GPU, but it is typically deployed in multi-GPU systems where dozens of GPUs are connected using NVLink and NVSwitch technologies.

How much memory does the NVIDIA B300 have?

The NVIDIA B300 GPU supports up to 288 GB of HBM3e memory, enabling large AI models and high-performance AI workloads to run efficiently on a single GPU.