Accurate Benchmarking on GPU devices

When a GPU (Graphics Processing Unit) operation is executed, it executes asynchronously. This means the CPU (Central Processing Unit) queues up work for the GPU and immediately continues to the next line of code - it doesn’t wait for the GPU to finish. Accurate measurement of code execution on GPUs involves the insertion of synchronization barriers to ensure no pending GPU tasks are executing before and after the timing measurements are made.

Illustration

Without Synchronization

Here you can see that the timing statements are measuring the duration up till the end of the final kernel launch. The GPU computation hasn’t completed yet, which means the timing measurement is not accurate and would affect any future inference based on this information.

With Synchronization

Here you can see that a device synchronization call is made before executing the code, this ensures that the CPU waits for any pending GPU tasks to finish before starting the timer. After the launch of the final kernel, another device synchronization call is made which ensures all pending GPU tasks are finished before measuring the runtime.

Pytorch Example

Execute the following code in your Python Interpreter to get the kernel launch time (Replace cuda with mps everywhere to run on your Mac).

import torch
import time
device = "cuda"
x = torch.randn(8192, 8192, device=device)
y = torch.randn(8192, 8192, device=device)
t0 = time.perf_counter_ns()
z = torch.matmul(x, y)
t1 = time.perf_counter_ns()
print(f"Without synchronize: {(t1 - t0) / 1e6:.3f} ms")

Now, Restart your interpreter and execute the following code to get the kernel execution time (Replace cuda with mps everywhere to run on your Mac).

import torch
import time
device = "cuda"
x = torch.randn(8192, 8192, device=device)
y = torch.randn(8192, 8192, device=device)
torch.cuda.synchronize()  # clear any pending work
t0 = time.perf_counter_ns()
z = torch.matmul(x, y)
torch.cuda.synchronize()  # wait for GPU to finish
t1 = time.perf_counter_ns()
print(f"With synchronize: {(t1 - t0) / 1e6:.3f} ms")

Expected Output on CUDA

Without synchronize: 69.157 ms
With synchronize: 152.277 ms

How Codeflash measures execution time involving GPUs

Codeflash automatically inserts synchronization barriers before measuring performance. It currently supports GPU code written in Pytorch, Tensorflow and JAX for NVIDIA GPUs (CUDA) and MacOS Metal Performance Shaders (MPS).

PyTorch: Uses torch.cuda.synchronize() (CUDA) or torch.mps.synchronize() (MPS) depending on the device.
JAX: Uses jax.block_until_ready() to wait for computation to complete. It works for both CUDA and MPS devices.
TensorFlow: Uses tf.test.experimental.sync_devices() for device synchronization. It works for both CUDA and MPS devices.

Overview

Getting Started

Using Codeflash

Configuration

Core Concepts

IDE Extension

How Codeflash Measures Code Runtime on GPUs

Accurate Benchmarking on GPU devices

Illustration

Without Synchronization

With Synchronization

Pytorch Example

How Codeflash measures execution time involving GPUs

Overview

Getting Started

Using Codeflash

Configuration

Core Concepts

IDE Extension

​Accurate Benchmarking on GPU devices

​Illustration

​Without Synchronization

​With Synchronization

​Pytorch Example

​How Codeflash measures execution time involving GPUs

Accurate Benchmarking on GPU devices

Illustration

Without Synchronization

With Synchronization

Pytorch Example

How Codeflash measures execution time involving GPUs