Maximizing AI Performance: How GEMM Tuning Enhanced AMD Instinct MI300X AI Accelerator by 7x

Ashton Clark


Maximizing AI Performance: How GEMM Tuning Enhanced AMD Instinct MI300X AI Accelerator by 7x

AMD’s Instinct MI300X AI Throughput Performance & Latency Improved By 7x With GEMM Tuning

Nscale has tested AMD's flagship Instinct MI300X AI accelerator utilizing the GEMM tuning framework, achieving 7x faster performance.

Nscale's Newest AMD MI300X Benchmarking Reveals That GEMM Tuning Has Brought In Significant Performance Bumps

[Press Release]: In Nscale's latest technical deep dive, we explore a critical aspect of AI model optimization: throughput benchmarking, performance tuning, and latency reduction using GEMM (General Matrix Multiplication) tuning.

Maximizing the performance of GPU-accelerated tasks involves more than just raw speed. Optimizing GEMM ensures efficient processing, higher throughput, and the ability to handle complex models and datasets effectively.

In this blog, we will explore the benchmarking of vLLM throughput across multiple models and delve into the significant impact of GEMM tuning. Powerful libraries such as rocBLAS (ROCm Basic Linear Algebra Subprograms) and hipBLASlt (Heterogeneous-Compute Interface for Portability, Basic Linear Algebra Subprograms) are instrumental in this process.

What is GEMM Tuning?

GEMM tuning is a powerful technique for enhancing the performance of matrix-multiplication operations. This process includes selecting the most appropriate algorithm based on factors such as memory, cache, and compute capabilities.

By fine-tuning parameters and selecting optimal algorithms, we ensure the GEMM operation maximizes efficiency when using available computing resources. This translates to significant speed improvements for AI and machine learning models.

Metrics Compared


Our analysis compared several key performance metrics between the two benchmark runs.

  • Generation Speed (tokens per second): Allowed us to gauge the efficiency of token generation for both input and output processes.
  • Requests per Second: Providing a clear indication of the system's ability to manage multiple concurrent requests effectively.
  • Cumulative Performance (units processed per second): Encompasses the combined productivity of generation velocity and request management, providing a holistic perspective of the system's efficiency under varied setups.
  • Average Latency (seconds): Measuring the time taken to generate a response.

Settings for Benchmark Runs

We configured each benchmark run with the following settings:

  • Input Prompt Length for Each Request: 256 tokens
  • Output Length for Each Request: 256 tokens
  • Tensor Parallel Size: 1 (utilizing a single GPU, specifically the MI300X)
  • Batch Sizes: 1, 2, and 4

Key Observations

Let’s delve into the notable advancements achieved through GEMM tuning of LLMs such as Llama, Mistral, Mixtral, and Falcon. We will analyze a series of graphs and data visualizations that elucidate the impact of Tuned GEMM on the performance and efficiency of these models.

The graph shows a significant increase in generation speed when GeMM tuning is enabled on the AMD Instinct MI300X AI accelerator.

  • GEMM Tuning Impact: Enabling GEMM tuning increases throughput by up to 7.2x, as seen with the LLaMA-2-70B model.
  • Model Size: Larger models like LLaMA-2-70B and LLaMA-3-70B show the most significant improvements in throughput, with increases of 7.2x and 5.9x, respectively.
  • Batch Size: Higher batch sizes generally lead to greater throughput, amplified by GEMM tuning. For instance, throughput for the Falcon 7B model rises from 244.74 tokens/second at batch size 1 to 952.38 tokens/second at batch size 4 without GEMM tuning. With tuning, it climbs further to 2736.58 tokens/second.
  • Comparison Across Models: Among the models tested, LLaMA-2-70B and LLaMA-3-70B exhibit the highest throughput due to their complexity and size. Conversely, smaller models like Qwen 1.5 4B and Falcon 7B show relatively higher throughput, indicating more efficient processing for less complex models.



The graph depicts the consistent reduction in latency achieved through GEMM tuning.

  • GEMM Tuning Impact: Latency reduces significantly across all models. For instance, latency for the LLaMA-2-7B model drops from 1.00 to 0.35 seconds. During testing, we observed that with GEMM tuning enabled, the latency of the LLaMA-2-7B model with a batch size of 1 dropped by 66.5% from 1.97 seconds to 0.66 seconds. This pattern holds until a batch size of 4, highlighting the significant performance enhancement GEMM tuning offers.
  • Model Size: Larger models inherently exhibit higher latency. The LLaMA-2-70B model, for example, shows a latency of 1.00 seconds without GEMM tuning and 0.14 seconds with tuning enabled. In comparison, smaller models like LLaMA-2-7B show much lower latency under similar conditions. This trend is consistent across batch sizes, emphasizing that model size directly impacts processing time.
  • Batch Size: While larger batch sizes typically increase latency, GEMM tuning mitigates this, maintaining lower latency. In our testing of the LLaMA-2-7B model without GEMM tuning, the latency rises from 1.97 seconds at batch size 1 to 2.11 seconds at batch size 4. With GEMM tuning enabled, the increase is from 0.66 seconds to 0.77 seconds. This suggests that while GEMM tuning mitigates the latency increase to some extent, processing larger batches naturally requires more computational effort and time.
  • Comparison Across Models: Models like Qwen 1.5 4B and Falcon 7B also show reduced latency, emphasizing the effectiveness of GEMM tuning across different complexities.


Our comprehensive benchmarking study of AMD MI300X GPUs with GEMM tuning reveals improvements in both throughput and latency, with gains of up to 7.2x in specific models. By optimizing GEMM operations using rocBLAS and hipBLASlt libraries, we significantly enhanced the performance and efficiency of various large language models, including LLaMA, Mistral, Mixtral, and Falcon.

News Source: Nscale