4th Gen AMD EPYC Delivers Market-Leading Performance & Efficiency versus NVIDIA Grace Superchip

By Raghu Nambiar Corporate Vice President at AMD.

Industry benchmarks show single- and dual-socket 4th Gen AMD EPYC™ systems delivering between ~2.00x and ~3.70x higher performance and more than double the energy efficiency of a 2P NVIDIA Grace CPU Superchip system.

My previous blogs highlighted how 4th Gen AMD EPYC™ processors outperform both the latest 5th Gen Intel® Xeon® Platinum processors and Ampere® Altra® Max M128-30 processors for important workloads. Today, I’m going to delve into how 4th Gen AMD EPYC™ processors compare against NVIDIA Grace™ CPU Superchip processors in terms of both performance and energy efficiency.

4th Gen AMD EPYC processors continue setting new standards in datacenter performance, power efficiency, security, and total cost of ownership, driven by continuous innovation. The 4th Gen AMD EPYC™ processor portfolio offers cutting-edge on-premises and cloud-based solutions to address today’s demanding, highly varied workload requirements.

The extensive AMD EPYC ecosystem comprises over 250 distinct server designs and supports more than 800 unique cloud instances. AMD EPYC processors also hold over 300 world records for performance across a broad spectrum of benchmarks, including business applications, technical computing, data management, data analytics, digital services, media and entertainment, and infrastructure solutions.

NVIDIA recently launched their Grace™ CPU Superchip alongside some eye-opening performance comparisons. It’s essential to carefully evaluate these claims because the benchmark results they published are minimal and lack important system configuration details. By contrast, AMD consistently showcases superior performance and power efficiency using extensive industry-standard benchmark publications. For example, currently, there are 5927 official SPEC CPU® 2017 publications for AMD EPYC processors, compared to zero for the NVIDIA Grace. As you will soon see, 4th Gen AMD EPYC processors deliver commanding energy efficiency and performance uplifts versus the NVIDIA processor.

Please be aware that I am only covering a subset of the tested workloads in this blog. The x86 processor architecture supports a broad range of enterprise, cloud-native, and HPC applications, but compatibility issues restrict the number of workloads that can be executed on the ARM-based NVIDIA Grace.

AMD compared several systems powered by single-socket and dual-socket configurations of AMD EPYC 9754 processors (code name “Bergamo,” featuring 128 cores and 256 threads/vCPUs), dual-socket AMD EPYC 9654 (code name “Genoa,” with 96 cores and 192 threads/vCPUs), and a dual-socket system utilizing the NVIDIA Grace processor (featuring 144 cores/vCPUs in a “2 x 72 cores” format). Each AMD EPYC system was equipped with 12 x 64GB of DDR5-4800 memory per socket, unless otherwise specified in the following sections. The NVIDIA system included the maximum 480 GB of LPDDR5-8532 memory currently supported by server vendors.

Power Efficiency

Modern data centers are striving to meet rapidly growing demands while optimizing power usage to control costs and achieve sustainability goals. The SPECpower_ssj® 2008 benchmark from the Standard Performance Evaluation Corporation (SPEC®) provides a comparative measure of the energy efficiency of volume server class computers by evaluating both the power and performance characteristics of the System Under Test (SUT).

Figure 1 illustrates that a single- and dual-socket AMD EPYC 9754 systems outperformed an NVIDIA Grace system by ~2.50x and ~2.75x, respectively. Further, a dual-socket AMD EPYC 9654 system outperformed the same NVIDIA system by ~2.27x on the same tests.[1][2][3]

raghu_nambiar_0-1721186332550.png

Figure 1: SPECpower_ssj® 2008 relative performance

General Purpose Computing

The Standard Performance Evaluation Corporation (SPEC) developed the SPEC CPU® 2017 benchmark suite to evaluate the performance of computer systems. SPECrate® 2017_int_base scores assess integer performance and is widely acknowledged as a premier industry standard for evaluating the capabilities of general-purpose computing infrastructure.

Figure 2 illustrates that single- and dual-socket AMD EPYC 9754 systems outperformed an NVIDIA Grace system by an estimated ~1.33x and ~2.64x, respectively. Further, a dual-socket AMD EPYC 9654 system outperformed the same NVIDIA system by an estimated ~2.43x on the same tests.[4][5][6]

raghu_nambiar_1-1721186332552.png

Figure 2: SPECrate® 2017_int_base relative performance (estimated)

Server-Side Java

4th Gen AMD EPYC processors deliver performance, efficiency, and compatibility and allow you to deploy cloud native workloads with no compromises and without expensive architectural transitions. Java® has become a universal language across enterprise and cloud environments. The SPECjbb®2015 benchmark evaluates Java-based application performance on server-class hardware by modeling an e-commerce company with an IT infrastructure that handles a mix of point-of-sale requests, online purchases, and data-mining operations.

Figure 3 illustrates that single- and dual-socket AMD EPYC 9754 systems outperformed an NVIDIA Grace system by ~1.81x and ~3.58x, respectively. Further, dual-socket AMD EPYC 9654 system outperformed the same NVIDIA system by ~3.36x on the same tests when comparing SPECjbb2015-MultiJVM max-jOPS.[7][8][9]

raghu_nambiar_2-1721186332553.png

Figure 3: Server-Side Java relative performance

AMD EPYC powered systems are using SUSE Linux Enterprise Server 15 SP4 v15.14.21 with Java SE 21.0 for AMD EPYC 9654 and Java SE 17.0 LTS for AMD EPYC 9754. NVIDIA Grace system using Ubuntu 22.04.4 (kernel v15.15.0-105-generic) with the latest Java SE 22.0.

Transactional Databases

MySQL is extensively utilized as an open-source relational database system across enterprise and cloud environments. AMD used the HammerDB TPROC-C benchmark to evaluate Online Transaction Processing (OLTP). The HammerDB TPROC-C workload is derived from the TPC-C Benchmark™ Standard, and as such is not comparable to published TPC-C™ results, as the results do not comply with the TPC-C Benchmark Standard.

Figure 4 illustrates that single- and dual-socket AMD EPYC 9754 systems outperformed an NVIDIA Grace system by ~1.58x and ~2.16x, respectively. Further, dual-socket AMD EPYC 9654 system outperformed the same NVIDIA system by ~2.17x on the same tests.[10][11][12]

raghu_nambiar_3-1721186332553.png

Figure 4: MySQL TPROC-C relative performance

The test systems were set up with Ubuntu® version 22.04, MySQL™ version 8.0.37, and HammerDB version 4.4. Each system hosted multiple virtual machines (VMs), each having 16 cores. Median values from three test runs were combined for comparative analysis.

Decision Support Systems

MySQL is widely used in Decision Support Systems deployments as well. AMD used HammerDB TPROC-H to evaluate Design Support System performance. The HammerDB TPROC-H workload is derived from the TPC-H Benchmark™ Standard, and as such is not comparable to published TPC-H™ results, as the results do not comply with the TPC-H Benchmark Standard.

Figure 5 illustrates that single- and dual-socket AMD EPYC 9754 systems configured as shown above outperformed an NVIDIA Grace system by ~1.42x and ~2.98x, respectively. Further, dual-socket AMD EPYC 9654 system outperformed the same system by ~2.62x on the same tests.[13][14][15]

raghu_nambiar_4-1721186332554.png

Figure 5: MySQL TPROC-H relative performance

The test systems were set up with Ubuntu® version 22.04, MySQL™ version 8.0.37, and HammerDB version 4.4. Each system hosted multiple virtual machines (VMs), each having 16 cores. Median values from three test runs were combined for comparative analysis.

Web Server

NGINX™ is a flexible web server known for its ability to act as a reverse proxy, load balancer, mail proxy, and HTTP cache. It’s designed to efficiently handle client requests and deliver web content. NGINX can operate as a standalone web server or boost performance and security by serving as a reverse proxy for other servers. AMD used the popular WRK tool to evaluate performance by generating significant HTTP loads during benchmarking.

Figure 6 illustrates that single- and dual-socket AMD EPYC 9754 systems outperformed an NVIDIA Grace system by ~1.27x and ~2.56x, respectively. Further, dual-socket AMD EPYC 9654 system outperformed the same NVIDIA system by ~1.89x on the same tests.[16][17][18]

raghu_nambiar_5-1721186332556.png

Figure 6: NGINX relative performance

This benchmark test placed both the server and client on the same system to reduce network latency and focus on CPU processing power measurement. The systems were configured with Ubuntu® v22.04 and NGINX v1.18.0. Each system hosted multiple instances, each equipped with 8 cores. The workload was tested for 90 seconds in each run, and the median requests per second (rps) values from 3 runs per platform were aggregated to compare relative performance.

In-Memory Analytics

Redis™ is a powerful in-memory data structure store that serves as a distributed, key–value database, cache, and message broker, with optional durability features. AMD used the widely recognized purpose built redis-benchmark tool to measure Redis server performance.

Figure 7 illustrates that single- and dual-socket AMD EPYC 9754 systems outperformed an NVIDIA Grace system by ~1.15x and ~2.29x, respectively. Further, dual-socket AMD EPYC 9654 system outperformed the same NVIDIA system by ~1.54x on the same tests.[19][20][21]

raghu_nambiar_6-1721186332556.png

Figure 7: Redis relative performance

The systems were set up with Ubuntu® v22.04 and Redis v7.0.11, along with the redis-benchmark v7.2.3 client. Each client created 512 connections for GET/SET operations with key sizes of 1000 bytes to its respective Redis server. Each system hosted multiple instances, each equipped with 8 cores. The workload test executed 10 million requests on each system, and the median requests per second (rps) values from three runs were aggregated to compare relative performance.

Cache Tier

Memcached™ is a high-performance, distributed in-memory caching system designed to store key-value pairs for small chunks of arbitrary data such as strings or objects. It typically caches results from database or API calls, as well as rendered pages. AMD selected the widely utilized memtier benchmarking tool to assess latency and throughput uplifts.

Figure 8 illustrates that single- and dual-socket AMD EPYC 9754 systems outperformed an NVIDIA Grace™ system by ~1.16x and ~2.26x, respectively. Further, dual-socket AMD EPYC 9654 system outperformed the same NVIDIA system by ~1.97x on the same tests.[22][23][24]

raghu_nambiar_7-1721186332557.png

Figure 8: Memcached relative performance

The systems were configured with Ubuntu® v22.04, Memcached v1.6.14, and memtier v1.4.0. Each memtier client established 10 connections, utilized 8 pipelines, and maintained a 1:10 SET/GET ratio with its corresponding Memcached server. Each system hosted multiple instances, each equipped with 8 physical cores. The workload test processed 10 million requests on each system, and the median requests per second (rps) values from three runs per platform were aggregated to assess relative performance.

High Performance Computing (HPC)

HPC influences every aspect of our lives, where performance is crucial from manufacturing to life sciences. ARM processors handle lighter workloads well but face challenges with critical data-centric and HPC workloads. Many applications have not been adapted for ARM, and those that have often lack the advanced features found in x86 processors that boost HPC performance. Memory capacity is another significant consideration: 4th Gen AMD EPYC processors support up to 3 TB of memory, whereas ARM-based NVIDIA Grace are limited to 480 GB.

Compiling HPC applications to run on ARM processors and debugging runtime failures is a nontrivial undertaking, to say the least. AMD engineers ran into the following issues attempting to run common open-source HPC workloads for testing that required quick turn-around times:

  • NAMD throws runtime errors despite source code changes.
  • GROMACS fails to compile and requires manual intervention at source level.
  • OpenRadioss fails to compile and requires both changes and a pull request to enable it for ARM instructions.
  • OpenFOAM® and WRF® dependencies fail to compile.

AMD engineers were able to compile and test the following open-source HPC workloads without considerable hurdles:

  • HPL requires minor changes in the build system (cmake) to compile and run with the ARM performance math library.
  • Quantum ESPRESSO compiles and runs with the ARM performance math library, but scalapack is not available in this library.

These challenges illustrate the importance of the x86 processor architecture compatibility offered by all AMD EPYC processors across all current generations. Let’s compare the relative performance of these two workloads.

High Performance Linpack

High Performance Linpack (HPL) is a software package and benchmark that evaluates the floating-point performance of HPC clusters. This portable implementation uses Gaussian elimination to test the ability of the cluster to solve dense linear unary equations of a given degree. AMD ran the HPLinpack 2.3 matrix benchmark.

Figure 9 illustrates that the dual-socket AMD EPYC 9754 system outperformed the NVIDIA Grace system by ~2.34x. Further, a dual-socket AMD EPYC 9654 system outperformed the same NVIDIA system by ~1.97x on the same tests.[25][26]

raghu_nambiar_8-1721186332557.png

Figure 9: HPL relative performance

The AMD EPYC 9754 and 9654 systems feature 1.5 TB of DDR5-4800 memory and run Red Hat Enterprise Linux 9.4 with kernel 5.14.0-427.16.1.el9_4.x86_64, while the NVIDIA Grace system is equipped with 480GB of LPDDR5X-8532 memory and operates on Red Hat Enterprise Linux 9.4 with kernel 5.14.0-427.18.1.el9_4.aarch64+64k.

Quantum ESPRESSO

Quantum ESPRESSO is an open-source suite that performs nanoscale electronic-structure calculations and materials modeling based on density-functional theory, plane waves, and pseudopotentials. The Quantum ESPRESSO 7.0 ausurf benchmark was used to compare the performance of the systems.

Figure 10 illustrates that the dual-socket AMD EPYC 9754 system outperformed the NVIDIA Grace system by ~4.08x. Further, a dual-socket AMD EPYC 9654 system outperformed the same NVIDIA system by ~3.46x on the same tests.[27][28]

raghu_nambiar_9-1721186332558.png

Figure 10: Quantum ESPRESSO relative performance

The AMD EPYC 9754 and 9654 systems feature 1.5 TB of DDR5-4800 memory and run Red Hat Enterprise Linux 9.4 with kernel 5.14.0-427.16.1.el9_4.x86_64, while the NVIDIA Grace system is equipped with 480GB of LPDDR5X-8532 memory and operates on Red Hat Enterprise Linux 9.4 with kernel 5.14.0-427.18.1.el9_4.aarch64+64k.

Video Encoding

FFmpeg is an exceptionally versatile multimedia framework renowned for its ability to encode, decode, transcode, stream, filter, and play nearly any type of video format, from traditional to cutting-edge technologies. It is widely adopted for its comprehensive capabilities in video and audio processing, including format conversion, video scaling, editing, and seamless streaming.

Figure 11 illustrates that single- and dual-socket AMD EPYC 9754 systems outperformed an NVIDIA Grace system by ~1.60x and ~2.90x, respectively. Further, dual-socket AMD EPYC 9654 system outperformed the same NVIDIA system by ~2.38x on the same tests.[29][30][31]

raghu_nambiar_10-1721186332558.png

Figure 11: FFmpeg relative encoding performance

The systems were set up with Ubuntu® v22.04 and FFmpeg v4.4.2. Each instance of FFmpeg transcoded a single input file with 4K resolution in raw video format using the VP9 codec. Each system hosted multiple instances, each configured with 8 cores. Performance on each system was evaluated based on the median total frames processed per hour across three test runs.

Conclusion

This blog highlighted the leadership performance of both general-purpose AMD EPYC 9654 and—most importantly—cloud-native AMD EPYC 9764 processors versus NVIDIA Grace across eleven key foundational, database, and HPC workloads. This combined with the data I shared in my blog last June demonstrates that 4th Gen AMD EPYC processors retain their market leading performance and power efficiency position.

We’re not stopping there: AMD is poised to extend our commanding performance and energy efficiency leads even further with the forthcoming debut of 5th Gen AMD EPYC processors (codenamed “Turin”) with up to 192 cores per processor and the fully compatible x86 architecture that seamlessly runs your existing and future workloads. I’ll say more when the time comes; meanwhile, let’s keep this shrouded in a little mystery.

Competition is the key to a thriving technology industry because it spurs ongoing innovation and advances that have the potential to solve humanity’s most pressing challenges and meet emerging needs head on. I am thrilled to be able to say that AMD EPYC processors are leading the pack. I can also assure you that AMD is relentlessly focused on maintaining our leadership in current and future generations of AMD EPYC processors.

Raghu Nambiar is a Corporate Vice President of Data Center Ecosystems and Solutions for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.