Americas

  • United States

Nvidia GTC 2024 wrap-up: Blackwell not the only big news

News
Mar 29, 20245 mins
CPUs and ProcessorsData CenterHigh-Performance Computing

More happened at the Nvidia GTC conference than the Blackwell announcement, including the launch of two new high-speed network platforms.

Nvidia CEO Jensen Huang GB200 Grace Blackwell Superchips
Credit: Nvidia

Nvidia’s GDC conference is in our rearview mirror, and there was plenty of news beyond the major announcement of the Blackwell architecture and the massive new DGX systems powered by it. Here’s a rundown of some of the announcements you might have missed:

High-speed networking platforms

Nvidia is as much of a networking company as it is a GPU company, although it doesn’t focus on the networking side as much. But Nvidia took the ball with Mellanox and ran with it. The company announced two new high-speed network platforms with throughput speeds of up to 800 GB/s, geared for AI systems.

The first platform is the Quantum-X800 InfiniBand, which consists of two components: the Quantum 3400 switch and ConnectX-8 SuperNIC. It provides five times the bandwidth capacity and nine times the in-network computing compared to the previous generation, which comes out to 14.4 teraflops of throughput.

The second platform is the Spectrum-X800 Ethernet, which utilizes the Spectrum SN5600 800Gbps switch and the Nvidia BlueField-3 SuperNIC. It is designed for multi-tenant generative AI clouds and large enterprises.

Cloud services providers are lining up for both Quantum InfiniBand and Spectrum-X Ethernet, including Microsoft Azure, Oracle Cloud Infrastructure, and Coreweave.

Inferencing microservices

Microservices have not traditionally been associated with AI because they are small, lightweight programs designed to do a single or few functions. They are compute-stingy, the antithesis of AI. But Nvidia has introduced microservices for inferencing on large language models (LLMs).

Dubbed Nvidia Inference Microservices (NIM), the software is part of Nvidia’s Enterprise AI software package. It consists of a package of optimized inference engines, industry-standard APIs, and support for AI models all bundled into containers for easy deployment. NIM provides prebuilt models as well as allows organizations to add their own proprietary data and models.

One thing you can say about this NIM technology is that Nvidia did not work in a vacuum. The company worked with many major software vendors, including SAP, Adobe, Cadence, CrowdStrike, and ServiceNow, as well as data platform vendors, including Box, Cohesity, Cloudera, Databricks, Datastax, and NetApp.

It offers inference processing on many of the popular AI models from Google, Meta, Hugging Face, Microsoft, Mistral AI and Stability AI. The NIM microservices will be available from Amazon Web Services, Google Kubernetes Engine, and Microsoft Azure AI.

Getting into storage validation

Storage is a key component of AI processing, because AI is nothing without copious amounts of data. To that end, Nvidia started a storage partner validation program designed to help businesses find the right storage solutions by offering certification for AI and graphics-intensive workloads. The program is called Nvidia OVX, a similar naming scheme to the DGX compute servers. The first batch of companies seeking OVX storage validation are DDN, Dell PowerScale, NetApp, Pure Storage and WEKA.

NVIDIA OVX servers combine high-performance, GPU-accelerated compute with high-speed storage access and low-latency networking to address a range of complex AI and graphics-intensive workloads. The program provides a standardized process for partners to validate their storage appliances.

Server makers jump on Blackwell

All of the major OEMs announced new Blackwell-based offerings.

  • Dell Technologies announced that the PowerEdge XE9680 servers – its flagship eight-way GPU accelerated server for generative AI training, model customization and large-scale AI inferencing – will be updated to the new Blackwell generation.
  • Lenovo announced new 8-GPU AI servers – the ThinkSystem SR680a V3, SR685a V3, and SR780a V3 GPU systems – using Blackwell to support AI, high-performance computing (HPC), and graphical and simulation workloads across various industries.
  • Hewlett Packard Enterprise announced that the supercomputing products it announced last November at SC23 are now available to order for organizations seeking a preconfigured and pretested full-stack solution for the development and training of large AI models. The servers are purpose-built turnkey solution to help customers accelerate genAI and deep learning projects, and can support up to 168 GH200 Grace Hopper Superchips. In addition to hardware, HPE Services is offering assistance to enterprises to help design, deploy, and manage the solution.
  • Supermicro unveiled a range of servers at GTC 2024, with new systems featuring the GB200 Grace Blackwell Superchip, plus the B200 and B100 GPUs. Plus, the company said its existing Nvidia HGX H100 and H200 systems are being made “drop-in ready” for the new GPUs, which means customers can swap out the Hopper-based hardware for Blackwell when it is available. Supermicro claims it will be the first server company to launch HGX B200 8-GPU and HGX B100 8-GPU systems later this year.

Nvidia/AWS supercomputer gets a Blackwell upgrade

Nvidia and Amazon teamed up last year to build what was going to be one of the fastest supercomputers in the world, called Project Ceiba. With the announcement of the Blackwell processor, Project Ceiba will get an upgrade that will make it up to six times faster than originally planned.

Project Ceiba as initially described was an absolute beast, with more than 16,000 H100 Hopper AI processors and offering 65 exaflops of AI processing power when complete. By way of perspective, the current fastest supercomputer supercomputer is the US Department of Energy’s Frontier, which can hit 1.1 exaFLOPS.

Nvidia and Amazon are going to upgrade Project Ceiba with 10,386 Blackwell B200 superchips. The B200 consists of one Grace CPU and two Blackwell chips stitched together. So that means a total of 20,736 GPUs. Nvidia claims this machine could hit an incredible 414 exaFLOPS.