NVIDIA's Eos supercomputer_Data logger_Power supply_Inverter Driver_Smart controller_WiFi Smart Devices

Depending on the hardware you're using, training a large language model of any significant size can take weeks, months, even years to complete. That's no way to do business — nobody has the electricity and time to be waiting that long. On Wednesday, NVIDIA unveiled the newest iteration of its Eos supercomputer, one powered by more than 10,000 H100 Tensor Core GPUs and capable of training a 175 billion-parameter GPT-3 model on 1 billion tokens in under four minutes. That's three times faster than the previous benchmark on the MLPerf AI industry standard, which NVIDIA set just six months ago.

Eos represents an enormous amount of compute. It leverages 10,752 GPUs strung together using NVIDIA's Infiniband networking (moving a petabyte of data a second) and 860 terabytes of high bandwidth memory (36PB/sec aggregate bandwidth and 1.1PB sec interconnected) to deliver 40 exaflops of AI processing power. The entire cloud architecture is comprised of 1344 nodes — individual servers that companies can rent access to for around $37,000 a month to expand their AI capabilities without building out their own infrastructure.

In all, NVIDIA set six records in nine benchmark tests: the 3.9 minute notch for GPT-3, a 2.5 minute mark to to train a Stable Diffusion model using 1,024 Hopper GPUs, a minute even to train DLRM, 55.2 seconds for RetinaNet, 46 seconds for 3D U-Net and the BERT-Large model required just 7.2 seconds to train.

NVIDIA was quick to note that the 175 billion parameter version of GPT-3 used in the benchmarking is not the full-sized iteration of the model (neither was the Stable Diffusion model). The larger GPT-3 offers around 3.7 trillion parameters and is just flat out too big and unwieldy for use as a benchmarking test. For example, it'd take 18 months to train it on the older A100 system with 512 GPUs — though, Eos needs just eight days.

So instead, NVIDIA and MLCommons, which administers the MLPerf standard, leverage a more compact version that uses 1 billion tokens (the smallest denominator unit of data that generative AI systems understand). This test uses a GPT-3 version with the same number of potential switches to flip (s the full-size (those 175 billion parameters), just a much more manageable data set to use in it (a billion tokens vs 3.7 trillion).

The impressive improvement in performance, granted, came from the fact that this recent round of tests employed 10,752 H100 GPUs compared to the 3,584 Hopper GPUs the company used in June's benchmarking trials. However NVIDIA explains that despite tripling the number of GPUs, it managed to maintain 2.8x scaling in performance — an 93 percent efficiency rate — through the generous use of software optimization.

"Scaling is a wonderful thing," Salvator said."But with scaling, you're talking about more infrastructure, which can also mean things like more cost. An efficiently scaled increase means users are "making the best use of your of your infrastructure so that you can basically just get your work done as fast [as possible] and get the most value out of the investment that your organization has made."

The article is reprinted from the Internet. If there is any issue like copyright or others, please contact: lmy01@gdchico.cn to delete it.

https://www.engadget.com/nvidias-eos-supercomputer-just-broke-its-own-ai-training-benchmark-record-170042546.html

HOME

COMPANY

TECHNOLOGIES

PRODUCTS

NEWS

CONTACT US