
Performance of Nvidia’s Chinese A800 GPU Revealed
More and more short story The large demand for Nvidia’s high-performance computing {hardware} in China has revealed the efficiency of Nvidia’s mysterious A800 computing GPU made for the Chinese language market. In response to MyDrivers, the A800 complies with strict US export requirements that restrict how a lot processing energy Nvidia can promote, whereas A100 GPUs run at 70%.
Now three years previous, Nvidia’s A100 is sort of profitable: it provides 9.7 FP64/19.5 FP64 Tensor TFLOPS for HPC and as much as 624 BF16/FP16 TFLOPS (with sparseness) for AI workloads. Even decreased by round 30%, these numbers will nonetheless look terrific: 6.8 FP64/13.7 FP64 Tensor TFLOPS plus 437 BF16/FP16 (sparse).
Regardless of its “castration” (efficiency limits) as MyDrivers put it, Nvidia’s A800 is extremely aggressive in computing capabilities to completely superior China-based Biren’s BR104 and BR100 compute GPUs. In the meantime, Nvidia’s compute GPUs and CUDA structure are largely supported by functions run by their prospects, whereas Biren’s processors are but to be adopted. And even Biren can’t ship full-fledged computing GPUs to China as a result of newest rules.
Row 0 – Cell 0 | Biren BR104 | Nvidia A800 | Nvidia A100 | Nvidia H100 |
type issue | FHFL Card | FHFL Card (?) | SXM4 | SXM5 |
Variety of Transistors | ? | 54.2 billion | 54.2 billion | 80 billion |
loop | N7 | N7 | N7 | 4N |
Energy | 300W | ? | 400W | 700W |
FP32 TFLOPS | 128 | 13.7 (?) | 19.5 | 60 |
TF32+ TFLOPS | 256 | ? | ? | ? |
TF32 TFLOPS | ? | 109/218* (?) | 156/312* | 500/1000* |
FP16 TFLOPS | ? | 56(?) | 78 | 120 |
FP16 TFLOPS Tensor | ? | 218/437* | 312/624* | 1000/2000* |
BF16 TFLOPS | 512 | 27 | 39 | 120 |
BF16 TFLOPS Tensor | ? | 218/437* | 312/624* | 1000/2000* |
INT8 | 1024 | ? | ? | ? |
INT8 TFLOPS Tensor | ? | 437/874* | 624/1248* | 2000/4000* |
* with sparseness
Export guidelines enforced by the US in October 2021 prohibit the export to China of American applied sciences that enable supercomputers exceeding 100 FP64 PetaFLOPS or 200 FP32 PetaFLOPS in an space of 41,600 cubic ft (1,178 cubic meters) or much less. Whereas the export restrictions don’t particularly restrict the efficiency of every compute GPU offered to a China-based entity, they do place limits on their switch speeds and scalability.
After the brand new guidelines went into impact, Nvidia misplaced the flexibility to promote its ultra-high-end A100 and H100 computing GPUs to prospects in China with out an export license, which was laborious to get. To fulfill the efficiency demand required by Chinese language hyperscalers, the corporate has launched a scaled-down model of its A100 GPU referred to as the A800. Till now, it was unclear how succesful this GPU was.
As using AI grows amongst each customers and companies, so does the recognition of high-performance {hardware} that may deal with acceptable workloads. Nvidia is among the many largest beneficiaries of the AI mega development, which is why its GPUs are in such excessive demand that even the downsized A800 has offered out in China.
Biren’s BR100 can be provided in an OAM type issue and can devour as much as 550W of energy. The chip helps the corporate’s proprietary 8-way BLink know-how, which permits set up of as much as eight BR100 GPUs per system. In distinction, the 300W BR104 will ship in an FHFL twin large PCIe card type issue and can help 3-way multi-GPU configuration. Each chips use a PCIe 5.0 x16 interface with CXL protocol on prime for accelerators, stories EETrend (by way of VideoCardz).
Biren says each of its chips are made utilizing TSMC’s 7nm grade fabrication course of (with out detailing whether or not it makes use of the N7, N7+, or N7P). The bigger BR100 combines 77 billion transistors, which outweighs the 54.2 billion with the Nvidia A100, once more made utilizing one in all TSMC’s N7 nodes. The corporate additionally says that TSMC ought to use chip design and the foundry’s CoWoS 2.5D know-how to beat community dimension limitations; this makes full sense as Nvidia’s A100 approaches a mesh dimension and the BR100 ought to be on par. bigger contemplating the upper transistor depend.
Contemplating the specs, we are able to guess that the BR100 mainly makes use of two BR104s, though the developer hasn’t formally confirmed this.
Biren labored with Inspur on an 8-way AI server that may pattern from This autumn 2022 to commercialize the BR100 OAM accelerator. Baidu and China Cellular can be among the many first prospects to make use of Biren’s computing GPUs.
#Efficiency #Nvidias #Chinese language #A800 #GPU #Revealed