← Back to Papers

2017 dnn; mlp; cnn; rnn; lstm; neural network; deep learning; domain-specific architecture; accelerator; tensorflow; tpu; gpu; transmission line matrix methods; graphics processing units; artificial neural networks; central processing unit; tensile stress; training; hardware

In-datacenter performance analysis of a tensor processing unit

N. P. Jouppi; C. Young; N. Patil; D. Patterson; G. Agrawal; R. Bajwa; S. Bates; S. Bhatia; N. Boden; A. Borchers; R. Boyle; P. -l. Cantin; C. Chao; C. Clark; J. Coriell; M. Daley; M. Dau; J. Dean; B. Gelb; T. V. Ghaemmaghami; R. Gottipati; W. Gulland; R. Hagmann; C. R. Ho; D. Hogberg; J. Hu; R. Hundt; D. Hurt; J. Ibarz; A. Jaffey; A. Jaworski; A. Kaplan; H. Khaitan; D. Killebrew; A. Koch; N. Kumar; S. Lacy; J. Laudon; J. Law; D. Le; C. Leary; Z. Liu; K. Lucke; A. Lundin; G. MacKean; A. Maggiore; M. Mahony; K. Miller; R. Nagarajan; R. Narayanaswami; R. Ni; K. Nix; T. Norrie; M. Omernick; N. Penukonda; A. Phelps; J. Ross; M. Ross; A. Salek; E. Samadiani; C. Severn; G. Sizikov; M. Snelham; J. Souter; D. Steinberg; A. Swing; M. Tan; G. Thorson; B. Tian; H. Toma; E. Tuttle; V. Vasudevan; R. Walter; W. Wang; E. Wilcox; D. H. Yoon

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X–30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X–80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Added 2026-04-21