HARDWARE

Machine learning (Esperanto, Enflame, Qualcomm)


11:08 AM EDT – Welcome to Hot Chips! This is the annual conference on the latest, largest and upcoming big silicon that excites us all. Follow our regular AnandTech blogs live on Mondays and Tuesdays.

11:08 AM EDT – The event starts at 8:30 a.m. Pacific time, so in about 22 minutes

11:25 AM EDT – I’ll start here in about 5 minutes

11:30 AM EDT – The first is a conversation from Esperanto Technologies

11:31 AM EDT – AI Accelerator – 1000 RISC -V cores on a chip

11:32 AM EDT – 1088 colors RISC-V

11:32 AM EDT – ET-Minion with tensor units

11:33 AM EDT – 160 million bytes of SRAM on the board

11:33 AM EDT – PCIe x8 Gen 4

11:33 AM EDT – Up to 200 Terra-Ops

11:33 AM EDT – Under 20 watts for locking

11:33 AM EDT – focus on recommendation models

11:34 AM EDT – traditionally works on x86

11:34 AM EDT – These servers require additional cards

11:34 AM EDT – Low power consumption per card

11:34 AM EDT – Support for multiple types of data

11:34 AM EDT – dense and sparse load

11:34 AM EDT – be programmable

11:35 AM EDT – reduce memory references that are not in use

11:36 AM EDT – Fixed function hardware can become obsolete quickly

11:37 AM EDT – Thousands of threads

11:38 AM EDT – limited parallelism with individual large chips

11:38 AM EDT – 1000s of RISC-V cores in Esperanto

11:38 AM EDT – Big chips have great power

11:38 AM EDT – Esperanto divides it into chips

11:38 AM EDT – allows lower voltage, increasing efficiency

11:38 AM EDT – Highest recommended performance within 120W in six chips

11:40 AM EDT – TSMC 7nm FinFET

11:40 AM EDT – reduce the voltage per core

11:40 AM EDT – Dynamics C is difficult

11:41 AM EDT – Efficiency in relation to voltage – 0.34 is the best

11:42 AM EDT – Conclusions per second per watt

11:42 AM EDT – One chip could use 275W at peak

11:42 AM EDT – 0.75 volts is 164 W per chip

11:43 AM EDT – The best efficient point is at 8.5 W – 2.5 times better performance than at 0.9 volts

11:44 AM EDT -64-bit risc-v processor, software-configured l1 data cache

11:44 AM EDT – custom pipeline

11:44 AM EDT – SMT2

11:45 AM EDT – 300 MHz to 2 GHz

11:45 AM EDT – can perform 64 operations on one tensor instruction

11:45 AM EDT – 64,000 operations

11:45 AM EDT -512-bit wide integer per cycle, 256-bit wide FP per cycle, per core

11:46 AM EDT – The 8 cores on the chip make up the neighborhood

11:46 AM EDT – before wide length became an issue

11:46 AM EDT – 8 minions share one large instruction cache

11:46 AM EDT – far more efficient than any core with its own I-cache

11:47 AM EDT – cooperative workloads

11:47 AM EDT – customized instructions

11:47 AM EDT – 4 neighborhoods are rich

11:47 AM EDT – with 4 MB of common SRAM

11:48 AM EDT – mesh connections on each board

11:48 AM EDT – SRAM banks can be divided into private L2 or common L3

11:48 AM EDT – Networks cross the core

11:48 AM EDT – 16 LPDDR4X controllers

11:49 AM EDT – 256-bit LPDDR4X

11:49 AM EDT – Six chips and 24 LPDDR4 chips on a PCIe card with a PCIe switch

11:49 AM EDT – 192 GB of accelerator memory

11:49 AM EDT – 822 GB / s total memory bandwidth per PCIe card

11:50 AM EDT – OCP versions

11:50 AM EDT – How to deploy on a larger scale

11:50 AM EDT – 6 chips have one cooling device

11:51 AM EDT – Software through many interfaces

11:52 AM EDT – Esperanto projected play

11:54 AM EDT -Four high-performance ET-Maxions

11:54 AM EDT – Complete RV64GC ISA

11:54 AM EDT – 24 billion transistors, 570 mm2, 89 layers of mask

11:54 AM EDT – First silicone intake

11:55 AM EDT – Silicone A0 in the test

11:55 AM EDT – Commercial RISC-V chip of the highest performance so far

11:55 AM EDT – Early access to qualified users later in 2021

11:56 AM EDT – P * Once

11:58 AM EDT – Q: External memory and IO power supply add power above 20 W – A: IO are on. 20W includes DRAM and other components

12:00 PM EDT – Q: Why not BF16? A: Actually, but BF16 would be extended to FP32 for computing and returned to BF16 back in storage. Because we do reasoning – the customer wants reasoning, he doesn’t need BF16

12:01 PM EDT – Q: General purpose data cache size A: With an area of ​​1000 cores, the L1 / L2 shift at multiple levels is important. Special circuits – maintain a very robust voltage, it is necessary to use a large SRAM for low voltage. 4 KB L1 gave a good result with L2 for performance

12:02 PM EDT – The next conversation is Enflame

12:02 PM EDT – First generation

12:02 PM EDT – Designed in 2018, launched in 2019

12:03 PM EDT – DTU 1.0

12:03 PM EDT – 80 TF BF16, 12nm FinFet, 14.1 billion transistors, 200 GB / s interconnect

12:04 PM EDT – 16 lanes PCIe 4.0

12:04 PM EDT – 300W

12:05 PM EDT – 2 HBM2 at 512 GB / s

12:05 PM EDT – 32 AI computing colors

12:05 PM EDT – ip networkj

12:05 PM EDT – 4 groups of 8 tensor units

12:06 PM EDT – 40 data transfer machines

12:06 PM EDT – on chip network *

12:06 PM EDT – VLIW programmable

12:06 PM EDT – 1024-bit bus with

12:06 PM EDT – 256 KB of L1 data

12:06 PM EDT – DMA engine with 1 KB interface

12:07 PM EDT – GPU-Care 1.0

12:07 PM EDT – 256 Tensor compute kernels

12:07 PM EDT -Each core supports 1x-32-bit MAC or 4×16-bit / 8-bit MAC. All cores do all the precision

12:08 PM EDT – Introduce scarcity in power

12:08 PM EDT – can completely skip instructions if a zero power instruction is detected

12:09 PM EDT – 2 kbit per cycle for storage, 1 kbit per cycle for load

12:09 PM EDT – Sum of support and scalar support for Cector and Skalar

12:10 PM EDT – hardware can add padding elements to achieve best efficiency combined with zero power instruction detection

12:11 PM EDT – 256 cores support convolution operations

12:12 PM EDT – Support various forms of tensors

12:12 PM EDT – must have two borders on the border

12:13 PM EDT – L0 cache memory with 10 TB / s bandwidth

12:13 PM EDT – Data flow asynchronization and calculation

12:14 PM EDT – 4D tensors

12:14 PM EDT – Supports dimension reshaping

12:15 PM EDT – 200 GB / s bidirectional IO per card

12:15 PM EDT – custom protocol with submicrosecond latency

12:15 PM EDT – Layers of cables on stands without DMA

12:16 PM EDT – AIC and OAM

12:17 PM EDT – Increase to 2D torus floor

12:18 PM EDT – Performance up to 160 cards

12:20 PM EDT – Next product ready soon

12:20 PM EDT – Questions and answers

12:21 PM EDT – Q: Is there a targeted training load? A: Training, supported type and machine language processing. The first customer used MLP

12:21 PM EDT -Q: Why design your own chip-to-chip protocol? Is cached A: Not cached, mailbox for data synchronization. we wanted an easier protocol with better latency

12:22 PM EDT – Q: Sell in the west? A: Current customers are Asia, but if you have interest, come to Enflame

12:22 PM EDT – The next speech is Qualcomm Cloud AI 100

12:23 PM EDT – 12 TOPS / watt

12:23 PM EDT – high performance and efficient accelerator

12:23 PM EDT – Another introduction to what drives AI

12:24 PM EDT – Qualcomm at the forefront of AI research, currently in the 6th generation

12:25 PM EDT – two forms of factors – high performance in PCIe HHHL and more powerful dual M.2

12:25 PM EDT – top SoC slide

12:26 PM EDT – customized high performance architecture

12:26 PM EDT – 400+ Int8 TOPs

12:26 PM EDT – 8 lanes PCIe 4.0

12:26 PM EDT – 16 GB / sof LPDDR4

12:26 PM EDT – store all weights on the SoC with 144 MB of built-in memory

12:27 PM EDT – Double M.2 is used for power supply

12:27 PM EDT – power management controller

12:27 PM EDT – 4-way VLIW

12:27 PM EDT – 1800+ instructions

12:27 PM EDT – SMT scalar nucleus

12:27 PM EDT – FP32 / FP16 and INT16 / INT8

12:28 PM EDT – 1 MB L2 cache

12:28 PM EDT – Vector unit, Tensor unit

12:28 PM EDT – Vector memory closely connected 8 MB between all units

12:28 PM EDT – Almost everything

12:29 PM EDT – Can work at different power levels

12:29 PM EDT – 12W for edge, 20W for ADAS, 70W High Perf mode

12:29 PM EDT – 7 nm

12:30 PM EDT – The tensor unit is 5 times more efficient than the Vecotr unit

12:30 PM EDT – 16 AI colors

12:30 PM EDT – 5 TOP / W at high performance

12:31 PM EDT – The whole lock package

12:33 PM EDT – Compiler supports mixed precision

12:36 PM EDT – Low power optimizations

12:36 PM EDT – minimize access to DDR and improve performance

12:36 PM EDT – Reuse data as much as you can get started before you start getting more

12:39 PM EDT – Divide the network into multiple AI100 cards

12:39 PM EDT – up to 16 cards per system

12:39 PM EDT -PCIe peer-to-peer switch

12:41 PM EDT – Performance on INT8 and Mixed, all inference

12:42 PM EDT – ‘industry leading results’

12:42 PM EDT – Performance in relation to batch size

12:44 PM EDT – AIMET can perform inlfight compression for locking

12:44 PM EDT – 15% increase in ResNET50 perf for only 1.1% decrease in accuracy

12:45 PM EDT – Edge implementation vs server implementation

12:45 PM EDT – DM.2e = double M.2

12:45 PM EDT – 15W TDP in that double M.2

12:46 PM EDT – Scalable solution for 5G, ADAS, infrastructure

12:46 PM EDT – Time for questions and answers

12:47 PM EDT – Q: Whether the power points are static or automatic adjustment A: The chip has DVFS – it can change DVFS based on power. For TDP, based on the solution, you can set the TDP in the firmware

12:47 PM EDT -Q: 12 board levels based on TOPS / W or chip level? A: Chip

12:49 PM EDT – Q: What are the main drivers for achieving peaks / WA: Good building blocks – 6th generation I have been in business for a long time. I’ve been doing this on mobile phones for a long time, especially locking. The basic block is efficient. VLIW – the compiler is quite advanced, keeping the hardware simpler. Same process for SoC level. Not cached, enabled via compiler

12:51 PM EDT – Q: Compromises between VLIW and RISC A: ML fits very well into VLIW, have an insight. We know how to make very efficient VLIW cores. But the load is well adapted to VLIW. I did an evaluation, but I found it to be the best way.

12:51 PM EDT – Q: NOC details? Network, crossbar? A: Hybrid, more linear with routers

12:53 PM EDT – Q: Systolic sequence? A: No.

12:53 PM EDT – Q: The scalar core is RISCV A: Proprietary VLIW

12:55 PM EDT – It’s a wrapper



Source link

Naveen Kumar

Friendly communicator. Music maven. Explorer. Pop culture trailblazer. Social media practitioner.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button