Graphics (Intel, AMD, Google, Xilinx)

5:28 PM EDT – Welcome to Hot Chips! This is the annual conference on the latest, largest and upcoming big silicon that excites us all. Follow our regular AnandTech blogs live on Mondays and Tuesdays.

5:31 PM EDT – Stream starts! We have Intel, AMD, Google, Xilinx

5:32 PM EDT – One of the most complex projects at Intel

5:33 PM EDT – Target 500x compared to Intel’s previous best GPU

5:33 PM EDT – The scale is very important

5:33 PM EDT – Four variants of Xe

5:34 PM EDT – Scope of market needs

5:34 PM EDT – a wide set of data types

5:34 PM EDT – Xe-Core

5:34 PM EDT – No more EU – Xe Cores now

5:35 PM EDT -Each core in HPC has 8x 512-bit vectors, 8×4096-bit matrix motors, 8 deep systolic arrays

5:35 PM EDT – Large L2 cache of 512 KB per Xe Core

5:35 PM EDT – Software adjustable shared scratch memory

5:36 PM EDT – 8192 x INT8 per Xe-core

5:36 PM EDT – One segment has 16 Xe cores, 16 RT units, 1 hardware context

5:36 PM EDT – ProVis and content creation

5:37 PM EDT – There are four slices in a row

5:37 PM EDT – 64 Xe cores, 64 RT units, 4 hardware contexts, L2 cache, 4 HBM2e controllers

5:37 PM EDT – 8 Vehicle connections

5:37 PM EDT – Support 2 piles

5:38 PM EDT – directly connected via packaging

5:38 PM EDT – GPU communication

5:38 PM EDT – 8 fully connected graphics processors via a built-in switch

5:38 PM EDT -not for CPU-to-GPU

5:39 PM EDT – 8 GPUs in OAM

5:39 PM EDT – OCP accelerator module

5:39 PM EDT – 1 million INT8 / hour in one system

5:40 PM EDT – Advanced packaging

5:41 PM EDT – Lots of new stuff

5:41 PM EDT – EMIB + Foveros

5:41 PM EDT – 5 different process nodes

5:42 PM EDT – MDFI mutual turnover

5:42 PM EDT – Lots of challenges

5:42 PM EDT – I learned a lot

5:43 PM EDT – Floor plan locked very early

5:43 PM EDT – Run Foveros at a frequency of 1.5x that was originally thought to minimize foveros connections

5:43 PM EDT – launched a few days after the first silicone return

5:44 PM EDT – For an order of magnitude more Foveros connectors than other previous versions

5:44 PM EDT – Calculate tiles built on TSMC N5

5:45 PM EDT – 640 mm2 per base plate, built on Intel 7

5:46 PM EDT – Xe Link tile built in less than a year

5:47 PM EDT – OneAPI support

5:47 PM EDT – 45 TFLOPs of permanent perf

5:48 PM EDT – Customers early next year

5:48 PM EDT – Questions and answers

5:50 PM EDT – Q: PV calculation 45TF FP32 – 45 TF FP64? Oh yes

5:51 PM EDT – Q: More insight into the hardware context – is 8x PV monolithic or 800 instances? A: It looks like a single logical device, independent applications can work in isolation at the context level

5:53 PM EDT – Q: Does Xe Link support CXL, if so, which revision? A: It has nothing to do with CXL

5:54 PM EDT – Q: Does the GPU connect to the CPU via PCIe or CXL? A: PCIe

5:54 PM EDT – Q: Xe Link bandwidth? A: 90G serdes

5:55 PM EDT – Q: Peak power / TDP? A: Not disclosed – no specific product numbers

5:55 PM EDT – The next conversation is AMD – RDNA2

5:57 PM EDT – CDNA for computing versus RDNA for games

5:57 PM EDT – Both are focused on computing for each direction

5:58 PM EDT – Flexible and customizable design

5:58 PM EDT – 18 months after the first RDNA product

5:59 PM EDT – 128 MB Infinity cache

5:59 PM EDT – increase the frequency

5:59 PM EDT – RDNA did not jump out of the design from certain GCN substrates

5:59 PM EDT – Perf / W is a key metric

5:59 PM EDT – minimize energy consumption

6:00 PM EDT – DX12 Ultimate support, DirectStorage support

6:00 PM EDT – Next-generation consoles have helped develop a set of features

6:01 PM EDT – + 30% Freq. At iso-power, or below half the power for isofrequency

6:02 PM EDT – Everything is done without changing the process node

6:03 PM EDT – RX5000 – RDNA1 – high bandwidth, but low hit rates

6:04 PM EDT – You are trying to avoid using GDDR to reduce power – so increase your cache!

6:04 PM EDT – GPU cache hit rates

6:04 PM EDT – graphics used to be one-way

6:05 PM EDT – Large L3 cache

6:07 PM EDT – lower energy per bit – only 1.3 pJ / bit in cache versus 7-8 pJ / bit for GDDR6

6:08 PM EDT – The average memory latency on the RX6800 is 34% less than the RX5700

6:10 PM EDT – Air monitoring in RDNA2

6:10 PM EDT – Variable speed shading

6:10 PM EDT – Sample feedback

6:10 PM EDT – Mesh Shaders

6:11 PM EDT – RT tried to be efficient without adding directing

6:12 PM EDT – tightly integrated into the shader architecture

6:12 PM EDT – Simplified implementation

6:13 PM EDT – VRS uses a fine-grained selection of 8×8 pixels

6:13 PM EDT – VRS up to 2×2 in 8×8 network

6:16 PM EDT – Questions and answers

6:19 PM EDT -P: Why Infinity Cache vs Stacked V-Cache A: Not discussed, just estimated cache on die

6:22 PM EDT – Q: TDP% as CU vs other? A: High workload, most power in CUs – accurate analysis is based on workload – CUs are the largest, can exceed 50%, others are common GPU blocks, 3. DRAM interface. Infinite cache changes 2./3. position. The fourth is Infinity Cache

6:23 PM EDT – Q: Van Gogh’s SteamDeck? A: No comment

6:29 PM EDT – Google VCU talk

6:30 PM EDT – saw> 60% of global internet traffic

6:30 PM EDT – need better algorithms

6:30 PM EDT – The video becomes hardware compressed

6:31 PM EDT – AV1 consumes 200 times more coding time in software than H.264

6:31 PM EDT – Pixels per second increased 8000x from H.264

6:32 PM EDT – Most consumer hardware is optimized for price, not performance or efficiency

6:32 PM EDT – I couldn’t find everything on offer from the shelf

6:32 PM EDT – Encode 10s versions with one input

6:33 PM EDT – Full access to configuration tools is required

6:34 PM EDT – Dedicated VP9 encoding and decoding

6:36 PM EDT – enabling the sw / hw code character

6:38 PM EDT – With HLS, they are allowed to test many architecture variations for features and performance

6:39 PM EDT – Accelerators should be scaled to the warehouse

6:40 PM EDT – tolerate errors at the chip and core level – reliability is a higher level function

6:40 PM EDT – 48 encodings per decoding (MOT) should be supported

6:40 PM EDT – Upload one video, encode multiple versions

6:41 PM EDT – The chip-level cache was not efficient

6:41 PM EDT – kernel is supported for large MOTs

6:41 PM EDT – LPDDR4 for byte bandwidth

6:41 PM EDT – use ECC on the chip memory

6:42 PM EDT – conservative NOC design

6:43 PM EDT – One decoded frame can be used multiple times – one decoding for multiple encodings

6:43 PM EDT – parallel rows for high usability

6:44 PM EDT – 2 ASICs per board split, 5 boards per chassis, 2 chassis per host

6:44 PM EDT – As many machines per rack as space and power allow

6:44 PM EDT – Performance compared to Skylake with double socket

6:44 PM EDT – 100x VP9 vs H.264

6:45 PM EDT – one 20 VCU machine replaces the CPU stands

6:46 PM EDT – Improved quality after implementation

6:47 PM EDT – Time for questions and answers

6:49 PM EDT – Q: Can VCU be in tandem with ASIC? A: Not possible, no input / output in the middle. Tightly joined design

6:50 PM EDT – Q: What is the profile of the PCIe card – tape / TDP? A: In-house bifurcation format, otherwise FHFL double slot, silicone below 100 W

6:50 PM EDT – Q: VCU enables GCP? A: You are always looking for a unique GCP, but there are no announcements.

6:52 PM EDT – Q: Q: Can HLS make parity with RTL? Oh yes

6:54 PM EDT – Q: SECDED ECC caches? A: where SECDED is possible, some SRAMs in the encoder encoder only detect detection – if an error does occur, we can reset the job

6:54 PM EDT – Q: 8K60 – Can one VCU do that? A: Bandwidth, yes. But there is no VP9 profile.

6:55 PM EDT – Q: Other codecs? A: There are no comments on future formats. Largely involved in AV1’s next-generation AV2

6:55 PM EDT – Q; Audio streams? A: The stream is interrupted between video / audio tracks, can be interrupted and processed elsewhere. The VCU does not touch the sound

6:58 PM EDT – The last speech is Xilinx

6:59 PM EDT -Xilinx Versal AI Edge

6:59 PM EDT – 7 nm

6:59 PM EDT – AIE-ML architecture optimized for inference

7:00 PM EDT – What is ML used for?

7:00 PM EDT – All applications require a lot of artificial intelligence at low latency and low power

7:02 PM EDT – the lowest and highest device highlighted today

7:03 PM EDT – 10s ML plate to 100s plate

7:04 PM EDT – many factors of the form of interdependence

7:05 PM EDT – Architecture details

7:05 PM EDT – memory boards, optimized computer core

7:06 PM EDT – Native support for INT4 and BF16

7:07 PM EDT – SRAM is protected by ECC

7:07 PM EDT – The amount of memory card depends on the device – the average range is about 38 megabytes of memory cards

7:10 PM EDT – New ML-targeted tiles on these mid-range products

7:10 PM EDT – Superior still uses AIE because it needs 5G

7:10 PM EDT – VLIW vector processor

7:10 PM EDT – non-blocking interconnection

7:10 PM EDT – micro-DMA

7:15 PM EDT – device-level data movement

7:15 PM EDT – Tiles can be read directly from the GDR, no intermediate levels are required

7:16 PM EDT – DDR supports live compression

7:20 PM EDT – memory is allocated, no data replication, no cache leaks

7:23 PM EDT – Software stack coming soon

7:23 PM EDT – You don’t have to program in C ++ – pytorch, Tensorflow, Caffe, tvm

7:24 PM EDT – use cases

7:25 PM EDT – How to use a complete Versal AI Edge disposable processor

7:31 PM EDT – It’s a wrapper

Source link

Naveen Kumar

Friendly communicator. Music maven. Explorer. Pop culture trailblazer. Social media practitioner.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button