Enable Generative AI Everywhere with Ubiquitous Hardware and Open Software – Guobing Chen, Intel

December 21, 2023

Generative AI like Large Language Models (LLM) usually require both massive memory and computation resource due to their incremental larger model size. However, by our comprehensive analysis, there are a set of optimization opportunities for most of the LLM models which can greatly reduce their inference latency, typically including low precision inference via bfloat16/INT8/INT4, Flash Attention and Efficient Attention in scaled dot product attention (SDPA), optimized KV cache access, Kernel Fusion such as RoPE, scale up/out model inference on multiple devices with Tensor Parallel, etc.

We implemented these optimizations within PyTorch and Intel Extension for PyTorch, and our experiment on a typical CPU server with two 4th generation of Intel Xeon Scalable Processors shows that we can achieve

source

by The Linux Foundation

linux foundation