Enable Generative AI Everywhere with Ubiquitous Hardware and Open Software – Guobing Chen, Intel
Enable Generative AI Everywhere with Ubiquitous Hardware and Open Software – Guobing Chen, Intel
Generative AI like Large Language Models (LLM) usually require both massive memory and computation resource due to their incremental larger model size. However, by our comprehensive analysis, there are a set of optimization opportunities for most of the LLM models which can greatly reduce their inference latency, typically including low precision inference via bfloat16/INT8/INT4, Flash Attention and Efficient Attention in scaled dot product attention (SDPA), optimized KV cache access, Kernel Fusion such as RoPE, scale up/out model inference on multiple devices with Tensor Parallel, etc.
We implemented these optimizations within PyTorch and Intel Extension for PyTorch, and our experiment on a typical CPU server with two 4th generation of Intel Xeon Scalable Processors shows that we can achieve
by The Linux Foundation
linux foundation