OPERATING SYSTEMSOS Linux

Serve 100s of Fine-tuned LLMs for the Cost of Serving One with LoRAX

Predibase recently released the novel LoRA Exchange (LoRAX) serving architecture which offers developers the most efficient and cost-effective way to train and serve smaller, task-specific LLMs using any GPU.

Fine-tuning and serving a large collection of models for production applications has, until now, required dedicated GPU resources for each deployed model which can be cost-prohibitive.

LoRAX is a modular LLM serving architecture that allows users to dynamically serve 100+ fine-tuned models from a single GPU, effectively letting you serve them all for the price of one.

In this session, you will learn:

• Parameter-efficient fine-tuning with LoRA
• Just-in-time dynamic LoRA adapter loading
• How to avoid out of memory errors (OOMs) with tiered weight caching
• Optimizing for high aggregate throughput

Included in this session is a live demo showing how you can leverage our new Python SDK to fine-tune and query LLaMA-2-7b using LoRAX through the Predibase 2-week free trial.

• LoRAX follow-along notebook: https://pbase.ai/loraxcolab
• Slides from the session: https://pbase.ai/loraxwebinarslides
• Fine-tune and serve LLMs for free with our trial: https://pbase.ai/getstarted

source

by Predibase

linux foundation

One thought on “Serve 100s of Fine-tuned LLMs for the Cost of Serving One with LoRAX

  • This was really good, thanks guys. After trying a bunch of different ways, and having some success (and plenty of OOM) running GPU machines and hosting models … your approach makes so much sense. Looking forward to trying it.

Comments are closed.