How to Self-Host Your Own Private AI Stack

July 11, 2024

In this tutorial we’ll walk through my local, private, self-hosted AI stack so that you can run it too.

If you’re looking for the overview of this stack, you can out the video here: https://www.youtube.com/watch?v=GrLpdfhTwLg

Video Notes: https://technotim.live/posts/ai-stack-tutorial/

Products:
GPUs
12GB is probably “good enough” for small models.

3000 Series
👇 My value pick. 3060, 12GB RAM, affordable, can play games too.
– MSI Gaming GeForce RTX 3060 12GB: https://amzn.to/3WerOUI
– GIGABYTE GeForce RTX 3060 Gaming OC 12G: https://amzn.to/3WdqHV9
– NVIDIA GeForce RTX 3090 Founders Edition Graphics Card Renewed 24GB: https://amzn.to/4cTIS7Q
– Zotac Gaming GeForce RTX 3090 Trinity OC, 24GB: https://amzn.to/3WdhJqV

4000 Series
– ASUS TUF Gaming GeForce RTX™ 4090 OG OC Edition Gaming Graphics Card 24GB: https://amzn.to/463LHRF
– MSI Gaming GeForce RTX 4060 8GB: https://amzn.to/4eVxrON
– ASUS Dual GeForce RTX™ 4070 White OC Edition 12GB: https://amzn.to/460rqN0

CPUs
You’ll want a modern CPU, if you are going desktop class here are a few I would choose
– Intel Core i7-12700K Gaming Desktop Processor: https://amzn.to/3XTj1Zu
– Intel Core i7-13700K Gaming Desktop Processor: https://amzn.to/3Lg7M5V
– Intel Core i7-14700K Gaming Desktop Processor: https://amzn.to/3WftehU

Storage
For flash storage, I always go with these SSDs
– Samsung 870 EVO SATA: https://amzn.to/3XTW6gA
– SAMSUNG 990 PRO SSD NVMe: https://amzn.to/4cW1xA9

(Affiliate links may be included in this description. I may receive a small commission at no cost to you.)

Support me on Patreon: https://www.patreon.com/technotim
Sponsor me on GitHub: https://github.com/sponsors/timothystewart6
Subscribe on Twitch: https://www.twitch.tv/technotim
Merch Shop 🛍️: https://l.technotim.live/shop
Gear Recommendations: https://l.technotim.live/gear
Get Help in Our Discord Community: https://l.technotim.live/discord
Main channel: https://www.youtube.com/@TechnoTim
Talks Channel: https://www.youtube.com/@TechnoTimTalks

00:00 – Intro
01:14 – Hardware Specs
02:23 – GPU Considerations
03:10 – GPU Perspective
04:54 – Proxmox
07:05 – Server OS
07:25 – Drivers
08:31 – NVIDIA Container Toolkit
09:44 – Docker
10:25 – Folder Layout
11:20 – Stack Overview
12:15 – Traefik & Docker Networking Considerations
12:55 – Ollama
17:29 – Open WebUI
20:55 – Starting the Stack
23:08 – Ollama Models
25:50 – Chatting with Ollama & Performance
32:05 – searXNG
33:33 – Stable Diffusion & ComfyUI
39:54 – Stable Diffusion Models
44:42 – ComfyUI Workflows
47:00 – Verifying Model Checksum
48:48 – Integrating ComfyUI into Open WebUI
51:59 – Whisper
56:22 – Home Assistant Assist – Chat, Voice, and Text to Voice
01:04:15 – Code Suggestions and Chat Assistant in VSCode (Free Co-pilot)

Thank you for watching!

source

by Techno Tim Tinkers

linux dns configuration

26 thoughts on “How to Self-Host Your Own Private AI Stack”

@gr8tbigtreehugger

July 11, 2024 at 9:22 am

This is everything I want to do! Many thanks!!! 🙏
@zerompg

July 11, 2024 at 9:22 am

You could run all the AI stuff in an LXC. You can pass the GPUs through to the LXC. The way that I figured out how to do it is mapping the devices in the LXC config and then install (the same!!) drivers on the host and in the LXC. You can actually share the GPUs this way if you wanted.
@canoozie

July 11, 2024 at 9:22 am

Yeah I run 6x RTX A6000 (Ampere generation, same as the 30 series consumer cards) in 2 GPU nodes in my homelab. I don't train, but I do have a bunch of agents, and some automations that run a lot, so parallel compute of AI models is important enough to have spent a mid-sized car's worth on GPUs. EDIT: I'd also suggest gemma2:27b from ollama, it's a great model, better than llama3:8b in my testing (and in some, better than llama3:70b … i can run both).
@tmaris

July 11, 2024 at 9:22 am

Well there's not a chance that I'll recreate this beast of a configuration but it was a fun watch! Thank you
@astacc

July 11, 2024 at 9:22 am

ollama/ollama:rocm docker container works great on my RX6700 XT with any model that fits within VRAM that I tested so far
@the_thunder_god

July 11, 2024 at 9:22 am

Cool and all, but still something it looks like I can't do with the hardware available to me. My best GPU is in my Gaming PC, a 6700XT…a 3 year old card. My server downstairs is from 11 years ago. Works great for the tasks I currently use it for.
@BrokenGlytch

July 11, 2024 at 9:22 am

Thanks for covering this so in-depth Tim. I've been working on a similar project based on a Dell R730 with 2x Tesla P40 cards…much more power hungry than your setup, but will spit out answers from 70b and complex Stable Diffusion workloads as fast as the small models generate on a 3090. I was running into some issues with getting things integrated in the same stack, so great to see how you've put this all together….I have some rebuilding to do.
@DanielosCompaneros

July 11, 2024 at 9:22 am

What about VirGL? Would it still work with docker?
@GundamExia88

July 11, 2024 at 9:22 am

Great video! How do you update the stack? git pull and docker compose pull or?
@Techonsapevole

July 11, 2024 at 9:22 am

I did a similar setup with Debian and rocm containers for AMD APU
@PavelMezentsev

July 11, 2024 at 9:22 am

Thank you for the video, a very nice guide.

One potential thing to play around with is quantization of the models, one can find one that is less quantized but still fits in the memory. For example with gemma2 27b once gets `q4` that takes 16 GB by default but one could get `27b-instruct-q6_K ` that takes ~21 GB and perhaps gives slightly better results. Of course then one has less space to space to host the models for other services like stable diffusion or whisper. One needs to click on tags when picking the model size on ollama website to see the full list.

Another nice potential addition to the stack could be `openedia-speech` to handle text-to-speech. It can be integrated with Open Web UI. Not a must have but complements the stack nicely IMHO.
@KevinHaeusler

July 11, 2024 at 9:22 am

Could this run with multi gpu? Like 4x P40s
@romayojr

July 11, 2024 at 9:22 am

i'm considering redeploying my private ai using docker stack 😩thanks a lot, tim! 🤓
@camerontgore

July 11, 2024 at 9:22 am

This is too cool! Thanks for doing this!!!
@dagleaves

July 11, 2024 at 9:22 am

I ran into an error with starting Open WebUI

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name'

I solved it through some Googling by setting the environment variable RAG_EMBEDDING_MODEL_AUTO_UPDATE=True. Youtube doesn't like links so I can't seem to share the GitHub issue I found it under

Whishper also doesn't seem to work, but I will probably just stand up my own API for faster-whisper. Other than the one bug with open webui, everything ran great on a 2060 w/ 12GB of VRAM.
@sku2007

July 11, 2024 at 9:22 am

is it possible to run this in windows and with self signed certificate for local use only?
@MsSteganos

July 11, 2024 at 9:22 am

Awesome video. I am trying to run ollama in kubernetes but now i think it will be easier to run as docker swarn
@mathieuleclerc4136

July 11, 2024 at 9:22 am

An overview of what a local ai can actually do would be great lol. Because I don't want to jump in it without knowing the final goals 😂
@SirJohn2024

July 11, 2024 at 9:22 am

I'm always surprised when everything just works… Kudos & thanks for the support notes… Great job
@thericksauce

July 11, 2024 at 9:22 am

Do you think you'll be adding RouteLLM? I just watched a few videos about it. Apparently it has accuracy up to 95% of GPT-4 with an 85% cost reduction
@Damia-cz8og

July 11, 2024 at 9:22 am

@TechnoTimTinkers Permission denied (tailscale) help fixed
@CraftComputing

July 11, 2024 at 9:22 am

Oh baby! Time to put some of my Tesla GPUs to work 🙂
@justthestuff3324

July 11, 2024 at 9:22 am

I have a 4090 for gaming local LAN stream gaming and have considered using it as a AI workhorse but the energy it takes to run it on an always on server is keeping me from doing so.
@shaswatamondal9786

July 11, 2024 at 9:22 am

This guy just knows all ! I am watching all of his videos for almost two years and finally networking and hypervisor are getting easy. I just needed this detailed discussion about personalised AI. ❤
@coletraintechgames2932

July 11, 2024 at 9:22 am

Thank you so much for this video!
At 31:37, this is how mine runs because I have a small GPU.
@SpaceGuy101

July 11, 2024 at 9:22 am

Every time toy called the RTX 3090 "GTX 3090" it just hurt a bit inside. But otherwise cool ideas, thank you.