Self-Hosted AI is pretty darn cool

@chagall@lemmy.world · 9 months ago

Self-Hosted AI is pretty darn cool

@coffee_with_cream@sh.itjust.works · 9 months ago

Uncensored models are so much better, too. chatGPT is like one of those plastic children’s toy hammers vs real models are titanium hammers

@HumanPerson@sh.itjust.works · 9 months ago

Yeah, I like it too. My only issue is ollama’s lack of intel support. I have been looking at issue 1590 on their GitHub. For now I have a 1050ti in a cardboard box PC with other hardware being 10+ years old and a mixed set of RAM totalling 12G. It also has a 100Mbit nic, so I can’t take advantage of full internet speed when downloading models. The worst part is they can support intel, but haven’t merged the solution because of an issue with the windows intel drivers. Linux is fine but I can 't have it. I wasn’t planning to rant, but I already typed it so… enjoy?

@chagall@lemmy.world · edit-2 9 months ago

Yeah, I have an NVDIA GPU and it is magic. The best part is when you are using Ollama, open a second terminal window and enter the command, watch -n 0.5 nvidia-smi and you can see your GPU usage go up and down in real-time as you ask the GPT questions. Pretty cool.

Hopefully they get the ARC folks up and running soon.

@Goodtoknow@lemmy.ca · 9 months ago

Have you found much practical use for small models yet? I love the idea that even the 1.1B tinyllama model can run on my phone, but haven’t found much real world use for it yet. Llama3 8b feels better, but not much better for even emails as it’s a bit dumb

@chagall@lemmy.world · 9 months ago

I use my phone all the time, but I just use a wireguard VPN to tunnel into my home container of Open WebUI. Then I can interact with my desktop machine using a NVIDIA gpu. I’m currently testing mistral-nemo. It’s pretty great but it gets a bit verbose sometimes.

@kureta@lemmy.ml · 9 months ago

I am also using open webui. Most LLMs are too verbose for me, so I created a model in open-webui with system prompt “Do not repeat the questions. Avoid giving lists as answers. Do not summarize the answer at the end. If asked a follow-up question, respond with only new information, do not repeat previously stated information.” and named it No Nonsense.

@chagall@lemmy.world · 9 months ago

That’s really smart. I just found out about fabric yesterday and it is helping me with things like what you stated. Prompt engineering is a huge thing.

@coffee_with_cream@sh.itjust.works · 9 months ago

Imo it’s worthwhile to just run the biggest model available and rent expensive GPU time. It still amounts to very little overall and you get much better results. Project dependent of course

@superglue@lemmy.dbzer0.com · 9 months ago

What kinds of specs do you need to run it well? I’ve got a laptop with a 3070.

@coffee_with_cream@sh.itjust.works · edit-2 9 months ago

You probably want 48gb of vram or more to run the good stuff. I recommend renting GPU time instead of using your own hardware, via AWS or other vendors - runpod.io is pretty good.

@NotMyOldRedditName@lemmy.world · 9 months ago

Kinda defeats the purpose of doing it private and local.

I wouldn’t trust any claims a 3rd party service makes with regards to being private.

@31337@sh.itjust.works · 9 months ago

IDK, looks like 48GB cloud pricing would be 0.35/hr => $255/month. Used 3090s go for $700. Two 3090s would give you 48GB of VRAM, and cost $1400 (I’m assuming you can do “model-parallel” will Llama; never tried running an LLM, but it should be possible and work well). So, the break-even point would be <6 months. Hmm, but if Severless works well, that could be pretty cheap. Would probably take a few minutes to process and load a ~48GB model every cold start though?

ffhein · 9 months ago

Assuming they already own a PC, if someone buys two 3090 for it they’ll probably also have to upgrade their PSU so that might be worth including in the budget. But it’s definitely a relatively low cost way to get more VRAM, there are people who run 3 or 4 RTX3090 too.

Dataprolet · 9 months ago

Isn’t this using a lot of computing power?

@Swedneck@discuss.tchncs.de · edit-2 9 months ago

you hear that said about AI because companies are desperately throwing more and more resources at it to get 0.3% better results, and people are collectively running an insane amount of prompts all the time.

but on a personal level it’s not really any different from any other computations, people render videos all the time and no one complains about the resource usage from that, because companies aren’t trying to sell bloated video rendering services to gardening businesses.

@Phoenicianpirate@lemm.ee · 9 months ago

I am going to be buying a monster high end machine and I want to do all the AI stuff on it.

@CallMeButtLove@lemmy.world · 9 months ago

Is there a way to host an LLM in a docker container on my home server but still leverage the GPU on my main PC?

@chasingtheflow@lemmy.world · 9 months ago

Very cool! You can use something like Tailscale to access your local services remotely without exposing them to the internet.

@toynbee@lemmy.world · 9 months ago

With all respect, the first paragraph seems self contradictory.

@Appoxo@lemmy.dbzer0.com · 9 months ago

Very technical vs not can be very subjective.
It can be a 50 year old sysadmin vs Adam I pulled from the street or a graybeard linux admin vs a beginner sysadmin only in it for thr career instead of the passion (those can be very non-technical but good problem solver folks)

I know my comparison is flawed