Selfhosted & AI

curbstickle@anarchist.nexus · 24 days ago

Selfhosted & AI

Scrubbles@poptalk.scrubbles.tech · 24 days ago

Yeah it’s heresy on Lemmy, but I do find it genuinely useful. My only regret is that I have to use Claude/Anthropic more than I’d like, which is why I have a vested interest in selfhosting myself. I’d rather figure out how to run the larger models myself and cut them off completely, but you even begin to mention that here and you’ll get downvoted to hell.

brucethemoose@lemmy.world · edit-2 23 days ago

You don’t even need Claude anymore. GLM 5.2 API is good enough for 95% of the same things and vastly cheaper.

MiMo 2.5 Pro and Kimi are also very good. And then there’s Cerebras API if you just want simple things done quick.

The thing with self hosting, while awesome, is that it requires a lot of hardware and considerable time investment for what’s essentially a “base tier model,” or at best one step down for what’s still a very cheap API. I still love it, especially the privacy and control aspect, but you aren’t running Claude at home unless you’ve got a threadripper or server hardware collecting dust.

…Hence I can understand why people don’t pursue it. Especially since a cursory Google search will lead you to trying the Deepseek distillation on Ollama (which is awful).

SuspiciousCarrot78@aussie.zone · edit-2 23 days ago

What Ollama did what that distill is shameful.

For those not in the know: they took a small, 8B model with Deepseek fine tune (Qwen3-8B iirc) and claimed it was the 400+B param Deepseek.

They essentially tricked folks into thinking they were running a near-peer SOTA model at home when in fact they were running a small language model (SLM) with crippled settings (again, iirc, ctx -4096 by default).

Lying via obfuscation is still lying.

brucethemoose@lemmy.world · edit-2 23 days ago

All while hiding any attribution to the underlying engine, just to start:

https://sleepingrobots.com/dreams/stop-using-ollama/

And that article isn’t comprehensive. A book could be written on damage and drama they’ve caused.

SuspiciousCarrot78@aussie.zone · 23 days ago

I think it’s actually a pretty interesting case study of how something from the open source community can get co-opted and fucked over.

That article is a good read.

As always, the game plan seems to be “disrupt, own the market, enshittify”.

brucethemoose@lemmy.world · edit-2 23 days ago

As always, the game plan seems to be “disrupt, own the market, enshittify”.

But with a slimy veneer of SEO/engagement spamming as the primary business strategy.

Scrubbles@poptalk.scrubbles.tech · 23 days ago

That’s where I am okay with hardware, but can’t seem to fit the models on my 3090. I have dreams of something like an A100 someday, but not until there’s a ton of used ones that hit the market. What do you use for your hardware?

brucethemoose@lemmy.world · edit-2 23 days ago

I have a single 3090!

That’s the dream GPU, these days.

And I have 128GB CPU RAM. So the best model I can run is MiMo 2.5 (a 300B model) at around 10 tokens/sec, using hybrid CPU inference.

…But that’s the worst-case scenario, for speed. It’s an IQ3_KT quant (a high quality “trellis” quantization type, but very slow on CPU), with a gigantic model that barely fits in my RAM+VRAM combined, with no DFlash or any kind of speculative decoding turned on. I could tune it to be much faster, but I mostly just want “max quality, fast enough to read as it streams, barely fits in memory” for this model.

For speed, or prompts with lots of thinking or context (like agenic use), I just run Qwen 3.6 27B now. That would fit in your 3090 no matter how much CPU RAM you have, but you have to be smart about the backend and quantization you pick. If you just use Ollama, it’s gonna tell you it won’t fit, or use some horrible default that spits out garbage.

…This is what I meant to emphasize.

It’s not just the hardware. You kinda have to be part developer, part enthusiast to even follow this stuff, it up optimally, and keep it up-to-date. If you just try to Google “best LLM for 3090,” you will get absolute garbage.

SuspiciousCarrot78@aussie.zone · edit-2 23 days ago

I’m still impressed you got any MiMo to work at home, at 10 tok/s.

For those trying to visualise that -

https://mikeveerman.github.io/tokenspeed/?rate=10&mode=agent&think=10

Is it a constant 10 or does it (it must do, right?) drop off as context increases?

I imagine you must have compaction or something to mitigate that.

brucethemoose@lemmy.world · edit-2 23 days ago

It’s drops off, but not as much as you’d think.

MiMo uses 5:1 SWA, so its long-context compute doesn’t increase as catastrophically as older models. That, and most of the “slowness” comes from the MoE layers being on CPU (whereas the attention layers that get heavier at high context are all on the 3090).

That’s the beauty of these MoEs: they’re just the right size for the “compute-lite” parts to stay in CPU RAM.

I will measure it tomorrow. It is a constant ~9-10TPS for short queries, but definitely slower near my current max context of 85K.

And do you mean prompt compaction? I don’t automate that; when I use that particular model, I tend to use it in Mikupad, aka “raw” notepad mode, and manipulate the context directly. This is so I can do things like chop out conversations, pick different tokens from the logprobs, or edit its own replies/thinking and continue mid reply.

I like manually handling this because, being a local model, prompts are cached. Streaming starts quickly if most of the prompt stays cached, which is actually a really nice advantage over APIs.

SuspiciousCarrot78@aussie.zone · edit-2 23 days ago

Oh, it’s a MoE? That makes sense.

If you’re getting MiMo at -ctx 85K … you’re within spitting distance of SOTA. You can do real work with that.

I take it MiMo doesn’t do the Qwen “hyperventilate into a paper bag” loop as --ctx increases. Qwen’s seem to be really sensitive to that at lower quants.

I’m using 27B via OR API and I swear the diff providers use entirely diff quants. Sometimes you get a genius and other times a drooling mess.

brucethemoose@lemmy.world · edit-2 23 days ago

They 100% do. They’re probably serving “naive” FP8 via VLLM, which is worse than you’d think, especially if they flip on the awful FP8 KV cache.

In a local quant, you can stop quantized models from falling apart at higher CTX by leaving the attention heads at a higher quantization. As an example, with MiMo 2.5, I have all the MoE MLP layers at IQ3_KT, the dense experts at Q6K, but all the attention layers at Q8_0.

For Qwen 27B, I’m still experimenting, but leaning towards IQ4_KT for the MLPs, Q6K for attention, and Q8_0 for the small, very sensitive KV heads. Or a similar scheme as an exl3 quant.

That being said, sometimes even unquantized models fall apart in certain long context scenarios because the max advertised context is a lie. You just have to test them and see, but Qwen has certainly done this in the past.

Scrubbles@poptalk.scrubbles.tech · 22 days ago

I’ll have to play around with mine then, because I’ve had not great luck with it, or at least very disappointing. The CPU offloading is fairly slow, but maybe I should try tweaking more

brucethemoose@lemmy.world · edit-2 22 days ago

Be sure to try the ik_llama.cpp fork. Basically, it specializes in MoE CPU offloading on Nvidia cards, and more efficient quantization types than mainline llama.cpp:

https://github.com/ikawrakow/ik_llama.cpp/

And see this repo for specific 3090 configs: https://github.com/noonghunna/club-3090

Honestly I should just write up my setup in this community too.