There’s quantization which basically compresses the model to use a smaller data type for each weight. Reduces memory requirements by half or even more.
There’s also airllm which loads a part of the model into RAM, runs those calculations, unloads that part, loads the next part, etc… It’s a nice option but the performance of all that loading/unloading is never going to be great, especially on a huge model like llama 405b
Then there are some neat projects to distribute models across multiple computers like exo and petals. They’re more targeted at a p2p-style random collection of computers. I’ve run petals in a small cluster and it works reasonably well.
Yes, but 200 gb is probably already with 4 bit quantization, the weights in fp16 would be more like 800 gb
IDK if its even possible to quantize more, if it is, you’re probably better of going with a smaller model anyways
Why, of course! People on here saying it’s impossible, smh
Let me introduce you to the wonderful world of thrashing. What is thrashing? It’s when you run out of ram. Luckily, most computers these days do something like swap space - they just treat your SSD as extra slow extra RAM.
Your computer gets locked up when it genuinely doesn’t have enough RAM still though, so it unloads some RAM into disk, puts what it needs right now back into RAM, executes a bit of processing, then the program tells it actually needs some of what got shelved on disk. And it does it super fast, so it’s dropping the thing it needs hundreds of times a second - technology is truly remarkable
Depending on how the software handles it, it might just crash… But instead it might just take literal hours
there are other options less ram consuming?
There’s quantization which basically compresses the model to use a smaller data type for each weight. Reduces memory requirements by half or even more.
There’s also airllm which loads a part of the model into RAM, runs those calculations, unloads that part, loads the next part, etc… It’s a nice option but the performance of all that loading/unloading is never going to be great, especially on a huge model like llama 405b
Then there are some neat projects to distribute models across multiple computers like exo and petals. They’re more targeted at a p2p-style random collection of computers. I’ve run petals in a small cluster and it works reasonably well.
Yes, but 200 gb is probably already with 4 bit quantization, the weights in fp16 would be more like 800 gb IDK if its even possible to quantize more, if it is, you’re probably better of going with a smaller model anyways
Why, of course! People on here saying it’s impossible, smh
Let me introduce you to the wonderful world of thrashing. What is thrashing? It’s when you run out of ram. Luckily, most computers these days do something like swap space - they just treat your SSD as extra slow extra RAM.
Your computer gets locked up when it genuinely doesn’t have enough RAM still though, so it unloads some RAM into disk, puts what it needs right now back into RAM, executes a bit of processing, then the program tells it actually needs some of what got shelved on disk. And it does it super fast, so it’s dropping the thing it needs hundreds of times a second - technology is truly remarkable
Depending on how the software handles it, it might just crash… But instead it might just take literal hours