episode 3 home

running large models on potato pcs

as you set out to become the best deep learning engineer or whatever you plan on doing so in the field, there will be a desire to run these large models on your pc. but, if youre unlucky like me with a terrible graphics card, you might not be able to do so. your hopes of running llama3-405b just went down the gutter :P

however, there is a solution. though you might not have access to a A100 collecting dust in your attic, what you could do is to quantize the models you're trying to run and pray that it works.

but wait, what is quantization?

quantization actually comes from digital signal processing, where we compress a range of values into a single value.

so how do we apply it here?

for that, we have to briefly understand how model weights are actually stored. so let's delve into some cool math.

cool math time.

all model weights are stored in single-precision (FP32) or nowadays, in half-precision (FP16) datatype. And respectively, these take 4 bytes and 2 bytes. So let's do some math.

for example: let us take some generic-70b model. Now, during computation, the weights of all 70 billion parameters are stored on the gpu (probably why they use multiple gpus, but that's not the scope here.), so:

70,000,000,000 (70 billion) x 4 bytes per param: 280,000,000,000 bytes (or) 280 GB of memory/vram of the GPU.

this, is a HUGE amount. good luck trying to find the money to fund yourself for that. but this is where quantization comes into the picture.

with quantization, we can basically tradeoff a bit of accuracy for precision, so we can convert it from single-precision to int8 format, which only takes 1 byte, compared to the 4 bytes initially. so then:

70,000,000,000 (70 billion) x 1 byte per param: 70,000,000,000 bytes (or) 70 GB of memory. That's way better! obviously, you still need a high-end gpu to run this, it's still better than the unquantized model.

tradeoff.

it's important to understand that through quantization, you lose accuracy. the more aggressive the quantization, the more you lose in accuracy. though there are ways to overcome this, that should be kept in mind. we have different kinds of quantization:

post-training quantization (ptq)
quantization-aware training (qat)

post-training quantization as the name suggests, quantizes the model after training is complete. this is the cheapest way to quantize a model, but it is also more prone to loss of accuracy. quantization-aware training integrates the quantization process within training itself, this has a loss function of its own and post-training, weights will already be quantized with the least loss in accuracy.

qat is always better than ptq as it accounts for minimization of loss and hence accuracy will not be a huge disadvantage.

other methods.

there are other quantization methods as well such as qlora, gptq, ggml/uf, awq, etc. let's briefly talk about them:

qlora (quantized low-rank adaptation) is a technique where the original weights of the base model are quantized to 4-bit to run it on a single GPU.
gptq (generative pre-trained transformer quantization) quantizes a model layer at a time, and also keeps track of the output error and tries to minimize it.
ggml/uf is a quantization technique and a model format to be stored, such that models can be run on CPUs, as opposed to needing GPUs.
awq keeps a certain percentage of the weights and quantizes the rest to prevent quantization loss or without performance degradation.

notes and resources.

so now that you know a little bit as to how quantization works, you could try quantizing a model yourself and be happy about the fact that you can run larger models on your computer without spending thousands on acquiring a gpu.

more on quantization: huggingface
how to quantize models: quantization