episode 3 home

running large models on potato pcs

as you set out to become the best deep learning engineer or whatever you plan on doing so in the field, there will be a desire to run these large models on your pc. but, if youre unlucky like me with a terrible graphics card, you might not be able to do so. your hopes of running llama3-405b just went down the gutter :P

however, there is a solution. though you might not have access to a A100 collecting dust in your attic, what you could do is to quantize the models you're trying to run and pray that it works.

but wait, what is quantization?

quantization actually comes from digital signal processing, where we compress a range of values into a single value.

so how do we apply it here?

for that, we have to briefly understand how model weights are actually stored. so let's delve into some cool math.

cool math time.

all model weights are stored in single-precision (FP32) or nowadays, in half-precision (FP16) datatype. And respectively, these take 4 bytes and 2 bytes. So let's do some math.

for example: let us take some generic-70b model. Now, during computation, the weights of all 70 billion parameters are stored on the gpu (probably why they use multiple gpus, but that's not the scope here.), so:

70,000,000,000 (70 billion) x 4 bytes per param: 280,000,000,000 bytes (or) 280 GB of memory/vram of the GPU.

this, is a HUGE amount. good luck trying to find the money to fund yourself for that. but this is where quantization comes into the picture.

with quantization, we can basically tradeoff a bit of accuracy for precision, so we can convert it from single-precision to int8 format, which only takes 1 byte, compared to the 4 bytes initially. so then:

70,000,000,000 (70 billion) x 1 byte per param: 70,000,000,000 bytes (or) 70 GB of memory. That's way better! obviously, you still need a high-end gpu to run this, it's still better than the unquantized model.

tradeoff.

it's important to understand that through quantization, you lose accuracy. the more aggressive the quantization, the more you lose in accuracy. though there are ways to overcome this, that should be kept in mind. we have different kinds of quantization:

post-training quantization as the name suggests, quantizes the model after training is complete. this is the cheapest way to quantize a model, but it is also more prone to loss of accuracy. quantization-aware training integrates the quantization process within training itself, this has a loss function of its own and post-training, weights will already be quantized with the least loss in accuracy.

qat is always better than ptq as it accounts for minimization of loss and hence accuracy will not be a huge disadvantage.

other methods.

there are other quantization methods as well such as qlora, gptq, ggml/uf, awq, etc. let's briefly talk about them:

notes and resources.

so now that you know a little bit as to how quantization works, you could try quantizing a model yourself and be happy about the fact that you can run larger models on your computer without spending thousands on acquiring a gpu.