as you set out to become the best deep learning engineer or whatever
you plan on doing so in the field, there will be a desire to run these large
models on your pc. but, if youre unlucky like me with a terrible graphics card,
you might not be able to do so. your hopes of running llama3-405b just went down the
gutter :P
however, there is a solution. though you might not have access to a A100 collecting
dust in your attic, what you could do is to quantize the models you're
trying to run and pray that it works.
quantization actually comes from digital signal processing, where we compress
a range of values into a single value.
so how do we apply it here?
for that, we have to briefly understand how model weights are actually stored. so let's
delve into some cool math.
all model weights are stored in single-precision (FP32) or nowadays, in half-precision (FP16)
datatype. And respectively, these take 4 bytes and 2 bytes. So let's do some math.
for example: let us take some generic-70b model. Now, during computation, the weights
of all 70 billion parameters are stored on the gpu (probably why they use multiple gpus, but
that's not the scope here.), so:
70,000,000,000 (70 billion) x 4 bytes per param: 280,000,000,000 bytes (or) 280 GB of memory/vram
of the GPU.
this, is a HUGE amount. good luck trying to find the money to fund yourself for that. but this is where
quantization comes into the picture.
with quantization, we can basically tradeoff a bit of accuracy for precision, so we can convert it from
single-precision to int8 format, which only takes 1 byte, compared to the 4 bytes initially. so then:
70,000,000,000 (70 billion) x 1 byte per param: 70,000,000,000 bytes (or) 70 GB of memory. That's way
better! obviously, you still need a high-end gpu to run this, it's still better than the unquantized
model.
it's important to understand that through quantization, you lose accuracy. the more aggressive the quantization, the more you lose in accuracy. though there are ways to overcome this, that should be kept in mind. we have different kinds of quantization:
there are other quantization methods as well such as qlora, gptq, ggml/uf, awq, etc. let's briefly talk about them:
so now that you know a little bit as to how quantization works, you could try quantizing a model yourself and be happy about the fact that you can run larger models on your computer without spending thousands on acquiring a gpu.