episode 1 episode 3

attention, attention!

if you had thought this blog was supposed to give you the attention that you didn't receive as a child, then im afraid you're in the wrong place :D but, i do want to talk about attention, just under a different context. the attention mechanism from the paper "Attention Is All You Need" is probably the most important and critical deep learning paper to date. models like GPT-4, BERT, etc. use some variation of the initial self-attention that was initially used.

but what is self-attention? what is attention really?

attention, in the context of deep learning, mainly states that by finding out how relevant some parts of the data are, compared to the parts that are not so relevant, we are able to improve the performance of a model by focusing on the relevant information alone. for example, you can take a look at this visualization here. now warning, this might look complicated but i will try to explain it as low-level as possible.



for each word in a sentence, say "I ate an apple.", we are calculating how each word interacts with the other words in the sentence, calculating an "attention score" for each interaction.

that's it? what makes it so cool then?

the mathematics behind the attention mechanism is astoundingly simple, yet almost mindblowing, but first we have to understand few words like, "query", "key", "value"

  • and the attention mechanism finds out the attention scores by measuring the interaction between the query and the keys. let's take a look at the equation



    here, it is to be noted that Q, K and V are matrices with certain dimensions of seq_length * d_model where:
  • what are the different types of attention?

    mainly, two different categories exist: self-attention and cross-attention.

    self-attention is when, the matrices Q, K and V are identical. so the attention mechanism is basically interaction over itself whereas cross-attention, is when at least one of the matrices are different.

    here we will talk about multi-head attention, which sounds complicated but really it just means there are many different sets of query, key and vaue matrices and we are computing the attention for each set and concatenating the result in the end. the equation will change ever so slightly here:



    there are other variations of the attention mechanism as well but the most common one used is mult-head attention.

    how do i code it on my own?

    it's very simple. this code that i will be showing is from my repository: from-scratch.

    only numpy was used for understanding purposes, for self-attention, one can code it like so:

                    
    import numpy as np
    
    class SelfAttention:
        """
        This class implements the self-attention mechanism for a single head.
    
        Args:
            input_matrix (ndarray): The input matrix of shape (d_model, seq_length).
            seq_length (int, optional): The sequence length. Defaults to 6.
            d_model (int, optional): The dimensionality of the model. Defaults to 512.
    
        Attributes:
            seq_length (int): The sequence length.
            d_model (int): The dimensionality of the model.
            input_matrix (ndarray): The input matrix of shape (d_model, seq_length).
    
        Methods:
            forward(): Performs the forward pass of the self-attention mechanism.
                Returns:
                    ndarray: The output tensor of shape (d_model, seq_length).
        """
        def __init__(self, input_matrix: np.ndarray, seq_length: int = 6, d_model: int = 512):
            self.seq_length = seq_length
            self.d_model = d_model
            self.input_matrix = input_matrix
            
        def forward(self):
            matrix = np.random.rand(self.d_model, self.seq_length)
            
            query, key, value = matrix, matrix, matrix
            
            sf = self.d_model**2
            num = np.dot(query, key.T) 
            denom = (1 / np.sqrt(self.d_model)) * num
            output = (1/sf) * denom
    
            prob_num = np.exp(output - np.max(output, axis=-1, keepdims=True))
            prob_denom = prob_num.sum(axis=-1, keepdims=True)
    
            final = (prob_num / prob_denom)
            out = sf * np.dot(final, value)
    
            return out
                    
                
    you can take a look at my other code in the repository linked above.