Positional & Input Embeddings (Data Preprocessing)

πŸ“˜3. Positional Embeddings

🧠 Why Do We Need Positional Embeddings?

In embedding layer, the same tokens get mapped to the same vector representation. That means the model naturally has no idea about token order.

For example, consider these two sentences:

  • βœ… Dog bites man
  • ❌ Man bites dog

Same words, totally different meaning β€” because of word order.

So the model must somehow know which token came first, second, third, etc.

That’s exactly what positional embeddings do. βœ…

πŸ‘‰ Positional embeddings solve this by adding information about the position of each token in the sequence β€” letting the model know who comes first, second, etc.


🧩 Types of Positional Embeddings

Positional embeddings come in two main types:

1. Absolute Positional Embeddings

Each token position (e.g., 0, 1, 2, …) gets its own learned embedding vector.

Example:

PositionEmbedding Vector (simplified)
0[0.12, -0.45, 0.33, …]
1[0.56, 0.11, -0.07, …]
2[-0.22, 0.43, 0.88, …]

So the embedding for position 0 always means β€œfirst token,” position 1 means β€œsecond token,” and so on.

πŸ“Œ Used in: GPT models (GPT-1, GPT-2, GPT-3)
In these models, absolute position embeddings are learned along with token embeddings β€” they are trainable parameters optimized during model training.

🧠 Limitation:
They only work for sequence lengths seen during training.
If a model was trained on 512 tokens, it can’t easily generalize to 1024-token sequences because positions beyond 512 have no embeddings.

2. Relative Positional Embeddings

Instead of representing absolute position, they encode the distance between tokens.

For example:

β€œThis token is 3 steps after that token.”

So, even if the sequence gets longer, the relative distances stay meaningful.

βœ… Advantage: Works for variable-length or longer sequences not seen during training.
πŸ’‘ Used in: Transformer-XL, T5, DeBERTa, and modern architectures like GPT-NeoX or Llama (with rotary embeddings).


βš™οΈ Dimension of Positional Vectors

The dimension of positional embeddings is always the same as the token embeddings.

This allows simple addition:

1input_embeddings = token_embeddings + positional_embeddings

Input Embeddings

In modern transformer-based models, the final input embedding is formed by combining two components:

1**Input Embedding = Token Embedding + Positional Embedding**

πŸ§ͺ Implementing Positional & Input Embeddings (Hands-On)

  • 🧩 Batch size = number of sequences processed in parallel (e.g., 8).
  • πŸ“ Context length = number of tokens per sequence (e.g., 4).
  • πŸ”’ Embedding dimension = size of each token or position vector (e.g., 256).

So, before adding positional embeddings:

Input shape (token IDs): [8, 4]

  • Each of the 8 sequences contains 4 tokens.
  • Each token is mapped to a 256-dimensional vector by the token embedding layer:

Token embeddings shape: [8, 4, 256]

  • Positional embeddings also have the same shape [4, 256] (one per position).
    When added, they are broadcasted across the batch, i.e., added to every token in all 8 sequences.

After combining token + positional embeddings:

Final input embedding shape: [8, 4, 256]

βœ… So the final 3D tensor fed into the Transformer model has:

8 batches Γ— 4 tokens per sequence Γ— 256 embedding dimensions

This 3D structure is what the Transformer processes in its attention layers.

In **CODE**:

 1from importlib import metadata
 2import tiktoken # GPT 2 tokenizer.
 3import torch # pyhon deep learning framework.
 4from torch.utils.data import Dataset, DataLoader # this library have these pytorch base classes for creating datsets.
 5# Dataset β†’ Defines how data is stored and accessed.
 6# DataLoader β†’ Defines how data is batched, shuffled, and fed into your model.
 7
 8class GPTDatasetV1(Dataset): # in brackets means this class GPTDatasetV1 is inheriting from other class Dataset.
 9    def __init__(self, txt, tokenizer, max_length, stride): 
10        # max_length - how long each training sequence should be.
11        # stride - how far to move the sliding window beween chunks. (controls overlap)
12        self.input_ids = []
13        self.target_ids = [] # two arrays initialised first.
14
15        # Tokenize the entire text
16        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
17        # tokenizer is object, enocode is method in tiktoken library but first tokenizer is initialized downside.
18
19        # Use a sliding window to chunk the book into overlapping sequences of max_length
20        for i in range(0, len(token_ids) - max_length, stride): 
21            # in this for stop we're taking this len(token_ids) - max_length because if we take len(token_ids) the slice 
22            # i+max_length would go beyond the end of the list -- causing an error, so we stop at this (len(token_ids) - max_length)
23            input_chunk = token_ids[i:i + max_length]
24            target_chunk = token_ids[i + 1: i + max_length + 1]
25            self.input_ids.append(torch.tensor(input_chunk))
26            self.target_ids.append(torch.tensor(target_chunk))
27
28    def __len__(self): # this method tells PyTorch how many samples are present in your dataset.
29        return len(self.input_ids)
30
31    def __getitem__(self, idx): # when you ask for one training eg (datset[i]), it returns a pair that is 
32        # (input_tokens, target_tokens)
33        return self.input_ids[idx], self.target_ids[idx]
34    
35
36def create_dataloader_v1(txt, batch_size=4 #it is how many CPU processors you wanna run parallely
37                         , max_length=256, 
38                         stride=128, shuffle=True, drop_last=True,
39                         num_workers=0 # number of CPU threads which we wanna run simultaneously 
40                         ):
41
42    # Initialize the tokenizer
43    tokenizer = tiktoken.get_encoding("gpt2") # this loads the same tokeniaer used by GPT-2 model.
44
45
46    # Create dataset
47    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
48
49    # Create dataloader
50    dataloader = DataLoader(
51        dataset, #dataloader will just look for __getitem__ in class and will return the input and the target pairs.
52        batch_size=batch_size,
53        shuffle=shuffle,
54        drop_last=drop_last,
55        num_workers=num_workers
56    )
57
58    return dataloader
59
60with open("the-verdict.txt", "r", encoding="utf-8") as f:
61    raw_text = f.read()
62
63
64vocab_size = 50257 # no. of token ids
65output_dim = 256 # no. of dimensions.
66
67token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
68
69max_length = 4
70dataloader = create_dataloader_v1(
71    raw_text, batch_size=8, max_length=max_length,
72    stride=max_length, shuffle=False
73)
74data_iter = iter(dataloader)
75inputs, targets = next(data_iter)
76print("Token IDs:\n", inputs)
77print("\nInputs shape:\n", inputs.shape)
78
79token_embeddings = token_embedding_layer(inputs)
80print(token_embeddings.shape)
81
82context_length = max_length
83pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
84
85pos_embeddings = pos_embedding_layer(torch.arange(max_length))
86print(pos_embeddings.shape)
87
88input_embeddings = token_embeddings + pos_embeddings
89print(input_embeddings.shape)