Positional & Input Embeddings (Data Preprocessing)

π3. Positional Embeddings
π§ Why Do We Need Positional Embeddings?
In embedding layer, the same tokens get mapped to the same vector representation.
That means the model naturally has no idea about token order.
For example, consider these two sentences:
- β Dog bites man
- β Man bites dog
Same words, totally different meaning β because of word order.
So the model must somehow know which token came first, second, third, etc.
Thatβs exactly what positional embeddings do. β
π Positional embeddings solve this by adding information about the position of each token in the sequence β letting the model know who comes first, second, etc.
π§© Types of Positional Embeddings
Positional embeddings come in two main types:
1. Absolute Positional Embeddings
Each token position (e.g., 0, 1, 2, β¦) gets its own learned embedding vector.
Example:
| Position | Embedding Vector (simplified) |
|---|---|
| 0 | [0.12, -0.45, 0.33, β¦] |
| 1 | [0.56, 0.11, -0.07, β¦] |
| 2 | [-0.22, 0.43, 0.88, β¦] |
So the embedding for position 0 always means βfirst token,β position 1 means βsecond token,β and so on.
π Used in: GPT models (GPT-1, GPT-2, GPT-3)
In these models, absolute position embeddings are learned along with token embeddings β they are trainable parameters optimized during model training.
π§ Limitation:
They only work for sequence lengths seen during training.
If a model was trained on 512 tokens, it canβt easily generalize to 1024-token sequences because positions beyond 512 have no embeddings.
2. Relative Positional Embeddings
Instead of representing absolute position, they encode the distance between tokens.
For example:
βThis token is 3 steps after that token.β
So, even if the sequence gets longer, the relative distances stay meaningful.
β
Advantage: Works for variable-length or longer sequences not seen during training.
π‘ Used in: Transformer-XL, T5, DeBERTa, and modern architectures like GPT-NeoX or Llama (with rotary embeddings).
βοΈ Dimension of Positional Vectors
The dimension of positional embeddings is always the same as the token embeddings.
This allows simple addition:
1input_embeddings = token_embeddings + positional_embeddings
Input Embeddings
In modern transformer-based models, the final input embedding is formed by combining two components:
1**Input Embedding = Token Embedding + Positional Embedding**
π§ͺ Implementing Positional & Input Embeddings (Hands-On)
- π§© Batch size = number of sequences processed in parallel (e.g., 8).
- π Context length = number of tokens per sequence (e.g., 4).
- π’ Embedding dimension = size of each token or position vector (e.g., 256).
So, before adding positional embeddings:
Input shape (token IDs): [8, 4]
- Each of the 8 sequences contains 4 tokens.
- Each token is mapped to a 256-dimensional vector by the token embedding layer:
Token embeddings shape: [8, 4, 256]
- Positional embeddings also have the same shape
[4, 256](one per position).
When added, they are broadcasted across the batch, i.e., added to every token in all 8 sequences.
After combining token + positional embeddings:
Final input embedding shape: [8, 4, 256]
β So the final 3D tensor fed into the Transformer model has:
8 batches Γ 4 tokens per sequence Γ 256 embedding dimensions
This 3D structure is what the Transformer processes in its attention layers.
In **CODE**:
1from importlib import metadata
2import tiktoken # GPT 2 tokenizer.
3import torch # pyhon deep learning framework.
4from torch.utils.data import Dataset, DataLoader # this library have these pytorch base classes for creating datsets.
5# Dataset β Defines how data is stored and accessed.
6# DataLoader β Defines how data is batched, shuffled, and fed into your model.
7
8class GPTDatasetV1(Dataset): # in brackets means this class GPTDatasetV1 is inheriting from other class Dataset.
9 def __init__(self, txt, tokenizer, max_length, stride):
10 # max_length - how long each training sequence should be.
11 # stride - how far to move the sliding window beween chunks. (controls overlap)
12 self.input_ids = []
13 self.target_ids = [] # two arrays initialised first.
14
15 # Tokenize the entire text
16 token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
17 # tokenizer is object, enocode is method in tiktoken library but first tokenizer is initialized downside.
18
19 # Use a sliding window to chunk the book into overlapping sequences of max_length
20 for i in range(0, len(token_ids) - max_length, stride):
21 # in this for stop we're taking this len(token_ids) - max_length because if we take len(token_ids) the slice
22 # i+max_length would go beyond the end of the list -- causing an error, so we stop at this (len(token_ids) - max_length)
23 input_chunk = token_ids[i:i + max_length]
24 target_chunk = token_ids[i + 1: i + max_length + 1]
25 self.input_ids.append(torch.tensor(input_chunk))
26 self.target_ids.append(torch.tensor(target_chunk))
27
28 def __len__(self): # this method tells PyTorch how many samples are present in your dataset.
29 return len(self.input_ids)
30
31 def __getitem__(self, idx): # when you ask for one training eg (datset[i]), it returns a pair that is
32 # (input_tokens, target_tokens)
33 return self.input_ids[idx], self.target_ids[idx]
34
35
36def create_dataloader_v1(txt, batch_size=4 #it is how many CPU processors you wanna run parallely
37 , max_length=256,
38 stride=128, shuffle=True, drop_last=True,
39 num_workers=0 # number of CPU threads which we wanna run simultaneously
40 ):
41
42 # Initialize the tokenizer
43 tokenizer = tiktoken.get_encoding("gpt2") # this loads the same tokeniaer used by GPT-2 model.
44
45
46 # Create dataset
47 dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
48
49 # Create dataloader
50 dataloader = DataLoader(
51 dataset, #dataloader will just look for __getitem__ in class and will return the input and the target pairs.
52 batch_size=batch_size,
53 shuffle=shuffle,
54 drop_last=drop_last,
55 num_workers=num_workers
56 )
57
58 return dataloader
59
60with open("the-verdict.txt", "r", encoding="utf-8") as f:
61 raw_text = f.read()
62
63
64vocab_size = 50257 # no. of token ids
65output_dim = 256 # no. of dimensions.
66
67token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
68
69max_length = 4
70dataloader = create_dataloader_v1(
71 raw_text, batch_size=8, max_length=max_length,
72 stride=max_length, shuffle=False
73)
74data_iter = iter(dataloader)
75inputs, targets = next(data_iter)
76print("Token IDs:\n", inputs)
77print("\nInputs shape:\n", inputs.shape)
78
79token_embeddings = token_embedding_layer(inputs)
80print(token_embeddings.shape)
81
82context_length = max_length
83pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
84
85pos_embeddings = pos_embedding_layer(torch.arange(max_length))
86print(pos_embeddings.shape)
87
88input_embeddings = token_embeddings + pos_embeddings
89print(input_embeddings.shape)