llm | Angel Vyas

LLM Data Preprocessing
Nov 8, 2025 · 15 min read · llm
LLM Data Preprocessing Preprocessing text data is one of the most important steps in training Large Language Models (LLMs). The first step in this process is tokenization — the act of converting raw text into smaller, manageable units called tokens. 1. Tokenization Tokenization can be broadly categorized into three …

Read More
LLM Introduction
Nov 8, 2025 · 1 min read · llm
What is an LLM? LLM stands for Large Language Model. An LLM is a type of artificial intelligence model designed to understand, generate, and interact using human language. It learns patterns, meanings, and relationships in text by being trained on massive amounts of written data such as books, articles, websites, and …

Read More
Positional & Input Embeddings (Data Preprocessing)
Nov 7, 2025 · 5 min read · llm
📘3. Positional Embeddings 🧠 Why Do We Need Positional Embeddings? In embedding layer, the same tokens get mapped to the same vector representation. That means the model naturally has no idea about token order. For example, consider these two sentences: ✅ Dog bites man ❌ Man bites dog Same words, totally different …

Read More
Token Embeddings (Data Preprocessing)
Nov 5, 2025 · 6 min read · llm
🧠2.Token Embeddings in LLMs They’re also often called vector embeddings or word embeddings. 🔤 From Tokens to Token Embeddings Before a model can “read” anything, it first breaks text into smaller chunks called tokens. Each token is then assigned a unique token ID, a simple integer from the vocabulary. For example: Text …

Read More

LLM Data Preprocessing

LLM Introduction

Positional & Input Embeddings (Data Preprocessing)

Token Embeddings (Data Preprocessing)