LLM Concepts Deep Dive: The Stuff I Wish Someone Explained Simply

LLM Concepts Deep Dive: The Stuff I Wish Someone Explained Simply
Date: February 20, 2026 Purpose: Personal reference + blog draft for anyone starting out with AI/LLM concepts
When I first started learning about AI and LLMs, I hit a wall of jargon. Tokens, embeddings, attention, temperature, context windows, RAG, fine-tuning -- every article assumed you already knew the last article. I'd read something, nod along, and realize 10 minutes later I had no idea what I just read.
What actually helped was getting my hands dirty. Running local models, breaking things, building agentic workflows, messing with parameters until something clicked. Then I'd go back to those same articles and research papers and it was all "aha" moments. Suddenly the jargon made sense because I'd seen it in action.
This post is my attempt to simplify these concepts for anyone who's just starting out. No PhD required. If you already know this stuff, cool -- skip ahead. And honestly, this is also a reminder for myself to come back to whenever I forget how something works.
Table of Contents
The Basics
What Even Is an LLM?
An LLM (Large Language Model) is a program that predicts the next word. That's it. Everything else is built on top of that one trick.
You type: "The capital of France is" It predicts: "Paris"
It does this by having read an enormous amount of text during training and learning statistical patterns about which words tend to follow which other words. It doesn't "know" things the way you know things. It's really good at pattern matching.
The Transformer Architecture
Almost every modern LLM is built on the transformer architecture (the T in GPT). Before transformers, we had models that read text one word at a time, left to right. Transformers can look at the entire input at once and figure out which parts matter most for each word.
Think of it like reading a book:
Old approach (RNN): Read word by word, try to remember everything
Transformer: Scan the whole page, highlight what matters, then write your response
The key innovation is the attention mechanism -- more on that below.
Tokens
How LLMs Read
LLMs don't read words. They read tokens -- chunks of text that might be a word, part of a word, or even a single character.
"Hello, how are you?" = ["Hello", ",", " how", " are", " you", "?"]
= 6 tokens
"Anthropic" = ["Anthrop", "ic"]
= 2 tokens
"I'm" = ["I", "'m"]
= 2 tokens
Why Tokens Matter
Concept | Why It Matters |
|---|---|
Cost | API pricing is per token (input + output) |
Speed | More tokens = slower generation |
Context window | Measured in tokens, not words |
Rough conversion | ~1 token = ~0.75 words (English) |
So when someone says "128k context window" they mean 128,000 tokens, which is roughly 96,000 words or about 300 pages of text.
Tokenization
Different models use different tokenizers. The same sentence might be 10 tokens in one model and 12 in another. This is why you can't directly compare token counts across models.
Common tokenizers:
BPE (Byte-Pair Encoding): Used by GPT models, Claude
SentencePiece: Used by Llama, Mistral
WordPiece: Used by BERT
You don't need to memorize these. Just know that tokenization is the first step -- raw text goes in, tokens come out, and the model works with tokens from that point on.
Embeddings
How LLMs Understand
Once text is split into tokens, each token gets converted into a vector -- a list of numbers that represents its meaning in a high-dimensional space.
"king" = [0.2, 0.8, -0.3, 0.5, ...] (hundreds of dimensions)
"queen" = [0.2, 0.7, -0.3, 0.6, ...] (similar! close in space)
"car" = [-0.5, 0.1, 0.9, -0.2, ...] (very different)
The Famous Example
The classic embedding arithmetic:
king - man + woman = queen
This works because embeddings capture semantic relationships as directions in space. "Male to female" is a direction. "Singular to plural" is a direction. The model learns these during training.
Why Embeddings Matter
Similarity search: Find documents that are semantically similar (not just keyword matching)
RAG: Store embeddings of your documents, search by meaning
Clustering: Group similar concepts together automatically
When people talk about "vector databases" (Pinecone, Chroma, Weaviate), they're storing embeddings and searching through them efficiently.
Attention
How LLMs Focus
Attention is the mechanism that lets the model decide which parts of the input matter most for generating each output token.
When the model sees: "The cat sat on the mat because it was tired"
It needs to figure out what "it" refers to. The attention mechanism assigns weights:
"it" pays attention to:
"cat" -> 0.85 (high! "it" = "the cat")
"mat" -> 0.05 (low)
"sat" -> 0.03 (low)
"The" -> 0.02 (low)
...
Self-Attention vs Cross-Attention
Self-attention: The input looks at itself (each token looks at every other token in the same sequence)
Cross-attention: The output looks at the input (used in encoder-decoder models like the original transformer)
Most modern LLMs (GPT, Claude, Llama) are decoder-only and use self-attention exclusively.
Multi-Head Attention
The model doesn't just have one attention pattern -- it has multiple "heads" that each learn to focus on different things:
Head 1 might track grammatical relationships
Head 2 might track semantic meaning
Head 3 might track position/distance
Head 4 might track some pattern we can't even name
A model with 32 attention heads is looking at the input 32 different ways simultaneously.
KV Cache
When generating text token by token, the model doesn't want to recompute attention from scratch each time. The KV (Key-Value) cache stores the attention computations for previous tokens so only the new token needs full computation.
This is why:
Long contexts use a lot of VRAM (the KV cache grows with context length)
First token is slow, subsequent tokens are faster (cache is warming up)
Some quantization methods target the KV cache to reduce memory
Context Window
Short-Term Memory
The context window is how much text the model can "see" at once. Everything outside the window doesn't exist to the model.
Model | Context Window | Roughly |
|---|---|---|
GPT-3 (original) | 2k tokens | ~3 pages |
GPT-4 | 128k tokens | ~300 pages |
Claude 3.5 Sonnet | 200k tokens | ~500 pages |
Gemini 1.5 Pro | 2M tokens | ~5,000 pages |
The "Lost in the Middle" Problem
Models tend to pay more attention to the beginning and end of their context window. Information buried in the middle can get overlooked. This is a known limitation.
Practical implications:
Put important instructions at the beginning (system prompt)
Put the immediate question/task at the end
Don't rely on the model perfectly recalling a detail from page 200 of a 500-page context
Context Window vs Memory
The context window is not memory in the human sense. When the conversation exceeds the window:
Old messages get dropped (or summarized, depending on implementation)
The model has zero knowledge of what was discussed before the window
There is no persistent storage between sessions unless you build it
This is why agentic systems need external memory (files, databases, knowledge graphs).
Temperature and Sampling
Creativity Controls
When the model predicts the next token, it doesn't just pick one -- it calculates probabilities for every possible token and then samples from that distribution.
Temperature controls how "creative" vs "predictable" the output is:
Prompt: "The sky is"
Temperature 0.0 (deterministic):
"blue" -> always picks the highest probability
Temperature 0.7 (balanced):
"blue" (60%), "clear" (20%), "beautiful" (10%), "dark" (5%), ...
Might pick any of these
Temperature 1.5 (wild):
"blue" (30%), "clear" (15%), "screaming" (8%), "purple" (7%), ...
Much more random, might say weird things
Other Sampling Parameters
Parameter | What It Does | Practical Use |
|---|---|---|
Temperature | Controls randomness | 0 = deterministic, 1+ = creative |
Top-P (nucleus) | Only consider tokens in the top P% of probability | 0.9 = ignore the bottom 10% |
Top-K | Only consider the K most likely tokens | 40 = only top 40 choices |
Repetition penalty | Penalize tokens that already appeared | Prevents loops and repetition |
Max tokens | Hard cap on output length | Prevents runaway generation |
For most practical work:
Coding: Temperature 0-0.3 (you want deterministic, correct code)
Creative writing: Temperature 0.7-1.0
General chat: Temperature 0.5-0.7
Training vs Inference
Learning vs Using
These are two completely different phases:
Training (Learning)
Happens once (expensive, takes weeks/months on thousands of GPUs)
The model reads massive amounts of text
Adjusts its weights to get better at predicting the next token
Costs millions of dollars for frontier models
You (probably) don't do this
Inference (Using)
Happens every time you chat with the model
The model uses its trained weights to generate text
Can run on your laptop with quantized models
Costs per-token via API, or free if running locally
This is what you do every day
Training Phases
Most LLMs go through multiple training phases:
Pre-training: Read the internet, learn language patterns (the expensive part)
Supervised Fine-Tuning (SFT): Train on curated instruction-response pairs to follow directions
RLHF/RLAIF: Reinforcement Learning from Human (or AI) Feedback -- learn what good vs bad responses look like
Safety training: Learn to refuse harmful requests, stay within guidelines
The base model after pre-training is like a very knowledgeable but chaotic entity. SFT and RLHF turn it into something that actually follows instructions and has conversations.
Fine-Tuning
Teaching New Tricks
Fine-tuning takes a pre-trained model and trains it further on specific data. Instead of training from scratch (billions of dollars), you're adjusting an existing model (maybe a few hundred dollars).
Types of Fine-Tuning
Full Fine-Tuning:
Update all model weights
Expensive, needs lots of VRAM
Best results but overkill for most use cases
LoRA (Low-Rank Adaptation):
Only train a small adapter on top of the frozen base model
10-100x cheaper than full fine-tuning
The adapter is tiny (MBs vs GBs)
Can stack multiple LoRAs on one base model
QLoRA:
LoRA but on a quantized base model
Even cheaper -- fine-tune a 70B model on a single GPU
Slight quality trade-off
When to Fine-Tune vs Not
Use Case | Better Approach |
|---|---|
"I want the model to know about my company's docs" | RAG (not fine-tuning) |
"I want the model to write in a specific style" | Fine-tuning |
"I want the model to follow a specific output format" | Prompt engineering first, fine-tune if that fails |
"I want domain-specific knowledge (medical, legal)" | Fine-tuning + RAG |
"I want the model to use my API" | Tool use / function calling (not fine-tuning) |
The honest take: most people who think they need fine-tuning actually need better prompts or RAG. Fine-tuning is for when you've exhausted the other options.
RAG
Giving LLMs a Cheat Sheet
RAG (Retrieval-Augmented Generation) is a simple but powerful idea: before asking the model to answer, first search your own data for relevant information and stuff it into the prompt.
Without RAG:
User: "What's our refund policy?"
Model: "I don't know your specific refund policy." (or makes something up)
With RAG:
1. Search your documents for "refund policy"
2. Find the relevant policy document
3. Stuff it into the prompt:
"Based on this document: [refund policy text]
Answer the user's question: What's our refund policy?"
4. Model gives an accurate answer grounded in your data
RAG Pipeline
User Question
|
v
[Embed the question] -> vector
|
v
[Search vector database] -> find similar document chunks
|
v
[Stuff top results into prompt]
|
v
[LLM generates answer using the retrieved context]
Chunking Strategies
Your documents need to be split into chunks before embedding. How you chunk matters:
Fixed size: Split every 500 tokens (simple but might break mid-sentence)
Semantic: Split at paragraph/section boundaries (better context preservation)
Recursive: Try large chunks first, split further if too big
Document-aware: Respect headers, code blocks, tables
When RAG Goes Wrong
Bad chunks: Split in the middle of important context
Bad embeddings: The search doesn't find relevant documents
Too much context: Stuffing 50 documents confuses the model
Stale data: Your vector database is outdated
Prompt Engineering
Talking to the Machine
Prompt engineering is the art of giving LLMs instructions that actually produce what you want. It sounds simple but makes a massive difference.
Key Techniques
System Prompts: The hidden instruction that sets the model's behavior. Every good system prompt includes role, constraints, and output format.
Few-Shot Examples: Show the model what you want by giving examples:
Convert to JSON:
Input: "John is 30 years old"
Output: {"name": "John", "age": 30}
Input: "Alice lives in London"
Output: {"name": "Alice", "city": "London"}
Input: "Bob is an engineer at Google"
Output:
The model picks up the pattern and continues it.
Chain of Thought (CoT): Ask the model to think step by step. This genuinely improves reasoning:
Bad: "What's 17 * 24?"
Good: "What's 17 * 24? Think through it step by step."
Structured Output: Tell the model exactly what format you want:
"Respond in this exact JSON format:
{
"summary": "...",
"sentiment": "positive|negative|neutral",
"confidence": 0.0-1.0
}"
Context Engineering
The Real Game
This is where it gets interesting. Prompt engineering is about crafting a single prompt. Context engineering is about designing the entire information environment the model operates in.
Think of it as the difference between writing a good email (prompt engineering) vs designing the entire briefing package for a decision maker (context engineering).
What Goes Into Context
[System prompt] <- who the model is, rules, format
[Tool definitions] <- what the model can do
[Retrieved documents] <- RAG results
[Conversation history] <- what was said before
[Working memory] <- scratchpad, intermediate results
[User message] <- the actual request
Every one of these affects output quality. Context engineering is about optimizing all of them together.
Practical Context Engineering
Prune irrelevant history: Don't send 50 turns of chat if only the last 5 matter
Summarize, don't truncate: When context gets long, summarize old messages instead of cutting them off
Order matters: Important stuff at the top and bottom, less important in the middle
Be specific about tools: Clear tool descriptions mean the model picks the right one
Dynamic system prompts: Change the system prompt based on what the user is doing
This is what separates a basic chatbot from a well-built agentic system.
Agents
LLMs That Do Things
A plain LLM just generates text. An agent is an LLM that can take actions -- read files, search the web, run code, call APIs.
The Agent Loop
1. User gives a task
2. LLM thinks about what to do
3. LLM picks a tool and calls it
4. Tool returns a result
5. LLM looks at the result
6. Go back to step 2 (or respond if done)
This loop is what makes agents powerful. The model can chain multiple actions together, adapt based on results, and handle tasks that require multiple steps.
Key Components
Component | What It Does |
|---|---|
LLM | The brain -- decides what to do next |
Tools | The hands -- functions the LLM can call |
Memory | Short-term (context) + long-term (files, DBs) |
Orchestration | The loop that connects everything |
ReAct Pattern
Most agents follow the ReAct (Reasoning + Acting) pattern:
Thought: I need to find the user's config file
Action: search_files("config.json")
Observation: Found at /home/user/.config/app/config.json
Thought: Now I need to read it
Action: read_file("/home/user/.config/app/config.json")
Observation: {"theme": "dark", "language": "en"}
Thought: I have the information, I can answer now
Response: "Your config uses dark theme and English language."
The model explicitly reasons about what to do before doing it.
MCP
Giving Agents Hands
MCP (Model Context Protocol) is a standard for connecting LLMs to external tools and data sources. Think of it as USB for AI -- a universal way to plug in capabilities.
Before MCP
Every tool integration was custom:
OpenAI had function calling (their format)
Anthropic had tool use (their format)
Every app built their own integration layer
With MCP
One standard protocol. Build an MCP server once, any MCP client can use it.
MCP Server (provides tools) MCP Client (uses tools)
- File system access <-> Claude Code
- Database queries <-> Cursor
- API integrations <-> Any MCP-compatible app
- Web browsing <->
MCP Components
Server: Exposes tools, resources, and prompts
Client: Connects to servers, makes tool available to the LLM
Transport: How they communicate (stdio, HTTP/SSE)
Why MCP Matters
If you're building agentic workflows, MCP means you write your tool integration once and it works everywhere. You don't rebuild the same database connector for every AI app.
Hallucinations
When LLMs Make Stuff Up
LLMs hallucinate. This is not a bug that will be fixed in the next version -- it's a fundamental property of how they work. They generate statistically plausible text, and sometimes plausible != true.
Types of Hallucination
Type | Example |
|---|---|
Factual | "The Eiffel Tower was built in 1920" (it was 1889) |
Citation | "According to Smith et al. (2019)..." (paper doesn't exist) |
Confident nonsense | Generating a detailed but completely wrong technical explanation |
Subtle errors | Mostly correct answer with one wrong detail buried in it |
Reducing Hallucinations
RAG: Ground responses in actual documents
Low temperature: Less creative = less hallucination
Ask for sources: "Cite your sources" (model might still hallucinate sources though)
Structured output: Force the model into a format that's easier to verify
Multiple passes: Ask the model to verify its own answer
Tool use: Let the model look things up instead of guessing
The honest truth: you cannot fully eliminate hallucinations. Always verify critical information.
Benchmarks
How We Measure
Benchmarks try to measure how "good" a model is. Take all of them with a grain of salt.
Common Benchmarks
Benchmark | What It Tests |
|---|---|
MMLU | General knowledge across 57 subjects |
HumanEval | Code generation (writing Python functions) |
MATH | Mathematical reasoning |
GSM8K | Grade school math word problems |
ARC | Science reasoning |
HellaSwag | Common sense reasoning |
TruthfulQA | Resistance to common misconceptions |
MT-Bench | Multi-turn conversation quality |
Why Benchmarks Are Tricky
Teaching to the test: Models can be optimized for specific benchmarks
Contamination: If benchmark questions appear in training data, scores are inflated
Real-world gap: High benchmark scores don't always mean the model is better for your use case
Cherry picking: Companies show the benchmarks where they win
What Actually Matters
For practical work, the best benchmark is: does the model do what I need it to do? Try it on your actual tasks. A model that scores 2% lower on MMLU but is faster and cheaper might be the better choice for your use case.
Quick Reference Card
Term | One-Line Explanation |
|---|---|
Token | A chunk of text (~0.75 words) |
Embedding | A number-list representing meaning |
Attention | How the model decides what's important |
Context window | How much text the model can see at once |
Temperature | Randomness dial (0 = predictable, 1+ = creative) |
Inference | Running the model to get output |
Fine-tuning | Further training on specific data |
LoRA | Cheap fine-tuning (small adapter, frozen base) |
RAG | Search your docs, stuff into prompt |
Agent | LLM + tools + loop |
MCP | Model Context Protocol. Universal tool protocol for AI |
Hallucination | Model generating plausible but false info |
KV Cache | Stored attention computations for speed |
RLHF | Training with human preference feedback |
MoE | Multiple expert networks, only some active |
Quantization | Compress model weights to use less memory |
This is a living document. I'll keep adding to it as I learn more and inevitably forget things again.





