LLM Concepts Deep Dive: The Stuff I Wish Someone Explained Simply

Date: February 20, 2026 Purpose: Personal reference + blog draft for anyone starting out with AI/LLM concepts

When I first started learning about AI and LLMs, I hit a wall of jargon. Tokens, embeddings, attention, temperature, context windows, RAG, fine-tuning -- every article assumed you already knew the last article. I'd read something, nod along, and realize 10 minutes later I had no idea what I just read.

What actually helped was getting my hands dirty. Running local models, breaking things, building agentic workflows, messing with parameters until something clicked. Then I'd go back to those same articles and research papers and it was all "aha" moments. Suddenly the jargon made sense because I'd seen it in action.

This post is my attempt to simplify these concepts for anyone who's just starting out. No PhD required. If you already know this stuff, cool -- skip ahead. And honestly, this is also a reminder for myself to come back to whenever I forget how something works.

The Basics: What Even Is an LLM?
Tokens: How LLMs Read
Embeddings: How LLMs Understand
Attention: How LLMs Focus
Context Window: Short-Term Memory
Temperature & Sampling: Creativity Controls
Training vs Inference: Learning vs Using
Fine-Tuning: Teaching New Tricks
RAG: Giving LLMs a Cheat Sheet
Prompt Engineering: Talking to the Machine
Context Engineering: The Real Game
Agents: LLMs That Do Things
MCP: Giving Agents Hands
Hallucinations: When LLMs Make Stuff Up
Benchmarks: How We Measure

The Basics

What Even Is an LLM?

An LLM (Large Language Model) is a program that predicts the next word. That's it. Everything else is built on top of that one trick.

You type: "The capital of France is" It predicts: "Paris"

It does this by having read an enormous amount of text during training and learning statistical patterns about which words tend to follow which other words. It doesn't "know" things the way you know things. It's really good at pattern matching.

The Transformer Architecture

Almost every modern LLM is built on the transformer architecture (the T in GPT). Before transformers, we had models that read text one word at a time, left to right. Transformers can look at the entire input at once and figure out which parts matter most for each word.

Think of it like reading a book:

Old approach (RNN): Read word by word, try to remember everything
Transformer: Scan the whole page, highlight what matters, then write your response

The key innovation is the attention mechanism -- more on that below.

Tokens

How LLMs Read

LLMs don't read words. They read tokens -- chunks of text that might be a word, part of a word, or even a single character.

"Hello, how are you?" = ["Hello", ",", " how", " are", " you", "?"]
                       = 6 tokens

"Anthropic" = ["Anthrop", "ic"]
            = 2 tokens

"I'm" = ["I", "'m"]
      = 2 tokens

Why Tokens Matter

Concept	Why It Matters
Cost	API pricing is per token (input + output)
Speed	More tokens = slower generation
Context window	Measured in tokens, not words
Rough conversion	~1 token = ~0.75 words (English)

So when someone says "128k context window" they mean 128,000 tokens, which is roughly 96,000 words or about 300 pages of text.

Tokenization

Different models use different tokenizers. The same sentence might be 10 tokens in one model and 12 in another. This is why you can't directly compare token counts across models.

Common tokenizers:

BPE (Byte-Pair Encoding): Used by GPT models, Claude
SentencePiece: Used by Llama, Mistral
WordPiece: Used by BERT

You don't need to memorize these. Just know that tokenization is the first step -- raw text goes in, tokens come out, and the model works with tokens from that point on.

Embeddings

How LLMs Understand

Once text is split into tokens, each token gets converted into a vector -- a list of numbers that represents its meaning in a high-dimensional space.

"king"  = [0.2, 0.8, -0.3, 0.5, ...]   (hundreds of dimensions)
"queen" = [0.2, 0.7, -0.3, 0.6, ...]    (similar! close in space)
"car"   = [-0.5, 0.1, 0.9, -0.2, ...]   (very different)

The Famous Example

The classic embedding arithmetic:

king - man + woman = queen

This works because embeddings capture semantic relationships as directions in space. "Male to female" is a direction. "Singular to plural" is a direction. The model learns these during training.

Why Embeddings Matter

Similarity search: Find documents that are semantically similar (not just keyword matching)
RAG: Store embeddings of your documents, search by meaning
Clustering: Group similar concepts together automatically

When people talk about "vector databases" (Pinecone, Chroma, Weaviate), they're storing embeddings and searching through them efficiently.

Attention

How LLMs Focus

Attention is the mechanism that lets the model decide which parts of the input matter most for generating each output token.

When the model sees: "The cat sat on the mat because it was tired"

It needs to figure out what "it" refers to. The attention mechanism assigns weights:

"it" pays attention to:
  "cat"  -> 0.85 (high! "it" = "the cat")
  "mat"  -> 0.05 (low)
  "sat"  -> 0.03 (low)
  "The"  -> 0.02 (low)
  ...

Self-Attention vs Cross-Attention

Self-attention: The input looks at itself (each token looks at every other token in the same sequence)
Cross-attention: The output looks at the input (used in encoder-decoder models like the original transformer)

Most modern LLMs (GPT, Claude, Llama) are decoder-only and use self-attention exclusively.

Multi-Head Attention

The model doesn't just have one attention pattern -- it has multiple "heads" that each learn to focus on different things:

Head 1 might track grammatical relationships
Head 2 might track semantic meaning
Head 3 might track position/distance
Head 4 might track some pattern we can't even name

A model with 32 attention heads is looking at the input 32 different ways simultaneously.

KV Cache

When generating text token by token, the model doesn't want to recompute attention from scratch each time. The KV (Key-Value) cache stores the attention computations for previous tokens so only the new token needs full computation.

This is why:

Long contexts use a lot of VRAM (the KV cache grows with context length)
First token is slow, subsequent tokens are faster (cache is warming up)
Some quantization methods target the KV cache to reduce memory

Context Window

Short-Term Memory

The context window is how much text the model can "see" at once. Everything outside the window doesn't exist to the model.

Model	Context Window	Roughly
GPT-3 (original)	2k tokens	~3 pages
GPT-4	128k tokens	~300 pages
Claude 3.5 Sonnet	200k tokens	~500 pages
Gemini 1.5 Pro	2M tokens	~5,000 pages

The "Lost in the Middle" Problem

Models tend to pay more attention to the beginning and end of their context window. Information buried in the middle can get overlooked. This is a known limitation.

Practical implications:

Put important instructions at the beginning (system prompt)
Put the immediate question/task at the end
Don't rely on the model perfectly recalling a detail from page 200 of a 500-page context

Context Window vs Memory

The context window is not memory in the human sense. When the conversation exceeds the window:

Old messages get dropped (or summarized, depending on implementation)
The model has zero knowledge of what was discussed before the window
There is no persistent storage between sessions unless you build it

This is why agentic systems need external memory (files, databases, knowledge graphs).

Temperature and Sampling

Creativity Controls

When the model predicts the next token, it doesn't just pick one -- it calculates probabilities for every possible token and then samples from that distribution.

Temperature controls how "creative" vs "predictable" the output is:

Prompt: "The sky is"

Temperature 0.0 (deterministic):
  "blue" -> always picks the highest probability

Temperature 0.7 (balanced):
  "blue" (60%), "clear" (20%), "beautiful" (10%), "dark" (5%), ...
  Might pick any of these

Temperature 1.5 (wild):
  "blue" (30%), "clear" (15%), "screaming" (8%), "purple" (7%), ...
  Much more random, might say weird things

Other Sampling Parameters

Parameter	What It Does	Practical Use
Temperature	Controls randomness	0 = deterministic, 1+ = creative
Top-P (nucleus)	Only consider tokens in the top P% of probability	0.9 = ignore the bottom 10%
Top-K	Only consider the K most likely tokens	40 = only top 40 choices
Repetition penalty	Penalize tokens that already appeared	Prevents loops and repetition
Max tokens	Hard cap on output length	Prevents runaway generation

For most practical work:

Coding: Temperature 0-0.3 (you want deterministic, correct code)
Creative writing: Temperature 0.7-1.0
General chat: Temperature 0.5-0.7

Training vs Inference

Learning vs Using

These are two completely different phases:

Training (Learning)

Happens once (expensive, takes weeks/months on thousands of GPUs)
The model reads massive amounts of text
Adjusts its weights to get better at predicting the next token
Costs millions of dollars for frontier models
You (probably) don't do this

Inference (Using)

Happens every time you chat with the model
The model uses its trained weights to generate text
Can run on your laptop with quantized models
Costs per-token via API, or free if running locally
This is what you do every day

Training Phases

Most LLMs go through multiple training phases:

Pre-training: Read the internet, learn language patterns (the expensive part)
Supervised Fine-Tuning (SFT): Train on curated instruction-response pairs to follow directions
RLHF/RLAIF: Reinforcement Learning from Human (or AI) Feedback -- learn what good vs bad responses look like
Safety training: Learn to refuse harmful requests, stay within guidelines

The base model after pre-training is like a very knowledgeable but chaotic entity. SFT and RLHF turn it into something that actually follows instructions and has conversations.

Fine-Tuning

Teaching New Tricks

Fine-tuning takes a pre-trained model and trains it further on specific data. Instead of training from scratch (billions of dollars), you're adjusting an existing model (maybe a few hundred dollars).

Types of Fine-Tuning

Full Fine-Tuning:

Update all model weights
Expensive, needs lots of VRAM
Best results but overkill for most use cases

LoRA (Low-Rank Adaptation):

Only train a small adapter on top of the frozen base model
10-100x cheaper than full fine-tuning
The adapter is tiny (MBs vs GBs)
Can stack multiple LoRAs on one base model

QLoRA:

LoRA but on a quantized base model
Even cheaper -- fine-tune a 70B model on a single GPU
Slight quality trade-off

When to Fine-Tune vs Not

Use Case	Better Approach
"I want the model to know about my company's docs"	RAG (not fine-tuning)
"I want the model to write in a specific style"	Fine-tuning
"I want the model to follow a specific output format"	Prompt engineering first, fine-tune if that fails
"I want domain-specific knowledge (medical, legal)"	Fine-tuning + RAG
"I want the model to use my API"	Tool use / function calling (not fine-tuning)

The honest take: most people who think they need fine-tuning actually need better prompts or RAG. Fine-tuning is for when you've exhausted the other options.

RAG

Giving LLMs a Cheat Sheet

RAG (Retrieval-Augmented Generation) is a simple but powerful idea: before asking the model to answer, first search your own data for relevant information and stuff it into the prompt.

Without RAG:
  User: "What's our refund policy?"
  Model: "I don't know your specific refund policy." (or makes something up)

With RAG:
  1. Search your documents for "refund policy"
  2. Find the relevant policy document
  3. Stuff it into the prompt:
     "Based on this document: [refund policy text]
      Answer the user's question: What's our refund policy?"
  4. Model gives an accurate answer grounded in your data

RAG Pipeline

User Question
     |
     v
[Embed the question] -> vector
     |
     v
[Search vector database] -> find similar document chunks
     |
     v
[Stuff top results into prompt]
     |
     v
[LLM generates answer using the retrieved context]

Chunking Strategies

Your documents need to be split into chunks before embedding. How you chunk matters:

Fixed size: Split every 500 tokens (simple but might break mid-sentence)
Semantic: Split at paragraph/section boundaries (better context preservation)
Recursive: Try large chunks first, split further if too big
Document-aware: Respect headers, code blocks, tables

When RAG Goes Wrong

Bad chunks: Split in the middle of important context
Bad embeddings: The search doesn't find relevant documents
Too much context: Stuffing 50 documents confuses the model
Stale data: Your vector database is outdated

Prompt Engineering

Talking to the Machine

Prompt engineering is the art of giving LLMs instructions that actually produce what you want. It sounds simple but makes a massive difference.

Key Techniques

System Prompts: The hidden instruction that sets the model's behavior. Every good system prompt includes role, constraints, and output format.

Few-Shot Examples: Show the model what you want by giving examples:

Convert to JSON:
Input: "John is 30 years old"
Output: {"name": "John", "age": 30}

Input: "Alice lives in London"
Output: {"name": "Alice", "city": "London"}

Input: "Bob is an engineer at Google"
Output:

The model picks up the pattern and continues it.

Chain of Thought (CoT): Ask the model to think step by step. This genuinely improves reasoning:

Bad:  "What's 17 * 24?"
Good: "What's 17 * 24? Think through it step by step."

Structured Output: Tell the model exactly what format you want:

"Respond in this exact JSON format:
{
  "summary": "...",
  "sentiment": "positive|negative|neutral",
  "confidence": 0.0-1.0
}"

Context Engineering

The Real Game

This is where it gets interesting. Prompt engineering is about crafting a single prompt. Context engineering is about designing the entire information environment the model operates in.

Think of it as the difference between writing a good email (prompt engineering) vs designing the entire briefing package for a decision maker (context engineering).

What Goes Into Context

[System prompt]           <- who the model is, rules, format
[Tool definitions]        <- what the model can do
[Retrieved documents]     <- RAG results
[Conversation history]    <- what was said before
[Working memory]          <- scratchpad, intermediate results
[User message]            <- the actual request

Every one of these affects output quality. Context engineering is about optimizing all of them together.

Practical Context Engineering

Prune irrelevant history: Don't send 50 turns of chat if only the last 5 matter
Summarize, don't truncate: When context gets long, summarize old messages instead of cutting them off
Order matters: Important stuff at the top and bottom, less important in the middle
Be specific about tools: Clear tool descriptions mean the model picks the right one
Dynamic system prompts: Change the system prompt based on what the user is doing

This is what separates a basic chatbot from a well-built agentic system.

Agents

LLMs That Do Things

A plain LLM just generates text. An agent is an LLM that can take actions -- read files, search the web, run code, call APIs.

The Agent Loop

1. User gives a task
2. LLM thinks about what to do
3. LLM picks a tool and calls it
4. Tool returns a result
5. LLM looks at the result
6. Go back to step 2 (or respond if done)

This loop is what makes agents powerful. The model can chain multiple actions together, adapt based on results, and handle tasks that require multiple steps.

Key Components

Component	What It Does
LLM	The brain -- decides what to do next
Tools	The hands -- functions the LLM can call
Memory	Short-term (context) + long-term (files, DBs)
Orchestration	The loop that connects everything

ReAct Pattern

Most agents follow the ReAct (Reasoning + Acting) pattern:

Thought: I need to find the user's config file
Action: search_files("config.json")
Observation: Found at /home/user/.config/app/config.json
Thought: Now I need to read it
Action: read_file("/home/user/.config/app/config.json")
Observation: {"theme": "dark", "language": "en"}
Thought: I have the information, I can answer now
Response: "Your config uses dark theme and English language."

The model explicitly reasons about what to do before doing it.

MCP

Giving Agents Hands

MCP (Model Context Protocol) is a standard for connecting LLMs to external tools and data sources. Think of it as USB for AI -- a universal way to plug in capabilities.

Before MCP

Every tool integration was custom:

OpenAI had function calling (their format)
Anthropic had tool use (their format)
Every app built their own integration layer

With MCP

One standard protocol. Build an MCP server once, any MCP client can use it.

MCP Server (provides tools)        MCP Client (uses tools)
  - File system access        <->    Claude Code
  - Database queries          <->    Cursor
  - API integrations          <->    Any MCP-compatible app
  - Web browsing              <->

MCP Components

Server: Exposes tools, resources, and prompts
Client: Connects to servers, makes tool available to the LLM
Transport: How they communicate (stdio, HTTP/SSE)

Why MCP Matters

If you're building agentic workflows, MCP means you write your tool integration once and it works everywhere. You don't rebuild the same database connector for every AI app.

Hallucinations

When LLMs Make Stuff Up

LLMs hallucinate. This is not a bug that will be fixed in the next version -- it's a fundamental property of how they work. They generate statistically plausible text, and sometimes plausible != true.

Types of Hallucination

Type	Example
Factual	"The Eiffel Tower was built in 1920" (it was 1889)
Citation	"According to Smith et al. (2019)..." (paper doesn't exist)
Confident nonsense	Generating a detailed but completely wrong technical explanation
Subtle errors	Mostly correct answer with one wrong detail buried in it

Reducing Hallucinations

RAG: Ground responses in actual documents
Low temperature: Less creative = less hallucination
Ask for sources: "Cite your sources" (model might still hallucinate sources though)
Structured output: Force the model into a format that's easier to verify
Multiple passes: Ask the model to verify its own answer
Tool use: Let the model look things up instead of guessing

The honest truth: you cannot fully eliminate hallucinations. Always verify critical information.

Benchmarks

How We Measure

Benchmarks try to measure how "good" a model is. Take all of them with a grain of salt.

Common Benchmarks

Benchmark	What It Tests
MMLU	General knowledge across 57 subjects
HumanEval	Code generation (writing Python functions)
MATH	Mathematical reasoning
GSM8K	Grade school math word problems
ARC	Science reasoning
HellaSwag	Common sense reasoning
TruthfulQA	Resistance to common misconceptions
MT-Bench	Multi-turn conversation quality

Why Benchmarks Are Tricky

Teaching to the test: Models can be optimized for specific benchmarks
Contamination: If benchmark questions appear in training data, scores are inflated
Real-world gap: High benchmark scores don't always mean the model is better for your use case
Cherry picking: Companies show the benchmarks where they win

What Actually Matters

For practical work, the best benchmark is: does the model do what I need it to do? Try it on your actual tasks. A model that scores 2% lower on MMLU but is faster and cheaper might be the better choice for your use case.

Quick Reference Card

Term	One-Line Explanation
Token	A chunk of text (~0.75 words)
Embedding	A number-list representing meaning
Attention	How the model decides what's important
Context window	How much text the model can see at once
Temperature	Randomness dial (0 = predictable, 1+ = creative)
Inference	Running the model to get output
Fine-tuning	Further training on specific data
LoRA	Cheap fine-tuning (small adapter, frozen base)
RAG	Search your docs, stuff into prompt
Agent	LLM + tools + loop
MCP	Model Context Protocol. Universal tool protocol for AI
Hallucination	Model generating plausible but false info
KV Cache	Stored attention computations for speed
RLHF	Training with human preference feedback
MoE	Multiple expert networks, only some active
Quantization	Compress model weights to use less memory

This is a living document. I'll keep adding to it as I learn more and inevitably forget things again.

Command Palette