Skip to main content

Command Palette

Search for a command to run...

LLM Concepts Deep Dive: The Stuff I Wish Someone Explained Simply

Published
19 min read
LLM Concepts Deep Dive: The Stuff I Wish Someone Explained Simply

LLM Concepts Deep Dive: The Stuff I Wish Someone Explained Simply

Date: February 20, 2026 Purpose: Personal reference + blog draft for anyone starting out with AI/LLM concepts


When I first started learning about AI and LLMs, I hit a wall of jargon. Tokens, embeddings, attention, temperature, context windows, RAG, fine-tuning -- every article assumed you already knew the last article. I'd read something, nod along, and realize 10 minutes later I had no idea what I just read.

What actually helped was getting my hands dirty. Running local models, breaking things, building agentic workflows, messing with parameters until something clicked. Then I'd go back to those same articles and research papers and it was all "aha" moments. Suddenly the jargon made sense because I'd seen it in action.

This post is my attempt to simplify these concepts for anyone who's just starting out. No PhD required. If you already know this stuff, cool -- skip ahead. And honestly, this is also a reminder for myself to come back to whenever I forget how something works.


Table of Contents

  1. The Basics: What Even Is an LLM?

  2. Tokens: How LLMs Read

  3. Embeddings: How LLMs Understand

  4. Attention: How LLMs Focus

  5. Context Window: Short-Term Memory

  6. Temperature & Sampling: Creativity Controls

  7. Training vs Inference: Learning vs Using

  8. Fine-Tuning: Teaching New Tricks

  9. RAG: Giving LLMs a Cheat Sheet

  10. Prompt Engineering: Talking to the Machine

  11. Context Engineering: The Real Game

  12. Agents: LLMs That Do Things

  13. MCP: Giving Agents Hands

  14. Hallucinations: When LLMs Make Stuff Up

  15. Benchmarks: How We Measure


The Basics

What Even Is an LLM?

An LLM (Large Language Model) is a program that predicts the next word. That's it. Everything else is built on top of that one trick.

You type: "The capital of France is" It predicts: "Paris"

It does this by having read an enormous amount of text during training and learning statistical patterns about which words tend to follow which other words. It doesn't "know" things the way you know things. It's really good at pattern matching.

The Transformer Architecture

Almost every modern LLM is built on the transformer architecture (the T in GPT). Before transformers, we had models that read text one word at a time, left to right. Transformers can look at the entire input at once and figure out which parts matter most for each word.

Think of it like reading a book:

  • Old approach (RNN): Read word by word, try to remember everything

  • Transformer: Scan the whole page, highlight what matters, then write your response

The key innovation is the attention mechanism -- more on that below.


Tokens

How LLMs Read

LLMs don't read words. They read tokens -- chunks of text that might be a word, part of a word, or even a single character.

"Hello, how are you?" = ["Hello", ",", " how", " are", " you", "?"]
                       = 6 tokens

"Anthropic" = ["Anthrop", "ic"]
            = 2 tokens

"I'm" = ["I", "'m"]
      = 2 tokens

Why Tokens Matter

Concept

Why It Matters

Cost

API pricing is per token (input + output)

Speed

More tokens = slower generation

Context window

Measured in tokens, not words

Rough conversion

~1 token = ~0.75 words (English)

So when someone says "128k context window" they mean 128,000 tokens, which is roughly 96,000 words or about 300 pages of text.

Tokenization

Different models use different tokenizers. The same sentence might be 10 tokens in one model and 12 in another. This is why you can't directly compare token counts across models.

Common tokenizers:

  • BPE (Byte-Pair Encoding): Used by GPT models, Claude

  • SentencePiece: Used by Llama, Mistral

  • WordPiece: Used by BERT

You don't need to memorize these. Just know that tokenization is the first step -- raw text goes in, tokens come out, and the model works with tokens from that point on.


Embeddings

How LLMs Understand

Once text is split into tokens, each token gets converted into a vector -- a list of numbers that represents its meaning in a high-dimensional space.

"king"  = [0.2, 0.8, -0.3, 0.5, ...]   (hundreds of dimensions)
"queen" = [0.2, 0.7, -0.3, 0.6, ...]    (similar! close in space)
"car"   = [-0.5, 0.1, 0.9, -0.2, ...]   (very different)

The Famous Example

The classic embedding arithmetic:

king - man + woman = queen

This works because embeddings capture semantic relationships as directions in space. "Male to female" is a direction. "Singular to plural" is a direction. The model learns these during training.

Why Embeddings Matter

  • Similarity search: Find documents that are semantically similar (not just keyword matching)

  • RAG: Store embeddings of your documents, search by meaning

  • Clustering: Group similar concepts together automatically

When people talk about "vector databases" (Pinecone, Chroma, Weaviate), they're storing embeddings and searching through them efficiently.


Attention

How LLMs Focus

Attention is the mechanism that lets the model decide which parts of the input matter most for generating each output token.

When the model sees: "The cat sat on the mat because it was tired"

It needs to figure out what "it" refers to. The attention mechanism assigns weights:

"it" pays attention to:
  "cat"  -> 0.85 (high! "it" = "the cat")
  "mat"  -> 0.05 (low)
  "sat"  -> 0.03 (low)
  "The"  -> 0.02 (low)
  ...

Self-Attention vs Cross-Attention

  • Self-attention: The input looks at itself (each token looks at every other token in the same sequence)

  • Cross-attention: The output looks at the input (used in encoder-decoder models like the original transformer)

Most modern LLMs (GPT, Claude, Llama) are decoder-only and use self-attention exclusively.

Multi-Head Attention

The model doesn't just have one attention pattern -- it has multiple "heads" that each learn to focus on different things:

  • Head 1 might track grammatical relationships

  • Head 2 might track semantic meaning

  • Head 3 might track position/distance

  • Head 4 might track some pattern we can't even name

A model with 32 attention heads is looking at the input 32 different ways simultaneously.

KV Cache

When generating text token by token, the model doesn't want to recompute attention from scratch each time. The KV (Key-Value) cache stores the attention computations for previous tokens so only the new token needs full computation.

This is why:

  • Long contexts use a lot of VRAM (the KV cache grows with context length)

  • First token is slow, subsequent tokens are faster (cache is warming up)

  • Some quantization methods target the KV cache to reduce memory


Context Window

Short-Term Memory

The context window is how much text the model can "see" at once. Everything outside the window doesn't exist to the model.

Model

Context Window

Roughly

GPT-3 (original)

2k tokens

~3 pages

GPT-4

128k tokens

~300 pages

Claude 3.5 Sonnet

200k tokens

~500 pages

Gemini 1.5 Pro

2M tokens

~5,000 pages

The "Lost in the Middle" Problem

Models tend to pay more attention to the beginning and end of their context window. Information buried in the middle can get overlooked. This is a known limitation.

Practical implications:

  • Put important instructions at the beginning (system prompt)

  • Put the immediate question/task at the end

  • Don't rely on the model perfectly recalling a detail from page 200 of a 500-page context

Context Window vs Memory

The context window is not memory in the human sense. When the conversation exceeds the window:

  • Old messages get dropped (or summarized, depending on implementation)

  • The model has zero knowledge of what was discussed before the window

  • There is no persistent storage between sessions unless you build it

This is why agentic systems need external memory (files, databases, knowledge graphs).


Temperature and Sampling

Creativity Controls

When the model predicts the next token, it doesn't just pick one -- it calculates probabilities for every possible token and then samples from that distribution.

Temperature controls how "creative" vs "predictable" the output is:

Prompt: "The sky is"

Temperature 0.0 (deterministic):
  "blue" -> always picks the highest probability

Temperature 0.7 (balanced):
  "blue" (60%), "clear" (20%), "beautiful" (10%), "dark" (5%), ...
  Might pick any of these

Temperature 1.5 (wild):
  "blue" (30%), "clear" (15%), "screaming" (8%), "purple" (7%), ...
  Much more random, might say weird things

Other Sampling Parameters

Parameter

What It Does

Practical Use

Temperature

Controls randomness

0 = deterministic, 1+ = creative

Top-P (nucleus)

Only consider tokens in the top P% of probability

0.9 = ignore the bottom 10%

Top-K

Only consider the K most likely tokens

40 = only top 40 choices

Repetition penalty

Penalize tokens that already appeared

Prevents loops and repetition

Max tokens

Hard cap on output length

Prevents runaway generation

For most practical work:

  • Coding: Temperature 0-0.3 (you want deterministic, correct code)

  • Creative writing: Temperature 0.7-1.0

  • General chat: Temperature 0.5-0.7


Training vs Inference

Learning vs Using

These are two completely different phases:

Training (Learning)

  • Happens once (expensive, takes weeks/months on thousands of GPUs)

  • The model reads massive amounts of text

  • Adjusts its weights to get better at predicting the next token

  • Costs millions of dollars for frontier models

  • You (probably) don't do this

Inference (Using)

  • Happens every time you chat with the model

  • The model uses its trained weights to generate text

  • Can run on your laptop with quantized models

  • Costs per-token via API, or free if running locally

  • This is what you do every day

Training Phases

Most LLMs go through multiple training phases:

  1. Pre-training: Read the internet, learn language patterns (the expensive part)

  2. Supervised Fine-Tuning (SFT): Train on curated instruction-response pairs to follow directions

  3. RLHF/RLAIF: Reinforcement Learning from Human (or AI) Feedback -- learn what good vs bad responses look like

  4. Safety training: Learn to refuse harmful requests, stay within guidelines

The base model after pre-training is like a very knowledgeable but chaotic entity. SFT and RLHF turn it into something that actually follows instructions and has conversations.


Fine-Tuning

Teaching New Tricks

Fine-tuning takes a pre-trained model and trains it further on specific data. Instead of training from scratch (billions of dollars), you're adjusting an existing model (maybe a few hundred dollars).

Types of Fine-Tuning

Full Fine-Tuning:

  • Update all model weights

  • Expensive, needs lots of VRAM

  • Best results but overkill for most use cases

LoRA (Low-Rank Adaptation):

  • Only train a small adapter on top of the frozen base model

  • 10-100x cheaper than full fine-tuning

  • The adapter is tiny (MBs vs GBs)

  • Can stack multiple LoRAs on one base model

QLoRA:

  • LoRA but on a quantized base model

  • Even cheaper -- fine-tune a 70B model on a single GPU

  • Slight quality trade-off

When to Fine-Tune vs Not

Use Case

Better Approach

"I want the model to know about my company's docs"

RAG (not fine-tuning)

"I want the model to write in a specific style"

Fine-tuning

"I want the model to follow a specific output format"

Prompt engineering first, fine-tune if that fails

"I want domain-specific knowledge (medical, legal)"

Fine-tuning + RAG

"I want the model to use my API"

Tool use / function calling (not fine-tuning)

The honest take: most people who think they need fine-tuning actually need better prompts or RAG. Fine-tuning is for when you've exhausted the other options.


RAG

Giving LLMs a Cheat Sheet

RAG (Retrieval-Augmented Generation) is a simple but powerful idea: before asking the model to answer, first search your own data for relevant information and stuff it into the prompt.

Without RAG:
  User: "What's our refund policy?"
  Model: "I don't know your specific refund policy." (or makes something up)

With RAG:
  1. Search your documents for "refund policy"
  2. Find the relevant policy document
  3. Stuff it into the prompt:
     "Based on this document: [refund policy text]
      Answer the user's question: What's our refund policy?"
  4. Model gives an accurate answer grounded in your data

RAG Pipeline

User Question
     |
     v
[Embed the question] -> vector
     |
     v
[Search vector database] -> find similar document chunks
     |
     v
[Stuff top results into prompt]
     |
     v
[LLM generates answer using the retrieved context]

Chunking Strategies

Your documents need to be split into chunks before embedding. How you chunk matters:

  • Fixed size: Split every 500 tokens (simple but might break mid-sentence)

  • Semantic: Split at paragraph/section boundaries (better context preservation)

  • Recursive: Try large chunks first, split further if too big

  • Document-aware: Respect headers, code blocks, tables

When RAG Goes Wrong

  • Bad chunks: Split in the middle of important context

  • Bad embeddings: The search doesn't find relevant documents

  • Too much context: Stuffing 50 documents confuses the model

  • Stale data: Your vector database is outdated


Prompt Engineering

Talking to the Machine

Prompt engineering is the art of giving LLMs instructions that actually produce what you want. It sounds simple but makes a massive difference.

Key Techniques

System Prompts: The hidden instruction that sets the model's behavior. Every good system prompt includes role, constraints, and output format.

Few-Shot Examples: Show the model what you want by giving examples:

Convert to JSON:
Input: "John is 30 years old"
Output: {"name": "John", "age": 30}

Input: "Alice lives in London"
Output: {"name": "Alice", "city": "London"}

Input: "Bob is an engineer at Google"
Output:

The model picks up the pattern and continues it.

Chain of Thought (CoT): Ask the model to think step by step. This genuinely improves reasoning:

Bad:  "What's 17 * 24?"
Good: "What's 17 * 24? Think through it step by step."

Structured Output: Tell the model exactly what format you want:

"Respond in this exact JSON format:
{
  "summary": "...",
  "sentiment": "positive|negative|neutral",
  "confidence": 0.0-1.0
}"

Context Engineering

The Real Game

This is where it gets interesting. Prompt engineering is about crafting a single prompt. Context engineering is about designing the entire information environment the model operates in.

Think of it as the difference between writing a good email (prompt engineering) vs designing the entire briefing package for a decision maker (context engineering).

What Goes Into Context

[System prompt]           <- who the model is, rules, format
[Tool definitions]        <- what the model can do
[Retrieved documents]     <- RAG results
[Conversation history]    <- what was said before
[Working memory]          <- scratchpad, intermediate results
[User message]            <- the actual request

Every one of these affects output quality. Context engineering is about optimizing all of them together.

Practical Context Engineering

  • Prune irrelevant history: Don't send 50 turns of chat if only the last 5 matter

  • Summarize, don't truncate: When context gets long, summarize old messages instead of cutting them off

  • Order matters: Important stuff at the top and bottom, less important in the middle

  • Be specific about tools: Clear tool descriptions mean the model picks the right one

  • Dynamic system prompts: Change the system prompt based on what the user is doing

This is what separates a basic chatbot from a well-built agentic system.


Agents

LLMs That Do Things

A plain LLM just generates text. An agent is an LLM that can take actions -- read files, search the web, run code, call APIs.

The Agent Loop

1. User gives a task
2. LLM thinks about what to do
3. LLM picks a tool and calls it
4. Tool returns a result
5. LLM looks at the result
6. Go back to step 2 (or respond if done)

This loop is what makes agents powerful. The model can chain multiple actions together, adapt based on results, and handle tasks that require multiple steps.

Key Components

Component

What It Does

LLM

The brain -- decides what to do next

Tools

The hands -- functions the LLM can call

Memory

Short-term (context) + long-term (files, DBs)

Orchestration

The loop that connects everything

ReAct Pattern

Most agents follow the ReAct (Reasoning + Acting) pattern:

Thought: I need to find the user's config file
Action: search_files("config.json")
Observation: Found at /home/user/.config/app/config.json
Thought: Now I need to read it
Action: read_file("/home/user/.config/app/config.json")
Observation: {"theme": "dark", "language": "en"}
Thought: I have the information, I can answer now
Response: "Your config uses dark theme and English language."

The model explicitly reasons about what to do before doing it.


MCP

Giving Agents Hands

MCP (Model Context Protocol) is a standard for connecting LLMs to external tools and data sources. Think of it as USB for AI -- a universal way to plug in capabilities.

Before MCP

Every tool integration was custom:

  • OpenAI had function calling (their format)

  • Anthropic had tool use (their format)

  • Every app built their own integration layer

With MCP

One standard protocol. Build an MCP server once, any MCP client can use it.

MCP Server (provides tools)        MCP Client (uses tools)
  - File system access        <->    Claude Code
  - Database queries          <->    Cursor
  - API integrations          <->    Any MCP-compatible app
  - Web browsing              <->

MCP Components

  • Server: Exposes tools, resources, and prompts

  • Client: Connects to servers, makes tool available to the LLM

  • Transport: How they communicate (stdio, HTTP/SSE)

Why MCP Matters

If you're building agentic workflows, MCP means you write your tool integration once and it works everywhere. You don't rebuild the same database connector for every AI app.


Hallucinations

When LLMs Make Stuff Up

LLMs hallucinate. This is not a bug that will be fixed in the next version -- it's a fundamental property of how they work. They generate statistically plausible text, and sometimes plausible != true.

Types of Hallucination

Type

Example

Factual

"The Eiffel Tower was built in 1920" (it was 1889)

Citation

"According to Smith et al. (2019)..." (paper doesn't exist)

Confident nonsense

Generating a detailed but completely wrong technical explanation

Subtle errors

Mostly correct answer with one wrong detail buried in it

Reducing Hallucinations

  • RAG: Ground responses in actual documents

  • Low temperature: Less creative = less hallucination

  • Ask for sources: "Cite your sources" (model might still hallucinate sources though)

  • Structured output: Force the model into a format that's easier to verify

  • Multiple passes: Ask the model to verify its own answer

  • Tool use: Let the model look things up instead of guessing

The honest truth: you cannot fully eliminate hallucinations. Always verify critical information.


Benchmarks

How We Measure

Benchmarks try to measure how "good" a model is. Take all of them with a grain of salt.

Common Benchmarks

Benchmark

What It Tests

MMLU

General knowledge across 57 subjects

HumanEval

Code generation (writing Python functions)

MATH

Mathematical reasoning

GSM8K

Grade school math word problems

ARC

Science reasoning

HellaSwag

Common sense reasoning

TruthfulQA

Resistance to common misconceptions

MT-Bench

Multi-turn conversation quality

Why Benchmarks Are Tricky

  • Teaching to the test: Models can be optimized for specific benchmarks

  • Contamination: If benchmark questions appear in training data, scores are inflated

  • Real-world gap: High benchmark scores don't always mean the model is better for your use case

  • Cherry picking: Companies show the benchmarks where they win

What Actually Matters

For practical work, the best benchmark is: does the model do what I need it to do? Try it on your actual tasks. A model that scores 2% lower on MMLU but is faster and cheaper might be the better choice for your use case.


Quick Reference Card

Term

One-Line Explanation

Token

A chunk of text (~0.75 words)

Embedding

A number-list representing meaning

Attention

How the model decides what's important

Context window

How much text the model can see at once

Temperature

Randomness dial (0 = predictable, 1+ = creative)

Inference

Running the model to get output

Fine-tuning

Further training on specific data

LoRA

Cheap fine-tuning (small adapter, frozen base)

RAG

Search your docs, stuff into prompt

Agent

LLM + tools + loop

MCP

Model Context Protocol. Universal tool protocol for AI

Hallucination

Model generating plausible but false info

KV Cache

Stored attention computations for speed

RLHF

Training with human preference feedback

MoE

Multiple expert networks, only some active

Quantization

Compress model weights to use less memory


This is a living document. I'll keep adding to it as I learn more and inevitably forget things again.