Eight Layers of AI: LLM, Token, Agent, and How They Connect

Table of Contents

Introduction
#

I recently took the time to systematically review the core concepts behind modern AI, and I realized that while I had encountered each one individually, spelling out how they actually depend on each other took more effort than I expected. The biggest insight after putting it all together: these concepts aren’t isolated. They’re stacked layer by layer, with each one filling a gap left by the one below it. This post is my attempt to trace that stack from the ground up.

LLM: Where It All Begins
#

LLM stands for Large Language Model. Pretty much every major model today is built on the Transformer architecture¹, first proposed by Google in the 2017 paper Attention Is All You Need. The ironic part: Google lit the spark, but OpenAI was the one who turned it into a wildfire. GPT-3.5 dropped in late 2022, and just months later GPT-4 pushed the ceiling of what AI could do to a whole new level. The GPT family remains an industry benchmark today, though Claude, Gemini, and others are competing hard in their respective strengths.

The underlying mechanism is surprisingly straightforward: it’s essentially a word-prediction game. You give it some input, it predicts the most probable next word, appends that word to the input, predicts the next one, and so on until it emits a special end-of-sequence token.

That’s also why ChatGPT streams its responses word by word. It’s not a network issue; it’s just how the model fundamentally works.

Token: The Atom of Language Processing
#

Calling it a “word-prediction game” is a simplification. In reality, an LLM is a massive mathematical function running matrix operations. It only understands numbers, not text. Between you and the model sits a translation layer called the Tokenizer.

The Tokenizer does two things:

Split: Break the input text into the smallest possible fragments, each called a Token
Map: Convert each Token to a number (Token ID), since the model only processes numbers

For instance, the sentence “Today’s weather is great” might be split into [“Today”, “’s”, “weather”, “is”, “great”], each mapped to a distinct number. On the output side, the process runs in reverse: numbers get mapped back to text.

But here’s the catch: a Token is not the same thing as a “word.” The English word “unbelievable” might get split into “un”, “believ”, and “able”. A single emoji can require three or more Tokens. Tokenization is a set of rules the model learns during training, and it doesn’t correspond neatly to how we think about words.

As a rough estimate, one Token equals about 0.75 English words or 1.5 to 2 Chinese characters. 400K Tokens is roughly 300K English words².

Context and Context Window
#

Ever wondered how an LLM “remembers” what you said earlier? It doesn’t actually have memory.

The trick is that every time you send a message, the application behind the scenes bundles your entire conversation history along with your new question and sends it all to the model. The model sees the full picture every time, which is how it knows what came before.

This brings us to Context: everything the model receives when processing a request, including the user’s message, conversation history, system instructions, tool lists, and more. Think of it as the model’s short-term memory.

How large can this memory be? That’s where Context Window comes in. It defines the maximum number of Tokens a Context can hold. Current mainstream models offer impressive Context Windows: GPT-5.4 at 1.05M Tokens, Gemini 3.1 Pro and Claude Opus 4.6 at 1M Tokens each. 1M Tokens is roughly 750K English words; you could fit the entire Harry Potter series.

But Context Window isn’t a silver bullet. If you have a thousand-page product manual and want the model to answer user questions based on it, stuffing the whole thing into context is technically possible but financially impractical. That’s where RAG (Retrieval-Augmented Generation)³ shines: it retrieves only the most relevant snippets from the manual and sends just those to the model, sidestepping both Context Window limits and cost concerns.

Prompt: Telling the Model What to Do
#

A Prompt is simply the question or instruction you give the model. “Write me a poem” is a Prompt. Sounds trivial, but how you phrase it directly shapes the output. “Write me a poem” might get you a haiku, a sonnet, or a limerick, because the model doesn’t know what you actually want. Something like “Write a Shakespearean sonnet about autumn leaves in a melancholic tone” narrows things down considerably.

This is the domain of Prompt Engineering. It used to be a hot topic; these days, not so much. Partly because the bar is low (it’s essentially “be clear”), and partly because modern models are good enough to infer your intent even from vague prompts.

Prompts come in two flavors:

User Prompt: what you type into the chat box
System Prompt: instructions configured by the developer behind the scenes, invisible to the user but consistently shaping the model’s behavior

For example, say you’re building a customer service bot and don’t want it to freely promise refunds. You set the System Prompt to: “You are a post-sale support agent for an e-commerce platform. When handling complaints, first acknowledge the customer’s frustration, then investigate the issue. Do not authorize refunds or compensation on your own. For cases requiring escalation, guide the user to submit a ticket.” When a user types “My item arrived broken, I want a refund,” the model responds with “I’m sorry about the experience. Could you share a photo of the damage? I’ll help get this sorted once I can see the situation,” rather than jumping straight to “Sure, refunding now.”

Tool: Giving the Model Eyes and Hands
#

LLMs have an obvious blind spot: they can’t perceive the outside world. Ask “How’s the weather in Beijing today?” and it’ll tell you straight up that it can’t access real-time information, because its knowledge is frozen at the training cutoff date.

Tools fill this gap. A Tool is essentially a function: give it input, get output. A weather tool, for example, takes a city and date, calls a weather API internally, and returns the forecast.

The key thing to understand: the model cannot call tools itself. Its only ability is generating text. So the full workflow requires a middleman (usually called the platform) to orchestrate:

The user’s question and the list of available tools are sent to the model together
The model analyzes the request and outputs a tool-call instruction (specifying the tool name and parameters)
The platform receives the instruction and actually calls the corresponding function
The tool returns a result, and the platform forwards it back to the model
The model synthesizes the result into natural language and replies to the user

Each role has a clear responsibility: the model selects tools and summarizes results, the tool executes the action, and the platform ties the whole pipeline together.

MCP: A Universal Standard for Tools
#

Tools solve the capability problem, but they introduce a new engineering headache: no standard integration format.

Building a tool for ChatGPT means writing integration code to OpenAI’s spec. Doing the same for Claude means following Anthropic’s spec. And again for Gemini with Google’s spec. Same tool, three implementations, because every platform has its own format.

MCP (Model Context Protocol)⁴ was created to solve exactly this. It defines a single, unified standard for tool integration. Developers build a tool once following the MCP spec, and it works across every platform that supports MCP. Think of it like the USB-C standard: accessory makers don’t need a separate charger for every phone brand anymore.

Agent: The Autonomous Planner
#

With Tools and MCP, LLMs can already do a lot. But for more complex requests, a single tool call won’t cut it.

Imagine a user saying: “I’m planning to go for a run this afternoon. Can you check if it’s a good idea?” This requires multiple tool calls where each step depends on the previous one: first get the user’s location coordinates, then use those to check both the weather and air quality, and finally evaluate all the data to decide whether outdoor exercise makes sense. The model needs to reason step by step about the current state and figure out what to do next.

A system that can autonomously plan, call tools, and keep working until the task is done is called an Agent. Products like Claude Code, Codex, and Gemini CLI are all Agents under the hood. They use different architectural patterns, with ReAct and Plan and Execute being among the most well-known⁵.

Agent Skill: The Agent’s Instruction Manual
#

Agents sound powerful, and they are. But in day-to-day use, you quickly hit a pain point: personalization rules have to be repeated every single time.

Say you want an Agent as your fitness advisor. Before each workout, it should assess your physical condition and the day’s environmental factors. You have your own preferences: skip squats when your knees are acting up, move indoors if the AQI exceeds 150, reduce intensity above 35°C, always remind you to stretch afterward. And you want a fixed output format: an overall score first, then specific recommendations.

Without any preset, the Agent will check the data but won’t know your body’s quirks or your formatting preferences. It’ll give generic advice. So you end up appending a wall of requirements to every single prompt. Copy, paste, repeat. Not fun.

Agent Skill exists to fix this. It’s essentially a pre-written instruction document (in Markdown) that tells the Agent how to behave in a specific scenario. It has two layers:

Metadata layer: the cover page, specifying the Skill’s name and description (e.g., name Fitness Advisor, description “Pre-workout assessment and recommendations”)
Instruction layer: the actual rules, including goals, execution steps, judgment criteria, output format, and examples

Once defined, you save it to a designated location. In Claude Code, that’s ~/.claude/skills/. The folder name must match the Skill name, and the file inside must be named SKILL.md.

At runtime, the Agent loads all Skill metadata on startup. When a user’s question matches a Skill’s description, only then does the Agent read that Skill’s full instruction layer and execute accordingly. This progressive disclosure mechanism (loading full content only when needed) keeps Token usage efficient⁶.

The Big Picture
#

String these concepts together and the layered structure of the AI stack becomes clear:

Concept	Role	Problem Solved
LLM	Foundation	Core text generation capability
Token	Data unit	Text-to-number translation between humans and models
Context	Memory	Giving the model short-term recall
Prompt	Instruction	Telling the model what to do and how
Tool	Capability extension	Letting the model perceive and affect the outside world
MCP	Protocol standard	Unifying tool integration formats
Agent	Autonomous system	Multi-step planning and tool use for complex tasks
Agent Skill	Customization	Making Agents follow your rules and formats

Each layer builds on the one below and addresses its shortcomings. Once you grasp this hierarchy, products like Claude Code, Codex, and OpenClaw all start to look like variations on the same framework. The buzzwords keep multiplying, but the underlying logic stays the same.

The internals of the Transformer architecture are beyond the scope of this post. For a deep dive, the original paper Attention Is All You Need is the definitive reference. ↩︎
Token-to-text ratios vary across models since each uses its own Tokenizer implementation. The numbers here are rough estimates. ↩︎
RAG works by “retrieve first, generate second”: it uses vector similarity matching to find the most relevant document chunks and sends only those as Context to the model. ↩︎
MCP was released by Anthropic in late 2024 and has since been adopted by multiple platforms and tools. ↩︎
For a detailed breakdown of Agent architecture patterns including ReAct and Plan and Execute, check out dedicated articles on the topic. ↩︎
Agent Skills also support advanced features like running code and referencing external resources. This post covers only the core usage. ↩︎

Introduction #

LLM: Where It All Begins #

Token: The Atom of Language Processing #

Context and Context Window #

Prompt: Telling the Model What to Do #

Tool: Giving the Model Eyes and Hands #

MCP: A Universal Standard for Tools #

Agent: The Autonomous Planner #

Agent Skill: The Agent’s Instruction Manual #

The Big Picture #