Technical Deep Dive

TokenizationHow AI Reads Your Text

Tokenization is how AI chops your words into bite-sized pieces called tokens. It's why 'ChatGPT' becomes two pieces and why your costs go up with longer messages. Understanding tokens helps you write better prompts and save money.

The simple version: Tokens are the "atoms" of AI text. Just like atoms make up everything physical, tokens make up everything the AI reads and writes. More tokens = more processing = more cost.

What Is Tokenization?

What is tokenization?

Tokenization is how AI chops up your text into smaller pieces called "tokens." Think of it like cutting a sentence into puzzle pieces. The AI can't read whole sentences — it reads these tiny chunks, one by one.

What exactly is a token?

A token is usually a word, part of a word, or a punctuation mark. "Hello" is one token. "Unbelievable" might be split into "Un" + "believ" + "able" — three tokens. Common words stay whole; rare words get broken apart.

Why should I care about tokens?

Tokens = money. AI pricing is based on tokens. Also, AI can only process so many tokens at once (the "context window"). Understanding tokens helps you write better prompts and avoid hitting limits.

See Tokens in Action

Real examples of how text gets split into tokens

Hello world

2 tokens

Common words stay whole. Notice the space is part of "world".

Hello world

ChatGPT is amazing!

5 tokens

"ChatGPT" splits into two tokens. Punctuation is separate.

ChatGPT is amazing!

Supercalifragilisticexpialidocious

9 tokens

Rare words get chopped into many pieces.

Supercalifragilisticexpialidocious

🎉🎂🎁

3 tokens

Most emojis are 1 token each.

🎉🎂🎁

Token Rules of Thumb

Patterns to help you estimate token counts

Common words = 1 token

Example: hello, the, is, good, you

Spaces attach to the next word

Example: " the" not "the" + " "

Rare/long words = multiple tokens

Example: cryptocurrency, photosynthesis

Punctuation = usually 1 token

Example: . , ! ? : ;

Numbers vary

Example: "42" = 1 token, "123456789" = multiple

Non-English uses more tokens

Example: Japanese, Chinese, Arabic text

Why Do AI Models Use Tokens?

It's not arbitrary — there are smart reasons

Handle Any Word

By breaking words into pieces, AI can understand words it's never seen before. "Cryptocurrency" might be new, but "Crypto" + "currency" aren't.

Keep Vocabulary Small

Instead of memorizing millions of words, AI learns ~50,000-100,000 tokens and combines them. More efficient and flexible.

Work Across Languages

The same tokenizer can handle English, French, code, and emojis. It just learns the common patterns in each.

Types of Tokenizers

Different AI models use different tokenization methods

BPE (Byte Pair Encoding)

Starts with characters, then merges the most common pairs. "th" + "e" → "the". Learns what to merge from training data.

Used by: GPT-3, GPT-4, Claude

Handles any text
Good balance of size
No unknown words

WordPiece

Similar to BPE but uses a slightly different merging algorithm based on likelihood.

Used by: BERT, some Google models

Great for search
Efficient for classification
Well-studied

SentencePiece

Language-independent tokenizer that works directly on raw text. No pre-tokenization needed.

Used by: T5, LLaMA, many multilingual models

Truly language-agnostic
Handles spaces uniformly
Good for multilingual

What Do Tokens Cost?

Current pricing for popular AI models

GPT-4o

Input: $2.50 / 1M tokens | Output: $10.00 / 1M tokens | Context: 128K

GPT-4o mini

Input: $0.15 / 1M tokens | Output: $0.60 / 1M tokens | Context: 128K

Claude 3.5 Sonnet

Input: $3.00 / 1M tokens | Output: $15.00 / 1M tokens | Context: 200K

Gemini 1.5 Pro

Input: $1.25 / 1M tokens | Output: $5.00 / 1M tokens | Context: 2M

Tips to Optimize Token Usage

Save money and stay within limits

Be concise in prompts

Fewer words = fewer tokens = lower cost. Get to the point.

Use common words

"Use" instead of "utilize." Common words are usually 1 token.

Remove unnecessary context

Don't paste your whole codebase. Include only what's needed.

Test with a tokenizer tool

OpenAI's Tokenizer tool shows exactly how your text splits.

Common Token Mistakes

Thinking 1 word = 1 token

✗ Don't

Long or rare words often become 2-5+ tokens.

✓ Do

Test your text with a tokenizer to see actual counts.

Why: Token counts affect both cost and context limits.

Ignoring system prompts

✗ Don't

System prompts count toward your token limit too!

✓ Do

Keep system prompts concise and efficient.

Why: Hidden tokens add up fast.

Forgetting about output tokens

✗ Don't

You pay for AI's response too, often at higher rates.

✓ Do

Ask for concise responses when you don't need long answers.

Why: Output tokens are usually more expensive.

Pasting huge documents

✗ Don't

A 10-page doc might use 3,000+ tokens instantly.

✓ Do

Extract only the relevant sections for your question.

Why: Smaller context = faster, cheaper, often better.

Keep Learning

Context Windows

LLMs

Embeddings

Prompt Engineering

Ready to Practice?

Put your knowledge to work with AI-powered learning.

Start Learning