Charsets

UTF-8 vs ASCII for URL encoding.

Why character sets matter for URL encoding, what UTF-8 does that ASCII can’t, and how to decode legacy data that’s in a different charset. Also: when (and when not) to use the ASCII toggle on our decoder.

The short version

URL encoding works on bytes, not characters. Before percent-encoding can happen, your text has to be converted to a sequence of bytes — and that conversion is what character sets are for.

For modern web work, always use UTF-8. It’s the universal default, supports every language on earth, and is required by the WHATWG URL standard that browsers follow. The only time you’d pick anything else is when you’re decoding data from an old system that used a different encoding.

What ASCII actually encodes

ASCII covers exactly 128 characters: digits, uppercase and lowercase English letters, common punctuation, and a few control characters. That’s it. Nothing else.

If your text is pure ASCII (only English letters, digits, spaces, and basic punctuation), URL-encoding it gives the same result whether you pick ASCII or UTF-8. The byte values for these characters are identical in both encodings.

What can’t be encoded as ASCII: accented Latin letters (é, ñ, ü), Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, emoji, mathematical symbols, currency symbols beyond $. If your input contains any of these and you try to encode as ASCII, you’ll either get an error or the characters will be silently dropped/replaced.

What UTF-8 encodes

UTF-8 encodes every Unicode character — over 140,000 in the current standard. It does this with a variable-length scheme:

ASCII characters (A-Z, 0-9, etc.) take 1 byte — identical to ASCII. The letter A is byte 0x41 in both.

Latin Extended characters like é, ñ, ü take 2 bytes. é is 0xC3 0xA9 in UTF-8, which percent-encodes to %C3%A9.

Most Asian scripts (Chinese, Japanese, Korean) take 3 bytes. The Chinese character is 0xE4 0xBD 0xA0%E4%BD%A0.

Emoji and some rare scripts take 4 bytes. The heart emoji is 0xE2 0x9D 0xA4%E2%9D%A4. A more complex emoji like 😀 is 0xF0 0x9F 0x98 0x80%F0%9F%98%80.

The toggle on our decoder

The decoder above offers a quick switch: UTF-8 (the default) and ASCII (the strict mode). Picking ASCII tells the decoder to expect only ASCII bytes — any byte with a value of 128 or higher will produce a warning. This is useful in two specific situations:

Validating ASCII-only data. If you’re consuming a system that promises pure ASCII (some legacy APIs), the toggle catches non-ASCII bytes that would otherwise slip through silently.

Debugging garbled output. If a UTF-8 decode gives nonsense, switching to ASCII shows you the raw bytes — useful for spotting whether the source data is actually in a different encoding entirely.

For everyday use, leave it on UTF-8.

What goes wrong with the wrong charset

The classic symptom: you decode and see things like café, café, or caf%E9 instead of the expected café. Three different causes:

café means the original bytes were encoded as two-byte ISO-8859-1 (0xC3 0xA9) but the decoder reinterpreted those two bytes individually as ISO-8859-1, giving à (0xC3) and © (0xA9). The data is UTF-8 but being decoded as ISO-8859-1.

café with an unmappable character means the source was UTF-8 but decoded as Windows-1252 or ISO-8859-1, and one of the bytes had no character at that position.

caf%E9 with raw percent codes means the decoder gave up — the input contained bytes that aren’t valid in the chosen charset.

Fix: switch the destination character set on the decoder to match the source.

How to know which charset the source used

You usually don’t — URLs don’t carry character-set metadata. Best heuristics:

It’s probably UTF-8. Modern systems, web pages, and APIs all default to UTF-8. If you don’t know, try UTF-8 first.

If UTF-8 gives garbled accented characters, try Windows-1252. Used by old Windows applications and pre-2010 web pages from Western Europe and Latin America. The most common alternative to UTF-8 in legacy URLs.

If UTF-8 gives garbled Cyrillic, try Windows-1251 or KOI8-R. Old Russian systems.

If UTF-8 gives garbled Japanese, try Shift_JIS or EUC-JP. Old Japanese systems.

If UTF-8 gives garbled Chinese, try GBK or Big5. Simplified and Traditional Chinese respectively.

The decoder above supports 30+ character sets including all of these.

The IRI standard (RFC 3987)

Strictly speaking, RFC 3986 (the URI standard) only deals with ASCII bytes. The convention of percent-encoding UTF-8 bytes for non-ASCII characters was formalized later in RFC 3987, the IRI (Internationalized Resource Identifier) standard. RFC 3987 says: to put a Unicode character in a URI, encode it as UTF-8, then percent-encode each byte. This is now universal — every modern URL parser, every browser, every standards-compliant library follows this.

Bottom line

Use UTF-8. The ASCII toggle exists for validation and debugging — not because you’d normally choose ASCII as your encoding.

Common questions

About this topic.

UTF-8, always, for new code. It handles every language and every emoji, is the universal default for modern web standards, and is required by the WHATWG URL specification. ASCII is only relevant when consuming or generating data for legacy systems that explicitly require ASCII-only.

Charset mismatch. The original bytes were UTF-8 but you decoded them as ISO-8859-1 or Windows-1252. Switch the destination charset on the decoder to UTF-8 and the result will be correct.

Modern browsers use UTF-8 to encode any non-ASCII characters in URLs. This is mandated by the WHATWG URL Living Standard — the spec all major browsers implement. When you type a URL containing é or 中 into the address bar, the browser UTF-8-encodes those characters before sending the request.

Same scheme: each character is converted to its UTF-8 bytes, then each byte is percent-encoded. Emoji generally take 4 bytes in UTF-8 (so 12 chars after encoding, like %F0%9F%98%80), Chinese characters generally take 3 bytes (9 chars after encoding, like %E4%BD%A0).

Technically yes, but Base64 is a much better choice. URL encoding inflates binary data by about 3× — every byte becomes 3 ASCII characters. Base64 inflates only 33% (every 3 bytes become 4 ASCII characters). For embedding binary in URLs, Base64URL (a URL-safe variant using - and _ instead of + and /) is the standard.

Related

Try the tool.