URL Encoding for Emoji and Unicode: The UTF-8 Mechanism

Modern URLs can carry any Unicode character — Chinese characters, Cyrillic, Arabic, mathematical symbols, even emoji. They get there via the UTF-8 + percent-encoding combo. The mechanism is simple once you see it, and modern browsers handle it transparently. But there are some surprising edge cases.

This article covers how Unicode in URLs works, what the encoding looks like for different scripts, and the common issues.

The two-step process

To encode a non-ASCII character in a URL:

Convert the character to its UTF-8 byte sequence
Percent-encode each byte as %XX

That’s it. The same rule applies to every Unicode character. The differences come from how many bytes each character takes in UTF-8.

Byte counts by script

Latin Extended (1-byte ASCII + 2-byte extensions): Letters like é, ñ, ü, ñ, ø take 2 bytes.

é = 0xC3 0xA9 = %C3%A9
ñ = 0xC3 0xB1 = %C3%B1
ü = 0xC3 0xBC = %C3%BC

Cyrillic, Greek, Hebrew, Arabic: Also 2 bytes in UTF-8.

А (Cyrillic A) = 0xD0 0x90 = %D0%90
Привет        = %D0%9F%D1%80%D0%B8%D0%B2%D0%B5%D1%82

CJK (Chinese, Japanese, Korean): 3 bytes per character.

你 = 0xE4 0xBD 0xA0 = %E4%BD%A0
你好  = %E4%BD%A0%E5%A5%BD
こんにちは = %E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF

Emoji and supplementary characters: 4 bytes per character.

😀  = 0xF0 0x9F 0x98 0x80 = %F0%9F%98%80
🎉  = 0xF0 0x9F 0x8E 0x89 = %F0%9F%8E%89
❤  = 0xE2 0x9D 0xA4 = %E2%9D%A4    (3 bytes — it’s an older symbol)
❤️ = 0xE2 0x9D 0xA4 0xEF 0xB8 0x8F = %E2%9D%A4%EF%B8%8F   (with variation selector)

Why ❤ and ❤️ encode differently

The heart character has two forms in Unicode. ❤ is the basic character (U+2764). ❤️ is the same character followed by U+FE0F, a "variation selector" that tells the renderer to use the colorful emoji style instead of the monochrome text style. The variation selector is a separate Unicode codepoint, so it adds its own 3 bytes (%EF%B8%8F) to the encoding.

You’ll see this for many emoji. The "with VS-16" form is what modern keyboards produce; the bare form is what older systems produce.

What browsers actually display

Modern browsers (Chrome, Firefox, Safari, Edge) display Unicode in URLs natively when possible:

You see:      https://example.com/café
URL is:       https://example.com/caf%C3%A9

The browser displays the readable form but the actual byte sequence is the encoded form. Copy-paste from the address bar usually gives you the encoded form (or, in some browsers, the decoded form — behavior varies).

IDN: Unicode in domain names

Domain names have a separate encoding scheme called Punycode (defined by IDN, Internationalized Domain Names). A domain like 例え.jp becomes xn--r8jz45g.jp on the wire. The path and query parts of the URL still use percent-encoded UTF-8 — it’s only the hostname that uses Punycode.

Display: https://例え.jp/path
Wire:    https://xn--r8jz45g.jp/path

This is why we don’t cover IDN in detail in this article — it’s a separate spec and a separate problem domain. Your URL encoder doesn’t handle hostnames; it handles the path and query.

Emoji in query strings — examples

Heart in a search query:

https://example.com/search?q=I%20%E2%9D%A4%EF%B8%8F%20you
Decoded: ?q=I ❤️ you

Multi-emoji message:

https://example.com/api/message?text=%F0%9F%8E%89%20Party%21%20%F0%9F%8D%B0
Decoded: ?text=🎉 Party! 🍰

Family emoji (compound, multiple codepoints):

👨‍👩‍👧‍👦 is six codepoints joined by ZWJ (zero-width joiner):
👨 + ZWJ + 👩 + ZWJ + 👧 + ZWJ + 👦

In UTF-8 percent-encoded:
%F0%9F%91%A8%E2%80%8D%F0%9F%91%A9%E2%80%8D%F0%9F%91%A7%E2%80%8D%F0%9F%91%A6

One displayed character, 25 bytes after encoding. This is why emoji in URLs can be surprisingly long.

The character set selector matters here

Our decoder defaults to UTF-8. That’s right for modern data. But legacy systems sometimes used other encodings for non-ASCII characters:

Windows-1252 (old Western European):

é encoded as ISO-8859-1 / Windows-1252:
%E9   (single byte)

vs UTF-8:
%C3%A9   (two bytes)

Shift_JIS (old Japanese):

あ encoded as Shift_JIS:
%82%A0   (different bytes)

vs UTF-8:
%E3%81%82   (three bytes)

If you decode legacy data with UTF-8, you get garbage. Switch the destination charset on our decoder to match the source — there are 30+ options.

Common emoji/Unicode URL problems

1. Garbled display after decoding

Symptom: café or ä¸æ–‡ instead of expected characters.

Cause: charset mismatch. The data was UTF-8 but you decoded as Latin-1, or vice versa.

Fix: change the destination charset.

2. Emoji disappear when copied

Symptom: copy a URL with emoji, paste it elsewhere, the emoji are gone or show as boxes.

Cause: the destination context doesn’t support UTF-8 (rare in 2026) or the font lacks emoji glyphs.

Fix: the URL itself is fine. The display is a font issue.

3. URL contains literal Unicode without encoding

Symptom: someone shared a URL with actual café in it, and your code rejects it.

Cause: not all clients/servers tolerate raw Unicode in URLs. The spec technically requires encoding.

Fix: encode it. encodeURI() in JS handles whole URLs, encodeURIComponent() for values.

4. URL way longer than expected

Symptom: a URL with a few emoji ends up surprisingly long.

Cause: each emoji is 4 bytes UTF-8, which becomes 12 chars after percent-encoding. Compound emoji (ZWJ sequences) can be much longer.

Fix: if you’re hitting URL length limits, consider sending the data in a POST body instead.

The escape() function is broken for Unicode

JavaScript’s legacy escape() function produces nonstandard output for non-ASCII characters:

escape("café")          // "caf%E9"  — wrong! That’s Latin-1, not UTF-8
encodeURI("café")        // "caf%C3%A9"  — correct UTF-8 encoding
encodeURIComponent("café") // "caf%C3%A9"  — correct

Never use escape(). It’s deprecated and produces invalid URL encoding for any character above U+00FF.

Bottom line

Unicode in URLs is mechanically simple: UTF-8 the character, percent-encode each byte. Use your language’s standard library — they all handle this correctly. The only complications are emoji that combine multiple codepoints (longer than expected) and legacy data in non-UTF-8 encodings (use the charset selector on our decoder).

Found this useful? Try the URL decoder, the URL encoder, or browse all tools.

Encoding emoji and Unicode in URLs

The two-step process

Byte counts by script

Why ❤ and ❤️ encode differently

What browsers actually display

IDN: Unicode in domain names

Emoji in query strings — examples

The character set selector matters here

Common emoji/Unicode URL problems

1. Garbled display after decoding

2. Emoji disappear when copied

3. URL contains literal Unicode without encoding

4. URL way longer than expected

The escape() function is broken for Unicode

Bottom line

From the blog.

The two-step process

Byte counts by script

Why ❤ and ❤️ encode differently

What browsers actually display

IDN: Unicode in domain names

Emoji in query strings — examples

The character set selector matters here

Common emoji/Unicode URL problems

1. Garbled display after decoding

2. Emoji disappear when copied

3. URL contains literal Unicode without encoding

4. URL way longer than expected

The escape() function is broken for Unicode

Bottom line

From the blog.

Spaces in URLs: %20 vs +

URL encoding for UTM tracking links