Background

What encoding actually is

Text on disk is just bytes; an encoding is the key that maps those bytes to characters. UTF-8 is the modern default and covers every language; legacy files are often Windows-1252 or Latin-1, which only cover Western European characters. Decode bytes with the wrong key and you get garbage.

There is no marker inside most files announcing their encoding, so programs guess — and a wrong guess is the root of every problem below.

Symptom 1

café → café (mojibake)

When a UTF-8 file is read as Windows-1252, each multi-byte character splits into two visible ones: é becomes é, ü becomes ü. This is called mojibake, and it is the single most common CSV encoding bug.

The fix is to re-open the file declaring UTF-8, not to hand-edit the broken characters — there are usually too many, and editing only masks the underlying mismatch. In Excel: Data → From Text/CSV → 65001: Unicode (UTF-8). In code: specify encoding="utf-8" when you read it.

→ Detect the encoding

Symptom 2

The replacement character �

A black-diamond question mark (, U+FFFD) appears where a character should be. That's the decoder hitting a byte sequence that is invalid in the encoding it assumed — often a UTF-8 file opened as ASCII, or a genuinely corrupted byte.

Re-open as UTF-8 first; if the � persists in the same spot, that byte is actually damaged in the source file and needs to be re-exported from the original system.

Symptom 3

 on the first header

Your first column header reads id instead of id, or a strict parser complains about an unexpected character at position zero. That's a byte order mark — an invisible three-byte UTF-8 prefix (EF BB BF) that some apps, Excel included, add when saving.

A good parser strips the BOM automatically. If yours doesn't, save the file as UTF-8 without BOM, or run it through a tool that normalizes it.

→ Check for a BOM

The durable fix

Standardize on UTF-8

Save every CSV as UTF-8 unless a specific downstream tool demands otherwise — it is the one encoding that handles every character and that virtually everything can read. Only add a BOM if a consumer (some Excel workflows) actually needs it.

When you import a file you didn't create, declare its encoding instead of letting the program guess. If you're not sure what it is, run it through the validator — it flags the encoding and the BOM, locally, without uploading anything.

→ Validate encoding, locally

Common questions
  • ·

    How do I know which encoding my file is?

    If non-English text looks garbled, it's an encoding mismatch — usually a UTF-8 file read as Latin-1/Windows-1252 or vice versa. The validator inspects the bytes and reports the likely encoding and whether a BOM is present.

  • ·

    Why can't I just find-and-replace the bad characters?

    Because there are usually hundreds of them and the replacements depend on context. Fixing the encoding at read time repairs them all at once and correctly; manual replacement only papers over it.

  • ·

    Should I save with or without a BOM?

    Without, by default — it's the most compatible. Add a UTF-8 BOM only if a specific tool (often older Excel import paths) needs it to detect UTF-8.

  • ·

    Is my file uploaded to check its encoding?

    No. The validator reads the bytes entirely in your browser; nothing is sent to a server.

Keep going