Text on disk is just bytes; an encoding is the key that maps those bytes to characters. UTF-8 is the modern default and covers every language; legacy files are often Windows-1252 or Latin-1, which only cover Western European characters. Decode bytes with the wrong key and you get garbage.
There is no marker inside most files announcing their encoding, so programs guess — and a wrong guess is the root of every problem below.
The fix is to re-open the file declaring UTF-8, not to hand-edit the broken characters — there are usually too many, and editing only masks the underlying mismatch. In Excel: Data → From Text/CSV → 65001: Unicode (UTF-8). In code: specify encoding="utf-8" when you read it.
A black-diamond question mark (�, U+FFFD) appears where a character should be. That's the decoder hitting a byte sequence that is invalid in the encoding it assumed — often a UTF-8 file opened as ASCII, or a genuinely corrupted byte.
Re-open as UTF-8 first; if the � persists in the same spot, that byte is actually damaged in the source file and needs to be re-exported from the original system.
Symptom 3
 on the first header
Your first column header reads id instead of id, or a strict parser complains about an unexpected character at position zero. That's a byte order mark — an invisible three-byte UTF-8 prefix (EF BB BF) that some apps, Excel included, add when saving.
A good parser strips the BOM automatically. If yours doesn't, save the file as UTF-8 without BOM, or run it through a tool that normalizes it.
Save every CSV as UTF-8 unless a specific downstream tool demands otherwise — it is the one encoding that handles every character and that virtually everything can read. Only add a BOM if a consumer (some Excel workflows) actually needs it.
When you import a file you didn't create, declare its encoding instead of letting the program guess. If you're not sure what it is, run it through the validator — it flags the encoding and the BOM, locally, without uploading anything.
If non-English text looks garbled, it's an encoding mismatch — usually a UTF-8 file read as Latin-1/Windows-1252 or vice versa. The validator inspects the bytes and reports the likely encoding and whether a BOM is present.
·
Why can't I just find-and-replace the bad characters?
Because there are usually hundreds of them and the replacements depend on context. Fixing the encoding at read time repairs them all at once and correctly; manual replacement only papers over it.
·
Should I save with or without a BOM?
Without, by default — it's the most compatible. Add a UTF-8 BOM only if a specific tool (often older Excel import paths) needs it to detect UTF-8.
·
Is my file uploaded to check its encoding?
No. The validator reads the bytes entirely in your browser; nothing is sent to a server.