ISO-8859-1 Garbled Decode

The ISO-8859-1 code space occupies every single-byte value; on systems that treat it as a transparent 8-bit container, no byte sequence coming from any other encoding is ever discarded.
In other words, you can safely regard any foreign byte stream as if it were ISO-8859-1 without losing data. This property is why early MySQL defaulted to Latin-1: the server simply stores the bytes it receives. ASCII is a 7-bit bucket; ISO-8859-1 is an 8-bit bucket.

Is it OK to store Chinese in Latin-1? You can, but you shouldn’t. Once you do, the database no longer knows anything about character semantics—sorting, comparison, length calculations and string functions all return nonsense. For example, the UTF-8 sequence for “中” is 0xE4B8AD (three bytes). Inserted into a Latin-1 column, MySQL does not see one Chinese character; it sees three separate Latin-1 code points (0xE4, 0xB8, 0xAD). The on-disk value is still 0xE4B8AD—nothing is lost—but you must remember to interpret those bytes as UTF-8 when you read them back, or you will only see mojibake.

This tool reverses the mistake: it first turns the “Latin” characters back into raw bytes (hex), then lets you pick the correct charset to decode the original text. Take the character “中”: its UTF-8 bytes are 0xE4B8AD; if those bytes are mistakenly rendered as ISO-8859-1, you see ä¸. To restore it, encode ä¸ as ISO-8859-1 to recover E4B8AD, then decode those bytes as UTF-8 to get the original “中”.

C# snippet for fixing mojibake:

byte[] raw = Encoding.GetEncoding("ISO-8859-1").GetBytes("ä¸");
string result = Encoding.UTF8.GetString(raw);
Console.WriteLine(result);

Mojibake in a nutshell

What you call “ISO-8859-1 garbage” is almost never real ISO-8859-1; it is a UTF-8 (or GBK) byte stream that was decoded as ISO-8859-1 by your terminal, browser or database.
To fix it, reverse the error: encode the text back to bytes using ISO-8859-1, then decode those bytes with the actual encoding.
As long as the raw bytes are still intact, the process is lossless. If any step ever converted the bytes to characters and re-encoded them, the data may be gone for good.

Description

0 Comments