A journey into character decoding
Disclaimer: This post isn’t a complete or technically rigorous explanation of UTF-8 decoding. It’s more of a “here’s how I started to wrap my head around it” kind of thing. The Lua code is intentionally simplified — it skips over edge cases like invalid byte sequences, overlong encodings, and error handling. If you’re looking for a formal spec, check out RFC 3629 or the Unicode Standard.
This deeper dive into character encoding came about because I had an issue where Git claimed a change in an edited file where the before and after lines were identical in the GitHub UI, and even locally, comparing diff, there was no visible difference in my IDE.
Viewing the source code as hex showed a line feed at the end of the file, which was not showing in text editors. The original copy did not end with a line feed.
The line feed should be there, but the systems are not greatly tested, and I didn’t want to break something accidentally. It’s old code, and the system’s possibly fragile, so I ran truncate -s -1 file_name
, removing the last byte from the file, and merged my changes without the added line feed.
This is what sent me down the rabbit hole of how character encoding/decoding works more than I understood already.
I watched a video here: YouTube
And my understanding became this:
There was 7-bit ASCII, which was okay but limited the number of characters available. With more bits came more character sets. As with most things without a standard, things get pretty messy. Consequently, the Unicode Consortium put together a reasonable standard that was backwards compatible with ASCII. Then, after some iterations, UTF-8 was created as an implementation of Unicode.
I wanted to look at character encoding in a lower-level detail, and I’ve been learning Lua, so I decided to try decoding some ASCII in Lua. It was pretty easy and looked something like this.
-- Writes extended part of UTF-8 and original ASCII characters to file. UNCOMMENT TO CREATE THE FILE
-- local file = io.open("ascii", "w")
-- file:write("Hello, world!\n")
-- file:close()
local file = io.open("ascii", "r")
if file == nil then
print("Unable to open file ".. arg[1])
os.exit(-1)
end
local file_contents = file:read()
if file_contents == nil then os.exit() end
local ascii_map = {
[72] = "H",
[101] = "e",
[108] = "l",
[44] = ",",
[32] = " ",
[119] = "w",
[111] = "o",
[114] = "r",
[100] = "d",
[33] = "!",
[12] = "\n",
}
for i = 1, #file_contents do
local byte = string.byte(file_contents, i)
io.write(ascii_map[byte])
end
io.write(ascii_map[12]) -- write LF
Above, I created a table/map in Lua of the ASCII characters I would try to decode.
I read in the file contents, Hello World!
(all ASCII characters), access the map with each byte, writing them to stdout along the way; voila, it works.
Then, I wanted to extend this into a UTF-8 implementation which would handle multibytes.
How UTF-8 works.
So, ASCII characters are 7-bit. So, as long as the leftmost bit of a byte starts with a zero, it is an ASCII character; otherwise, it’s a multibyte character. If the first two bits of the first byte are active, but the second bit is a zero, then it’s a two-byte character. If the first two bits of the first byte are active but the third bit is a zero, then it is a two-byte char; if the first three bits of the first byte are active but the fourth bit is a zero, then it is a three-byte char. Finally, if the first four bits are active, it is a four-byte char.
I started with a tree-type structure containing the extended parts so I would be able to walk over each byte until I reach a character, then print the character, and repeat, for each character.
Which ended up like this:
-- Writes extended part of UTF-8 and original ASCII characters to file. UNCOMMENT TO CREATE THE FILE
--local file = io.open("utf8", "w")
--file:write("¢harmeleon")
--file:close()
if arg[1] == nil or arg[1] == "" or arg[1] == " " then
print("Usage: " .. arg[0] .. "<file>")
os.exit(-1)
end
local file = io.open(arg[1], "r")
if file == nil then
print("Unable to open file ".. arg[1])
os.exit(-1)
end
local file_contents = file:read()
if file_contents == nil then os.exit() end
-- UTF-8 map (ASCII + extended multibyte UTF-8)
local multi_byte_map = {
-- ASCII characters (0x00 to 0x7F)
[0x00] = "\0", [0x01] = "\1", [0x02] = "\2", [0x03] = "\3",
[0x04] = "\4", [0x05] = "\5", [0x06] = "\6", [0x07] = "\a",
[0x08] = "\b", [0x09] = "\t", [0x0A] = "\n", [0x0B] = "\v",
[0x0C] = "\f", [0x0D] = "\r", [0x0E] = "\14", [0x0F] = "\15",
[0x10] = "\16", [0x11] = "\17", [0x12] = "\18", [0x13] = "\19",
[0x14] = "\20", [0x15] = "\21", [0x16] = "\22", [0x17] = "\23",
[0x18] = "\24", [0x19] = "\25", [0x1A] = "\26", [0x1B] = "\27",
[0x1C] = "\28", [0x1D] = "\29", [0x1E] = "\30", [0x1F] = "\31",
[0x20] = " ", [0x21] = "!", [0x22] = "\"", [0x23] = "#",
[0x24] = "$", [0x25] = "%", [0x26] = "&", [0x27] = "'",
[0x28] = "(", [0x29] = ")", [0x2A] = "*", [0x2B] = "+",
[0x2C] = ",", [0x2D] = "-", [0x2E] = ".", [0x2F] = "/",
[0x30] = "0", [0x31] = "1", [0x32] = "2", [0x33] = "3",
[0x34] = "4", [0x35] = "5", [0x36] = "6", [0x37] = "7",
[0x38] = "8", [0x39] = "9", [0x3A] = ":", [0x3B] = ";",
[0x3C] = "<", [0x3D] = "=", [0x3E] = ">", [0x3F] = "?",
[0x40] = "@", [0x41] = "A", [0x42] = "B", [0x43] = "C",
[0x44] = "D", [0x45] = "E", [0x46] = "F", [0x47] = "G",
[0x48] = "H", [0x49] = "I", [0x4A] = "J", [0x4B] = "K",
[0x4C] = "L", [0x4D] = "M", [0x4E] = "N", [0x4F] = "O",
[0x50] = "P", [0x51] = "Q", [0x52] = "R", [0x53] = "S",
[0x54] = "T", [0x55] = "U", [0x56] = "V", [0x57] = "W",
[0x58] = "X", [0x59] = "Y", [0x5A] = "Z", [0x5B] = "[",
[0x5C] = "\\", [0x5D] = "]", [0x5E] = "^", [0x5F] = "_",
[0x60] = "`", [0x61] = "a", [0x62] = "b", [0x63] = "c",
[0x64] = "d", [0x65] = "e", [0x66] = "f", [0x67] = "g",
[0x68] = "h", [0x69] = "i", [0x6A] = "j", [0x6B] = "k",
[0x6C] = "l", [0x6D] = "m", [0x6E] = "n", [0x6F] = "o",
[0x70] = "p", [0x71] = "q", [0x72] = "r", [0x73] = "s",
[0x74] = "t", [0x75] = "u", [0x76] = "v", [0x77] = "w",
[0x78] = "x", [0x79] = "y", [0x7A] = "z", [0x7B] = "{",
[0x7C] = "|", [0x7D] = "}", [0x7E] = "~", [0x7F] = "\127", -- DEL
-- 2-byte sequences
[0xC2] = {
[0xA2] = "¢"
},
-- 3-byte sequences
[0xD0] = {
[0x9F] = "П",
[0xB8] = "и",
[0xB2] = "в",
[0xB5] = "е",
},
[0xD1] = {
[0x80] = "р",
[0x82] = "т",
},
[0xE0] = {
[0xA4] = {
[0xA8] = "न",
[0xAE] = "म",
[0xB8] = "स",
[0x95] = "क",
[0xBE] = "ा",
[0xB0] = "र",
},
[0xA5] = {
[0x8D] = "्"
}
},
[0xE2] = {
[0x9C] = {
[0x94] = "✔"
},
[0x99] = {
[0xAA] = "♪"
}
},
[0xE3] = {
[0x81] = {
[0x93] = "こ",
[0xAB] = "に",
[0xA1] = "ち",
[0xAF] = "は"
},
[0x82] = {
[0x93] = "ん"
}
},
[0xE4] = {
[0xB8] = {
[0xAD] = "中"
},
[0x96] = {
[0x87] = "文"
}
},
[0xD9] = {
[0x85] = "م",
[0x84] = "ل",
},
[0xD8] = {
[0xB1] = "ر",
[0xAD] = "ح",
[0xA8] = "ب",
[0xA7] = "ا",
},
[0xD7] = {
[0xA9] = "ש",
[0x9C] = "ל",
[0x95] = "ו",
[0x9D] = "ם"
},
-- 4-byte sequences
[0xF0] = {
[0x9F] = {
[0x98] = {
[0x81] = "😁" -- emoji
}
}
}
}
local function get_bytes_from_string(str, bytes, i)
if i > string.len(str) then
return bytes
end
local byte = string.byte(str, i)
table.insert(bytes, byte)
return get_bytes_from_string(str, bytes, i + 1)
end
local bytes = get_bytes_from_string(file_contents, {}, 1)
local function map_bytes_as_utf8(bytes, map_to_return, bytes_of_char, i)
if i > #bytes then
return map_to_return
end
local curr_byte = bytes[i]
table.insert(bytes_of_char, curr_byte)
local how_many_bytes_in_char = 1
if bytes_of_char[1] & 0xF0 == 0xF0 then
how_many_bytes_in_char = 4
elseif bytes_of_char[1] & 0xE0 == 0xE0 then
how_many_bytes_in_char = 3
elseif bytes_of_char[1] & 0xC0 == 0xC0 then
how_many_bytes_in_char = 2
end
if #bytes_of_char < how_many_bytes_in_char then
return map_bytes_as_utf8(bytes, map_to_return, bytes_of_char, i + 1)
end
table.insert(map_to_return, bytes_of_char)
bytes_of_char = {}
return map_bytes_as_utf8(bytes, map_to_return, bytes_of_char, i + 1)
end
local utf8_chars = map_bytes_as_utf8(bytes, {}, {}, 1)
// get character at nth byte
// The naming is bad, but I wanted it not to be too wide in print_chars(...), or too high, for the post.
local function gcnb(char, n)
return char[n]
end
local function print_chars(chars, i)
if i > #chars then return end
local c = chars[i]
if #chars[i] == 1 then
io.write(multi_byte_map[gcnb(c, 1)])
elseif #chars[i] == 2 then
io.write(multi_byte_map[gcnb(c, 1)][gcnb(c, 2)])
elseif #chars[i] == 3 then
io.write(multi_byte_map[gcnb(c, 1)][gcnb(c, 2)][gcnb(c, 3)])
elseif #chars[i] == 4 then
io.write(multi_byte_map[gcnb(c, 1)][gcnb(c, 2)][gcnb(c, 3)][gcnb(c, 4)])
end
print_chars(chars, i + 1)
end
print_chars(utf8_chars, 1)
Several functions are called in succession.
get_bytes_from_string
: creates an array of bytes from a string.
map_bytes_as_utf8
: recursively traverses an array of bytes, creating an array of byte arrays grouped by char.
print_chars
: recursively traverses an array of byte arrays using get access to access global, multi_byte_map and prints its associated character.
By extending multi_byte_map
to hold all of the UTF-8 characters, the script can take any UTF-8 file, read it in as a byte array, and print the characters to stdout.
…
0a