A journey into character decoding

Disclaimer: This post isn’t a complete or technically rigorous explanation of UTF-8 decoding. It’s more of a “here’s how I started to wrap my head around it” kind of thing. The Lua code is intentionally simplified — it skips over edge cases like invalid byte sequences, overlong encodings, and error handling. If you’re looking for a formal spec, check out RFC 3629 or the Unicode Standard.

I came to this deeper understanding of character encoding after running into a strange issue in the Github UI. The before and after lines looked identical in the GitHub UI for my commit, even locally there was no visible difference in the diff. However, viewing the source code as hex for both files revealed a subtle difference — an extra line feed at the end of the file.

The line feed should be there, and I found most IDEs add it transparently to the user. But, because the systems I was working on are not greatly tested, and I didn’t want to break something accidentally, I ran truncate -s -1 file_name, removing the last byte from the file, merged my changes, and moved on.

But this got me wondering how character encoding/decoding really works.

I watched a video here: YouTube

And my understanding became this:

There was 7-bit ASCII, which worked fine for English text but limited the range of characters. With more bits came more character sets — and, as is often the case without a standard, things got messy. Eventually, the Unicode Consortium stepped in and put together a reasonable standard which was backwards compatible with ASCII. And, after several iterations, UTF-8 emerged as a widely accepted implementation of Unicode.

I wanted to look at character encoding in a lower-level detail, and since I’ve been learning Lua, I decided to try decoding some ASCII in Lua first. It turned out to be straightforward.

-- Writes extended part of UTF-8 and original ASCII characters to file. UNCOMMENT TO CREATE THE FILE
-- local file = io.open("ascii", "w")
-- file:write("Hello, world!\n")
-- file:close()

local file = io.open("ascii", "r")
if file == nil then
  print("Unable to open file ".. arg[1])
  os.exit(-1)
end

local file_contents = file:read()
if file_contents == nil then os.exit() end

local ascii_map = {
  [72]  = "H",
  [101] = "e",
  [108] = "l",
  [44]  = ",",
  [32]  = " ",
  [119] = "w",
  [111] = "o",
  [114] = "r",
  [100] = "d",
  [33]  = "!",
  [12]  = "\n",
}

for i = 1, #file_contents do
  local byte = string.byte(file_contents, i)
  io.write(ascii_map[byte])
end
io.write(ascii_map[12]) -- write LF

Here, I created a table mapping byte values to ASCII characters. The script reads the contents of the file (Hello, world! — all ASCII), and prints each character to stdout by looking it up in the map. Simple and it works.

Then I wanted to extend this to handle UTF-8, which includes multibyte characters.

How UTF-8 Works (At a Glance)

ASCII characters use 7 bits, so if the most significant bit of a byte is 0, it’s just an ASCII character. But if the first bit is a 1, then it’s part of a multibyte sequence.

The number of consecutive 1s at the start of a byte tells us how many bytes are in the character:

0xxxxxxx — 1 byte (ASCII)
110xxxxx — 2-byte character
1110xxxx — 3-byte character
11110xxx — 4-byte character

Each continuation byte starts with 10xxxxxx.

To decode these, I started building a tree-like structure in Lua. Each byte either leads to a character or to another nested table of possible continuations, eventually mapping to the character once the sequence is complete.

The code got more complex, but it let me walk over each byte, check the map, and print the decoded character.

-- Writes extended part of UTF-8 and original ASCII characters to file. UNCOMMENT TO CREATE THE FILE
--local file = io.open("utf8", "w")
--file:write("¢harmeleon")
--file:close()

if arg[1] == nil or arg[1] == "" or arg[1] == " " then
  print("Usage: " .. arg[0] .. "<file>")
  os.exit(-1)
end

local file = io.open(arg[1], "r")
if file == nil then
  print("Unable to open file ".. arg[1])
  os.exit(-1)
end

local file_contents = file:read()
if file_contents == nil then os.exit() end

-- UTF-8 map (ASCII + extended multibyte UTF-8)
local multi_byte_map = {
  -- ASCII characters (0x00 to 0x7F)
  [0x00] = "\0", [0x01] = "\1",   [0x02] = "\2",  [0x03] = "\3",
  [0x04] = "\4", [0x05] = "\5",   [0x06] = "\6",  [0x07] = "\a",
  [0x08] = "\b", [0x09] = "\t",   [0x0A] = "\n",  [0x0B] = "\v",
  [0x0C] = "\f", [0x0D] = "\r",   [0x0E] = "\14", [0x0F] = "\15",
  [0x10] = "\16", [0x11] = "\17", [0x12] = "\18", [0x13] = "\19",
  [0x14] = "\20", [0x15] = "\21", [0x16] = "\22", [0x17] = "\23",
  [0x18] = "\24", [0x19] = "\25", [0x1A] = "\26", [0x1B] = "\27",
  [0x1C] = "\28", [0x1D] = "\29", [0x1E] = "\30", [0x1F] = "\31",
  [0x20] = " ",   [0x21] = "!",   [0x22] = "\"",  [0x23] = "#",
  [0x24] = "$",   [0x25] = "%",   [0x26] = "&",   [0x27] = "'",
  [0x28] = "(",   [0x29] = ")",   [0x2A] = "*",   [0x2B] = "+",
  [0x2C] = ",",   [0x2D] = "-",   [0x2E] = ".",   [0x2F] = "/",
  [0x30] = "0",   [0x31] = "1",   [0x32] = "2",   [0x33] = "3",
  [0x34] = "4",   [0x35] = "5",   [0x36] = "6",   [0x37] = "7",
  [0x38] = "8",   [0x39] = "9",   [0x3A] = ":",   [0x3B] = ";",
  [0x3C] = "<",   [0x3D] = "=",   [0x3E] = ">",   [0x3F] = "?",
  [0x40] = "@",   [0x41] = "A",   [0x42] = "B",   [0x43] = "C",
  [0x44] = "D",   [0x45] = "E",   [0x46] = "F",   [0x47] = "G",
  [0x48] = "H",   [0x49] = "I",   [0x4A] = "J",   [0x4B] = "K",
  [0x4C] = "L",   [0x4D] = "M",   [0x4E] = "N",   [0x4F] = "O",
  [0x50] = "P",   [0x51] = "Q",   [0x52] = "R",   [0x53] = "S",
  [0x54] = "T",   [0x55] = "U",   [0x56] = "V",   [0x57] = "W",
  [0x58] = "X",   [0x59] = "Y",   [0x5A] = "Z",   [0x5B] = "[",
  [0x5C] = "\\",  [0x5D] = "]",   [0x5E] = "^",   [0x5F] = "_",
  [0x60] = "`",   [0x61] = "a",   [0x62] = "b",   [0x63] = "c",
  [0x64] = "d",   [0x65] = "e",   [0x66] = "f",   [0x67] = "g",
  [0x68] = "h",   [0x69] = "i",   [0x6A] = "j",   [0x6B] = "k",
  [0x6C] = "l",   [0x6D] = "m",   [0x6E] = "n",   [0x6F] = "o",
  [0x70] = "p",   [0x71] = "q",   [0x72] = "r",   [0x73] = "s",
  [0x74] = "t",   [0x75] = "u",   [0x76] = "v",   [0x77] = "w",
  [0x78] = "x",   [0x79] = "y",   [0x7A] = "z",   [0x7B] = "{",
  [0x7C] = "|",   [0x7D] = "}",   [0x7E] = "~",   [0x7F] = "\127", -- DEL

  -- 2-byte sequences 
  [0xC2] = {
    [0xA2] = "¢"
  },

  -- 3-byte sequences
  [0xD0] = {
    [0x9F] = "П", 
    [0xB8] = "и",
    [0xB2] = "в",
    [0xB5] = "е",
  },
  [0xD1] = {
    [0x80] = "р",
    [0x82] = "т",
  },

  [0xE0] = {
    [0xA4] = {
      [0xA8] = "न",
      [0xAE] = "म",
      [0xB8] = "स",
      [0x95] = "क",
      [0xBE] = "ा",
      [0xB0] = "र",
    },
    [0xA5] = {
      [0x8D] = "्"
    }
  },

  [0xE2] = {
    [0x9C] = {
      [0x94] = "✔"
    },
    [0x99] = {
      [0xAA] = "♪"
    }
  },

  [0xE3] = {
    [0x81] = {
      [0x93] = "こ",
      [0xAB] = "に",
      [0xA1] = "ち",
      [0xAF] = "は"
    },
    [0x82] = {
      [0x93] = "ん"
    }
  },

  [0xE4] = {
    [0xB8] = {
      [0xAD] = "中"
    },
    [0x96] = {
      [0x87] = "文"
    }
  },

  [0xD9] = {
    [0x85] = "م",
    [0x84] = "ل",
  },
  [0xD8] = {
    [0xB1] = "ر",
    [0xAD] = "ح",
    [0xA8] = "ب",
    [0xA7] = "ا",
  },

  [0xD7] = {
    [0xA9] = "ש",
    [0x9C] = "ל",
    [0x95] = "ו",
    [0x9D] = "ם"
  },

  -- 4-byte sequences
  [0xF0] = {
    [0x9F] = {
      [0x98] = {
        [0x81] = "😁"  -- emoji
      }
    }
  }
}

local function get_bytes_from_string(str, bytes, i)
  if i > string.len(str) then
    return bytes
  end

  local byte = string.byte(str, i)
  table.insert(bytes, byte)

  return get_bytes_from_string(str, bytes, i + 1)
end

local bytes = get_bytes_from_string(file_contents, {}, 1)

local function map_bytes_as_utf8(bytes, map_to_return, bytes_of_char, i)
  if i > #bytes then
    return map_to_return
  end

  local curr_byte = bytes[i]

  table.insert(bytes_of_char, curr_byte)

  local how_many_bytes_in_char = 1

  if bytes_of_char[1] & 0xF0 == 0xF0 then
    how_many_bytes_in_char = 4
  elseif bytes_of_char[1] & 0xE0 == 0xE0 then
    how_many_bytes_in_char = 3
  elseif bytes_of_char[1] & 0xC0 == 0xC0 then
    how_many_bytes_in_char = 2
  end

  if #bytes_of_char < how_many_bytes_in_char then
    return map_bytes_as_utf8(bytes, map_to_return, bytes_of_char, i + 1)
  end

  table.insert(map_to_return, bytes_of_char)
  bytes_of_char = {}

  return map_bytes_as_utf8(bytes, map_to_return, bytes_of_char, i + 1)
end

local utf8_chars = map_bytes_as_utf8(bytes, {}, {}, 1)

// get character at nth byte
//  The naming is bad, but I wanted it not to be too wide in print_chars(...), or too high, for the post.
local function gcnb(char, n)
  return char[n]
end

local function print_chars(chars, i)
  if i > #chars then return end
  local c = chars[i]

  if #chars[i] == 1 then
    io.write(multi_byte_map[gcnb(c, 1)])
  elseif #chars[i] == 2 then
    io.write(multi_byte_map[gcnb(c, 1)][gcnb(c, 2)])
  elseif #chars[i] == 3 then
    io.write(multi_byte_map[gcnb(c, 1)][gcnb(c, 2)][gcnb(c, 3)])
  elseif #chars[i] == 4 then
    io.write(multi_byte_map[gcnb(c, 1)][gcnb(c, 2)][gcnb(c, 3)][gcnb(c, 4)])
  end
  print_chars(chars, i + 1)
end

print_chars(utf8_chars, 1)

The rest of the map continues in that style. If the current byte has a nested table, I check the next byte(s) and keep digging until I hit a character.

This was a fun way to get closer to how encoding really works. It’s also a reminder of how powerful UTF-8 is — flexible, backward compatible, and yet still human-readable (in a way) at the byte level.

I’ll probably clean this up into a proper decoder eventually, but for now, it was a great learning experiment.

…