Disclaimer: This post isn’t a complete or technically rigorous explanation of UTF-8 decoding. It’s more of a “here’s how I started to wrap my head around it” kind of thing. The Lua code is intentionally simplified — it skips over edge cases like invalid byte sequences, overlong encodings, and error handling. If you’re looking for a formal spec, check out RFC 3629 or the Unicode Standard.

This deeper dive into things came about because I had a file show a change in a PR where the before and after were identical in the GitHub ui, and even locally there was no visible difference in my IDE.

Viewing the source code as hex showed a line feed a the end of the file which was not showing in text editors. The original copy did not end with a line feed.

Really, the line feed should be there, but I didn’t want to get hassled in code review, actually, it’s more like I didn’t want to accidently break something, it’s old code, and the system’s possibly fragile, so I ran truncate -s -1 file_name, removing the last byte from the file.

This is essentially what sent me down the rabbit hole of how character encoding/decoding works a little bit more than I understood already.

I watched a video here: YouTube

And my understanding became this:

There was 7-bit ASCII, this was ok, but limited amount of characters available. With more bits, came more character sets. As with most things without a standard, things get pretty messy. Consequently, the Unicode Consortium put together a reasonable standard, which was backwards compatible with ASCII. Then UTF-8 was created as an implementation of Unicode, which is, put simply, somewhat elegant.

So, then I wanted to make sure I understood it in a little bit more low-level detail, and I’ve been learning lua, so I decided I would try and decode some ASCII in lua. It was pretty easy and looked something like this.

-- Writes extended part of UTF-8 and original ASCII characters to file. UNCOMMENT TO CREATE THE FILE
-- local file = io.open("ascii", "w")
-- file:write("Hello, world!\n")
-- file:close()

local file = io.open("ascii", "r")
if file == nil then
  print("Unable to open file ".. arg[1])
  os.exit(-1)
end

local file_contents = file:read()
if file_contents == nil then os.exit() end

local ascii_map = {
  [72]  = "H",
  [101] = "e",
  [108] = "l",
  [44]  = ",",
  [32]  = " ",
  [119] = "w",
  [111] = "o",
  [114] = "r",
  [100] = "d",
  [33]  = "!",
  [12]  = "\n",
}

for i = 1, #file_contents do
  local byte = string.byte(file_contents, i)
  io.write(ascii_map[byte])
end
io.write(ascii_map[12]) -- write LF

Above, I created a table/map in lua of the ASCII characters I was going to try and decode. I read in the file contents Hello World! (all ASCII characters), access the map with each byte, writing them to stdout along the way, and voila, it works.

Then I wanted to extend this into UTF-8 and implement a second part, which would handle multi-bytes. This is where things started to get interesting, it’s the elegant part I mentioned earlier. How UTF-8 works.

So ASCII characters are 7 bit. So we say, as long as a byte starts with a zero it is an ASCII character, otherwise, it’s the extended part and its a multi-byte character.

Each byte of a multi-byte begins with an active bit, i.e. the left-most bit of the byte is on/1/active. By peeking the left-most bit of the next byte we can tell if the multi-byte continues or ends.

To implement this, I decided to start with a tree type structure containing the extended parts which I could walk sequentially with each byte until I reach a character, print the character, and continue.

Which ended up like this:

-- Writes extended part of UTF-8 and original ASCII characters to file. UNCOMMENT TO CREATE THE FILE
--local file = io.open("utf8", "w")
--file:write("¢harmele")
--file:close()

local file = io.open("utf8", "r")
if file == nil then
  print("Unable to open file ".. arg[1])
  os.exit(-1)
end

local file_ctents = file:read()
if file_ctents == nil then os.exit() end

local multi_byte_map = {
  [72]  = "H",
  [101] = "e",
  [108] = "l",
  [44]  = ",",
  [32]  = " ",
  [119] = "w",
  [111] = "o",
  [114] = "r",
  [100] = "d",
  [33]  = "!",
  [12]  = "\n",
  [104] = "h",
  [97]  = "a",
  [109] = "m",
  [110] = "n",
  [194] = {
    [162] = "¢",
  },
}

-- Recursively walk over list (table) of bytes to get multi byte char
local functi get_char(bytes, map)
  if type(map) == "string" then return map end
  local first_byte = bytes[1]
  table.remove(bytes, 1)
  return get_char(
    bytes,
    map[first_byte]
  )
end

local multi_byte_char = {}
for i = 1, #file_ctents do
  local hex = string.byte(file_ctents, i)
  local left_most_bit = (hex >> 7) & 1
  if left_most_bit == 1 then
    table.insert(multi_byte_char, hex)

    local peek = string.byte(file_ctents, i + 1)
    local peeked_left_most_bit = (peek >> 7) & 1

    if peeked_left_most_bit == 0 then
      io.write(get_char(multi_byte_char, multi_byte_map))
      multi_byte_char = {}
    end
  else
    io.write(get_char({hex}, multi_byte_map))
  end
end
io.write(multi_byte_map[12]) -- write LF

As we go over the bytes sequentially inside the for loop, if the left most bit of the current byte is 1 we know it is a multi-byte char, We start appending it to a list (table). We know we have reached the end of a multi-byte char by peeking the next byte. We then take the list of multi-byte chars and pass it to a recursive function which walks over the multi-byte map until it reaches a character. We write the character to stdout and then we can continue to the next char.

Notes:

  • There is likely a bug if the last character is a multi-byte character as when it tries to peek the left most bit of the next character the next character is non existent.
  • I only implemented the characters I was going to decode, rather than the full set too, maybe there is more bugs here as well.

0a