Writing My Own Programming Language
Why I Wrote My Own Programming Language
Last summer, I spent some time learning C. I liked the low-level control it provided but I was frustrated by the lack of clarity in the code as my projects grew more complex. And then I'd switch back to Python for some scripting, and while it was simple and familiar, I quickly realized that I missed the safety and certainty of static typing in C that just couldn't be supplemented by type annotations.
So I was basically stuck between Python which was familiar yet limiting, and C which was unrestrictive yet convoluted (at that time). This led me to think about what I really wanted altogether in one language, thus giving me the idea of creating my own called Phase -- a statically-typed bytecode-interpreted programming language that was designed to combine the expressiveness of high-level languages (like Python) with the explicitness of lower-level languages (like C). I chose the name 'Phase' because it represents an active shift between language types, and pretty much every other distinct name I came up with was already taken.
As for the toolchain, even though there are various powerful compiler tools that already exist like LLVM and YACC, I decided to fully handwrite this project myself because it let me shape the language exactly how I envisioned it. And it really forced me to learn proper compiler/interpreter theory; I challenged myself to do things the hard way so I could develop my skills the most.
Also, another important aspect of Phase that I put much thought into was the error system. Since this was, in some way, my ideal language, I realized that it had to also solve my frustrations with vague errors that I've had in other language -- for example, C's segfault and Python's convoluted messages. So I spent some time creating an informative and visually appealing diagnostics manager which was partially inspired by Rust's errors. Here's an example:
┏ Fatal Error [102]: Expected ')'.
┃ --> ../tests/missing_paren.phase:2:19-19
┃
┃ 2 | out("a, b, c:"
┃ | ^
┃
┣ Help: Add ')' here.
┃ Suggestion:
┃ - out("a, b, c:"
┃ + out("a, b, c:")
This message tells you what went wrong, where it went wrong, why it went wrong, and how it could be fixed -- all in one message, and essentially everything you need to solve the issue.
How the Interpreter Works
Phase's interpreter has several stages:
$$ \def\arraystretch{1.28} \newcommand{\vdown}{\big\downarrow} \begin{array}{c} \text{SOURCE CODE} \\ \vdown \\ \text{Frontend} \\[0.15em] \boxed{\begin{array}{c} \text{Lexer} \\ \vdown \\ \text{Parser} \\[-1.3em] \hphantom{\text{Bytecode Generator}} \end{array}} \\ \vdown \\ \text{Backend} \\[0.15em] \boxed{\begin{array}{c} \text{Type Checker} \\ \vdown \\ \text{Bytecode Generator} \\ \vdown \\ \text{Virtual Machine} \end{array}} \\ \vdown \\ \text{OUTPUT} \end{array} $$To demonstrate the process, we'll follow this basic line of Phase code from beginning to end:
out("Hello world!")
1. Lexer
The lexer (short for lexical analyzer) is the very first stage, taking in raw source code as an arbitrary character string, and breaking it down into tokens that represent keywords, operators, types, and literals. So our line of source code will be translated into this form:
OUT
LPAREN
STRING_LIT 'Hello world!'
RPAREN
NEWLINE
Now we have these organized tokens that separate each component of the string, making it much easier to process than the original line of code. Apart from that, the lexer also handles other cases like skipping whitespace, distinguishing between keywords and identifiers, and string literals with escape sequences. In concept, the lexer's pretty simple since it's just processing strings from a file, which is what I thought until I found that it was probably the most tedious and boring part of the whole project to design and actually make it work.
2. Parser
Next, the parser takes the stream of tokens the lexer produced and constructs an Abstract Syntax Tree (AST) -- the hierarchical structure of the program file. I specifically made a recursive-descent parser, where each grammar rule is designated to a specific function, and the parser starts at the top (program) level while going through each sublevel until there's none left. So parsing our tokens gives us this linear tree of nodes that can be easily processed:
STATEMENT (OUT)
╰ EXPRESSION (STRING) ["Hello world!"]
The parser is also where syntax errors are caught and raised. For example, if you forget a closing parenthesis in your code, the parser catches this based on the previous tokens it encounters -- in this case, an opening parenthesis, so it expects a closing one on the same line.
3. Type Checker
This is where Phase's static typing is enforced through semantic verification, where the type checker 'walks' along the AST and verifies that all operations are correct, so you can't mismatch variable types in assignment or arithmetic. The type checker is different from the lexer because it checks semantics while the lexer checks syntax; an analogy would be like arranging words in a sentence (syntax) versus what the sentence actually means (semantics), so for example, while the sentence "colourless green ideas sleep furiously" is correct with its word ordering (good syntax), it doesn't actually mean anything (bad semantics). To demonstrate, this line of code is syntactically correct:
let x: int = "Hi"
Yet we still get an error because the code is semantically wrong (the types are mismatched):
┏ Fatal Error [108]: Type mismatch.
┃ --> ../tests/type_mismatch.phase:3:5-14
┃
┃ 3 | let x: int = "Hi"
┃ | ^^^^^^^^^^
┃
┣ Help: Variable 'x' expects int but got str.
┃ Suggestion:
┃ - let x: int = "Hi"
┃ + let x: int = 0
But as a note, the type checker doesn't check every single thing -- for example our 'Hello world!' code we're following is a statement that accepts any expression type as an argument because it will always be converted to a string, so the type checker emits it as-is.
4. Bytecode Generator
Moving on, the type-checked AST is now taken by the bytecode generator and compiled into a custom Assembly-like instruction set called bytecode, which is much quicker and simpler to execute than reading the AST itself. For Phase specifically, I wrote in a stack-based architecture, meaning that operations push and pop values from storage in the order of Last In, First Out (LIFO). I designed the actual instruction set to contain about 25 different opcodes, which was quite fun since I was making my own mini ASM codes with total virtual low-level control. Anyways, inputting our AST now gives us this hexadecimal bytecode:
00 00 00
01
18
Which represents these instructions of opcodes and constant values:
OP_PUSH_CONST 0 ; Push constant 0 ('Hello world!') onto the stack
OP_PRINT ; Print it
OP_HALT ; Terminate the program
5. Virtual Machine
Finally, the pipeline ends with the virtual machine, which directly executes the bytecode sequence. Specifically, it maintains an instruction pointer tracking the next opcode to execute, a stack for temporary values and computations, and a global environment for holding variables. It basically works like a very simple CPU running a fetch-decode-execute cycle -- it reads an instruction, decides what operation to perform, runs it, and moves onto the next instruction. And like the bytecode generator, it was also pretty enjoyable to design and write the VM because I was finally performing the actual execution of my source code after only having debug info to look at for months. So we finally get our code's output:
Hello world!
What Design Decisions I Made
I probably spent more time dealing with Phase's design and tradeoffs than actually programming it, so here are some of the more important decisions I went through.
Interpreter versus transpiler. Originally, Phase was meant to be a transpiler that converted source code to C code, which was actually the first working version of Phase that I wrote. But as I started to test my first Phase programs, the build-run pipeline was so tedious that I had to replace the backend's code generator with a bytecode generator with a VM, even though I was planning to show more low-level knowledge via the transpiler.
Static typing versus dynamic typing. Static typing makes things harder because you have to implement type checking and more sophisticated error handling, but I felt that the benefits outweighed dynamic typing with bug-catching and explicitness in code. I originally implemented C-style variable type declarations because they were simple and I was used to them, but after some feedback from a person online, I replaced them with Rust-style declarations to future-proof parsing for the updates that came soon after.
Stack-based VM versus register-based VM. I chose a stack-based architecture over register-based because it's much simpler to implement, and while it's slower because of more stack operations, there wouldn't be any noticeable difference between the two architectures for a language this small.
Python versus C. I created the first prototype of Phase (called Luma at that time) in Python since I was much more familiar with it than C, and I got it working in a few days with some very basic features. But writing it in Python made me feel as if I were cheating in a way, so I decided to challenge myself by writing it fully in C so I could speed up my learning by forcing me to confront various concepts I didn't know about.
Wrapping Things Up
Phase is my very first complete and polished long-term project. It's got enough features as a programming language to write various small yet practical programs -- for example generating fibonacci sequences:
func fibonacci(n: int): void {
let (a, b): int = (0, 1)
let (next, count): int = (b, 1)
while count <= n {
out(next)
count += 1
a = b
b = next
next = a + b
}
}
entry {
fibonacci(10)
}
And while the project is finished, if I had to revisit it, I would probably work on standard input, arena memory allocation, and JIT bytecode compilation. Overall though, Phase taught me many things like deep compiler/interpreter theory, how to properly design and implement a large project in a low-level language like C, and how to think in a systems-oriented way so I consider all my architectural design decisions. It also proved the effectiveness of project-based learning (something I really value) since everything I learnt was valuable in that it helped me get closer to my goal; I couldn't have developed my software engineering skills so much if I hadn't actually tried to create something like this that was out of my reach at the time.
You can check out Phase's repository on GitHub.