From 35x Slower to Nearly Competitive: How I Optimized a Simple Zef Interpreter by 16.6x

When most developers think about making language implementations fast, they imagine complex techniques like JIT compilation or sophisticated garbage collectors. But what if you’re starting from scratch with a simple AST-walking interpreter? This post reveals how I transformed an extremely basic Zef interpreter from being 35x slower than CPython to nearly competitive with Lua and QuickJS—all without writing a JIT or optimizing the garbage collector.

The Starting Point: A Naïve but Clean Implementation

My journey began with a Zef interpreter that made almost no performance-conscious decisions. It was written in Fil-C++ (which typically costs 4x performance), used recursive AST walking, relied heavily on std::string lookups, and performed countless hashtable operations. Despite these limitations, the interpreter was remarkably simple and readable.

The initial benchmark results were sobering: 35x slower than CPython 3.10, 80x slower than Lua 5.4.7, and 23x slower than QuickJS-ng 0.14.0. But through systematic optimization, I achieved a 16.6x speedup, bringing Zef into the same performance ballpark as these established interpreters.

The Optimization Journey

1. Direct Operator Nodes (17.5% Speedup)

Instead of parsing a + b as a string-based method call DotCall(a, "add"), I modified the parser to generate distinct AST nodes for each operator. This eliminated string lookups for every math operation.

2. Direct RMW Operators (3.7% Speedup)

I extended the optimization to include compound assignment operators like +=, creating specialized nodes for each RMW case to avoid string-based dispatch.

3. Avoiding IntObject Virtual Calls (1% Speedup)

I eliminated unnecessary virtual calls in the value representation by handling IntObject cases directly rather than routing everything through Value.

4. Symbol-Based Lookups (18% Speedup)

Replacing std::string with hash-consed Symbol pointers for all variable and field lookups dramatically reduced hashing and comparison overhead.

5. Function Inlining (2.8% Speedup)

I introduced a separate valueinlines.h header to enable inlining of critical functions that depend on headers requiring value.h.

6. Object Model and Inline Caches (455% Speedup)

This massive change introduced:

Storage-based object model: Replacing hashtable-heavy contexts with pre-allocated storage at compile-time determined offsets
Inline caching: Remembering the last type and offset for property accesses, compiling specialized AST nodes on the fly
Watchpoints: Handling cases where cached assumptions might become invalid

This single change delivered a 4.55x speedup, bringing Zef within striking distance of the competition.

7. Optimized Argument Passing (33% Speedup)

I replaced std::optional<std::vector> argument passing with a specialized Arguments type that matched the callee’s expected argument structure, halving allocation overhead.

8-9. Specialized Getters and Setters (9% Total Speedup)

By inferring simple getter and setter patterns during function analysis, I eliminated AST evaluation overhead for these common cases.

10. Inlined Critical Functions (3.2% Speedup)

A one-line change to inline an important method call yielded measurable improvements.

11. Hashtable Optimization (15% Speedup)

I introduced a global hashtable to bypass hierarchical lookups for method calls, reducing O(hierarchy depth) operations to single lookups.

12. Avoiding Optional Allocations (1.7% Speedup)

I worked around a Fil-C++ pathology where std::optional caused heap allocations by changing function signatures.

13. Specialized Argument Types (3.8% Speedup)

I introduced ZeroArguments, OneArgument, and TwoArguments types to eliminate Arguments object allocation for built-in functions and inferred setters.

14. Value Slow Path Optimization (10% Speedup)

I changed slow path methods from taking implicit const Value* arguments to taking Value by value, eliminating stack allocation overhead in Fil-C++.

15-19. Targeted Specializations (13.8% Total Speedup)

I added specialized handling for:

sqrt method calls
toString conversions with reduced allocations
Constant array literals
Call operator slow paths

20. Build Configuration (1.8% Speedup)

Disabling RTTI and libc++ hardening provided a small but measurable improvement.

The Yolo-C++ Experiment: 67x Total Speedup

The final experiment involved compiling with Yolo-C++, which yielded a 4x additional speedup by replacing GC allocations with calloc. While this approach is unsound (memory is never freed) and suboptimal (real GC allocators are faster), it demonstrated the potential of the optimizations. With Yolo-C++, Zef became 1.9x faster than CPython, only 1.2x slower than Lua, and 3x faster than QuickJS.

Key Takeaways

Start with good value representation: The foundation of any interpreter performance is how you represent values. Tagged values are essential.
Inline caching works in interpreters: You don’t need a JIT to benefit from inline caching—it’s effective even in AST-walking interpreters.
Object model matters enormously: The shift from hashtable-heavy contexts to storage-based objects was the single biggest win.
Small optimizations compound: While individual changes might seem modest (1-3% each), they add up to massive improvements.
Sometimes you need big changes: The object model and inline cache implementation was a massive patch, but it delivered disproportionate results.
Language choice has limits: Fil-C++ provided safety and rapid development but imposed a 4x performance ceiling that Yolo-C++ broke through.

This journey proves that even with a simple interpreter architecture, you can achieve competitive performance through systematic optimization of fundamental operations. The techniques demonstrated here—value representation, inline caching, object model redesign, and targeted specializations—form the foundation of high-performance dynamic language implementations.

Viral Sentences

“From 35x slower to nearly competitive: How I optimized a simple Zef interpreter by 16.6x without writing a JIT”
“The optimization that gave me a 455% speedup: Inline caches in a pure AST-walking interpreter”
“Why your interpreter is probably wasting 80% of its time on string lookups”
“The one change that doubled my interpreter’s performance: Storage-based object model”
“How I made my interpreter faster than CPython by eliminating hashtables”
“The dirty secret of language implementation: Most optimizations aren’t about JITs”
“Why Fil-C++ costs you 4x performance (and how to break through that ceiling)”
“The optimization that sounds complicated but is actually simple: Inline caching explained”
“From 80x slower than Lua to nearly tied: My interpreter optimization journey”
“The surprising truth about interpreter performance: It’s all about value representation”

How To Make a Fast Dynamic Language Interpreter