I have always wanted the Cone compiler to be fast. Faster build times make it easier and more pleasant for the programmer to iterate and experiment quickly. Compilers built for speed, such as Turbo Pascal, D and Go, win praise and loyalty. Languages that struggle more with this, like C++ or Rust, get regular complaints and requests to speed up build times.
Compiler speed is also important to Jonathan Blow. In his videos, he demonstrates the ability to rebuild a large Jai program in seconds and claims the Jai compiler can build 100kloc in a second. I would like Cone to match that velocity, so long as it does not interfere in other priorities.
To that end, I designed the Cone compiler for performance, following the principles enumerated in Premature Optimization. In this post, let me share with you the coding choices I made and some recent performance measurements that evaluate whether those choices were sound. These numbers lead us inevitably to some fascinating questions about the elephant in the room: LLVM’s performance profile.
Architecting for Performance
Let’s begin with the Cone compiler design choices made for performance:
C. Choosing any systems language (C++, Rust, D) will likely result in measurably faster programs. From my standpoint, choosing C improves the odds, as it makes it easier to get closer to the metal. To manage performance even tighter, the compiler uses only one external library. Although some might argue C has slowed my productivity, bloated the code base, and made it susceptible to all manner of unsafety bugs, that has not been my experience. My only regret is I wish I could have self-hosted it in Cone instead.
Arena memory management. Compilers allocate many tiny pieces of data, largely for AST/IR nodes and symbol tables. To nearly eliminate the non-trivial runtime cost of malloc(), free() and liveness tracking for each data piece, the Cone compiler uses a very simple bump-pointer arena allocator. Allocation takes just a few instructions to extract the next bite out of a large, malloced arena. References are never tracked nor freed, and yet are 100% safe to use. The theoretical downside for this strategy is that memory consumption grows non-stop. However, in practice, I don’t fear the compile ever running out of memory. On average, the Cone compiler consumes about 35 bytes of memory for every source code byte. How big does your source code need to be before you would consume all memory?
Data is rarely copied. Nearly all data structures, including the IR and global name table, are mutable and reusable. Furthermore, the design of the IR means that most IR nodes survive largely intact all the way from parse to gen. Even temporary, working data stacks are allocated once and reused. Additionally, string copies of loaded source file tokens happen at most once, when interning names and when capturing string literals. Compilers that rewrite their AST/IR nodes or use immutable data structures will incur a measurable allocation and copying performance cost for doing so.
Names are interned. The lexer is responsible for interning all names into the global name table. As a result, the rest of the compiler (including parsing and name resolution) is never slowed down by any string comparison or hash table look-up. Correlating a name to an IR node is nearly as simple as dereferencing a pointer. Between lexing and generation, the compiler pays no attention to strings in any way, other than for error message generation.
Ad hoc, built-in heuristics. Semantic analysis traverses the full IR only three times. No “solvers” are used for type inference or borrow-checking, as cascading chains of conditionals suffice. Additionally, none of the semantic analysis happens in abstraction layers (e.g., via core library templates or generics).
I make no claim to any unimpeachable evidence that these strategies are indeed the best performing choices I could make. If any vetted research exists that validates these choices, I am unaware of it. I can point to Walter Bright’s article Increasing Compiler Speed By Over 75% which quantifies the performance benefit of arena-based memory allocation. I welcome any other hard data and/or recommendations regarding these (or alternative) performance-based design choices.
Measuring Performance of the Cone Compiler
The Cone compiler outputs a message summarizing elapsed time and memory usage for every compile. Running on my server, as part of the Playground, small programs typically compile in 4-11 milliseconds.
Although content with that, I recently got curious about what’s happening under the covers. I ran the Visual Studio profiler. One surprising result was that most compilation steps (like parsing and semantic analysis) were not even showing up on the log.
My hopeful guess was that these stages were happening more quickly than the milli-second probe speed of the profiler could notice. To validate that guess, I instrumented the compiler code to use QueryPerformanceCounter, which measures elapsed time at a precision of a tenth of a microsecond.
For a small program of 200 LOC (3800 bytes), here are the average elapsed-times on my laptop for each front-end compiler stage:
|300||Load. Completely load source program(s) from SSD disk to memory|
|100||Lex. Transform source UTF-8 characters to ready-to-use, interned tokens|
|300||Parse. Convert tokens into high-level IR nodes|
|100||Analysis. All 3 semantic passes: name resolve, type check, data flow|
One should not read too much into these numbers. Several stages will almost certainly slow down over time, particularly when metaprogramming support is added to Cone. Furthermore, we don’t know how these results will scale to larger programs. Even if the nature of the logic suggests scaling will likely be linear, tests will need to be performed that confirm this. The tests were run on my laptop competing with other programs for the CPU, and the results varied ±50% from one run to another.
All caveats aside, these results are breathtaking. The untuned, front-end compiler is chewing through 250Kloc per second, despite nearly 40% of the time being spent on disk i/o. These better-than-expected results improve my confidence that Cone’s upfront design choices were largely the right ones.
I do wonder why parsing is slower than lexing and analysis. Two possibilities come to mind: every token is likely to be trial-matched by the recursive descent parser many times across multiple functions (which is less true of the lexer’s handling of source code characters). More intriguing is that this may have more do with the cost of building the Cone IR nodes. As compared to lexing and analysis, parsing is by far doing the most memory mutation. Would extensive memory mutation slow things down this much (e.g., because of cache invalidation)?
An astute reader will have noticed that I did not give a number for the code generation stage. Here, unfortunately, the performance news is much less stellar. Generating an optimized object file from the analyzed IR takes about 115,500 μSec. That’s 99.3% of the total compile time! To understand this disappointing result, we must talk about LLVM.
LLVM and Compiler Performance
Like many recent compilers, Cone uses LLVM to generate optimized native code. Essentially, the compiler’s code generation stage builds binary LLVM IR from the Cone IR, asks LLVM to verify and optimize it, and then has LLVM generate an object file for the appropriate target CPU architecture.
My experience using LLVM has been extremely positive. I doubt I would have tackled building a compiler for Cone without the existence of LLVM. My misgivings about LLVM’s performance are the only significant concern I have about using LLVM in Cone.
I want to be very careful when talking about LLVM and performance. Although my intent is to offer helpful feedback and intelligence on what to expect from LLVM, precedence suggests my words could be misconstrued as rude or unfair to the many people that have invested 100s of man-years effort contributing to this amazing tool. It would be a shame if people felt a stronger need to vigorously (and needlessly) defend LLVM rather than engaging in a insightful dialogue that explores what we can learn.
Clearly, it would be silly to compare too closely the performance results of two differently-sized code bases, especially when they perform different tasks. Indeed, one should expect that a native backend will take longer to run than the front-end stages, for several reasons:
- the backend often performs at least two significant lowering stages to more complex representations (e.g., to LLVM IR and then to machine language).
- Some forms of optimization are computationally expensive.
- Engines like LLVM need architectural abstractions that improve flexibility, such as the ability to generate code across a wide-range of different architectures and the ability to define custom optimization passes.
However, given that LLVM’s performance plays a dominant role on compiler build times, it is reasonable to question whether a backend truly needs to consume almost 200x more of the CPU than a front-end processing essentially the same semantic intent. Might it be possible to achieve similar stellar outcomes more quickly by using a LLVM-comparable tool that was built using a more performance-based architecture?
Unfortunately, we cannot arrive at a definitive answer by just evaluating the current state-of-the-art in compilation tools. The evidence we have is circumstantial and can be interpreted either way:
- The fastest compilers do not use LLVM. I would not be surprised if the compile-speed advantages of (say) tcc, D and Jai are due, in no small portion, to the fact that they do not use LLVM.
- A valid counterargument is that these faster compilers generate less optimal runtime code. Furthermore, gcc (which does not use LLVM and which does compete well with LLVM-based CLang on runtime performance) offers no significant compiler speed advantage.
Given our lack of certainty as to whether it is possible to create a faster LLVM, it is worth noting that this performance challenge is important enough that work on alternative backends (e.g., Cranelift) is ongoing and is heavily motivated by improving compile-time performance. Even a partial solution (faster debug builds) is better than no improvement at all.
LLVM Performance Measurements
Several people have strongly claimed that the performance issues around LLVM are largely attributable to the extensive optimization work it performs. Let’s see whether detailed performance metrics support this theory.
Here are elapsed-time measurements for various LLVM backend stages processing the same small Cone program that I measured earlier:
|2,500||Setup. Initialize LLVM’s target machine, data layout and global context|
|11,000||Gen LLVM IR Build LLVM IR by traversing the Cone IR nodes|
|3,000||Verify LLVMIR Verify the build LLVM IR|
|15,000||Optimize Optimize the LLVM IR|
|84,000||Gen obj. Convert LLVM IR to an object file and save it|
Notice how much larger all these numbers are than the numbers shown for the front-end stages! Let’s explore what these results mean by offering some context about the work each stage performs:
Setup. It is interesting that LLVM initialization takes 3x longer than it takes the Cone front-end to fully process a 200 LOC program. Even though this initializes code across dozens of possible targets, it is still reasonable to wonder if LLVM initialization really needs to be this expensive. The one good thing we can say about this 2.5ms cost is that it won’t scale upward as our source program sizes grow.
Gen LLVM IR. This is the only stage that mixes together the performance of Cone and LLVM code. Essentially, the Cone logic traverses the Cone IR very similarly to any one of the earlier semantic analysis passes. Based on the nature of the work being performed, we can reasonably guess that the Cone-side performance is unlikely to take longer than 100 μSecs (3-passes of semantic analysis). If true, this means the LLVM-based logic to build the equivalent LLVM IR nodes takes 100x as long. Again, we would expect it should take more time, as the lowering process here is multiplicative: a single Core IR node often translates into multiple LLVM IR SSA-based nodes (in this case, a 200LOC source file turns into 1000LOC LLVM IR file). Additionally, a conversion from Cone type info to LLVM type info is regularly needed. But is a 100x jump a reasonable cost for what is simply a straightforward lowering step?
Verify LLVMIR. This step is effectively semantic analysis for LLVM IR. It ensures the module’s IR is well-formed, including type checking. It is a minor step and possibly unnecessary if the compiler knows it is building valid LLVM IR. Even still, it is 30x more expensive compared to the semantic analysis for Cone, whose rules as likely more complex (though likely traversing fewer nodes).
Optimize LLVMIR. This runs 6 LLVM optimization passes: converting stack variables to registers, function inlining, simple peephole/bit-twiddline, expression re-association, common subexpression elimination, and control-flow graph simplification. This reads in LLVM IR and likely rewrites the LLVM IR six times. It is only somewhat more expensive than building the original LLVM IR. LLVM IR optimization activity only takes up 15% of LLVM total processing time.
Gen Obj. This step lowers LLVM IR to the target’s machine language and stores it out to disk as a target-specific object file. This final LLVM stage takes up 73% of LLVM’s running time. Some of this can be explained because this second lowering step almost certainly takes more time than the first, due to the complexity of the translation and the further multiplying factors of generated CPU instructions to LLVM IR nodes (800 LOC LLVM IR file turns into 1900 LOC of assembler). Furthermore, this lowering stage likely requires further optimization logic that is specific to the target architecture’s instruction set. However, it is once again fair to wonder whether that warrants taking well over 100x more time than the full front-end work.
Based on this data, there is no smoking gun clearly demonstrating that LLVM’s aggressive optimization capability is the primary cause of its performance impact. On the contrary, every stage of its processing seems to be marked by significantly larger performance multipliers over what we might have expected by comparing with loosely similar front-end steps processing roughly the same semantic content. These results would be consistent with a theory that LLVM’s CPU consumption results to some degree from its software architecture choices.
It is impossible to reach any firm conclusions about Cone’s or LLVM’s performance architecture based purely on the the data and analysis shared in this post. The only way to reach more definitive conclusions requires careful benchmarks that compare performance of different design approaches that accomplish the exact same task. I don’t have the time to do that work now. I hope others will (or have).
From my perspective, some information is still better than none. On the basis of this analysis so far, I am taking away several useful lessons to guide my future choices:
- Designing for performance upfront can work out really well, contrary to the warnings from Knuth-inspired skeptics. Good decisions for performance can be made that improve programmer productivity and result in understandable code. I want to keep getting better at this. I also want to design the Cone language to make this easier for its programmers.
- I really don’t need to focus on performance tuning the Cone compiler’s front-end. I am really quite pleased with its current speed. After adding macro and generics support, it would probably be valuable to revisit these numbers.
- LLVM is going to be a boat anchor to rapid Cone build times. Near term, I need to explore ways to speed up my use of LLVM, if possible. For example, is there a way to build the LLVM IR by not allocating each LLVM IR node’s space one-at-a-time? This might improve data locality and speed up future activity.
- Longer term, I should probably explore alternatives to LLVM, particularly for doing debug builds where the generation of optimized runtimes is not a priority. Since I am not eager to tackle the work of building a custom back-end, I will certainly take a closer look at ways to exploit back-ends used by compilers that boast much faster code generation (e.g., tcc or Jai).
As always, I am interested to learn from other people’s experiences and insights regarding compiler and LLVM/backend performance, whether on your own projects or based on results you have seen published by other teams that must surely be wrestling with similar challenges.