How Codeflash measures code runtime
Codeflash reports benchmarking results that look like this:- How we measure the runtime of code
- How we determine if an optimization is actually faster
- Why we measure the timing as best of N runs
- How we measure the runtime when we run on a wide variety of test cases.
Goals of Codeflash auto-benchmarking
A core principle of Codeflash is that it makes no assumptions about which optimizations might be faster. Instead, it generates multiple possible optimizations with LLMs and automatically benchmarks the code on a variety of inputs to empirically verify if the optimization is actually faster. The goals of Codeflash auto-benchmarking are:- Accurately measure the runtime of code
- Measure runtime for a wide variety of code
- Measure runtime on a variety of inputs
- Do all the above on a real machine, where other processes might be running and causing timing measurement noise
- Finally make a binary decision whether an optimization is faster or not
Racing Trains as an analogy
Imagine you’re a boss at a train company choosing between two trains to runs between San Francisco and Los Angeles. You want to determine which train is faster. You can measure their by timing how long each takes to travel between the two cities. However, real-life factors affect train speeds: rail traffic, unfavorable weather, hills, and other obstacles. These can slow them down. To settle the contest, you have a driver race the two trains at maximum possible speed. You measure the travel times between the two cities for each train. Train A took 5% less time than Train B. But the driver points out that Train B encountered poor weather, making it impossible to draw firm conclusions. Since it’s crucial to know which train is truly faster, you need more data. You ask the driver to repeat the race multiple times. In this scenario, since they have plenty of time, they repeat the race 50 times. This gives us timing data (in hours) that looks like the following.
How Codeflash benchmarks code
This principle of measuring peak performance while minimizing external noise is exactly how Codeflash measures code runtime. Computer processors face various sources of noise that can increase function runtime:- Hardware: cache misses, CPU frequency scaling, etc.
- Operating system: context switches, memory allocation, etc.
- Programming language: garbage collection, thread scheduling, etc.
What happens when there are multiple inputs to a function?
While this approach works well for single inputs, what about multiple inputs? Now the race runs through multiple stations: Seattle to San Francisco to Los Angeles to San Diego. We still need to determine the faster train for this route. We can only measure times between adjacent stations. Here is how the timing data looks like (in hours):