The first loop accesses the memory sequentially from the beginning to the end of the array. The second one skips N steps between 2 iterations. Which one is faster?
int a[N * N];
// Loop #1
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
sum += a[i * N + j];
// Loop #2
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
sum += a[j * N + i];
I have run it and here is the result (you can find the code on Github - it is not exactly the same but should be close enough for this discussion)
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
BM_LoopNo1/16 119 ns 119 ns 5831556
BM_LoopNo1/64 1852 ns 1854 ns 377141
BM_LoopNo1/512 118845 ns 118953 ns 5832
BM_LoopNo1/4096 9011987 ns 9019854 ns 80
BM_LoopNo1/8192 36637359 ns 36666974 ns 18
BM_LoopNo2/16 124 ns 124 ns 5577102
BM_LoopNo2/64 2011 ns 2012 ns 349585
BM_LoopNo2/512 851109 ns 851729 ns 810
BM_LoopNo2/4096 153951705 ns 154055943 ns 4
BM_LoopNo2/8192 801699400 ns 802239396 ns 1
You can clearly see that loop #1 is faster, and by a large margin. Now, let’s find out why that is the case.
NOTES:
- In this blog, I use Google Benchmark Library to run benchmark code and perf to collect performance counters from kernel and CPU. You can check them out if you are not familiar.
- All examples are run under Linux and Intel CPU. Other platforms might show different results.
Page fault ๐
Page fault happens when we try to access memory that is not in the main memory. The kernel need to delay the program and load the memory from storage like hard drive, which is very expensive. Since loop #2 jumps back and forth between different pages, it could trigger more page fault than #1 if N is big enough.
To check on the number of page faults, you can use below command
$ perf stat -e page-faults,minor-faults,major-faults <program>
Where minor-faults
is when the page is in memory but is not allocated to the process or not registed to the memory management unit (e.g. when you first access the memory), major-faults
is when the page is not in main memory (that is very costly operation since we might need to read the memory from disk), and page-faults
is sum of them. Below is one possible result,
'./bazel-bin/bm_loop --benchmark_filter=BM_LoopNo1Long':
785,074 page-faults
33,743 major-faults
751,331 minor-faults
'./bazel-bin/bm_loop --benchmark_filter=BM_LoopNo2Long':
579,736 page-faults
157,724 major-faults
422,012 minor-faults
NOTES:
- I wrote a small utility to lock abitrary amount of memory in RAM, to simulate the low memory scenario.
TLB misses ๐
The address refered by the application is virtual address. To access the memory, we actually need to translate them into physical address, by walking through the page table (depsite the name, it is actually like a tree, hence “walking”). Since page table can be very large (say we have 16GB of RAM, and each page is 4KB, you do the math), we have something called Translation Lookaside Buffer or TLB to cache the page table. TLB miss happen when we try to access a page which is not in TLB.
Again, going back and forth between different pages won’t help in this case either.
To see number of TLB misses, you can run below commands:
$ perf stat -e dTLB-loads,dTLB-load-misses <program>
where dTLB
is data TLB (there is iTLB
for instruction), and load-misses
means number of lookup that miss TLB.
'./bazel-bin/bm_loop --benchmark_filter=BM_LoopNo1/8192':
2,494,435 dTLB-load-misses # 0.06% of all dTLB cache hits
4,338,977,389 dTLB-loads
'./bazel-bin/bm_loop --benchmark_filter=BM_LoopNo2/8192':
49,024,649 dTLB-load-misses # 25.15% of all dTLB cache hits
194,919,514 dTLB-loads
NOTES:
- The absolute value is not very useful on this case, because Google Benchmark run it multiple times until it reachs statistical significiant. You should look into percentage of dTLB miss in this case.
CPU Cache ๐
Fetching memory from RAM is actually very slow, it could cost hundreds of clock cycles. That’s why we have CPU cache, which caceh the memory inside the CPU, making subsequence accesses faster. There are multiple levels of CPU cache (e.g. L1, L2, L3), with increasing latency and capacity.
But since we only access each element once in this example, how can cache help? Turn out we don’t load 1 element to the cache at one time, we load a cache line, which usually are 64 bytes. So everytime we access an element, it will load 15 other elements (assuming int is 32-bit) to the cache, speeding up the next 15 memory access if we iterate them in sequential order.
You can check the L1 cache miss number with below commands:
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads <program>
where L1-dcache-loads
is number of time we try to load data from L1 cache, and L1-dcache-load-misses
is the number of time it fails to do so.
'./bazel-bin/bm_loop --benchmark_filter=BM_LoopNo1Long':
36,969,006 L1-dcache-load-misses # 4.77% of all L1-dcache hits
774,500,777 L1-dcache-loads
'./bazel-bin/bm_loop --benchmark_filter=BM_LoopNo2Long':
781,431,687 L1-dcache-load-misses # 100.77% of all L1-dcache hits
775,442,911 L1-dcache-loads
NOTES:
- You might notice that the original benchmark code has some flaws. The cache line between runs is not flushed, so it could affect the benchmark result a little bit. In this example, I only run the the code once to make sure it won’t affect the result.
L1-dcache-load-misses
are bigger thanL1-dcache-modes
in the second examples, which seems quite weird. Turn outs, those cache related parameters can be a bit misleading sometimes. This blog might shed some lights into this.
Cache Prefetching ๐
You might realize something is off by looking at above number. How can number of cache misses are so small compared to 1:15 ratio? Again, CPU is one step ahead of us. It has something called prefetcher which will try to find out the memory access pattern to predict the next memory access, fetching them into the cache before we even access them.
To test this, you can make each array element 64 bytes, which is equal to the cache line, to see if the miss rate is the same between loop#1 and loop#2. perf
also has some performance counter related to that; e.g. LLC-prefetches, L1-dcache-prefetches, etc; but I haven’t been able to test on a CPU that support those counters yet (TODO)
Compiler Optimization ๐
Iterating through the data sequentially also makes the compiler’s job easier. One important optimization is auto vectorization, which use vector instructions. These instructions operate at multiple elements at a time, potentially speeding up the program multiple times. You can see it in action in the below examples:
In GCC, loop #1 sum multiple elements at once (see the instructions with xmm
registers), while loop#2 add to the sum one by one.
What else? ๐
Is there anything else that could cause the speed differences? That is a common problem that I encounter a lot. Unfortunately, I don’t know for if there is an easy way to determine that. One thing we can do is to eliminate all the above problems and then see if the speed of these 2 loops are the same.
Here are what we can do:
- Donwload more RAM to prevent major page fault.
- Prefault the page with
explicit_bzero
to prevent minor page fault. - Use huge page to reduce TLB misses.
- Add padding bytes to make sure each element is in separated cache line.
- Flush all caches between runs
- Disable all CPU prefetch. You can do that via BIOS (in my machine, I disabled Adjacent Cache Line Prefetch, and Hardware Prefetch) or MSR if supported.
- Disable vectorization optimization. In GCC, you can use
__attribute__((optimize("no-tree-vectorize")))
.
After all of that, you can have something like this. Let’s see the result
$ ./bazel-bin/bm_loop_what_else --benchmark_filter=BM_Loop.*/64
2022-04-02T20:11:07+08:00
Running ./bazel-bin/bm_loop_what_else
Run on (6 X 4359.9 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.78, 0.63, 0.47
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
BM_LoopNo1/64/manual_time 6894 ns 47600 ns 101546
BM_LoopNo2/64/manual_time 8360 ns 49097 ns 84305
Loop #2 is still slower and I have no idea why ๐ข. If you know why that is the case, let me know in the comment below.