c - Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

ID : 274316

viewed : 38

Tags : cperformancessec

Top 5 Answer for c - Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

vote vote


You have a lot of noise in your results. I re-ran this on a Xeon E3-1230 V2 @ 3.30GHz running Debian 7, doing 12 runs (discarding the first to account for virtual memory noise) over a 200000000 array, with 10 iterations for the i within the benchmark functions, explicit noinline for the functions you provided, and each of your three benchmarks running in isolation: https://gist.github.com/creichen/7690369

This was with gcc 4.7.2.

The noinline ensured that the first benchmark wasn't optimised out.

The exact call being

./a.out 200000000 10 12 $n 

for $n from 0 to 2.

Here are the results:

load_ps aligned

min:    0.040655 median: 0.040656 max:    0.040658 

loadu_ps aligned

min:    0.040653 median: 0.040655 max:    0.040657 

loadu_ps unaligned

min:    0.042349 median: 0.042351 max:    0.042352 

As you can see, these are some very tight bounds that show that loadu_ps is slower on unaligned access (slowdown of about 5%) but not on aligned access. Clearly on that particular machine loadu_ps pays no penalty on aligned memory access.

Looking at the assembly, the only difference between the load_ps and loadu_ps versions is that the latter includes a movups instruction, re-orders some other instructions to compensate, and uses slightly different register names. The latter is probably completely irrelevant and the former can get optimised out during microcode translation.

Now, it's hard to tell (without being an Intel engineer with access to more detailed information) whether/how the movups instruction gets optimised out, but considering that the CPU silicon would pay little penalty for simply using the aligned data path if the lower bits in the load address are zero and the unaligned data path otherwise, that seems plausible to me.

I tried the same on my Core i7 laptop and got very similar results.

In conclusion, I would say that yes, you do pay a penalty for unaligned memory access, but it is small enough that it can get swamped by other effects. In the runs you reported there seems to be enough noise to allow for the hypothesis that it is slower for you too (note that you should ignore the first run, since your very first trial will pay a price for warming up the page table and caches.)

vote vote


There are two questions here: Are unaligned loads slower than aligned loads given the same aligned addresses? And are loads with unaligned addresses slower than loads with aligned addresses?

Older Intel CPUs (“older” in this case is just a few years ago) did have slight performance penalties for using unaligned load instructions with aligned addresses, compared to aligned loads with new addresses. Newer CPUs tend not to have this issue.

Both older and newer Intel CPUs have performance penalties for loading from unaligned addresses, notably when cache lines are crossed.

Since the details vary from processor model to processor model, you would have to check each one individually for details.

Sometimes performance issues can be masked. Simple sequences of instructions used for measurement might not reveal that unaligned-load instructions are keeping the load-store units busier than aligned-load instructions would, so that there would be a performance degradation if certain additional operations were attempted in the former case but not in the latter.

vote vote


See "§ Efficient Handling of Alignment Hazards" in Intel® 64 and IA-32 Architectures Optimization Reference Manual:

The cache and memory subsystems handles a significant percentage of instructions in every workload. Different address alignment scenarios will produce varying performance impact for memory and cache operations. For example, 1-cycle throughput of L1 (see Table 2-25) generally applies to naturally-aligned loads from L1 cache. But using unaligned load instructions (e.g. MOVUPS, MOVUPD, MOVDQU, etc.) to access data from L1 will experience varying amount of delays depending on specific microarchitectures and alignment scenarios.

I couldn't copy the table here, it basically shows that aligned and unaligned L1 loads are 1 cycle; split cache line boundary is ~4.5 cycles.

vote vote


This is architecture dependent and recent generations have improved things significantly. On the older Core2 architecture on the other hand:

$ gcc -O3 -fno-inline foo2.c -o a; ./a 1000000  Array Size: 3.815 MB                     Trial 1 _mm_load_ps with aligned memory:    0.003983 _mm_loadu_ps with aligned memory:   0.003889 _mm_loadu_ps with unaligned memory: 0.008085 Trial 2 _mm_load_ps with aligned memory:    0.002553 _mm_loadu_ps with aligned memory:   0.002567 _mm_loadu_ps with unaligned memory: 0.006444 Trial 3 _mm_load_ps with aligned memory:    0.002557 _mm_loadu_ps with aligned memory:   0.002552 _mm_loadu_ps with unaligned memory: 0.006430 Trial 4 _mm_load_ps with aligned memory:    0.002563 _mm_loadu_ps with aligned memory:   0.002568 _mm_loadu_ps with unaligned memory: 0.006436 Trial 5 _mm_load_ps with aligned memory:    0.002543 _mm_loadu_ps with aligned memory:   0.002565 _mm_loadu_ps with unaligned memory: 0.006400 
vote vote


Top 3 video Explaining c - Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?