You have a lot of noise in your results. I re-ran this on a Xeon E3-1230 V2 @ 3.30GHz running Debian 7, doing 12 runs (discarding the first to account for virtual memory noise) over a 200000000 array, with 10 iterations for the
i within the benchmark functions, explicit
noinline for the functions you provided, and each of your three benchmarks running in isolation: https://gist.github.com/creichen/7690369
This was with gcc 4.7.2.
noinline ensured that the first benchmark wasn't optimised out.
The exact call being
./a.out 200000000 10 12 $n
Here are the results:
min: 0.040655 median: 0.040656 max: 0.040658
min: 0.040653 median: 0.040655 max: 0.040657
min: 0.042349 median: 0.042351 max: 0.042352
As you can see, these are some very tight bounds that show that
loadu_ps is slower on unaligned access (slowdown of about 5%) but not on aligned access. Clearly on that particular machine loadu_ps pays no penalty on aligned memory access.
Looking at the assembly, the only difference between the
loadu_ps versions is that the latter includes a
movups instruction, re-orders some other instructions to compensate, and uses slightly different register names. The latter is probably completely irrelevant and the former can get optimised out during microcode translation.
Now, it's hard to tell (without being an Intel engineer with access to more detailed information) whether/how the
movups instruction gets optimised out, but considering that the CPU silicon would pay little penalty for simply using the aligned data path if the lower bits in the load address are zero and the unaligned data path otherwise, that seems plausible to me.
I tried the same on my Core i7 laptop and got very similar results.
In conclusion, I would say that yes, you do pay a penalty for unaligned memory access, but it is small enough that it can get swamped by other effects. In the runs you reported there seems to be enough noise to allow for the hypothesis that it is slower for you too (note that you should ignore the first run, since your very first trial will pay a price for warming up the page table and caches.)