Update scalar_4x64

Alex-GR commented at 9:34 pm on October 8, 2016: contributor

XOR reg,reg instead of MOV 0 to reg. It should be at least equal in all architectures and faster in some.

Triple test everything because I’m a noob in coding & github alike.

Update scalar_4x64_impl.h

XOR reg,reg instead of MOV 0 to reg. It should be at least equal in all architectures and faster in some else.

9d67afad96

afk11 commented at 7:54 pm on October 9, 2016: contributor

Can you elaborate on why this change is necessary?

Alex-GR commented at 8:09 pm on October 9, 2016: contributor

It’s not by any means necessary. It just reduces bytes used (xor instruction translated to fewer bytes than a mov), makes it faster in some architectures and finally it’s good for consistency, as the lines below 598 already xor instead of mov(e) zeroes.

If moves are used for a reason (?) that I ignore, let them be.

laanwj commented at 2:55 pm on October 11, 2016: member

Can you provide bench output to show what the performance difference is? (preferably on different types of CPUs)

Alex-GR commented at 7:58 pm on October 11, 2016: contributor

For my cpu (q8200 / core2 quad) any tiny differences seem to be within noise levels: https://s3.postimg.org/5em86bakj/comparison.jpg

That’s to be expected though, at most it will save a few cycles and bytes - because it’s not the cpu-intensive part. As for other cpus, most of the hardware I have access to, is similar to this one.

sipa commented at 9:36 am on October 12, 2016: contributor

Sounds expected. I think that using a xor with itself is a best practice way to zero a register, so on that basis, concept ack.

sipa commented at 9:16 pm on October 26, 2016: contributor

utACK

sipa commented at 9:24 pm on October 26, 2016: contributor

From http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf:

0Assembly/Compiler Coding Rule 36. (M impact, ML generality) Use
1dependency-breaking-idiom instructions to set a register to 0, or to
2break a false dependence chain resulting from re-use of registers.
3In contexts where the condition codes must be preserved, move 0 into
4the register instead. This requires more code space than using XOR
5and SUB, but avoids setting the condition codes.

sipa merged this on Oct 26, 2016

sipa closed this on Oct 26, 2016

sipa referenced this in commit 40c8d7e8bf on Oct 26, 2016

Alex-GR deleted the branch on Oct 26, 2016

Alex-GR commented at 2:39 pm on November 18, 2016: contributor

I had some time “playing around” with this asm, and I’ll share my findings here.

In case anyone wonders what the speed gains would be in reduce512 if all steps were merged - without each step going through memory: https://github.com/Alex-GR/secp256k1/commit/40947acf75b13faad2f86cdce7564a417bb00e77

I think this could go further, if the final reduce function (in c) was also in asm and then merged as the “4th step” - so that the registers keep feeding the next stage of final reduction directly instead of going through memory.

Obviously, right now it’s far more readable in what it does (my asm isn’t that readable…)

I also found some tiny room for optimization here (a few less moves + registers used instead of 2 tmp variables): https://github.com/Alex-GR/secp256k1/commit/b205b05b6ab53d96bff29d8b23c17fff6d2f0e04
While playing around, I experimented with the concept of using XMM registers for temp storage instead of the stack. When I do benchmarks like reg-xmm-reg, vs reg-stack-reg, the reg-xmm-reg wins by a long shot in terms of time but I couldn’t find any performance benefit while doing it in the sequences above. I have no idea why. In some cases it can still be a benefit though, because you have +1/2 more registers that you’d normally not have. If, say, rsp can’t get pushed (=segfault) to be repurposed for math or carrying a value, you can move rsp=>xmm, use rsp temporarily as a normal extra register (as long as you don’t need stack operations), and then move the xmm back to rsp when the code needs it. Obviously this only in operations where xmm leaking the stack address does not create a security issue.

From a small loop benchmark doing movs between registers in my q8200:

Reg64-to-Reg64 mov: 1075 msecs. Reg64-Reg64-Reg64 mov: 1818 msecs. Reg64-XMM Reg-Reg64 movq: 1861 msecs. Reg64-RAM-Reg64 mov: 2689 msecs. Push & Pop Reg64s: 2715 msecs.

While at it, I also did a test on mov 0, reg VS xor reg, reg, VS sub reg, reg for my architecture: XOR beats mov 0 and lea 0, and is equal to sub reg,reg (the decoder understands it means “set to zero”). Again this is for a q8200 intel quad core.

Time elapsed for 1 billion loops x10 MOV 0 same register: 3767 msecs. Time elapsed for 1 billion loops x10 MOV 0 on 10 different registers: 3762 msecs. Time elapsed for 1 billion loops x10 XOR same register: 2149 msecs. Time elapsed for 1 billion loops x10 XOR on 10 different registers: 2149 msecs. Time elapsed for 1 billion loops x10 SUB same register: 2149 msecs. Time elapsed for 1 billion loops x10 SUB on 10 different registers: 2149 msecs. Time elapsed for 1 billion loops x10 LEA [0] same register: 5394 msecs. Time elapsed for 1 billion loops x10 LEA [0] on 10 different registers: 5374 msecs. Time elapsed for 1 billion loops x10 PXOR same SSE register: 2150 msecs. Time elapsed for 1 billion loops x10 PXOR on 10 different SSE registers: 2149 msecs.

I also noticed performance fluctuations depending the ordering of code. Some times even changing the order of some lines (without affecting the logic), can bring speed increases or slowdowns. And even throwing a nop here or there can produce the same. I thought it was code alignment issue, so I tried with .align 1-2-4-8-16 (after every instruction line) to see if it makes any difference, but it just got worse. It might be possible that 2byte/4byte opcodes are “naturally” preferred compared to 3byte opcodes. I noticed this when trying to replace adcq $0’s with adcq of a recently xor’ed register. For example:

/* Extract r1 / “movq %%r11, 8(%q2)\n” “xorq %%r11, %%r11\n” / (r9,r8) += p4 _/ “addq %%r9, %%r13\n” “adcq $0, %%r11\n” becomes => “adcq %%r11, %%r11\n” /_it was already zeroed above so we get minus 1 byte opcode*/

My thinking was that if I replace ~20x adcq $0 with adcq zero register, I can save some opcode for secondary benefits (saving opcode = other functions can remain in fast L1…). In benchmarking with a test example, it doesn’t make any speed difference whether you adc with an immediate or a zero register. The only thing different is less opcode. However, when I do it in the asm sequence above, it does give me some slight decrease in speed - and I suspect it’s due to 3-byte opcodes rather than 4-byte.

Something similar happens when replacing the 10byte movabs (disassembled output) of the large immediates with 3 byte mov reg,reg. While the opcode savings are big, it usually ends up sliiiiightly slower - and the only thing I can think is odd byte addresses for the instructions. Perhaps it’s different in more modern processors than my q8200.

Note1: bench_verify behaves differently as of late. It used to show the benchmarked value and terminate. Now it shows the value but goes on without terminating.
Note2: I discovered that some benchmarks from bench_internal can be somewhat misleading in terms of how they are done between gcc and clang. Clang was showing me better speeds in some and I was wondering what it does differently… well, it was inlining the function call inside the benchmark- so it was showing better performance because it didn’t need to call anything (or the loop jumps were shorter in an inlined function inside the benchmark function). However that does not necessarily mean the function itself was compiled more efficiently (which one would think based on the benchmark result). Perhaps it’d be useful to also have a real-world benchmark, like processing 20 bitcoin blocks worth of signatures and seeing the actual time taken.

Anyway, just thought I’d share these thoughts and ideas…

Update scalar_4x64_impl.h #421