XOR reg,reg instead of MOV 0 to reg. It should be at least equal in all architectures and faster in some.
Triple test everything because I’m a noob in coding & github alike.
XOR reg,reg instead of MOV 0 to reg. It should be at least equal in all architectures and faster in some.
Triple test everything because I’m a noob in coding & github alike.
XOR reg,reg instead of MOV 0 to reg. It should be at least equal in all architectures and faster in some else.
It’s not by any means necessary. It just reduces bytes used (xor instruction translated to fewer bytes than a mov), makes it faster in some architectures and finally it’s good for consistency, as the lines below 598 already xor instead of mov(e) zeroes.
If moves are used for a reason (?) that I ignore, let them be.
For my cpu (q8200 / core2 quad) any tiny differences seem to be within noise levels: https://s3.postimg.org/5em86bakj/comparison.jpg
That’s to be expected though, at most it will save a few cycles and bytes - because it’s not the cpu-intensive part. As for other cpus, most of the hardware I have access to, is similar to this one.
0Assembly/Compiler Coding Rule 36. (M impact, ML generality) Use
1dependency-breaking-idiom instructions to set a register to 0, or to
2break a false dependence chain resulting from re-use of registers.
3In contexts where the condition codes must be preserved, move 0 into
4the register instead. This requires more code space than using XOR
5and SUB, but avoids setting the condition codes.
I had some time “playing around” with this asm, and I’ll share my findings here.
I think this could go further, if the final reduce function (in c) was also in asm and then merged as the “4th step” - so that the registers keep feeding the next stage of final reduction directly instead of going through memory.
Obviously, right now it’s far more readable in what it does (my asm isn’t that readable…)
I also found some tiny room for optimization here (a few less moves + registers used instead of 2 tmp variables): https://github.com/Alex-GR/secp256k1/commit/b205b05b6ab53d96bff29d8b23c17fff6d2f0e04
While playing around, I experimented with the concept of using XMM registers for temp storage instead of the stack. When I do benchmarks like reg-xmm-reg, vs reg-stack-reg, the reg-xmm-reg wins by a long shot in terms of time but I couldn’t find any performance benefit while doing it in the sequences above. I have no idea why. In some cases it can still be a benefit though, because you have +1/2 more registers that you’d normally not have. If, say, rsp can’t get pushed (=segfault) to be repurposed for math or carrying a value, you can move rsp=>xmm, use rsp temporarily as a normal extra register (as long as you don’t need stack operations), and then move the xmm back to rsp when the code needs it. Obviously this only in operations where xmm leaking the stack address does not create a security issue.
From a small loop benchmark doing movs between registers in my q8200:
Reg64-to-Reg64 mov: 1075 msecs. Reg64-Reg64-Reg64 mov: 1818 msecs. Reg64-XMM Reg-Reg64 movq: 1861 msecs. Reg64-RAM-Reg64 mov: 2689 msecs. Push & Pop Reg64s: 2715 msecs.
While at it, I also did a test on mov 0, reg VS xor reg, reg, VS sub reg, reg for my architecture: XOR beats mov 0 and lea 0, and is equal to sub reg,reg (the decoder understands it means “set to zero”). Again this is for a q8200 intel quad core.
Time elapsed for 1 billion loops x10 MOV 0 same register: 3767 msecs. Time elapsed for 1 billion loops x10 MOV 0 on 10 different registers: 3762 msecs. Time elapsed for 1 billion loops x10 XOR same register: 2149 msecs. Time elapsed for 1 billion loops x10 XOR on 10 different registers: 2149 msecs. Time elapsed for 1 billion loops x10 SUB same register: 2149 msecs. Time elapsed for 1 billion loops x10 SUB on 10 different registers: 2149 msecs. Time elapsed for 1 billion loops x10 LEA [0] same register: 5394 msecs. Time elapsed for 1 billion loops x10 LEA [0] on 10 different registers: 5374 msecs. Time elapsed for 1 billion loops x10 PXOR same SSE register: 2150 msecs. Time elapsed for 1 billion loops x10 PXOR on 10 different SSE registers: 2149 msecs.
/* Extract r1 / “movq %%r11, 8(%q2)\n” “xorq %%r11, %%r11\n” / (r9,r8) += p4 _/ “addq %%r9, %%r13\n” “adcq $0, %%r11\n” becomes => “adcq %%r11, %%r11\n” /_it was already zeroed above so we get minus 1 byte opcode*/
My thinking was that if I replace ~20x adcq $0 with adcq zero register, I can save some opcode for secondary benefits (saving opcode = other functions can remain in fast L1…). In benchmarking with a test example, it doesn’t make any speed difference whether you adc with an immediate or a zero register. The only thing different is less opcode. However, when I do it in the asm sequence above, it does give me some slight decrease in speed - and I suspect it’s due to 3-byte opcodes rather than 4-byte.
Something similar happens when replacing the 10byte movabs (disassembled output) of the large immediates with 3 byte mov reg,reg. While the opcode savings are big, it usually ends up sliiiiightly slower - and the only thing I can think is odd byte addresses for the instructions. Perhaps it’s different in more modern processors than my q8200.
Note1: bench_verify behaves differently as of late. It used to show the benchmarked value and terminate. Now it shows the value but goes on without terminating.
Note2: I discovered that some benchmarks from bench_internal can be somewhat misleading in terms of how they are done between gcc and clang. Clang was showing me better speeds in some and I was wondering what it does differently… well, it was inlining the function call inside the benchmark- so it was showing better performance because it didn’t need to call anything (or the loop jumps were shorter in an inlined function inside the benchmark function). However that does not necessarily mean the function itself was compiled more efficiently (which one would think based on the benchmark result). Perhaps it’d be useful to also have a real-world benchmark, like processing 20 bitcoin blocks worth of signatures and seeing the actual time taken.
Anyway, just thought I’d share these thoughts and ideas…