ARM assembly implementation of field

laanwj commented at 4:15 PM on December 26, 2014: member

--with-field=32bit --with-scalar=32bit --with-asm=no --enable-benchmark --host=arm-linux-gnueabihf

bench_verify:
min 2661.978us / avg 2662.845us / max 2665.311us

--with-field=32bit --with-scalar=32bit --with-asm=arm --enable-benchmark --host=arm-linux-gnueabihf

bench_verify:
min 1732.304us / avg 1732.896us / max 1733.773us

For a 35% speed-up in total.

Measured on HummingBoard-i1 w. i.MX6 Solo (Cortex A9).

gmaxwell commented at 6:09 PM on December 26, 2014: contributor

On a8 with endomorphism:

From: min 3744.119us / avg 3744.770us / max 3745.406us to: min 1886.788us / avg 1887.016us / max 1887.225us

Quite impressive.

peterdettman commented at 5:22 AM on December 27, 2014: contributor

Excellent! I gather this is without any NEON instructions so far?

There's an alternative way to write the multiplication that can save ~40 limb muls for 10x26 (not Karatsuba, but using the same underlying principle - see http://eprint.iacr.org/2014/852.pdf to get the general idea, though our field is not as suitable). I have sample C code for 5x52 and could whip up a 10x26 version. It's not faster in C because it adds a lot of additions, but they are well-organised for vectorization, so with SIMD instruction sets it may be viable. @laanwj Do you think you'd be able to give it a shot with NEON (I'll supply the C)?

peterdettman commented at 5:51 AM on December 27, 2014: contributor

Also, I'm curious how much cache is on these boards you guys are testing with. Have you tried reducing WINDOW_G (ecmult_impl.h)?

laanwj commented at 9:53 AM on December 27, 2014: member

@peterdettman No NEON, just wanted to see i if I could beat the compiler with plain ARM assembly. as not all SoCs support NEON and this gives a baseline.

I'm certainly interested in that more efficient C version. I still intend to do that. NEON can do 2x 32×32→64 add or mul or mad at the same time, and has 32 64 bit (2×32) registers, so it would be interesting to see what different method can be used there.

Re: cache, IMX6 info states 32 K instruction and data L1 caches and 256 KB to 1 MB of L2 cache. This is the bottom-of-the-line model so likely only 256 KB. I have not tried changing WINDOW_G. @gmaxwell That's indeed impressive, thanks for benchmarking. Somehow ASM optimizations work better for that board :) May be because of difference in memory speed, the manual implementation removes a lot of loads from/to the stack compared to what gcc generates.

sipa commented at 4:51 PM on December 29, 2014: contributor

@gmaxwell You feel you can review this?

gmaxwell commented at 9:47 AM on December 30, 2014: contributor

Yep. Just backlogged a bit. (also figure it should spend some time cooking on the tests before merging in any case.)

laanwj commented at 9:40 AM on January 21, 2015: member

As this code has no loops, and only basic arithmetic and bit operations, would it be viable to use symbolic execution to check the computed result is equivalent to e.g. gcc's output? There's some work in s2e to do symbolic execution of ARM binary code. Conceptually it sounds simple but as usual I may be forgetting about a state explosion or two.

laanwj commented at 3:38 PM on January 26, 2015: member

Using miasm I generated IR expressions from the secp256k1_fe_*_inner assembly

https://gist.github.com/laanwj/1b7730796aa94f5bfa87

Next step would be to find out how to symbolically execute it, and whether it is possible to make something useful from the result.

laanwj commented at 8:22 AM on January 30, 2015: member

Made some progress on this. Using symbolic execution I verified that for both sqr and mul:

The code writes only to memory in a sequential area on the stack init_SP-xxx .. init_SP-4, and the output init_R0+0 .. init_R0+36
The code only inputs from memory for the input arguments, e.g. init_R1+0 .. init_R1+36 and init_R2+0 .. init_R2+36

There is no dependence of the result on initial state of registers besides the memory addressed by R0,R1,R2. E.g. no other information leaks into the expression.

I also generated a few images

sqr_a sqr_inner from my assembly code

sqr_b sqr_inner as generated by gcc

The outputs (all 10 of them) are on the left, inputs (green) on the right. Legend:

Red Multiply
Yellow Add
Cyan Other operations such as bitshifts, and, or
Blue Bit slices and composes
Green Memory reads

Yes, the graph layout kind of sucks, too many intersections. I'm thinking of using simulated annealing to clean it up. Although this naive DAG approach still manages to show the structure better than anything I could get Gephi to produce. A bit like the Eiffel tower on its side.

I don't think comparing them is going to be particularly easy, at the least it's going to take a lot of expression rewriting.

gmaxwell commented at 2:06 AM on January 31, 2015: contributor

@laanwj so a validation strategy is to first prove via range analysis that the calculation can never overflow. Having done that it should be possible to convert the asm to an algebraic statement (e.g. first convert it to a SSA form, then just substitute in regular operations). Then the algebraic statement could be simplified with a cas and compared to the ideal representation of the function (or a conversion from the asm generated by GCC).

laanwj commented at 2:13 PM on January 31, 2015: member

For the leaf multiplications and additions it'd be quite easy to prove that no overflow happens. But I expect the least to be learned there, as the assembly code is a straightforward implementation of the C operations with umlal/umull.

However the most annoying are the carry computations (for ADC+ADDS 64 bit addition) higher up. Miasm's evaluator creates an expression based on cf = (((op1 ^ op2) ^ res) ^ ((op1 ^ res) & (~(op1 ^ op2)))).msb to compute it, which can be simplified somewhat by using a 33-bit addition then taking the upper bit. Maybe it'd be possible to recognize 64-bit additions and substitute them back, getting rid of the carry logic completely.

(another is 64-bit shift, which y = x>>S gets assembled to y.h = x.h >> S, y.l = (y.l >> S) | (y.h << (32-S)). By rewriting all shifts to bit compose/splicing and recognizing bitwise OR of disjunct composes, this could be reassembled into one operation. Maybe this isn't needed though, I'm not sure how much the expression needs to be simplified at all, just to match it...)

gmaxwell commented at 2:33 PM on January 31, 2015: contributor

::nods:: The purpose of proving no overflow is not that its very useful in and of itself so that the rest of the proving can be done by replacing everything with plain integer operations (instead of finite machine words) and using plain algebra since if nothing overflows the operations are the same.

ARM assembly implementation of field_10x26 inner 1a619fefc9

laanwj force-pushed on Mar 28, 2015

gmaxwell commented at 1:26 AM on April 18, 2015: contributor

FWIW, doesn't autodetect for me on my novena, -- needs a manual flag. Is it supposed to?

laanwj commented at 10:57 AM on April 18, 2015: member

It's not supposed to autodetect. This is experimental, after all.

sipa commented at 12:16 PM on April 27, 2015: contributor

I do feel that holding this up is a bit unfair - I'm probably demanding a level of review here that wasn't demanded for the x86_64 assembly, especially as I'd really like to see this in. Still, I'd like to see someone confirm they have reviewed it...

luke-jr commented at 8:36 AM on July 29, 2015: member

Benchmarking on USB Armory:

With c33307495b3a6658e602e14067dd594136d4690a
configure: Using assembly optimizations: no
configure: Using field implementation: 32bit
configure: Using bignum implementation: no
configure: Using scalar implementation: 32bit
configure: Using endomorphism optimizations: no

$ bench_internal
scalar_add: min 0.268us / avg 0.268us / max 0.269us
scalar_negate: min 0.160us / avg 0.160us / max 0.161us
scalar_sqr: min 2.04us / avg 2.04us / max 2.04us
scalar_mul: min 1.93us / avg 1.93us / max 1.93us
scalar_inverse: min 606us / avg 606us / max 606us
scalar_inverse_var: min 606us / avg 606us / max 606us
field_normalize: min 0.0801us / avg 0.0802us / max 0.0807us
field_normalize_weak: min 0.0488us / avg 0.0489us / max 0.0494us
field_sqr: min 1.25us / avg 1.26us / max 1.26us
field_mul: min 1.88us / avg 1.88us / max 1.89us
field_inverse: min 349us / avg 349us / max 350us
field_inverse_var: min 349us / avg 349us / max 350us
field_sqrt_var: min 344us / avg 345us / max 345us
group_double_var: min 11.0us / avg 11.0us / max 11.0us
group_add_var: min 28.0us / avg 28.0us / max 28.0us
group_add_affine: min 20.9us / avg 20.9us / max 20.9us
group_add_affine_var: min 19.4us / avg 19.4us / max 19.4us
ecmult_wnaf: min 4.96us / avg 4.97us / max 5.02us
hash_sha256: min 2.53us / avg 2.54us / max 2.56us
hash_hmac_sha256: min 10.2us / avg 10.2us / max 10.2us
hash_rfc6979_hmac_sha256: min 55.9us / avg 56.0us / max 56.0us

$ bench_recover
ecdsa_recover: min 5522us / avg 5522us / max 5522us

$ bench_sign
ecdsa_sign: min 2521us / avg 2521us / max 2522us

$ bench_verify
ecdsa_verify: min 5167us / avg 5167us / max 5167us

With c33307495b3a6658e602e14067dd594136d4690a+1a619fefc90e29d04c9f740af8e86142a40e1d5a:
configure: Using assembly optimizations: arm
configure: Using field implementation: 32bit
configure: Using bignum implementation: no
configure: Using scalar implementation: 32bit
configure: Using endomorphism optimizations: no

$ bench_internal
scalar_add: min 0.268us / avg 0.268us / max 0.269us
scalar_negate: min 0.158us / avg 0.158us / max 0.159us
scalar_sqr: min 2.04us / avg 2.04us / max 2.04us
scalar_mul: min 1.93us / avg 1.93us / max 1.93us
scalar_inverse: min 606us / avg 606us / max 606us
scalar_inverse_var: min 606us / avg 606us / max 607us
field_normalize: min 0.0801us / avg 0.0802us / max 0.0807us
field_normalize_weak: min 0.0488us / avg 0.0489us / max 0.0494us
field_sqr: min 0.597us / avg 0.598us / max 0.603us
field_mul: min 0.810us / avg 0.811us / max 0.816us
field_inverse: min 165us / avg 165us / max 165us
field_inverse_var: min 165us / avg 165us / max 165us
field_sqrt_var: min 163us / avg 163us / max 163us
group_double_var: min 5.22us / avg 5.22us / max 5.22us
group_add_var: min 12.5us / avg 12.5us / max 12.5us
group_add_affine: min 10.1us / avg 10.1us / max 10.1us
group_add_affine_var: min 8.80us / avg 8.80us / max 8.80us
ecmult_wnaf: min 4.95us / avg 4.96us / max 5.01us
hash_sha256: min 2.54us / avg 2.54us / max 2.55us
hash_hmac_sha256: min 10.2us / avg 10.2us / max 10.2us
hash_rfc6979_hmac_sha256: min 55.6us / avg 55.6us / max 55.7us

$ bench_recover
ecdsa_recover: min 2932us / avg 2932us / max 2932us

$ bench_sign
ecdsa_sign: min 1650us / avg 1650us / max 1652us

$ bench_verify
ecdsa_verify: min 2769us / avg 2769us / max 2770us

luke-jr commented at 6:49 AM on September 20, 2015: member

Benchmarking on Nokia N900:

With 85e3a2cc087993973a2195849c652005b0be7ddd
CFLAGS='-mcpu=cortex-a8 -mfpu=vfpv3 -mfloat-abi=hard -O2'
configure: Using assembly optimizations: no
configure: Using field implementation: 32bit
configure: Using bignum implementation: gmp
configure: Using scalar implementation: 32bit
configure: Using endomorphism optimizations: yes
configure: Building ECDH module: yes
configure: Building Schnorr signatures module: yes
configure: Building ECDSA pubkey recovery module: yes

$ tests
test count = 64
random seed = f3f2447cb0df6420fe4a7ac0af8307ee
random run = ca0a22a61ef8f80fc89fece8e28c6378

$ bench_internal
scalar_add: min 0.294us / avg 0.300us / max 0.320us
scalar_negate: min 0.125us / avg 0.126us / max 0.128us
scalar_sqr: min 2.52us / avg 2.56us / max 2.70us
scalar_mul: min 2.31us / avg 2.36us / max 2.66us
scalar_split: min 10.1us / avg 10.1us / max 10.3us
scalar_inverse: min 746us / avg 753us / max 786us
scalar_inverse_var: min 30.1us / avg 30.4us / max 31.8us
field_normalize: min 0.125us / avg 0.126us / max 0.128us
field_normalize_weak: min 0.0685us / avg 0.0689us / max 0.0705us
field_sqr: min 0.792us / avg 0.798us / max 0.814us
field_mul: min 1.08us / avg 1.09us / max 1.12us
field_inverse: min 220us / avg 222us / max 227us
field_inverse_var: min 39.7us / avg 40.2us / max 42.5us
field_sqrt_var: min 217us / avg 218us / max 224us
group_double_var: min 6.76us / avg 6.84us / max 7.15us
group_add_var: min 16.5us / avg 16.6us / max 17.0us
group_add_affine: min 13.2us / avg 13.3us / max 13.4us
group_add_affine_var: min 11.6us / avg 11.6us / max 11.9us
wnaf_const: min 3.15us / avg 3.18us / max 3.33us
ecmult_wnaf: min 6.58us / avg 6.65us / max 7.15us
hash_sha256: min 2.55us / avg 2.58us / max 2.70us
hash_hmac_sha256: min 10.6us / avg 10.8us / max 11.1us
hash_rfc6979_hmac_sha256: min 58.4us / avg 59.1us / max 61.1us
context_verify: min 327948us / avg 330529us / max 333565us
context_sign: min 1160us / avg 1169us / max 1211us

$ bench_recover
ecdsa_recover: min 2097us / avg 2101us / max 2109us

$ bench_sign
ecdsa_sign: min 2128us / avg 2131us / max 2137us

$ bench_verify
ecdsa_verify: min 2052us / avg 2058us / max 2064us

$ bench_ecdh
ecdh: min 2506us / avg 2519us / max 2579us

$ bench_schnorr_verify
schnorr_verify: min 2036us / avg 2039us / max 2042us

With 85e3a2cc087993973a2195849c652005b0be7ddd+1a619fefc90e29d04c9f740af8e86142a40e1d5a
CFLAGS='-mcpu=cortex-a8 -mfpu=vfpv3 -mfloat-abi=hard -O2'
configure: Using assembly optimizations: arm
configure: Using field implementation: 32bit
configure: Using bignum implementation: gmp
configure: Using scalar implementation: 32bit
configure: Using endomorphism optimizations: yes
configure: Building ECDH module: yes
configure: Building Schnorr signatures module: yes
configure: Building ECDSA pubkey recovery module: yes

$ tests
test count = 64
random seed = dd63c4e6e5d1b385ba297417ab7db622
random run = dc334c84b1b820836d957e83d1580f89

$ bench_internal
scalar_add: min 0.294us / avg 0.298us / max 0.312us
scalar_negate: min 0.125us / avg 0.126us / max 0.130us
scalar_sqr: min 2.52us / avg 2.56us / max 2.73us
scalar_mul: min 2.31us / avg 2.33us / max 2.36us
scalar_split: min 10.1us / avg 10.1us / max 10.5us
scalar_inverse: min 746us / avg 752us / max 773us
scalar_inverse_var: min 30.1us / avg 30.6us / max 33.1us
field_normalize: min 0.125us / avg 0.126us / max 0.130us
field_normalize_weak: min 0.0685us / avg 0.0693us / max 0.0725us
field_sqr: min 0.792us / avg 0.818us / max 1.02us
field_mul: min 1.08us / avg 1.08us / max 1.13us
field_inverse: min 220us / avg 221us / max 223us
field_inverse_var: min 39.7us / avg 40.2us / max 42.0us
field_sqrt_var: min 217us / avg 218us / max 221us
group_double_var: min 6.75us / avg 6.82us / max 7.03us
group_add_var: min 16.5us / avg 16.6us / max 16.8us
group_add_affine: min 13.2us / avg 13.3us / max 13.5us
group_add_affine_var: min 11.6us / avg 11.6us / max 11.8us
wnaf_const: min 3.15us / avg 3.20us / max 3.42us
ecmult_wnaf: min 6.58us / avg 6.70us / max 7.31us
hash_sha256: min 2.55us / avg 2.60us / max 2.84us
hash_hmac_sha256: min 10.7us / avg 10.8us / max 11.4us
hash_rfc6979_hmac_sha256: min 58.4us / avg 59.2us / max 61.4us
context_verify: min 327219us / avg 328688us / max 331987us
context_sign: min 1158us / avg 1168us / max 1208us

$ bench_recover
ecdsa_recover: min 2097us / avg 2101us / max 2106us

$ bench_sign
ecdsa_sign: min 2120us / avg 2123us / max 2126us

$ bench_verify
ecdsa_verify: min 2052us / avg 2055us / max 2059us

$ bench_ecdh
ecdh: min 2508us / avg 2557us / max 2769us

$ bench_schnorr_verify
schnorr_verify: min 2035us / avg 2043us / max 2050us

With 85e3a2cc087993973a2195849c652005b0be7ddd
CFLAGS='-mthumb -mno-thumb-interwork -mcpu=cortex-a8 -mfpu=vfpv3 -mfloat-abi=hard -O2 -Wa,-mthumb'
configure: Using assembly optimizations: no
configure: Using field implementation: 32bit
configure: Using bignum implementation: gmp
configure: Using scalar implementation: 32bit
configure: Using endomorphism optimizations: yes
configure: Building ECDH module: yes
configure: Building Schnorr signatures module: yes
configure: Building ECDSA pubkey recovery module: yes

$ tests
test count = 64
random seed = 61edcfb05a0a4846402d3471b26c00c2
random run = 590a9e113ba71a79db92ca985e503664

$ bench_internal
scalar_add: min 0.312us / avg 0.316us / max 0.331us
scalar_negate: min 0.139us / avg 0.139us / max 0.144us
scalar_sqr: min 2.57us / avg 2.59us / max 2.65us
scalar_mul: min 2.44us / avg 2.45us / max 2.49us
scalar_split: min 10.8us / avg 10.9us / max 11.5us
scalar_inverse: min 764us / avg 770us / max 792us
scalar_inverse_var: min 29.6us / avg 30.4us / max 33.1us
field_normalize: min 0.134us / avg 0.134us / max 0.138us
field_normalize_weak: min 0.0618us / avg 0.0627us / max 0.0659us
field_sqr: min 1.68us / avg 1.69us / max 1.72us
field_mul: min 2.06us / avg 2.11us / max 2.43us
field_inverse: min 421us / avg 423us / max 425us
field_inverse_var: min 40.7us / avg 41.0us / max 41.6us
field_sqrt_var: min 415us / avg 417us / max 419us
group_double_var: min 12.8us / avg 12.9us / max 13.0us
group_add_var: min 31.3us / avg 31.4us / max 31.6us
group_add_affine: min 23.8us / avg 24.0us / max 24.2us
group_add_affine_var: min 21.7us / avg 21.8us / max 22.0us
wnaf_const: min 3.71us / avg 3.76us / max 4.02us
ecmult_wnaf: min 6.65us / avg 6.74us / max 7.22us
hash_sha256: min 2.75us / avg 2.82us / max 3.04us
hash_hmac_sha256: min 11.4us / avg 11.5us / max 11.8us
hash_rfc6979_hmac_sha256: min 62.6us / avg 63.3us / max 65.3us
context_verify: min 573358us / avg 576329us / max 578496us
context_sign: min 1852us / avg 1866us / max 1906us

$ bench_recover
ecdsa_recover: min 3816us / avg 3820us / max 3825us

$ bench_sign
ecdsa_sign: min 3021us / avg 3025us / max 3028us

$ bench_verify
ecdsa_verify: min 3771us / avg 3775us / max 3778us

$ bench_ecdh
ecdh: min 4536us / avg 4540us / max 4548us

$ bench_schnorr_verify
schnorr_verify: min 3735us / avg 3737us / max 3741us

With 85e3a2cc087993973a2195849c652005b0be7ddd+1a619fefc90e29d04c9f740af8e86142a40e1d5a
CFLAGS='-mthumb -mno-thumb-interwork -mcpu=cortex-a8 -mfpu=vfpv3 -mfloat-abi=hard -O2 -Wa,-mthumb'
configure: Using assembly optimizations: arm
configure: Using field implementation: 32bit
configure: Using bignum implementation: gmp
configure: Using scalar implementation: 32bit
configure: Using endomorphism optimizations: yes
configure: Building ECDH module: yes
configure: Building Schnorr signatures module: yes
configure: Building ECDSA pubkey recovery module: yes

$ tests
test count = 64
random seed = f4d64b19fa6b8ed3112e13f9917e0bfb
random run = 0fdb669452614944fa8a5fc42d522ff1

$ bench_internal
scalar_add: min 0.312us / avg 0.323us / max 0.404us
scalar_negate: min 0.138us / avg 0.139us / max 0.144us
scalar_sqr: min 2.57us / avg 2.61us / max 2.80us
scalar_mul: min 2.44us / avg 2.46us / max 2.49us
scalar_split: min 10.8us / avg 10.9us / max 11.2us
scalar_inverse: min 764us / avg 773us / max 795us
scalar_inverse_var: min 29.8us / avg 30.7us / max 32.6us
field_normalize: min 0.134us / avg 0.135us / max 0.145us
field_normalize_weak: min 0.0618us / avg 0.0626us / max 0.0659us
field_sqr: min 0.793us / avg 0.829us / max 0.924us
field_mul: min 1.08us / avg 1.08us / max 1.12us
field_inverse: min 220us / avg 221us / max 226us
field_inverse_var: min 39.2us / avg 39.5us / max 39.9us
field_sqrt_var: min 217us / avg 218us / max 219us
group_double_var: min 6.78us / avg 6.83us / max 7.03us
group_add_var: min 16.6us / avg 16.7us / max 17.0us
group_add_affine: min 13.3us / avg 13.4us / max 13.6us
group_add_affine_var: min 11.6us / avg 11.7us / max 11.9us
wnaf_const: min 3.71us / avg 3.75us / max 3.97us
ecmult_wnaf: min 6.71us / avg 6.79us / max 7.13us
hash_sha256: min 2.76us / avg 2.81us / max 3.07us
hash_hmac_sha256: min 11.4us / avg 11.5us / max 11.8us
hash_rfc6979_hmac_sha256: min 62.7us / avg 63.3us / max 64.9us
context_verify: min 328046us / avg 329295us / max 330910us
context_sign: min 1164us / avg 1173us / max 1218us

$ bench_recover
ecdsa_recover: min 2098us / avg 2101us / max 2105us

$ bench_sign
ecdsa_sign: min 2136us / avg 2140us / max 2156us

$ bench_verify
ecdsa_verify: min 2054us / avg 2057us / max 2060us

$ bench_ecdh
ecdh: min 2508us / avg 2515us / max 2528us

$ bench_schnorr_verify
schnorr_verify: min 2036us / avg 2038us / max 2041us

sipa commented at 5:36 PM on September 22, 2015: contributor

Needs rebase. I still like to see this in, but someone reviewing it would be nice...

sipa referenced this in commit 7b0fb18b75 on May 25, 2016

laanwj commented at 3:12 PM on October 11, 2016: member

This should be closed as it was merged.

laanwj closed this on Oct 11, 2016

ARM assembly implementation of field_10x26 inner #173