It seems secp256k1_ec_pubkey_create() is not taking advantage of being called from different cores. And I see it's speed is around 25000/sec on my Apple M1 Max CPU.
I am curious if you're using precached points for private keys 1,2,4,8,… 2^255 and use up to 256 point additions to construct public key from private key?