[WIP] Replace OpenSSL PRNG with built-in Fortuna implementation

sipa commented at 3:48 pm on March 12, 2015: member

This replaces the OpenSSL-based random number generation code with a built-in implementation of Fortuna.

It is seeded from operating system entropy (/dev/urandom etc), a variety of environment entropy (which is strengthened), and very precise CPU timings. Expensive (more than a few millisecond) seedings are only done at startup, others occur every few minutes. Data from RPC, processing and network events is additionally fed into the entropy pool. This should give us fairly safe operation even in case operating system entropy code is broken.

Furtermore, it unifies the insecure and secure random number generator, and makes the private key generation code always request OS entropy (which is likely better than we can do, but we still want to protect against it being broken).

TODO:

Better environment information on OSX and Windows
Seeding from GUI events into the pool.
Unit tests for the Fortuna implementation.

Wipe hash object states after usage 417532c8ac

Switch addrman key from vector to uint256 608cc68ce2

Explicitly initialize and destruct addrman object d4d8d17bfb

Make the seeds in params actually constant 6614ff6cd5

Add AES implementation cfcf0de666

Add Fortuna PRNG 5a0c1b6857

sipa force-pushed on Mar 12, 2015

Switch to Fortuna-based PRNG

This changes the GetRand-like functions in the code to use an internal Fortuna
instance, rather than the OpenSSL PRNG.

The code for acquiring entropy is based on an implementation by Greg Maxwell.

ce2a03a459

Merge secure and insecure PRNG code 08b1a5674c

sipa force-pushed on Mar 12, 2015

gmaxwell commented at 8:50 pm on March 12, 2015: contributor

I’m happy about this approach.

We’ve seen operating system RNG’s fail silently in really frightening ways several times over the last few years, a belt-and-suspenders approach where silent failure at least gets a best effort bit of entropy-snake-oil (or maybe not so snakeoil: after going and writing a bunch of entropy collecting code I’m more impressed with the performance than I expected) seems to be a clear improvement. OpenSSL’s (and libressl’s) system entropy randomness generator is also pretty scary (it can fail silently back to snakeoil entropy much weaker than what this code does).

We’ll want to take good care to make sure GetOSEntropy is portable and safe. Maybe we should introduce a startup time test to get GetOSEntropy that just checks that it’s not outputting a constant. I wouldn’t expect an actual OS failure to do that, but e.g. if some massive screwup has replaced /dev/urandom with a file (easier than you might think…) it could be helpful.

Right now this code has no fork safety. A fork() will end up with a clone of the pool state. That StrongRandom always goes to the OS helps add confidence that total doom is unlikely, but we might want to deal with this case even though we never fork just in case someone adds something that forks later and doesn’t think about it. A PID vs last_pid test in the random calls that triggers a GetSystemRand(slow=0) would I think be all that it would take.

Beyond the nits I am adding the only big obvious gaps were that there is no facility to roll entropy from the pool forward across restarts. We should probably GetStrongRandom() on clean shutdown and write out a entropy file that gets fed back in at startup. Perhaps system rand can take an argument, and thats how the config environment could also be passed in?

There are plenty of missed opportunities to collect extra entropy (e.g. from UI events; UPNP real-IP, other hosts provided ping nonces) that don’t fit into line level patches, as those files aren’t even touched.

laanwj added the label Improvement on Mar 13, 2015

laanwj commented at 12:21 pm on March 13, 2015: member

Concept ACK, although as discussed I’d prefer it as a subtree’d external library like secp256k1 and leveldb - so that other (wallet) software which needs a good random source can use it too, and it receives more testing and review than an internal module in Bitcoin Core.

Krellan commented at 8:58 am on March 15, 2015: contributor

Good idea. I thought about this when implementing random numbers for the ping nonce. They don’t need to be cryptographically secure, just random enough to serve their purpose. At the time, thought about calling the OpenSSL insecure rand, but its API was not orthogonal to secure rand, unfortunately. It had side effects I wasn’t happy with, so just stuck with the regular OpenSSL secure rand, even though I wasn’t happy about it needlessly consuming the entropy pool, which is a somewhat precious resource. Would it be worth changing the ping nonce to use insecure rand instead, or keep it secure and await this fine new implementation?

gmaxwell commented at 9:04 am on March 15, 2015: contributor

@Krellan There is nothing “consumed” in the entropy pool.

laanwj commented at 12:14 pm on March 18, 2015: member

@Krellan After this pull the insecure random is certainly good enough to use for ping nonces.

gavinandresen commented at 7:02 pm on March 18, 2015: contributor

Running the Fortuna implementation’s output through http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software.html would be spiffy. Or maybe the Python port of that: http://gerhardt.ch/random.php

theuni commented at 5:52 am on March 20, 2015: member

@sipa Very nice work.

Taking advantage of the aes classes introduced here, I’ve added cbc functionality on top to replace our usage of openssl for wallet encryption/decryption and passphrase usage. See here: https://github.com/theuni/bitcoin/commits/aes-keys. This is a WIP; it still needs loads more testing.

I’ll submit a PR to discuss that after this is merged, but in the meantime you may want https://github.com/theuni/bitcoin/commit/a22fcf2d2835d4f50e357ca0125d55b54a0ccb15 and https://github.com/theuni/bitcoin/commit/a4d109e33cc247430054fe1a81bdc668456338c8.

I’m not sure if my test failures were due to other local changes or not, but either way I don’t think it would hurt.

Side-node: I think that (along with the fortuna stuff here, @laanwj’s libevent stuff, and libsecp256k1) is the last of the openssl stuff that needed to be re-implemented.

mikehearn commented at 11:24 am on April 10, 2015: contributor

This sounds good, but I have to play devils advocate here. The problems with OS randomness on Android was caused by exactly this kind of logic: “we can do better than the kernel with a spiffy userspace RNG”. And then they screwed it up.

Now, I trust sipa and gmaxwell to not screw things up. I am much less sure about the faceless unknowable individuals who will come later in the years after sipa has moved on, or who will patch Bitcoin as it gets packaged into distributions, etc. I do not trust them.

I especially do not trust that crypto code that is Windows specific will be well tested and will never break, given the extreme bias towards Linux and Mac that exists in the Bitcoin developer community.

If there is no Bitcoin Core RNG, then people will not be tempted to “improve” it and break it in the process. If there is, they will.

Randomness theory says we cannot beat the kernel. The kernel has all the sources of entropy available to it that a userspace app does, plus more that we don’t have access to. Kernel RNGs are also rare and carefully monitored: I trust the Linux, Darwin and Windows kernel teams a lot more than I trust userspace RNG developers. Kernels are unlikely to ship in a broken state. So, reading from /dev/urandom or CryptoAPI should be all that is required. Why add more complexity on top?

I am especially unkeen on things like “let’s mix in the current time or the value of CPUID”, because mixing in uselessly low value pieces of entropy is how both the Debian and Android randomness flaws evaded detection for so long - the output looked random but only had ~16 bits of entropy in it from the PID counter. Everything else broke but the PID kept the illusion alive for longer than should have been possible.

So - why go further than /dev/urandom?

gmaxwell commented at 2:34 pm on April 10, 2015: contributor

The primary reason is because the kernel has repeatedly screwed it up too on various systems, both due to OS bugs (e.g. recent Netbsd and Freebsd examples) and because of things like running in virtual machines where the environment was just too deterministic. These are not hypothetical problems, but have been observed in practice and recently too.

Kernels are unlikely to ship in a broken state.

Kernels have shipped in a broken state, and they’ve been “broken in the field” by poor use, like use in a VPS that denied them adequate access to randomness; by hypervisor checkpoint and restore, and by other issue.

or who will patch Bitcoin as it gets packaged into distributions, etc. I do not trust them

I don’t trust distributors either, (though I do trust future contributions as much as ourselves, all bets are off if they’re malicious or even more foolish than we are). It is obvious that the utmost care is required in structuring things not just so its correct, but so that the result is easy to audit, testable, and hard to break; for sure. This includes the above comment about incorporating the system RNG in key generation in the final step in a transparent and hard to break way. If someone wants to twiddle the rest, then great. The potential for harm is bounded when whatever gets done is just getting xored with /dev/urandom.

(Though unlike OpenSSL we would not consider using undefined behavior … almost guaranteeing the result would get broken, if not by a helpful packager than by a future spec compliant compiler)

“Just use /dev/random” is also fraught with peril on its own, it requires a file descriptor, if one isn’t available you cannot get to the device. It’s also slow and if you use the blocking form it will reliably block on Linux systems which are often entropy starved (due to a campaign of removing low quality entropy sources in the kernel in the last that Linux hasn’t yet recovered from); the blocking is fine for our highly critical long term key generation, but other places where we might need non-determinism it isn’t really. If the non-blocking /dev/urandom is used, you are exposed to getting non-random bits even when the kernel knows better (e.g. near bootup time). There is no escaping peril, we can only seek to understand and control the risks.

the output looked random but

Sadly this problem also exists with the kernel itself. Our setup must be such that if the kernel fails in a detectable way, we fail. Regardless; all the rest is for the cases where the kernel has failed undetectably.

Randomness theory says we cannot beat the kernel

I’m not aware of any such theory. What I am aware of is that we have entirely different constraints; when the kernel RNGs have failed none of them have managed to still emit output which was machine specific, though its not hard to do so in userspace. In some abstract sense the kernel has access to more data, but operates under different constraints and so we know it doesn’t actually make use of that data (at least in Linux and the BSDs; the windows source is not generally available, so I don’t know what it does internally). As another example, we’re also able to use a huge amount of hardening; so even if the kernel has failed in an undetectable way and there is only a moderate amount of non-determinism from other sources there is a chance that the user will be secure regardless. Likewise, we can roll state forward between runs, so even if the system state is only intermittently random the user can be practically secure; something else that is not generally available in the kernel.

For some time it looked like Linux /dev/u?random was going to be replaced with rdrand, and I think our users wouldn’t have appreciated that. I think it’s fair to say that when we’re generating keys our demands as we’re creating long term keys for public use which thousands to millions of dollars can be immediately lost if they’re insecure, is unlike 99.999% of the kernel random use.

Likewise, if our ability to read OS entropy silently breaks on windows then those users are in trouble regardless of what else we’ve done. So we must assure we don’t, and that we detect and fail if we do if detection is even possible. This is true regardless of whatever else we’re doing.

We also know from the Bitcoin space (web wallet entropy failures) that users still lose (lots) money if the randomness returned is the same as other users. So, trying to make undetectable failures more user transparent is already known to not really work.

A further consideration is that OpenSSL already does this, and so it would be arguably unwise in the “what were you thinking??!?? sense to replace it something with more known total failure modes. The OpenSSL implementation is inherently more fragile than anything we’d consider using (including silent failure when it knows the OS rng is failing for it); OpenSSL’s “more secure fork” LibreSSL replaces the code but also includes a similar fallback. Both do this in OS specific code, so it’s not likely they’re defending against unknown operating systems.

mikehearn commented at 2:54 pm on April 10, 2015: contributor

Minor aside: in Linux there is a getrandom() syscall to avoid the problems with fd exhaustion, or /dev/urandom being a regular file. The non-blocking variant is fine of course, it’s just an iterated hash over the entropy pool: http://man7.org/linux/man-pages/man2/getrandom.2.html

I guess this boils down to which you consider to be more fragile: Bitcoin Core or the users kernel. Combining them so both have to fail simultaneously sounds good, but that was the theory behind all the other RNGs and that failed, so it seems harder to pull off than I would have expected.

Regardless, my playing of devil’s advocate has come to an end, so carry on :)

gmaxwell commented at 3:53 pm on April 10, 2015: contributor

thanks, I hope that when we’re up to having something worth merging you’ll scrutinize the heck out of the implementation for its future riskyness. :) (WRT blocking; the issue there is that the non-blocking will give non-entropy when the kernel isn’t seeded yet; though getrandom has a mode that will handle that)

laanwj commented at 7:57 pm on April 10, 2015: member

@mikehearn That’s the kind of argument I’ve given to not make this part of Bitcoin Core, but make it a library like secp256k1 (can be subtree’ed). If the code is shared with other software - e.g. wallets, or other crypto software that doesn’t want to depend on OpenSSL - that needs a secure Random Number Generator, more eyes will be on the source code and it will be tested more, and this initiative will also help other projects.

gmaxwell commented at 11:15 pm on April 10, 2015: contributor

(FWIW, I think sipa and I were onboard with that)

jgarzik commented at 11:22 pm on April 10, 2015: contributor

/me is onboard too.

FWIW, you probably want to non-block + fail if entropy is absent.

sipa commented at 8:32 am on April 11, 2015: member

All very reasonable comments here.

I agree with making this available separately (though the API needs some rethinking). There is definitely a fair amount of complexity here, which - assuming the OS randomness works fine - is totally irrelevant. The whole point of course is belt-and-suspenders: what if somehow the OS randomness is not fine. For that reason, I also agree the result should be trivially correct assuming OS randomness is fine (i.e., for queries that require strong randomness, we should do a call to the OS, and mix in our own randomness afterwards). And yes, if the OS randomness call fails, the entire request should fail. We shouldn’t knowingly reduce security - it’s only intended to protect against unknown failure.

jwilkins commented at 4:27 pm on April 17, 2015: none

You want to add extra assurance for your entropy source because constant testing for poor entropy is a horrible path to go down. You can never distinguish good randomness from properly secured bad randomness (eg attacker replaces your /dev/random with stream cipher output for which only they have the key). This is code that needs (and I am sure will get) plenty of attention and auditing.

As for justifications, there is a tendency for the OS to silently break things or fail in spectacular ways. RDRAND has been the sole source of randomness multiple times, contrary to popular opinion. http://comments.gmane.org/gmane.comp.security.cryptography.randombit/4689 Just xoring in more stuff isn’t the answer. djb points out attack scenarios with a malicious entropy source: http://blog.cr.yp.to/20140205-entropy.html

EC is brittle when it comes to poor randomness. While using deterministic modes is clearly the preferred answer, entropy for key generation is still needed and providing a sane interface to a decent library is a major improvement to the entire bitcoin ecosystem. http://blog.cryptographyengineering.com/2012/03/surviving-bad-rng.html

Also, heroku instance under syn scan (not syn flood, just a port scan): https://gist.github.com/jwilkins/5997306 https://gist.github.com/jwilkins/5997296

This is an excellent (though java focused) guide to RNG concepts, part 3’s discussion of degrees of freedom is particularly valuable for people in the bitcoin space: http://blog.uncommons.org/2008/04/03/a-java-programmers-guide-to-random-numbers-part-1-beyond-javautilrandom/ http://blog.uncommons.org/2008/04/06/a-java-programmers-guide-to-random-numbers-part-2-not-just-coins-and-dice/ http://blog.uncommons.org/2008/04/10/a-java-programmers-guide-to-random-numbers-part-3-seeding/

jwilkins commented at 4:52 pm on April 17, 2015: none

Forgot to point out http://en.wikipedia.org/wiki/Blum_Blum_Shub

gmaxwell commented at 5:53 pm on April 18, 2015: contributor

On the points raised by DJB, (FWIW, I’m familiar with that post; and it influenced our work. I’m also familiar with it in another way: on account of that post the “watch that basket” argument is the go-to trope in the IETF used to argue against any non-TLS crypto)

Fortuna itself designed to be robust against malicious inputs. If we assume the kernel is maliciously observing our process (e.g. to give us bad randomness that xor cancels the next fortuna output) we’re kind of all bets off there– simply because it could more directly undermine our behavior. We could combine them with a cryptographic hash instead, which would be more secure against malicious input, I’m just wary about having any avoidable complexity in that one part, because its the one part that requires the most careful review.

Back on DJB’s post, before anyone takes up that argument…

But if the attacker sees the RNG state that was used to generate your long-term SSL keys, long-term PGP keys, etc., then what exactly are we gaining by coming up with unpredictable random numbers in the future?

The security of future keys. Sometimes there aren’t any future keys, and indeed, additional inputs are not helpful, sometimes (as is the case for us) they are.

Imagine, someone manages to heartbleed out the state of your RNG. Thats bad, but its even worse if it forever predicts your output.

The argument he gives applies more broadly then some might think, by that argument you’d generate a random state once and store it on disk, and just use that state… but this is clearly crazy because it’s obvious that the data on disk has a different security exposure than data elsewhere. The same is otherwise true. We do already avoid using ’new randomness’ where it makes sense to avoid it because it can’t buy us anything– e.g. in signing.

Another place where forward secure randomness is helpful is against process replay. Consider the OpenSSL state cloning bug on fork() on android: Processes were launched by forking(), the new tasks got a copy of the rng state and replayed randomness. If they each independently had been adding additional (even low entropy, like the address of their stack) randomness then the problem would have been avoided or reduced (e.g. if it didn’t reliably add before the copy was used). The same can also play out with process and VM snapshotting: if you only have a deterministic state, someone might resume an old copy without realizing the consequences, then use it to do different things.

mikehearn commented at 2:48 pm on April 19, 2015: contributor

Processes were launched by forking(), the new tasks got a copy of the rng state and replayed randomness. If they each independently had been adding additional (even low entropy, like the address of their stack) randomness then the problem would have been avoided

I have to disagree with this. I think it’s the opposite. Android was mixing in additional low entropy (the pid counter) and this is why the bug made it into production. If wallets had been making the exact same sequence of random numbers across process restarts then that would have been noticed almost immediately, as it’d have shown up in the UI and other such things. By mixing in the pid they ensured that the outputs only cycled occasionally - long after the apps had been shipped to production and money was stolen.

The Android bug was especially catastrophic because the mixin of the persisted operating system entropy happened after the zygote process started, so all you had to do to compromise thousands of wallets was buy a phone, measure the initial randomness for every possible PID and you’re done. That’s why I insisted on a simultaneous crash rollout of the fix: if anyone had realised the true nature of the bug, the Bitcoin ledger could have been hopelessly corrupted by theft to an extent that it’d have set the project back by years.

This is why this kind of RNG work is so dangerous. The history of industrial randomness is a long story of programmers thinking that they would be better than all the other schlubs, that their RNG would be simple and elegant and give better performance or robustness. After all, how hard can a RNG possibly be? And then they discover they made some impossible to detect mistake, or someone else patched their code, and it all came unravelled.

jwilkins commented at 8:14 pm on June 4, 2015: none

The problem is people unfamiliar with the research thinking they know better. Not the case here, gmaxwell and sipa are very knowledgable and prudent. It is only with an abundance of caution that they are suggesting this and inviting extensive external review as we have all repeatedly observed the underlying layers fail.

jgarzik commented at 6:08 pm on September 15, 2015: contributor

Re-reviewing this, @laanwj probably has the right approach.

I think this is probably easier to roll out into bitcoin via a libsecp256k1-like separate library.

sipa commented at 6:09 pm on September 15, 2015: member

Agree.

jgarzik commented at 6:19 pm on September 15, 2015: contributor

OK, closing. Let’s re-open this same Pull Request (to preserve the history) when libspiffyrng git subtree appears.

jgarzik closed this on Sep 15, 2015

laanwj commented at 11:55 am on December 3, 2015: member

Slated for 0.13 (with the goal of getting rid of OpenSSL dependency by then), opened #7162 to track this

DrahtBot locked this on Sep 8, 2021

[WIP] Replace OpenSSL PRNG with built-in Fortuna implementation #5885