RFC: Assumeutxo and large forks and reorgs

ryanofsky commented at 1:14 pm on June 14, 2024: contributor

(tl;dr This is mainly a question about when an assumeutxo snapshot is loaded, if it makes more sense for the original chainstate to continue downloading and attaching blocks in the normal way, or if it should only download and attach blocks leading up to the snapshot block.)

It seems unclear how validation code should behave when an assumeutxo snapshot is loaded, and then new headers are received pointing to a forked chain with more proof of work than any known chain including the snapshot block. This question come up in #29519 (review).

This is not an urgent question, because if this situation arose on the network there would be bigger issues to confront than assumeutxo behavior, but it’s worth raising because there seem to be different ways of dealing with it that impact that design of validation code, and maybe also have implications in other cases like eclipse attacks.

Background

When an assumeutxo snapshot is loaded, a new chainstate object is created. A chainstate is just a UTXO database and a pointer to the last block added to the database.

So immediately after the assumeutxo snapshot is loaded there are two chainstates:

The original chainstate pointing at the most-work block which has been locally validated. (This block should normally be an ancestor of the snapshot block, but doesn’t have to be.)
A new snapshot chainstate pointing at the snapshot block, which has been not been locally validated and probably not downloaded, but is assumed-valid.

After this point, because of missing block and undo data before the snapshot, the snapshot chainstate is constrained to only sync to chains including the snapshot block. If headers for chains with more work not including the snapshot block are found, the snapshot chainstate needs to ignore them, because even if it downloaded blocks on those chains, it would lack the undo data needed to reorg and actually validate them.

It is less clear whether the original chainstate should also ignore chains not including the snapshot block.

Possible behavior: Original chainstate targets the most-work chain

This is not what currently happens, but simplest approach might be for the original chainstate to be unaffected by the snapshot chainstate, and to continue to download and attach the same blocks it otherwise would have if no snapshot were loaded. It would just do it more slowly due to a reduced cache size and lower priority for block requests compared to the snapshot chainstate.

Current behavior: Original chainstate targets the snapshot block

Currently, instead of the original chainstate being unaffected by the snapshot chainstate, it’s constrained to only download and blocks that are ancestors of the snapshot block, and ignore other chains.

Tradeoffs

Possible advantages of original chainstate targeting the most work chain:

Probably simpler implemenation. If original chainstate behaves the same way whether or not a snapshot is loaded, fewer special cases need to exist when a snapshot is loaded.
Would sync to most-work chain faster if most-work chain did not include the snapshot block and turned out to be valid (i.e. not a hard fork)
Maybe more philosophically neutral, because the node continues normal behavior of syncing to the most-work chain, instead of ignoring any chain not including the snapshot block.

Possible advantages of original chainstate targeting the snapshot block:

This what code currently does so would not require further changes.
Maybe could provide resilience against hard forks that contain more work? If loading a snapshot makes a node temporarily ignore any chain not containing a snapshot block, maybe that is a useful feature if the chain with more work turns out to be invalid.
Could help in an eclipse attack? If headers or blocks after the snapshot block were withheld, this could temporarily stop the node from syncing to a undesirable fork excluding the snapshot block that seemed to have more work.
[Maybe other reasons? Personally I’m more inclined towards the first approach, and have a weak understanding of things like forks and eclipse attacks, so I’m struggling to think of advantages to this approach.]

Questions

As long as the most-work header chain includes the snapshot block, there should be no real differences in behavior between the two approaches described above. But if the most-work header chain doesn’t include the snapshot block, it raises questions about which approach might be preferable. It also raises questions about what other behaviors we should consider implementing if this state is detected, like warning the user, shutting down, changing sync behavior, maybe adjusting relative priorities of the two chainstates. The main question is if we should be doing anything different than we are doing now.

mzumsande commented at 5:10 pm on June 14, 2024: contributor

This is not what currently happens, but simplest approach might be for the original chainstate to be unaffected by the snapshot chainstate, and to continue to download and attach the same blocks it otherwise would have if no snapshot were loaded. It would just do it more slowly due to a reduced cache size and lower priority for block requests compared to the snapshot chainstate.

I think that the concept of the Active Chainstate / Active Tip is important for this this discussion. Currently, it is tied to the snapshot chainstate until the background sync has finished. If we keep that logic, having the background chainstate target the most-work block doesn’t really achieve anything - if that chain turns out to be valid, we would still never use it for anything meaningful because the snapshot chainstate will remain the active one. So in order for targeting the most-work chain to make any sense, we’d also need to introduce the possibility of switching the active chainstate to it without the requirement that the background sync has to finish. Which would lead to other questions: I we don’t prioritize syncing the background chainstate towards it anymore, should we keep the now-unused snapshot chainstate around indefinitely and just ignore it exists? Should we be able to switch the active chainstate back to the snapshot chain in case we receive more blocks building on top of it, so that it becomes the most-work chain again?

Considering that AssumeUtxo sync is meant to be an optional and temporary optimization, and that large reorgs should be very infrequent, it could also make sense to abandon the AssumeUtxo sync, delete the snapshot chainstate and revert to normal sync as soon as we accept a header on a different chain that has more work than the best header of the snapshot chain.

I think that pragmatically we shouldn’t accept a snapshot in the first place if it’s not an ancestor of the most-work header m_best_header (I plan on opening a PR for that soon), but that doesn’t solve the problem completely because we might only learn about another chain after having loaded the snapshot successfully.

ryanofsky commented at 8:23 pm on June 14, 2024: contributor

I think that the concept of the Active Chainstate / Active Tip is important for this this discussion

Agree it’s worth mentioning. I think decisions about how chainstates are synced are mostly separate from decisions about how they are used and prioritized. But for completeness, I was assuming that if the original chainstate ever had more work than the snapshot chainstate, the snapshot chainstate would be unused and could be deleted, and the original chainstate would become the “active” chainstate again. Also, cache and download priority would be shifted to whichever chainstate had headers showing the most-work.

(Somewhat related to this: I don’t think the concept of an “active” chainstate is useful, and in #30214 I eliminate many uses of that designation. Right now when a snapshot is loaded, indexes treat the original chainstate as active, while wallets treat the snapshot chainstate as active. RPCs mostly treat the snapshot chainstate as active, but sometimes show information about both chainstates. I think it’s better to refer to chainstates as current vs historical, or validated vs. assumed-valid instead of referring more nebulously to an “active” chainstate.)

it could also make sense to abandon the AssumeUtxo sync, delete the snapshot chainstate and revert to normal sync as soon as we accept a header on a different chain that has more work than the best header of the snapshot chain.

That would be a third approach. Keeping the “Original chainstate targets the snapshot block” logic, but then abandoning the snapshot and switching back to “Original chainstate targets the most-work chain” logic when some condition is detected. I’m not sure this approach has advantages over always targeting the most-work chain, but it could, depending on the implementation details.

I’m also not sure just the existence of headers with the most work not including the snapshot block is a good enough reason to refuse loading the snapshot, or to delete the snapshot chainstate after a snapshot is loaded. It could be weird if the other chain that seemed to have more work turned out to be invalid, or more headers were received later that actually included the snapshot block allowed loading the snapshot again after it had previously been discarded or refused.

In general, just letting original chainstate sync to most-work chain, regardless of whether a snapshot chainstate is loaded seems like the simplest approach with the fewest special cases, and doesn’t seem to have significant drawbacks?

sdaftuar commented at 1:07 pm on June 15, 2024: member

Good discussion question, thanks for raising!

Considering that AssumeUtxo sync is meant to be an optional and temporary optimization, and that large reorgs should be very infrequent, it could also make sense to abandon the AssumeUtxo sync, delete the snapshot chainstate and revert to normal sync as soon as we accept a header on a different chain that has more work than the best header of the snapshot chain.

I think something like this makes more sense than to change how the background sync works. In my view, the purpose of the assumeutxo optimization is to offer users a different trust model where they can still reasonably safely get online and using the network prior to the background sync finishing. In the event that there is a competing tip with more work than a tip built on the snapshot, it’s impossible for the software to determine which tip is the right one without doing a lot of work, and I think our choices are:

Proceed under the assumption that the assumeutxo snapshot is correct, and effectively “checkpoint” the assumeutxo block hash (so that the snapshot chainstate doesn’t try to reorg to the most work chain), allowing the user to continue using the network in the meantime.
Abandon the assumeutxo optimization in this scenario and fall back to current sync behavior.

Option 1 is of course incredibly risky to a user if that assumption is wrong (ie acting as though that is the most valid work chain can result in funds loss). Moreover, the scenario that we’d be optimizing for is a highly unusual one, as we’d be optimizing for faster startup time for assumeutxo users in the event that there’s a more-work but consensus invalid fork of the chain. In that unusual scenario, we could always write new code to optimize for that case – or do simpler things like invalidating the block hash of the first consensus-invalid block header on that chain to avoid processing it (we could hard code this in future software versions, and perhaps instruct users on how to use invalidateblock as a temporary workaround).

So I think option 2 makes more sense for now: let’s change our code so that if we ever detect the potential for an assumeutxo snapshot to not be on the most-work chain, just abandon the optimization and proceed without it. Exactly how we handle that in our software still requires some thought: if a user has already started transacting based on an assumeutxo snapshot, and then we learn about some more work headers chain that forks before the snapshot height, what exactly do we do? Perhaps we should shut down and throw whatever the biggest warning/error message we can is, requiring a restart that will discard the snapshot chainstate? Or we could just treat it as a reorg down to a less work tip (ie from the snapshot’s tip back to the fully-validated tip), which while not normally allowed is already possible in our code if invalidateblock is used, so maybe that works ok too?

ryanofsky commented at 1:10 pm on June 16, 2024: contributor

If I can summarize and clarify, neither of you think the current behavior of locking in snapshot block, and temporarily refusing to consider chains that don’t include it is a good idea? The list of reasons above trying to justify current behavior are basically B.S.? (The “Possible advantages of original chainstate targeting the snapshot block” section about network hard forks and eclipse attacks)

Instead, the behavior you both seem to prefer is just: when a snapshot is loaded, as soon as a new header is received showing the chain with the most work not include the snapshot block, we should immediately stop using the snapshot chainstate, and maybe delete it, and maybe warn the user and shut down?

This is a little different, but not very different, than the approach I was envisioning where if the snapshot block was not on the most-work header chain, instead of abandoning the snapshot chainstate, we would continue to let it sync in the background with lower priority, in case the other chain with more work turned out to be invalid. Later, if the chain with more work turned out to be valid, the snapshot chainstate would be deleted. Specifically this would happen when the tip of the snapshot chainstate had less work than the tip of original chainstate. This way, wallets would always use the chain that had the most work AND was valid, they wouldn’t be affected by headers for another chain that might not be valid (other than by syncing more slowly).

In any case, it sounds like we want to abandon the idea of locking in the snapshot block, and ignoring chains with more work that don’t include it. If so, it sounds like special case code targeting the snapshot block could be removed either way.

sdaftuar commented at 6:08 pm on June 16, 2024: member

If I can summarize and clarify, neither of you think the current behavior of locking in snapshot block, and temporarily refusing to consider chains that don’t include it is a good idea?

I’m not sure I’m following this point exactly: my recollection is that the current observable behavior in this scenario would be to crash, because even though the original/fully validated chainstate locks in ancestors of the snapshot block to be possible tips, we’d still try to reorg the snapshot chainstate to that more-work chain, and before that more-work chain could be validated we would crash when trying to disconnect a block for which we don’t have undo data.

I actually think that apart from the terrible UX, this is essentially the right behavior, in that we’re trying to maximally protect users from thinking everything is fine in the event that our assumptions around how this feature is meant to be used are violated (and we haven’t fully validated things for ourselves to know what’s going on).

Another way of putting it: the intent of assumeutxo is to optimize start-up time for a new node, not introduce a checkpointing system for which chains are considered valid or affect consensus in an observable way. We could debate the idea of adding more checkpoints to our software (whether shipped with our code or configurable by users with a command line option), but I think we should separate these two concerns, because assumeutxo ought to be generally useful without introducing new checkpoints.

Also, the logic where we limit the tips that we consider for the original chainstate to only be ones that are ancestors of the snapshot block (when a snapshot chainstate is present) is just designed to simplify the consensus logic so that it’s clear what is supposed to be happening. I also think that this is just an implementation detail and not really something that should be externally observable, and that the code could be designed differently without affecting the overall behavior, so if others feel differently about how this logic is implemented then we could change it so that it’s more easily understandable/robust/etc.

Possible advantages of original chainstate targeting the snapshot block: … Maybe could provide resilience against hard forks that contain more work? If loading a snapshot makes a node temporarily ignore any chain not containing a snapshot block, maybe that is a useful feature if the chain with more work turns out to be invalid.

I think if this were to happen, then it’d be legitimate to want the assumeutxo feature to still work to help users who opt-in be able to sync a new node more quickly and start using the network. However, I think the existence of a more-work hard fork that is several months old is something that (a) would be well known to us, and (b) would require us to solve many other problems as well in order to support our users generally (regardless of the assumeutxo feature). For instance, we’d want to try to partition the p2p network so that we prevent our new users from being immediately eclipsed by nodes on an invalid chain (which by the way would preclude even learning the block headers needed for assumeutxo to work in the first place), and we’d want to prevent our users from trying to download all the blocks leading up to the alternate chain’s tip (which might require a lot of block download in order to have all the blocks needed to attempt the big reorg, which would be a huge bandwidth waste). The simplest way to do that might just be to explicitly hardcode the block hash of the alternate chain which forks from ours as invalid, which would immediately address both of these additional problems in addition to making assumeutxo work in this scenario.

Of course we could consider other options as well, such as user-configurable checkpoints, or an option to explicitly checkpoint the assumeutxo blockhash – but again I think it’s helpful to separate the idea of changes to which tips we consider as valid from the startup optimization that I think assumeutxo is really designed to be.

Could help in an eclipse attack? If headers or blocks after the snapshot block were withheld, this could temporarily stop the node from syncing to a undesirable fork excluding the snapshot block that seemed to have more work.

I think that eclipse attacks are mostly orthogonal to assumeutxo: if an attacker has eclipsed you, then they can bypass whatever protection assumeutxo might give in this regard by simply presenting a chain that forks after the assumeutxo blockhash. (Moreover I think that shutting down is a safer option for users in this kind of scenario than pretending everything is ok.)

Instead, the behavior you both seem to prefer is just: when a snapshot is loaded, as soon as a new header is received showing the chain with the most work not include the snapshot block, we should immediately stop using the snapshot chainstate, and maybe delete it, and maybe warn the user and shut down?

Yes this is what I’m thinking (coupled with @mzumsande’s suggestion to just fail to load a snapshot if it’s not an ancestor of the most work header we have).

In any case, it sounds like we want to abandon the idea of locking in the snapshot block, and ignoring chains with more work that don’t include it. If so, it sounds like special case code targeting the snapshot block could be removed either way.

I’m actually not sure I would agree with this (assuming I’m understanding your suggestion correctly), but I may be biased as I think I authored the code you’re probably talking about, after finding the more general logic that someone else had written to be confusing! That said I’m open to alternate ways of implementing the desired behavior and if there’s a simpler version that is easier to understand/more robust/etc then that’s fine with me too.

ryanofsky commented at 2:33 pm on June 17, 2024: contributor

If I can summarize and clarify, neither of you think the current behavior of locking in snapshot block, and temporarily refusing to consider chains that don’t include it is a good idea?

I’m not sure I’m following this point exactly: my recollection is that the current observable behavior in this scenario would be to crash [and …] apart from the terrible UX, this is essentially the right behavior

Whether a node ignores chains that have the most work and don’t include the snapshot block (behavior with #29519) or whether the node crashes while trying to reorg to those chains (behavior without #29519) I’d call it “refusing to consider” those chains, so I think my summary was wrong, or no longer reflects your current opinion, so it is good to have that clarification.

Thanks for going into more depth about how you think about long term forks and eclipse attacks. That all makes sense, and is educational, and addresses the possible concerns I was trying to raise with not refusing chains that don’t include the snapshot block.

If so, it sounds like special case code targeting the snapshot block could be removed either way.

I’m actually not sure I would agree with this (assuming I’m understanding your suggestion correctly)

I think I have a pretty clear idea of how this code could be simplified by just eliminating unnecessary special cases, that would not raise concerns. I’m not trying to bring back the “more general” code you eliminated in #27746 that was just broken.

Before this discussion, I was thinking the node should not refuse to deal with valid snapshots just because headers existed pointing somewhere else that seemed to have more work. I thought it would make more sense to keep the snapshot chainstate until the somewhere else chain actually turned out to be valid, and only discard the snapshot chainstate at that point. Before that point, I thought it would be better let wallets use the snapshot chainstate and let it continue to sync at lower priority in case the blocks on the other chain turned out to be invalid or unavailable. I still don’t think this is a bad approach, but I can also see the benefits of throwing up some roadblocks, and don’t think having to mark another chain invalid in order to use a snapshot on a chain that seems like it has less work is too big of a roadblock.

mzumsande commented at 10:50 pm on June 17, 2024: contributor

The problem I see with allowing the background sync to target the most-work chain is that it doesn’t seem to fit well into the design philosophy that has been merged so far. The background sync is treated as a low-priority task that is mostly done as an extra precaution after having synced the snapshot to the tip. At some point in the future (I know this would be contentious) we might even want to give users the opt-in option to skip it completely to save bandwidth. But as it is, the background sync is given a lower priority in p2p, doesn’t log its progress much, etc. - if it would now gain the ability of leading us to the actual best chain, much of this this would have to change.

Of course all of theses things could be changed, and maybe I’m overestimating the amount of work necessary, but I have doubts that the benefits would justify the efforts, when just disallowing to load the snapshot is a trivial alternative that would prevent the issue in most situations, and discarding the snapshot chainstate at a later time doesn’t seem too complicated either - even if the user experience might be a little worse that way, that doesn’t seem too bad given how rare this situation should be.

It could be weird if the other chain that seemed to have more work turned out to be invalid, or more headers were received later that actually included the snapshot block allowed loading the snapshot again after it had previously been discarded or refused.

I don’t think of this as weird. In a situation where we know about a chain with more work but unknown validity, the logical thing is to explore that chain first. Loading a snapshot for another chain and prioritizing downloading blocks for that chain would be detrimental towards this goal, so forbidding it makes sense at this point in time. So we prioritize the most-work chain, and once we find out that this chain is invalid, the situation has changed: It now makes sense to re-allow loading the snapshot, because this chain is now the most-work one not known to be invalid.

fjahr commented at 12:33 pm on June 20, 2024: contributor

I first tried to approach this question from a users view and which action would make more sense for them. But I think there are many possible scenarios and depending on the mindset of the user one or the other action may make sense to them.

Considering that AssumeUtxo sync is meant to be an optional and temporary optimization, and that large reorgs should be very infrequent, it could also make sense to abandon the AssumeUtxo sync, delete the snapshot chainstate and revert to normal sync as soon as we accept a header on a different chain that has more work than the best header of the snapshot chain.

That is basically my view at the moment. If the scenario we discuss here happens the one thing we can say for certain is that something weird is going on. The safest option should be to fall back to normal sync and (without checking) this should be among the easier-to-implement solutions. We don’t have to break our brains about more special cases and we should be able to trust that the normal sync deals with the new situation in a safe way and the users are prevented from doing something unsafe just like they normally can’t do much until their node is synced.

I think that pragmatically we shouldn’t accept a snapshot in the first place if it’s not an ancestor of the most-work header m_best_header (I plan on opening a PR for that soon), but that doesn’t solve the problem completely because we might only learn about another chain after having loaded the snapshot successfully.

I actually thought we did this already some time ago and got confused about that, see #29996 (review) Is this what you had in mind as well @mzumsande ?

mzumsande commented at 4:08 pm on June 21, 2024: contributor

I actually thought we did this already some time ago and got confused about that, see #29996 (comment) Is this what you had in mind as well @mzumsande ?

I opened #30320 with my suggestion - I think we have to use m_best_header for this and can’t use the active chain which only has blocks that were connected (and therefore fully available at that point).

achow101 referenced this in commit 0cac45755e on Jul 18, 2024

Sjors commented at 4:43 pm on July 23, 2024: member

The simplest way to do that might just be to explicitly hardcode the block hash of the alternate chain which forks from ours as invalid, which would immediately address both of these additional problems in addition to making assumeutxo work in this scenario.

That’s effectively a (negative) checkpoint. Hopefully this heaver chain is invalid in such a way that we can detect it without context. Perhaps it contains a 4.1MB block (with less work that the real tip). In that case we would indeed hardcode the hash, but only do the following: when the hash is seen, immediately fetch the block, confirm that it can’t be valid and then proceed as normal.

This could even be achieved without hardcoding by adding a assumeinvalid RPC call that does this (useful without assumeutxo too). If there’s really no suitable block for this approach, then invalidateblock could be modified to pre-emptively invalidate (without having seen the header).

Anyway, I find @sdaftuar’s argument #30288 (comment) persuasive: when in doubt, we should abort the assume utxo process and fall back to the safer original behaviour.

As to the exact moment that we give up and nuke the assumed valid chainstate, I’m not sure. We could do it as soon as we have any header chain with more work than both the assumeutxo block and the minimum chain work. Or perhaps once that happens start a ten minute timer, pause all block downloads, and aggressively add more peers to find alternative header chains.

I also think it’s fine if loadtxoutset fails if a longer header chain is detected but not (yet) marked invalid, and then later on it succeeds. As long as there’s a useful error message.

Regarding (practical) eclipse attacks, from what I’ve read those involve at most a couple of blocks. Not tens of thousands going all the way back below the snapshot height.

RFC: Assumeutxo and large forks and reorgs #30288

Background

Possible behavior: Original chainstate targets the most-work chain

Current behavior: Original chainstate targets the snapshot block

Tradeoffs

Questions