I’m not sure I completely understand what you’re asking, but the topic is more complicated than just an assumption around whether the initial headers sync logic is robust:
- Ideally, we would only download the full headers chain from a single peer when we are starting up, because it saves bandwidth to do so.
- It’s possible that the peer we pick for initial headers sync could be (a) slow, (b) on a chain that is not the main chain, (c) adversarial, or some other terrible combination of those factors. So we cannot just have our logic rely on the initial peer to serve us the honest chain in a reasonable amount of time.
- We currently have two behaviors that help protect us from choosing a bad initial peer. The main protection we have is that when a block INV is received, we send a getheaders to all peers that announce the block, resulting in us getting the main chain with high probability. However, this is bandwidth wasting if we have many peers that serve us an INV at the same time, which is probably the common case when we’re in a scenario that our initial peer is slow.
- The second protection we have is that after about 20 minutes, we’ll evict our initial headers-sync peer if our tip’s timestamp isn’t within a day of the current time. This could kick in if we have a bad initial peer and no blocks are found for a while.
I think we could do a variety of things to improve the current situation on master; I think that adding (say) one additional headers sync peer on some kind of timer (maybe every 5 or 10 minutes) could make sense. I think that choosing a random peer among the set of peers announcing a block is probably better peer selection than choosing a random peer (or random outbound peer) on a timer, just because if a peer sends an INV there’s more reason to believe that they are responsive and going to be helpful in getting us the chain, but probably some combination of both would be even better.
However, the complexity I ran into when thinking about other strategies for initial sync has to do with the eviction logic. Right now, I think it’s mostly good that we evict our (single) initial headers-sync peer if we can’t get a chain tip that is recent within 20 minutes. However, triggering that logic on all our peers at the same time seems over the top to me, because there are edge-case scenarios (such as: no blocks have been found on the network for a day, or the honest chain is some kind of billion-block timewarp chain that takes more than 20 minutes to download) where I think such logic could be badly behaved for the network, because we could end up with no peers or we could fall out of consensus.
I think what I’m proposing in this patch is a narrow change that exactly addresses the bandwidth problem, and maximizes the chance we find a good peer quickly, without making our behavior in those edge-case scenarios any worse. Nevertheless, a bigger overhaul of this logic that carefully considers these things could certainly be an improvement and make this whole thing easier to think about.