Unbounded growth of scheduler queue #14289

issue sipa openend this issue on September 21, 2018
  1. sipa commented at 9:18 pm on September 21, 2018: member

    In validation.cpp, both ConnectTip and DisconnectTip invoke a GetMainSignals()... callback, which schedules a future event though the CScheduler interface.

    These functions are called in 3 places:

    • ActivateBestChainTipStep
    • InvalidateBlock
    • RewindBlockIndex

    The first of these 3 prevents the scheduler queue from growing unboundedly, by limiting the size to 10 in ActivateBestChainTip. The other two however do not, and @Sjors discovered that doing a giant invalidateblock RPC call will in fact blow up the node’s memory usage (which turns out to be caused by the queue events holding all disconnected blocks in memory).

    I believe there are several improvements necessary.

    1. (short term symptom fix) If this issue also appears for RewindBlockIndex, we must fix it before 0.17, as this may prevent nodes from upgrading pre-0.13 code. I think this is easy, as this function is called prior to normal node operation, so it can easily be reworked to release cs_main intermittently and drain the queue.

    2. (long term fix) The use of background queues is a means to increase parallellism, not a way to break lock dependencies (as @gmaxwell pointed out privately to me, whenever a background queue is used to break a dependency, attempts to limit its size can turn into a deadlock). One way I think is to have a debug mode where the background scheduler immediately runs all scheduled actions synchronously, forcing all add-to-queue calls to happen without locks held. Once everything works in such a mode, we can safely add actual queue size limits as opposed to ad-hoc checks to prevent the queue from growing too large.

    3. (orthogonal memory usage reduction) It’s unfortunate that our DisconnectBlock events hold a shared pointer to the block being disconnected. Ideally there is a shared layer that only keeps the last few blocks in memory, and releases and reloads them from disk when needed otherwise. This wouldn’t fix the unbounded growth issue above, but it would reduce the impact. It’s also independently useful to reduce the memory during very large reorgs (where we can’t run callbacks until the entire thing completed).

  2. sipa added the label Bug on Sep 21, 2018
  3. sipa added the label Resource usage on Sep 21, 2018
  4. MarcoFalke added this to the milestone 0.17.0 on Sep 21, 2018
  5. MarcoFalke commented at 10:56 pm on September 21, 2018: member
    See also “Unbounded reorg memory usage” #9027 about resource usage of blocks kept in memory and the wallet lagging behind too much.
  6. TheBlueMatt commented at 2:19 pm on September 22, 2018: member

    Are you sure (1) is a regression?

    The discussion in (2) isn’t entirely true - if you actually fully break lockdeps (which I think is a long-term goal here, especially across validation/wallet and like is bbeing worked towards in #12934), that isn’t an issue. Its only when you have circular deps and cs_main that it becomes a problem.

    (3) has been discussed a number of times and even had a TODO for it at some point, though I can’t find it anymore. Also appears to be a dup of #9027.

  7. Sjors commented at 5:30 pm on September 22, 2018: member
    Even if it’s not a regression, the deeper SegWit activation is burried, the more strain is put on RewindBlockIndex. Perhaps at some depth the node should not automatically upgrade, but promt the user whether they want to do that, or if they prefer to delete and resync (probably faster), or even ignore SegWit until some later moment.
  8. gmaxwell commented at 9:00 pm on September 22, 2018: contributor

    It’s a regression. At least before a walletless, txindexless node could happily invalidate down to 0 without causing extreme memory usage. It doesn’t now. The memory usage never goes down, even after returning back to the tip. 64GB of usage is hit after invalidating only 40k blocks too.

    The discussion in (2) isn’t entirely true - if you actually fully break lockdeps

    Sipa wrote “whenever a background queue is used to break a dependency”.

    The use of this to break locking dependencies is just wrong. It should not be used to break locking dependencies. Any uses of the queues for the exclusive purpose of breaking locking dependencies should be probably be removed because by definition every one of those cases results in “unbounded” memory usage. (technically there may be a “bound” but the resulting bound is impossible to reason about without consider the total behaviour of every lock touched by every piece of code that takes the lock(s) in question– e.g. the entire codebase). Effectively proving that the memory usage isn’t unbound is equivalent to proving that multiuse contention of the lock without the unbounded queue couldn’t deadlock.

    Also, as we’ve learned elsewhere the queue is actually pretty slow. So it may not be a suitable tool for improving performance through concurrency in many applications. It certainly shouldn’t be deployed as such without benchmarks that show that it actually improves performance.

  9. Sjors commented at 2:27 pm on September 24, 2018: member
    I’m syncing a v0.13.0 node now on AWS (i3.2xlarge) and plan to upgrade to v0.17.0rc4 once it’s near or at the tip, to see how memory behaves. (update: syncing it again because I turned it off without realizing the storage is ephemeral)
  10. Sjors commented at 9:37 pm on September 25, 2018: member

    After syncing v0.13 I upgraded to v0.17.0rc. It first upgraded the UTXO database as expected and then started rolling back blocks as expected.

    Memory uses blows up… I have a copy of the v0.13.0 database, so will try with 0.16 later and I can try fixes.

    Cache 1.5 GB, RAM 12.9 GB at height 533K: schermafbeelding 2018-09-25 om 21 36 18

    Cache 2.4 GB, RAM 20 GB at height 527K:

    No graceful stopping this either:

  11. Sjors commented at 2:45 pm on September 26, 2018: member

    Using a backup of the fully synced v0.13.0 database, I tried v0.16.3. It also first upgraded the UTXO database as expected and then started rolling back blocks as expected.

    Cache 1.4 GB, RAM 11.2 GB at height 533K:

  12. Sjors commented at 6:41 pm on September 26, 2018: member
    v0.15.2 does not have this bug. It rolled back to 533K with only 3.2 GB of RAM usage (it loaded the full UTXO database during the database upgrade). It’s still pretty slow, so I don’t think automatic rollback is a great user experience.
  13. MarcoFalke commented at 7:21 pm on September 26, 2018: member

    Using a backup of the fully synced v0.13.0 database, I tried v0.16.3. It also first upgraded the UTXO database as expected and then started rolling back blocks as expected. Cache 1.4 GB, RAM 11.2 GB at height 533K:

    Is that with wallet and txindex disabled?

  14. Sjors commented at 7:40 pm on September 26, 2018: member
    No wallet, no indexes.
  15. MarcoFalke commented at 7:56 pm on September 26, 2018: member
    Which weakly indicates that this is not a regression in 0.17.0
  16. Sjors commented at 8:55 pm on September 26, 2018: member
    @marcofalke it’s a regression in 0.16, based on the above tests. See also IRC.
  17. MarcoFalke removed this from the milestone 0.17.0 on Sep 30, 2018
  18. MarcoFalke added this to the milestone 0.17.1 on Sep 30, 2018
  19. MarcoFalke commented at 1:09 am on September 30, 2018: member
    Moved to 0.17.1, as this is documented as a known issue in the release notes: https://github.com/bitcoin-core/bitcoin-devwiki/wiki/0.17.0-Release-notes/_history
  20. MarcoFalke removed this from the milestone 0.17.1 on Nov 28, 2018
  21. MarcoFalke added this to the milestone 0.18.0 on Nov 28, 2018
  22. laanwj commented at 12:00 pm on March 4, 2019: member
    It’s not realistic to solve this entirely before the 0.18 release, and also not required as this is not a regression for 0.18—so I’m moving the milestone. Any fixes in this regard should of course be backported to the 0.18 branch.
  23. laanwj removed this from the milestone 0.18.0 on Mar 4, 2019
  24. laanwj added this to the milestone 0.19.0 on Mar 4, 2019
  25. laanwj referenced this in commit 3db0cc3947 on Mar 7, 2019
  26. MarcoFalke commented at 6:10 pm on May 30, 2019: member
    Could someone update the current status of this issue? I think it is hard to follow, since some if it is already solved, but other parts weren’t yet addressed.
  27. Sjors commented at 2:35 pm on July 5, 2019: member
    The specific instance I found with invalidateblock, and rewinding the chain in general, is solved.
  28. laanwj commented at 11:21 am on September 26, 2019: member
    To what extent is this a blocker for 0.19? Should we push forward the milestone again?
  29. MarcoFalke commented at 11:53 am on September 26, 2019: member
    What exactly is left to be done here?
  30. laanwj removed this from the milestone 0.19.0 on Sep 26, 2019
  31. MarcoFalke removed the label Bug on Oct 11, 2019
  32. MarcoFalke added the label Brainstorming on Oct 11, 2019
  33. promag commented at 0:40 am on April 27, 2020: member
    Ping.
  34. UdjinM6 referenced this in commit 6f0646a262 on Oct 29, 2021
  35. pravblockc referenced this in commit fda1d429af on Nov 18, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-03 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me