Mempool + Validation Code Organization

glozow commented at 9:23 am on July 11, 2022: member

From a commit message in #25487:

In the long run, I think we should probably invert the validation <-> mempool relationship. Instead of mempool not depending on validation, we should have validation not depend on mempool.

Background: We used to have a circular dependency between txmempool <-> validation. This was cut in #22677, so now we have validation -> txmempool.

Didn’t want to pollute the review comments there, so opening this to discuss removing mempool-related things from validation.{h,cpp} and what the ideal organization might look like.

glozow added the label Validation on Jul 11, 2022

glozow added the label Mempool on Jul 11, 2022

glozow commented at 9:23 am on July 11, 2022: member

My thoughts:

Defining “validation”:

If thinking about “validation” as management of chainstate, applying consensus rules to blocks and block transactions, etc., there is definitely no reason for validation to even know what a mempool is.
If thinking about “validation” as what is currently housed in validation.{h, cpp}, it’s really block + mempool validation. It houses MemPoolAccept, ChainstateManager has a reference to the mempool to call ATMP, etc. So it naturally has a dependency on txmempool.

Defining “txmempool”: Similarly, I think there’s a difference between what we think of conceptually vs where code is located. I think of these as tightly related but distinct mempool “modules”:

CTxMemPool + CTxMemPoolEntry data structures which store transactions but aren’t responsible for validation, policy, etc.
Policy logic and constants (e.g. RBF signals, standard script policy flags, default ancestor limits) and runtime settings (e.g. configured min relay feerate)
Mempool validation, i.e. MemPoolAccept
Fee estimator, which needs to know when transactions enter/leave the mempool
Miner BlockAssembler (i.e. the ancestor feerate algo which operates on the contents of the mempool, not the PoW part)
Others?

I’m not sure if this should all be in txmempool.{h,cpp} (my philosophy in #22677 was that it should just have CTxMemPool + CTxMemPoolEntry data structures, and thought it was weird to have removeForReorg() calling CheckSequenceLocks(), which is why I went with the direction of passing in a callable). In the future, perhaps we want to move towards having a MempoolManager responsible for mempool validation, policy settings, fee estimator, miner/block template building, etc. I imagine replacing ChainstateManager and PeerManager’s references to CTxMemPool with a reference to MempoolManager.

Spoke to @dongcarl about mempool vs validation previously (in the context of what it would look like to remove mempool from libbitcoinkernel), and a few ideas were discussed:

Have mempool be a client of CValidationInterface. This would require quite a few changes, including separating TransactionAddedToMempool() and TransactionRemovedFromMempool into a separate interface and iirc change the interface for transactions removed for block. I think #11775 does this somewhat?
As a less invasive step in the direction of making validation not require txmempool, pass callbacks to validation functions doing mempool things synchronously.

ryanofsky commented at 4:16 pm on July 11, 2022: contributor

I’m not sure it is possible to come up with a grand plan here, but doesn’t hurt to try. Validation code is a big monolith, and mempool code is somewhat monolithic too, so I think any time you can pull pieces of validation code or mempool code out of their respective monoliths into smaller modules which are coherent (block managers, chain states, fee estimators, utxo databases, etc), you are probably doing a good thing. When we see the opportunities to pull things out of src/validation.cpp and src/txmempool.cpp and move them into smaller files in src/kernel/[1], we should generally take them, and I think if we do that, the questions about what should depend on what will be simpler. It’s easier to figure out what dependencies should be between small focused modules than big blobby modules.

[1] Before introduction of the kernel library, it would have made sense to put new modules in src/node/ not src/kernel/, but now most things in src/node/ should actually move to src/kernel/.

darosior commented at 10:45 am on July 14, 2022: member

Regarding #11775, i’m not sure. After #25380, we can’t just make the fee estimator only a client of the validation interface: it needs an access to CTxMemPool. I also think it’s natural for the fee estimator to be connected to CTxMemPool, and we’ll need it if we want to give accurate estimates for packages.

glozow commented at 1:43 pm on July 15, 2022: member

I also think it’s natural for the fee estimator to be connected to CTxMemPool, and we’ll need it if we want to give accurate estimates for packages.

I agree, but my point is that the relationship should be fee estimator depending on txmempool, and txmempool not depending on fee estimator (currently we have circular). It could be a client of validation interface and have access to mempool.

ariard commented at 0:44 am on July 20, 2022: contributor

If thinking about "validation" as management of chainstate, applying consensus rules to blocks and block transactions, etc., there is definitely no reason for validation to even know what a mempool is.
If thinking about "validation" as what is currently housed in validation.{h, cpp}, it's really block + mempool validation. It houses MemPoolAccept, ChainstateManager has a reference to the mempool to call ATMP, etc. So it naturally has a dependency on txmempool.

I would lean for the first thinking, defining “validation” in term of a conceptual unit. However, I’m not sure it’s that’s simple as we might have a conceptual dependency between the two in the sense of g_scriptExecutionCache fulfilled by ConsensusScriptChecks() present in the mempool acceptance paths. In the eventuality we would like to disentangle “mempool” from “validation” code to introduce in between a src/interfaces/validation.h to enable process separation, g_scriptExecutionCache could be a bottleneck as the validation process might have to expose internal writeable state to the mempool process (and I guess you might have to batch the script cache writes otherwise the processes context-switch would likely swallow the perf opti of CuckooCache).

CTxMemPool + CTxMemPoolEntry data structures which store transactions but aren’t responsible for validation, policy, etc. Policy logic and constants (e.g. RBF signals, standard script policy flags, default ancestor limits) and runtime settings (e.g. configured min relay feerate) Mempool validation, i.e. MemPoolAccept Fee estimator, which needs to know when transactions enter/leave the mempool Miner BlockAssembler (i.e. the ancestor feerate algo which operates on the contents of the mempool, not the PoW part) Others?

Yes, I think that’s a good list of mempool “modules”. note, I believe what we define as policy logic isn’t that straightforward. E.g, we have AVG_FEEFILTER_BROADCAST_INTERVAL and MAX_FEEFILTER_CHANGE_DELAY in net_processing.cpp, do they belong to the transaction-relay policy scope or the mempool policy one as ultimately they influence your mempool content ? I don’t know if we have heuristics there, though I think it’s worthy to have clear view of the sets of policy constants to ease the understanding of implications on network behavior if we change any of them. There is also the fact that the wallet could have its own default view of the mempool, decoupled from the effective mempool settings, i.e to increase the propagation odds of your wallet-issued chain of transactions, I think the default run by the other nodes matter more and if your local settings are more permissive you should stick to Core default imo. For now, I think the wallet policy view is coupled to the default settings one (see relayMinFee() or checkChainLimits()).

I’m not sure if this should all be in txmempool.{h,cpp} (my philosophy in #22677 was that it should just have CTxMemPool + CTxMemPoolEntry data structures, and thought it was weird to have removeForReorg()> calling CheckSequenceLocks(), which is why I went with the direction of passing in a callable). In the future, perhaps we want to move towards having a MempoolManager responsible for mempool validation, policy settings, fee estimator, miner/block template building, etc. I imagine replacing ChainstateManager and PeerManager’s references to CTxMemPool with a reference to MempoolManager.

Yeah sounds a good direction to encapsulate slowly everything in MemPoolManager. I don’t have opinion on the path forward to remove the validation dependending on mempool current state of things.

ismaelsadeeq commented at 6:34 pm on September 14, 2024: member

If I understand correctly as is right now

Validation houses to all of validation logic(block, transaction, and mempool)
txmempool defines the data structure of mempool transactions managing removal and addition.

0graph TD
1    A[Validation] --> B[Block Validation]
2    A[Validation] --> C[Mempool Validation]
3    A[Validation] --> D[Chainstate Manager]
4    A[Validation] --> E[CTxMempool]
5    D --> E

I agree, but my point is that the relationship should be fee estimator depending on txmempool, and txmempool not depending on fee estimator (currently we have circular). It could be a client of validation interface and have access to mempool.

This was implemented in #28368 and creation of a standalone fee estimator Manager in progress in #30157, hence I removed the mempool dependency on fee estimator.

From the discussion it seems there is consensus on the following:

Refactoring the mempool validation logic in validation.cpp to move it into the mempool manager.
Refactoring validation from a large monolithic structure into a series of standalone modules. i) BlockManager
ii) Chainstate and ChainstateManager
iii) UTXO Databases

0graph TD
1    A[BlockManager] --> B[Block Validation]
2    C[ChainstateManager] --> D[Chainstate]
3    E[UtxoDatabases]
4    F[MemPoolManager] --> G[Mempool Validation]
5    F[MemPoolManager] --> H[CTxMemPool]
6    C[ChainstateManager] --> F[MemPoolManager]
7    I[Fee Estimator Manager] --> J[Policy Estimator]
8    I[Fee Estimator Manager] --> F[MemPoolManager]
9    I[Fee Estimator Manager] --> K[Other Estimator]

I also think it will also be better to not to add the mempool validation logic in to txmempool, moreever it should also be slimmed to only house mempool transactions with getters and setters API.

side note: There are some consensus related checks e.g in CScriptCheck, ValidationCache which I am not sure where they belong, but I think they should also be on their own.

glozow commented at 0:09 am on September 16, 2024: member

closing as (1) my main gripe with the fee estimator has been resolved, and (2) there isn’t 1 correct actionable plan here. There are some general ideas of separation that we should incorporate into e.g. cluster mempool stuff, but I don’t think there’s any point in keeping this issue open.

glozow closed this on Sep 16, 2024

Mempool + Validation Code Organization #25584