refactor: Guard TxRequestTracker by its own lock instead of cs_main #26151

pull dergoegge wants to merge 3 commits into bitcoin:master from dergoegge:2022-09-txrequest-cs_main-split changing 3 files +72 −41
  1. dergoegge commented at 8:15 pm on September 21, 2022: member

    I don’t see the need to have the TxRequestTracker guarded by cs_main which would also be more in line with our developer docs.

    From developer-notes.md:

    0Re-architecting the core code so there are better-defined interfaces between
    1the various components is a goal, with any necessary locking done by the
    2components (e.g. see the self-contained FillableSigningProvider class and its
    3cs_KeyStore lock for example).
    

    This PR gives TxRequestTracker its own mutex, thereby removing the need to guard PeerManagerImpl::m_txrequest using cs_main.

  2. fanquake added the label P2P on Sep 21, 2022
  3. fanquake requested review from ajtowns on Sep 21, 2022
  4. fanquake requested review from vasild on Sep 21, 2022
  5. DrahtBot commented at 11:31 pm on September 21, 2022: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #26551 (net_processing: Track orphans by who provided them by ajtowns)
    • #26295 (Replace global g_cs_orphans lock with local by ajtowns)
    • #25880 (p2p: Make stalling timeout adaptive during IBD by mzumsande)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  6. jnewbery commented at 8:12 am on September 22, 2022: contributor
    Concept ACK. Originally suggested in #19988 (review)
  7. in src/txrequest.cpp:543 in 7f9a893687 outdated
    537@@ -534,8 +538,10 @@ class TxRequestTracker::Impl {
    538     Impl(const Impl&) = delete;
    539     Impl& operator=(const Impl&) = delete;
    540 
    541-    void DisconnectedPeer(NodeId peer)
    542+    void DisconnectedPeer(NodeId peer) EXCLUSIVE_LOCKS_REQUIRED(!m_txrequest_mutex)
    543     {
    544+		LOCK(m_txrequest_mutex);
    


    fanquake commented at 8:27 am on September 22, 2022:
    linter is sad about tabs being used here
  8. fanquake commented at 8:28 am on September 22, 2022: member
    concept ack - restarted the *SAN job as that was an unrelated apt failure.
  9. in src/net_processing.cpp:1399 in 7f9a893687 outdated
    1398@@ -1399,6 +1399,7 @@ void PeerManagerImpl::PushNodeVersion(CNode& pnode, const Peer& peer)
    1399 void PeerManagerImpl::AddTxAnnouncement(const CNode& node, const GenTxid& gtxid, std::chrono::microseconds current_time)
    1400 {
    1401     AssertLockHeld(::cs_main); // For m_txrequest
    1402+
    


    ajtowns commented at 9:01 am on September 22, 2022:
    Comment is no longer correct; it’s now needed for CNodeState* state = State(nodeid) but not m_txrequest, I think.

    dergoegge commented at 3:32 pm on September 22, 2022:
    Removed the comment
  10. in src/net_processing.cpp:1823 in 7f9a893687 outdated
    1817@@ -1817,7 +1818,6 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
    1818         }
    1819     }
    1820     {
    1821-        LOCK(cs_main);
    


    ajtowns commented at 9:04 am on September 22, 2022:

    Enclosing { .. } are no longer needed.

    This change gives decreases the length of time cs_main is held.


    dergoegge commented at 3:32 pm on September 22, 2022:
    Done.
  11. in src/txrequest.cpp:312 in 7f9a893687 outdated
    306@@ -307,6 +307,8 @@ GenTxid ToGenTxid(const Announcement& ann)
    307 
    308 /** Actual implementation for TxRequestTracker's data structure. */
    309 class TxRequestTracker::Impl {
    310+    mutable Mutex m_txrequest_mutex;
    


    ajtowns commented at 9:08 am on September 22, 2022:
    Not this PR’s fault, but { should be on a new line, and maybe add private: to make it clear these are private members?

    dergoegge commented at 3:33 pm on September 22, 2022:
    Done.
  12. in src/txrequest.cpp:314 in 7f9a893687 outdated
    306@@ -307,6 +307,8 @@ GenTxid ToGenTxid(const Announcement& ann)
    307 
    308 /** Actual implementation for TxRequestTracker's data structure. */
    309 class TxRequestTracker::Impl {
    310+    mutable Mutex m_txrequest_mutex;
    311+
    312     //! The current sequence number. Increases for every announcement. This is used to sort txhashes returned by
    313     //! GetRequestable in announcement order.
    314     SequenceNumber m_current_sequence{0};
    


    ajtowns commented at 9:09 am on September 22, 2022:
    If you’re adding a Mutex, I think all these members should also be GUARDED_BY that mutex.
  13. dergoegge force-pushed on Sep 22, 2022
  14. ajtowns commented at 10:59 am on September 22, 2022: contributor

    If there’s to be a separate lock, I’m not convinced it makes sense to push the lock all the way into the Impl class – having m_impl GUARDED_BY(m_impl_mutex) in TxRequestTracker, or having m_txrequest GUARDED_BY(m_txrequest_mutex) seem more sensible to me, and the latter allows you to take the lock outside of the various for loops, rather than repeatedly taking it and dropping it on each iteration.

    I think the arguments from the original PR still apply though – this doesn’t meaningfully allow any more parallelism / reduce blocking as far as I can see.

  15. dergoegge commented at 3:01 pm on September 22, 2022: member

    Thanks @jnewbery for linking the previous discussion! I was looking through the old PRs yesterday and didn’t come across that comment.

    If there’s to be a separate lock, I’m not convinced it makes sense to push the lock all the way into the Impl class – having m_impl GUARDED_BY(m_impl_mutex) in TxRequestTracker, or having m_txrequest GUARDED_BY(m_txrequest_mutex) seem more sensible to me, and the latter allows you to take the lock outside of the various for loops, rather than repeatedly taking it and dropping it on each iteration. @ajtowns I’m not sure if I am on the same page about this but I am willing to be convinced. I used AddrMan (and the dev notes) as a reference for choosing the current design (i.e. putting the lock inside the Impl class).

    • Would you be in favor of refactoring our existing modules to externalize their locks? (e.g. AddrManImpl::cs)
    • Should our developer docs be updated? (I think our lock conventions are a bit all over the place, similar to general coding style, so maybe it helps to clarify our preferences?)

    I think the arguments from the original PR still apply though – this doesn’t meaningfully allow any more parallelism / reduce blocking as far as I can see.

    I have been looking at reducing cs_main usage within net processing because I don’t think that our networking code should be locking cs_main (our main validation lock) as much as it currently does. Reducing the scope of cs_main to validation seems like a good direction to be heading in (especially w.r.t. the kernel project). This PR seemed like an easy small step in that direction (similar to #26140) as it does remove two LOCK(cs_main) call sites.

  16. dergoegge force-pushed on Sep 22, 2022
  17. ajtowns commented at 4:34 am on September 23, 2022: contributor

    @ajtowns I’m not sure if I am on the same page about this but I am willing to be convinced. I used AddrMan (and the dev notes) as a reference for choosing the current design (i.e. putting the lock inside the Impl class).

    * Would you be in favor of refactoring our existing modules to externalize their locks? (e.g. `AddrManImpl::cs`)
    

    In general, I think it’s better to leave things alone if we’re not making things substantially more reliable/simple/efficient.

    For locking, I think the best approach on all of those axes is “having a reference to an object means you can manipulate it; if you can’t manipulate it, you don’t have a reference to in the first place”; but that’s often hard to achieve.

    Maybe compare with what @vasild’s proposing in #25390 – with that you’d just say Synced<TxRequestTracker> m_txrequest and write either m_txrequest->Foo(); to do a single operation that takes then releases the lock, or { auto proxy = *m_txrequest; proxy->Foo(); proxy->Bar(); } to do multiple operations while the lock’s held. Meanwhile, if you’re not doing threading (like in the unit/fuzz tests) you just allocate a TxRequestTracker directly, and don’t worry about locks at all. (Unfortunately, I don’t think the implementation there quite works right/clang isn’t clever enough to properly understand it, and I haven’t been able to come up with a better one. Err, except, maybe…)

    I have been looking at reducing cs_main usage within net processing because I don’t think that our networking code should be locking cs_main (our main validation lock) as much as it currently does. Reducing the scope of cs_main to validation seems like a good direction to be heading in (especially w.r.t. the kernel project). This PR seemed like an easy small step in that direction (similar to #26140) as it does remove two LOCK(cs_main) call sites.

    Yeah… Kind of feel like it’s probably better spending coding/review time on the hard steps though? For cs_main, having dedicated, non-global, locks for blockstorage, block indexes, and coin states, so that you don’t need cs_main there, seems like the priority. For net, getting rid of CNodeState entirely and passing around Peer& objects seems worthwhile, and I think maybe adding an opaque PeerRef to CNode would allow avoiding the extra map and ensure the Peer object doesn’t get deallocated before the CNode object does… We could also perhaps have more things under “only accessed by the message processing thread” by having that thread make read-only copies of the data available to other threads in advance, which would then reduce contention…

  18. hebasto commented at 10:27 am on September 23, 2022: member
    Concept ACK.
  19. hebasto commented at 10:43 am on September 23, 2022: member

    Maybe compare with what @vasild’s proposing in #25390 – with that you’d just say Synced<TxRequestTracker> m_txrequest and write either m_txrequest->Foo(); to do a single operation that takes then releases the lock, or { auto proxy = *m_txrequest; proxy->Foo(); proxy->Bar(); } to do multiple operations while the lock’s held.

    It makes me think that such an issue should be resolved at TxRequestTracker class’s API level. An object which self maintains its state in multi-threading environment is preferable (less error prone, easy to reason about etc).

  20. in src/txrequest.cpp:312 in 2ac77c4b5e outdated
    305@@ -306,23 +306,28 @@ GenTxid ToGenTxid(const Announcement& ann)
    306 }  // namespace
    307 
    308 /** Actual implementation for TxRequestTracker's data structure. */
    309-class TxRequestTracker::Impl {
    310+class TxRequestTracker::Impl
    311+{
    312+private:
    313+    mutable Mutex m_txrequest_mutex;
    


    vasild commented at 12:49 pm on September 28, 2022:
    nit: “txrequest” is already in the name of the class and looks redundant to have it also in the name of the variable. I think m_mutex is ok too.

    dergoegge commented at 4:48 pm on October 18, 2022:

    Renamed.

    I have seen nits on other PRs suggesting the exact reverse, which is why i named it this way in the first place, hoping to avoid the nit 😅 But i actually kind of agree that its redundant, so i shouldn’t have preemptively tried to avoid the nit.

  21. vasild approved
  22. vasild commented at 1:24 pm on September 28, 2022: contributor

    ACK 2ac77c4b5ef095846924be3c91e02967e8177abf

    This PR adds an internal lock and acquires it inside the methods that touch the relevant variables. This would allow finer grained control - e.g. to lock the mutex only for some part of a method, to improve concurrency. In this PR, however, we don’t do that - we lock the mutex for the entire duration of the methods, which is the same as having an external mutex outside, like @ajtowns mentioned above:

    … or having m_txrequest GUARDED_BY(m_txrequest_mutex) … allows you to take the lock outside of the various for loops, rather than repeatedly taking it and dropping it on each iteration.

    Which way is better would depend on how this is used by multiple threads. Because it is not, both approaches look equally good (or equally bad) now.

    Thanks, @ajtowns, for mentioning #25390. This PR can be achieved by just:

    0-    TxRequestTracker m_txrequest;
    1+    Synced<TxRequestTracker> m_txrequest;
    

    and mechanically replacing . with -> (we don’t need to hold the lock across multiple method calls). That is virtually one line of change. See this comment: #25390 (comment) looks like this PR is doing the “Lots of repetitions” case.

  23. in src/txrequest.cpp:309 in 2ac77c4b5e outdated
    305@@ -306,23 +306,28 @@ GenTxid ToGenTxid(const Announcement& ann)
    306 }  // namespace
    307 
    308 /** Actual implementation for TxRequestTracker's data structure. */
    309-class TxRequestTracker::Impl {
    310+class TxRequestTracker::Impl
    


    Riahiamirreza commented at 7:44 pm on September 29, 2022:
    I know this question is not in the scope of this PR but appreciate the answer. Why there is a separation between TxRequestTracker and it’s implementation? The comments says: Avoid littering this header file with implementation details. Is clarity and readability of the code, the whole point of this separation? If yes, is this a common design pattern?

    ajtowns commented at 0:03 am on September 30, 2022:
    That’s exactly the point. It’s called the “pimpl pattern” (pointer to impl class) – https://cpppatterns.com/patterns/pimpl.html

    dergoegge commented at 8:52 am on September 30, 2022:
    We also had a review club on this a while ago, if you are interested: https://bitcoincore.reviews/22950

    vasild commented at 9:00 am on September 30, 2022:

    Yeah, also here: https://en.cppreference.com/w/cpp/language/pimpl “This technique is used to construct C++ library interfaces with stable ABI and to reduce compile-time dependencies.”

    What it does for us in practice is to reduce re-compilation times - if everything is compiled and some private member is changed then only the cpp file will be recompiled instead of all files that include the h file.

    Improving clarity and readability is not mentioned anywhere in any of the pimpl articles I checked.

  24. maflcko commented at 11:10 am on October 4, 2022: member
    Could also move it out of the cs_main scope in FinalizeNode?
  25. DrahtBot added the label Needs rebase on Oct 17, 2022
  26. [net processing] Guard TxRequestTracker internally using its own mutex a51d282966
  27. [net processing] Reduce cs_main scope for m_txrequest
    m_txrequest is now guarded by TxRequestTracker's internal mutex.
    5a031f27ff
  28. dergoegge force-pushed on Oct 18, 2022
  29. dergoegge commented at 4:50 pm on October 18, 2022: member
    Rebased, added a method to remove multiple transactions from the tracker at once (to avoid locking the internal lock over and over again), and renamed m_txrequest_mutex -> m_mutex.
  30. [txrequest] Add method for removing multiple transaction requests at once 1d72d4206d
  31. dergoegge force-pushed on Oct 18, 2022
  32. DrahtBot removed the label Needs rebase on Oct 18, 2022
  33. in src/net_processing.cpp:1824 in 1d72d4206d
    1826-            m_txrequest.ForgetTxHash(ptx->GetHash());
    1827-            m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
    1828-        }
    1829-    }
    1830+
    1831+    m_txrequest.ForgetTxs(Span{pblock->vtx});
    


    vasild commented at 12:12 pm on October 21, 2022:

    nit:

    0    m_txrequest.ForgetTxs(pblock->vtx);
    
  34. vasild commented at 12:12 pm on October 21, 2022: contributor
    ACK 1d72d4206df579483f8297acd163611fc294378e See below.
  35. vasild commented at 12:29 pm on October 21, 2022: contributor

    Could also move it out of the cs_main scope in FinalizeNode?

    I am not sure about it. In PeerManagerImpl::FinalizeNode():

    0LOCK(cs_main);
    1...
    2m_txrequest.DisconnectedPeer(nodeid);
    3...
    4assert(m_txrequest.Size() == 0);
    

    hmm, wait! is that a bug in this PR? If somebody modifies (adds to) m_txrequest after DisconnectedPeer() and before the assert() then the assert() will be triggered. And the point of this PR is to be able to access (add to) m_txrequest without cs_main from anywhere :bomb: :fire:

    Maybe return the size from DisconnectedPeer() after the deletion and later assert that the returned size was 0 in FinalizeNode()?

  36. dergoegge commented at 12:59 pm on October 21, 2022: member

    hmm, wait! is that a bug in this PR? If somebody modifies (adds to) m_txrequest after DisconnectedPeer() and before the assert() then the assert() will be triggered. And the point of this PR is to be able to access (add to) m_txrequest without cs_main from anywhere

    I don’t think it is currently but it almost is! The assert(m_txrequest.Size() == 0); is gated by if (m_node_state.empty()) so no other peer can modify (add txs) m_txrequest.

    I will still change this as that seems a little brittle wrt future changes.

  37. vasild commented at 7:55 am on October 22, 2022: contributor

    How? It seems that the assumption in FinalizeNode() is that m_node_state is modified together/atomically with m_txrequest.

    Yes, the current code is fine, but then it is fine even without this PR. The aim is to make it future proof.

    Returning the size from DisconnectedPeer() like I suggested above seems to have its own (theoretical) flaw - it could return 1, afterwards m_node_state could become empty by another thread, we would enter the if and the assert would fail because 1 != 0.

    Maybe just delete the assert? Or expose the mutex of m_txrequest, lock it before DisconnectedPeer() and unlock it after the assert :question:

  38. in src/net_processing.cpp:1398 in 5a031f27ff outdated
    1394@@ -1395,7 +1395,8 @@ void PeerManagerImpl::PushNodeVersion(CNode& pnode, const Peer& peer)
    1395 
    1396 void PeerManagerImpl::AddTxAnnouncement(const CNode& node, const GenTxid& gtxid, std::chrono::microseconds current_time)
    1397 {
    1398-    AssertLockHeld(::cs_main); // For m_txrequest
    1399+    AssertLockHeld(::cs_main);
    


    maflcko commented at 11:43 am on October 24, 2022:
    Not sure if this is correct. While m_txrequest has an internal mutex to guard against UB, the internal mutex does nothing to guard the processing logic. A mutex different from the internal one is still needed here to guard against several threads calling into this function at the same time.

    dergoegge commented at 1:41 pm on October 24, 2022:
    Hm I think you’re right. Also seems similar to what Vasil mentioned, when we call methods on m_txrequest but expect the internal state between those calls not to change, then the internal mutex won’t help us. afaict the changes here don’t break anything because cs_main is still held in these places but I am trying to seperate m_txrequest from cs_main so cs_main should not be needed for any of this after the PR. Will mark as draft until I figure out how to address this. (Seems like the only way to address this is to change the TxRequestTracker interface)

    maflcko commented at 1:54 pm on October 24, 2022:
    Yeah, an alternative would be to replace cs_main with g_msgproc_mutex?

    vasild commented at 2:31 pm on October 25, 2022:
    Or expose the mutex outside of TxRequestTracker and lock for longer duration in the net processing code. I.e. have TxRequestTracker m_txrequest GUARDED_BY(m_its_own_mutex); as suggested by @ajtowns in #26151#pullrequestreview-1116661944
  39. dergoegge marked this as a draft on Oct 24, 2022
  40. DrahtBot added the label Needs rebase on Nov 28, 2022
  41. DrahtBot commented at 1:14 pm on November 28, 2022: contributor

    🐙 This pull request conflicts with the target branch and needs rebase.

  42. DrahtBot commented at 0:50 am on May 19, 2023: contributor

    There hasn’t been much activity lately and the patch still needs rebase. What is the status here?

    • Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
    • Is it no longer relevant? ➡️ Please close.
    • Did the author lose interest or time to work on this? ➡️ Please close it and mark it ‘Up for grabs’ with the label, so that it can be picked up in the future.
  43. dergoegge commented at 2:56 pm on May 30, 2023: member

    Closing this for now, can be marked up for grabs.

    #26151 (review) needs to be addressed for this to move forward. Imo, the interface of TxRequestTracker should change to internally enforce the MAX_PEER_TX_ANNOUNCEMENTS and MAX_PEER_TX_REQUEST_IN_FLIGHT limits.

  44. dergoegge closed this on May 30, 2023

  45. fanquake added the label Up for grabs on May 30, 2023
  46. bitcoin locked this on May 29, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-01 10:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me