test: add addpeeraddress “tried”, test addrman checks on restart with asmap #22831

pull jonatack wants to merge 3 commits into bitcoin:master from jonatack:addrman-asmap-init-order-test changing 4 files +65 −14
  1. jonatack commented at 7:26 pm on August 29, 2021: member

    This pull adds a tried argument to RPC addpeeraddress and a regression test for the recent addrman/asmap changes and issue.

    PR #22697 introduced a reproducible bug in commit 181a1207 that fails addrman consistency checks and causes it to significantly lose peer entries when the -asmap configuration option is used.

    The issue occurs upon bitcoind restart due to an initialization order change in src/init.cpp in that commit, whereby CAddrman asmap is set after deserializing peers.dat, rather than before.

    Issue reported on the #bitcoin-core-dev IRC channel starting at https://www.erisian.com.au/bitcoin-core-dev/log-2021-08-23.html#l-263.

    0addrman lost 22813 new and 2 tried addresses due to collisions or invalid addresses
    1ADDRMAN CONSISTENCY CHECK FAILED!!! err=-17 bitcoind: ./addrman.h:707: void CAddrMan::Check() const: Assertion `false' failed. Aborted
    

    How to reproduce:

    • git checkout 181a1207, build, and launch bitcoind with the -asmap and -checkaddrman=1 configuration options enabled
    • restart bitcoind
    • bitcoind aborts on the second call to the addrman consistency checks in CAddrMan::Check()

    How to test this pull:

    • git checkout 181a1207, cherry pick the first commit of this branch, build, git checkout this branch, run test/functional/rpc_net.py, which should pass, and then run test/functional/feature_asmap.py, which should fail with the following output:
    0AssertionError: Unexpected stderr bitcoind: ./addrman.h:739: void CAddrMan::Check() const: Assertion `false' failed.
    
  2. DrahtBot added the label Tests on Aug 29, 2021
  3. DrahtBot commented at 0:18 am on August 30, 2021: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #22087 (Validate port numbers by amadeuszpawlik)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  4. in test/functional/feature_asmap.py:45 in 6d9da9bc96 outdated
    40@@ -38,6 +41,10 @@ class AsmapTest(BitcoinTestFramework):
    41     def set_test_params(self):
    42         self.num_nodes = 1
    43 
    44+    def set_peers_dat(self):
    45+        peers_data = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'data/peers.dat')
    46+        shutil.copyfile(peers_data, os.path.join(self.datadir, 'peers.dat'))
    


    MarcoFalke commented at 6:45 am on August 30, 2021:
    The file is .2 MB, so I am wondering if there is a way to generate it with a few lines of python? Can addpeeraddress fill the tried table?

    jonatack commented at 6:50 am on August 30, 2021:
    Didn’t immediately see how to add tried entries programmatically but I think the file can be much smaller if concept ack.

    MarcoFalke commented at 7:36 am on August 30, 2021:
    An alternative is always to add the data to https://github.com/bitcoin-core/qa-assets, which doesn’t slow down the main repo clone at all.

    jonatack commented at 9:01 am on August 30, 2021:

    Reduced the peers.dat size 20x while still ensuring the test passes/fails

    before

    0236 KB
    1Addrman checks started: new 3157, tried 96, total 3253
    

    after

    011.9 KB
    1Addrman checks started: new 95, tried 3, total 98
    

    Files in test/functional/data for me locally

    0$ ls -l test/functional/data
    1-rw-r--r-- 1 88238 Jun  5 16:07 blockheader_testnet3.hex
    2-rw-r--r-- 1  8056 Aug 28 23:27 invalid_txs.py
    3-rw-r--r-- 1 12137 Aug 30 11:12 peers.dat
    4drwxr-xr-x 2  4096 Aug 19 14:46 __pycache__
    5-rw-r--r-- 1  3431 Jun  5 16:07 rpc_bip67.json
    6-rw-r--r-- 1 57899 Jun  5 16:07 rpc_getblockstats.json
    7-rw-r--r-- 1 47236 Jun  5 16:07 rpc_psbt.json
    

    jonatack commented at 9:02 am on August 30, 2021:
    (The new -checkaddrman option and logging is super helpful :)
  5. MarcoFalke commented at 7:37 am on August 30, 2021: member
    Concept ACK on adding the test
  6. jonatack force-pushed on Aug 30, 2021
  7. jonatack force-pushed on Aug 30, 2021
  8. jonatack commented at 10:44 am on August 30, 2021: member
    Re-pushed with a 20x smaller peers.dat containing 95 new and 3 tried entries (we need a few entries in the tried table for the asmap/addrman regression test) along with some docs per git diff c8a6035 792c9c2.
  9. MarcoFalke commented at 1:03 pm on August 30, 2021: member
    Maybe still include docs on how to generate the file? (See also #16796#pullrequestreview-285574484) Can be in the commit message or a separate file.
  10. jonatack force-pushed on Aug 30, 2021
  11. jonatack commented at 6:41 pm on August 30, 2021: member

    Maybe still include docs on how to generate the file? (See also #16796 (review)) Can be in the commit message or a separate file.

    Good idea. Done in the commit message.

  12. jonatack force-pushed on Aug 30, 2021
  13. amitiuttarwar commented at 8:52 pm on August 31, 2021: contributor
    concept ACK, thanks for adding tests. I think squashing cf90c5a7d5116468d42a812d5eed93a2ac56a248 & 80dbd9b0f46f2b12a9e6359e1aaf4c706b57f20b would make sense, because the functional test seems like the easiest way to verify the contents of the peers.dat file.
  14. in src/init.cpp:1216 in 0cdfe7fc90 outdated
    1212@@ -1184,10 +1213,12 @@ bool AppInitMain(NodeContext& node, interfaces::BlockAndHeaderTipInfo* tip_info)
    1213         } else {
    1214             // Addrman can be in an inconsistent state after failure, reset it
    1215             node.addrman = std::make_unique<CAddrMan>(/* deterministic */ false, /* consistency_check_ratio */ check_addrman);
    1216+            node.addrman->m_asmap = asmap;
    


    naumenkogs commented at 1:33 pm on September 3, 2021:
    Why making this assignment twice? The line node.addrman->m_asmap = asmap; is literally repeated? I think it could be done afterwards?

    naumenkogs commented at 1:35 pm on September 3, 2021:
    Ah, I realized that’s the whole point of this commit.
  15. jonatack commented at 3:08 pm on September 3, 2021: member

    I think squashing cf90c5a & 80dbd9b would make sense, because the functional test seems like the easiest way to verify the contents of the peers.dat file.

    They began as the same commit together, then I realized that it makes more sense for them to be separate:

    • the file addition is a standalone change that can be used for other tests (I have some ideas)
    • so that people can find the data file commit easily, see its documentation, how it was generated, maybe generate a new one (more addresses, or with tor v2 peers, etc.), or perhaps to revert it someday if we have a way to call CAddrMan::MakeTried() from the tests to move addresses into the tried table or to mock it
    • the commit messages are very different

    The easiest way to test the peers.dat file is to copy it to a datadir, start a node, and run commands on it like -addrinfo, getnodeaddresses, etc. Edit: added “how to test” info to the commit message.

  16. DrahtBot added the label Needs rebase on Sep 6, 2021
  17. jonatack force-pushed on Sep 6, 2021
  18. jonatack renamed this:
    p2p, bugfix: fix addrman tried table corruption on restart with asmap
    test, bugfix: addrman tried table corruption on restart with asmap
    on Sep 6, 2021
  19. jonatack commented at 12:42 pm on September 6, 2021: member
    Rebased, improved the commit messages, added additional test coverage, and renamed the test file to feature_addrman_asmap.py. Updated the PR title and description.
  20. DrahtBot removed the label Needs rebase on Sep 6, 2021
  21. jonatack force-pushed on Sep 6, 2021
  22. DrahtBot added the label Needs rebase on Sep 8, 2021
  23. jonatack force-pushed on Sep 8, 2021
  24. jonatack commented at 3:11 pm on September 8, 2021: member
    Rebased and merged the addrman and asmap tests into one test file.
  25. in test/functional/feature_addrman_asmap.py:59 in 4cda5cf421 outdated
    54+    def set_test_params(self):
    55+        self.num_nodes = 1
    56+
    57+    def set_peers_dat(self):
    58+        """Copy a peers.dat containing tried table entries into the data directory."""
    59+        peers_data = os.path.join(os.path.dirname(os.path.realpath(__file__)), "data/peers.dat")
    


    MarcoFalke commented at 3:48 pm on September 8, 2021:
    An alternative to adding a binary file would be to hack something together with the serialize_addrman function in this file

    jonatack commented at 8:01 am on September 9, 2021:

    I agree, it would be a logical next step if this is merged but this can do the job for now.

    (For now I think it’s fundamental to have these regression tests in place quickly, given the number of recent bugs that escaped review, the reproducible criticality of the last bug that this adds a test for, and the addrman/asmap refactoring taking place.)


    MarcoFalke commented at 3:10 pm on September 10, 2021:
    Obviously not a blocker, but if the file is removed after this is merged, then it would be better to not commit it in the first place and use serialize_addrman (or similar) from the beginning.

    amitiuttarwar commented at 3:24 pm on September 10, 2021:
    +1 to approach of serialize_addrman function.

    jonatack commented at 3:39 pm on September 10, 2021:
    Happy to look at it as a follow-up (or leave it for someone), but this does the job and by common sense ought to have been the first step in a responsible fix (or at least part of it), rather than ignored after #22791 (comment) while embarking on many refactoring and changes, no matter how desirable they may have been. Foot-dragging and bikeshedding doesn’t seem appropriate here compared to the gravity of the bug that is covered.

    MarcoFalke commented at 4:02 pm on September 10, 2021:

    Happy to look at it as a follow-up (or leave it for someone)

    Happy to propose an alternative that uses serialize_addrman, if you don’t want to do it here. I just think that we shouldn’t be adding binary files to the git repo, especially short-lived ones, when it can be avoided. Personally I’d like to see both approaches, so that they can be compared and the best one be picked. Though, again, not a blocker. If this gets review, I am happy to merge nonetheless.

    while embarking on many refactoring

    I am not aware of any refactoring changes that have been merged in the meantime after the bug was discovered.

    gravity of the bug that is covered

    I agree that test coverage for that is badly needed. However, the existence of a regression test doesn’t mean it should be merged without review. (At least I couldn’t find any review ACKs on this pull).


    jamesob commented at 4:23 pm on September 10, 2021:

    I am not aware of any refactoring changes that have been merged in the meantime after the bug was discovered.

    Instead of a straight revert, or simple reordering, #22791 was a larger than necessary diff with no accompanying tests.

  26. DrahtBot removed the label Needs rebase on Sep 8, 2021
  27. jonatack force-pushed on Sep 9, 2021
  28. MarcoFalke commented at 9:45 am on September 10, 2021: member
    Sorry, needs rebase again. Maybe split out the merge of the two files into a separate pull, or at least the first commit? Review of the later commits will be easier after getting the full picture first.
  29. DrahtBot added the label Needs rebase on Sep 10, 2021
  30. jonatack commented at 10:37 am on September 10, 2021: member

    Sorry, needs rebase again.

    It’s all good, this was the merge order that made sense to me.

  31. in test/functional/feature_asmap.py:18 in 916d54ac45 outdated
    11@@ -12,7 +12,10 @@
    12 
    13 3. `bitcoind -asmap=<relative path>`, using the unit test skeleton asmap
    14 
    15-4. `bitcoind -asmap/-asmap=` with no file specified, using the default asmap
    16+4. `bitcoind -asmap/-asmap=` with no file specified, using the default asmap,
    17+    combined with `-checkaddrman=1` as a regression test to ensure the correct
    18+    order of asmap/addrman initialization on bitcoind startup, failing which
    19+    the addrman tried table became corrupted and CAddrman::Check() aborted.
    


    fanquake commented at 12:50 pm on September 10, 2021:
    I’m not sure mentioning initialization order here is needed/wanted, as -checkaddrman is just a generic check for corruption. I also don’t think we should leak implementation details like CAddrman::Check(). Naming the function doesn’t add value, and it’s the kind of thing that just becomes outdated / needs changing again in a change like #22872.

    jonatack commented at 1:40 pm on September 10, 2021:
    I agree, and all this is removed in the commit that consolidates the tests.

    jonatack commented at 2:29 pm on September 10, 2021:

    Done, slimmed down the doc.

    0 4. `bitcoind -asmap/-asmap=` with no file specified, using the default asmap
    1+    combined with `-checkaddrman=1`
    
  32. in test/functional/feature_asmap.py:103 in 0ab75d87b7 outdated
     99@@ -98,6 +100,14 @@ def test_empty_asmap(self):
    100         self.node.assert_start_raises_init_error(extra_args=['-asmap'], expected_msg=msg)
    101         os.remove(self.default_asmap)
    102 
    103+    def test_missing_peers_dat_with_addrman_checks(self):
    


    fanquake commented at 1:09 pm on September 10, 2021:
    In 0ab75d87b7ff1cb6ebb69ffe68342b88403230b9: This commit is confusing because it adds an addrman test to feature_asmap.py, which from what I can see is basically the same test that is already in feature_addrman.py, with the addition of -checkaddrman, so isn’t actually asmap specific? Later in this PR the two files get merged together, so either this new version or the existing version essentially gets deleted. Why not just add -checkaddrman to the existing test after the merge, rather than another commit which is essentially duplicate code.

    jonatack commented at 1:31 pm on September 10, 2021:

    This commit was earlier.

    The goal was to have minimum regression tests in place before the fix, or all the refactoring, or at least as soon as possible, to reduce the chance of accidentally adding more critical bugs.

    I’m trying to catch up to a moving target and may slim this back down to just the first regression test.


    jonatack commented at 2:30 pm on September 10, 2021:
    Dropped
  33. fanquake commented at 1:19 pm on September 10, 2021: member

    Rebased and merged the addrman and asmap tests into one test file.

    Why? We are actively untangling addrman and asmap functionality, so I’m not sure why it now makes sense to combine the addrman and asmap functional tests into one place. This seems like the opposite of what we should be doing. cc @jnewbery.

  34. jnewbery commented at 1:24 pm on September 10, 2021: member

    Rebased and merged the addrman and asmap tests into one test file.

    Why? We are actively untangling addrman and asmap functionality, so I’m not sure why it now makes sense to combine the addrman and asmap functional tests into one place. This seems like the opposite of what we should be doing. cc @jnewbery.

    I agree. The main benefit of modularizing our components is the ability to then test them in isolation.

  35. jonatack commented at 1:34 pm on September 10, 2021: member

    Rebased and merged the addrman and asmap tests into one test file.

    Why? We are actively untangling addrman and asmap functionality, so I’m not sure why it now makes sense to combine the addrman and asmap functional tests into one place. This seems like the opposite of what we should be doing. cc @jnewbery.

    I agree. The main benefit of modularizing our components is the ability to then test them in isolation.

    The untested interaction between addrman and asmap was the source of (one of the) most critical bugs we’ve seen in a good while, so it makes sense to test that interaction – at least during the refactoring. The two files don’t have to be consolidated, but they do have a lot in common–including mocking an addrman, which could be the first step to not needing a peers.dat test file. It’s also quicker and easier to manually run one test file than two on these refactorings, and at the moment I have to cherry-pick these tests onto each addrman/asmap PR branch.

  36. MarcoFalke commented at 1:48 pm on September 10, 2021: member
    If people don’t like merging the two files, but you need access to addrman mocking, one solution would be to import the write_addrman helper into the asmap test.
  37. jamesob commented at 2:27 pm on September 10, 2021: member
    Concept ACK - I’m surprised and disconcerted that some minimal version of this wasn’t packaged with #22791, which would’ve allowed unambiguous demonstration of the fix.
  38. jonatack force-pushed on Sep 10, 2021
  39. jonatack commented at 2:36 pm on September 10, 2021: member

    Concept ACK - I’m surprised and disconcerted that some minimal version of this wasn’t packaged with #22791, which would’ve allowed unambiguous demonstration of the fix.

    I agree. Writing a regression test is the first step to take after finding and reproducing a severe bug.

    Slimmed this pull down to the original two commits that focus on regression test coverage for the issue. The additional addrman test coverage can be done in a follow-up.

    Per the PR description:

    How to reproduce the bug

    • git checkout 181a1207, build, and launch bitcoind with the -asmap and -checkaddrman=1 configuration options enabled
    • restart bitcoind
    • bitcoind aborts on the second call to the addrman consistency checks in CAddrMan::Check()

    How to test the test

    • git checkout 181a1207, build, and run test/functional/feature_asmap.py. Output:
    0AssertionError: Unexpected stderr bitcoind: ./addrman.h:739: void CAddrMan::Check() const: Assertion `false' failed.
    
  40. DrahtBot removed the label Needs rebase on Sep 10, 2021
  41. jamesob commented at 4:34 pm on September 10, 2021: member

    ACK 22a17be0b9b56e0b35cfc5d1376e51b1e6dc0ed6 (jamesob/ackr/22831.1.jonatack.test_bugfix_addrman_trie)

    Changeset looks like a minimal way to distinguish that the bug fixed in #22791 is actually fixed. I

    • git checkout 181a1207; make ...; git checkout 22a17be0b9b56e0b35cfc5d1376e51b1e6dc0ed6
    • ./test/functional/feature_asmap.py, observed failure.
    • git checkout master; make ...; git checkout 22a17be0b9b56e0b35cfc5d1376e51b1e6dc0ed6
    • ./test/functional/feature_asmap.py, ensured pass.
  42. mzumsande commented at 10:07 pm on September 10, 2021: member

    looks like a minimal way to distinguish that the bug fixed in #22791 is actually fixed

    Shouldn’t a single address in Tried in an otherwise empty addrman be the minimal way to test this? (instead of 96 addresses in new and 3 in tried)

  43. rajarshimaitra commented at 5:36 am on September 11, 2021: contributor

    tACK https://github.com/bitcoin/bitcoin/pull/22831/commits/22a17be0b9b56e0b35cfc5d1376e51b1e6dc0ed6

    Verified that 181a1207 fails with this test while both master and the PR passes.

    0$ git checkout 181a1207
    1$ make
    2$ git checkout origin/pr/22831
    3$ test/functiona/feature_asmap.py 
    42021-09-11T05:22:59.032000Z TestFramework (ERROR): Unexpected exception caught during testing
    
    0$ git checkout master
    1$ make
    2$ git checkout origin/pr/22831
    3$ test/functiona/feature_asmap.py
    42021-09-11T05:30:53.276000Z TestFramework (INFO): Tests successful
    
    0$ git checkout origin/pr/22831
    1$ make
    2$ test/functional/feature_asmap.py
    32021-09-11T05:30:53.276000Z TestFramework (INFO): Tests successful
    
  44. benthecarman approved
  45. benthecarman commented at 2:47 am on September 12, 2021: contributor

    tACK 22a17be0b9b56e0b35cfc5d1376e51b1e6dc0ed6

    test that the new test passes on master and this pr, and that it fails on 181a1207ba6bd179d181f3e2534ef8676565ce72

  46. ben-kaufman commented at 4:32 pm on September 12, 2021: contributor

    tACK 22a17be0b9b56e0b35cfc5d1376e51b1e6dc0ed6

    Tested against masterand this PR - passing as expected Tested against 181a1207ba6bd179d181f3e2534ef8676565ce72 - failing with the expected error

  47. fanquake commented at 3:51 am on September 13, 2021: member

    I’ve seen multiple mentions that the bug this PR is providing regression testing for is “critical”, “severe” and “(one of the) most critical bugs we’ve seen in a good while”. I just want to provide some perspective. When considering the severity of bugs I’m generally taking two things into account. The impact: what does it cause & how bad is it, and the scope: how many users does it effect. There are other factors that can be considered, i.e how easy is it to exploit, how often does it occur, how hard is it to (covertly) patch etc, but just considering impact and scope at a high level do a pretty good job of determining what needs to be done, and how quickly. High impact + big scope = critical / severe.

    Things I consider to be high impact would be remote code execution, (remote) node crashing, something that would result in a loss of user funds, BTC printing, networking splitting, DDOS etc. When it comes to scope, it’s basically the larger the proportion of the existing network that the issue effects, the bigger the scope. So, a trivial to exploit, node crashing bug, that effects the last few major versions deployed on the network, is critical/severe. This is the kind of bug that once discovered, is generally patched quickly (and likely covertly), new releases are then made available, and the true impact of the issue may not become public for some time after that, maybe many years.

    The addrman/asmap initialisation order bug fixed by #22791, is, in my opinion, not only low impact, but also low in scope. For most users (let alone developers running master), the impact of losing your peers.dat is probably more of an annoyance, than a critical issue. As for scope, this bug existed only in master, for ~2 weeks (the time between #22697 and #22791), and would only be seen when using -asmap, which is still fairly new / experimental.

    I think this perspective is important because it can determine how we fix bugs. In this case, the bug is already fixed, and this PR is adding regression testing (which is important!). However given the impact and scope are low, there isn’t any sort of rush to get this regression testing added that we can’t do it “right” the first time.

    In this case, “right”, in my opinion, is writing some python to generate the peers.dat needed for the test (potentially utilising code that already exists in the feature_addrman.py test), rather than adding a new binary file (which is generated in a fairly adhoc way) to the repo, just to likely delete it shortly after. Writing new code to assist in generating peers.dat files will also be useful in the future, as more comprehensive tests are added. This opinion seems to be shared by at least a few of the other reviewers.

  48. jarolrod commented at 9:02 am on September 13, 2021: member

    ACK 22a17be

    I mainly tested that the tests introduced here are actually effective. Checked that it would fail on the mentioned commit hash in PR description and passes on master. The mentioned testing steps were quite helpful.

    I’ve read up on the contexts of the mentioned PRs and the mentioned issue. I have no opinion on the severity of this bug, especially since it’s been patched. As mentioned in the PR description, regardless of the severity, regression tests here are still a nice-to-have. @fanquake

    “In this case, “right”, in my opinion, is writing some python to generate the peers.dat needed for the test (potentially utilising code that already exists in the feature_addrman.py test)”

    This would be quite useful, for a variety of contexts. Seems most appropriate as a follow-up or just on its own as an independent PR?

  49. mzumsande commented at 9:47 am on September 13, 2021: member

    Two points: I don’t think that the bug specifically corrupts the tried tables: when peers.dat is deserialized without an asmap, and only later an asmap is loaded, both entries from new and tried will be in incorrect buckets with respect to that asmap. It just so happens that CAddrman::Check_() checks whether tried entries are in their appropiate bucket, while it doesn’t do the same for new.

    In this case, “right”, in my opinion, is writing some python to generate the peers.dat needed for the test

    An alternative approach might be to expand the test-only RPC -addpeeraddress with an optional arg to promote to tried (by calling Good() after Add()). That way, one could add addresses to the tried tables directly in functional tests and let bitcoind do the serialization to peers.dat. Though I’m not sure if this would have many applications beyond this issue, since most of the tried logic is already covered via unit tests.

  50. jnewbery commented at 11:18 am on September 13, 2021: member

    The addrman/asmap initialisation order bug fixed by #22791, is, in my opinion, not only low impact, but also low in scope.

    I totally agree @fanquake’s assessment. I obviously can’t be totally unbiased since I introduced the initialization order bug in #22697. However, I have discovered/fixed/reviewed fixes for bugs of the type that @fanquake mentions (DDoS, remote crashing, funds loss, etc) in released code, and I can objectively say that this bug doesn’t match the severity of those. If I had to characterize the severity/type of the bug, I’d say it’s much closer to issues like #21584 and #22467. They’re certainly bugs that need to be fixed in master before the next release, but it’s totally inaccurate to claim it’s “(one of the) most critical bugs we’ve seen in a good while”.

    Update: The bug was patched in another PR, #22791 (which also proceeded to do further refactoring)

    (@jonatack) #22831#issue-722006280

    Instead of a straight revert, or simple reordering, #22791 was a larger than necessary diff with no accompanying tests.

    (@jamesob) #22831 (review)

    There seems to be some confusion around the fix in #22791. The initialization order bug that #22791 fixed was introduced in #22697. However, the conditions for that bug have existed since #16702, when m_asmap was introduced as a mutable public data member of CAddrMan. That PR changed CAddrMan’s initialization to be two-phase: first the object is constructed, and later on, external code mutates CAddrMan’s internal m_asmap data member. If any public methods are called on CAddrMan between those two phases, its internal invariants may be invalidated. Therefore, #22791 fixed not just the proximal cause of the bug (initialization order), but also the ultimate cause (bad encapsulation and partially-initalized objects) by making that member const and private. Just changing the initialization order would be a partial fix, rather than eliminating a whole class of bugs.

    The suggestions in #22791 (review) were also essentially ignored.

    (@jonatack) #22791 (comment)

    This is simply untrue. I welcomed the comments here: #22791 (comment) and offered to incorporate them into the PR. Instead of that, this PR was opened, which was essentially a competing/conficting PR with only the partial fix. It’s a bit difficult to see the history of this PR since the description has been edited 20 times in the two weeks that it’s been open, but this was originally opened as a conflicting PR with #22791.

    Having two conflicting PRs to fix the same issue seemed unnecessarily confusing for reviewers, so I continued to focus my efforts on #22791. I’m still a concept ACK for increasing the test coverage of asmap/addrman, but I don’t see why there’s a strict necessity for that to be included in the same PR as the fix, especially when multiple reviewers have raised concerns over the approach of the tests in this PR.

    Unfortunately, this PR has become very charged and politicized, with reasonable technical points raised by very experienced contributors (https://github.com/bitcoin/bitcoin/pull/22831#discussion_r704550576) being rejected out of hand as “foot-dragging and bikeshedding”. I’d encourage everyone to take a deep breath and remember that we’re all on the same team, and everyone is here to improve the project.

    An alternative approach might be to expand the test-only RPC -addpeeraddress with an optional arg to promote to tried (by calling Good() after Add()). That way, one could add addresses to the tried tables directly in functional tests and let bitcoind do the serialization to peers.dat. @mzumsande

    I agree that this is the easiest way. It’s a couple of lines change to update addpeeraddress to allow addresses to be added to the tried table, and ~10 lines to update the test to use that new functionality. I have a demonstration branch here: https://github.com/jnewbery/bitcoin/tree/2021-09-test-addrman-restart. @jonatack - feel free to take what you need from there if you think it’s useful.

  51. laanwj commented at 11:25 am on September 13, 2021: member

    Code review ACK 22a17be0b9b56e0b35cfc5d1376e51b1e6dc0ed6 It tests what it needs to test. I think it’s good to have a test here. Would have been good to have it part of the original fix.

    I agree with @fanquake’s remark about the criticality of this issue being somewhat overstated. This needs to be fixed, for sure, but losing peers.dat is not the end of the world. It’s not a “everyone needs to panic” level issue in itself.

    The lesson to be learned is: review changes to bitcoind/bitcoin-qt’s initialization sequence carefully. Historically it has been a big source of subtle bugs. This one was far from obvious. I hope the refactoring helps making initialization dependencies more clear in the long run to make it less of a mine field.

    In this case, “right”, in my opinion, is writing some python to generate the peers.dat needed for the test (potentially utilising code that already exists in the feature_addrman.py test), rather than adding a new binary file (which is generated in a fairly adhoc way) to the repo, just to likely delete it shortly after.

    This would be better, but is also quite a bit of further work. I think it’s somewhat unfair to expect this from @jonatack just for reporting the issue! Someone could work on this, but in the meantime it is good to have a test in place.

    I agree with the general sentiment that we don’t want large binary files to blow up the repository. But a 12kB binary file is not enough to reject it IMO.

  52. MarcoFalke commented at 11:29 am on September 13, 2021: member

    If this will be merged, it would be good to update the title first, which mentions “bugfix”, but the code changes are adding a test.

    Also the description could be shortened to not explain what the fix in anther pull did.

  53. vasild approved
  54. vasild commented at 8:39 am on September 14, 2021: member

    ACK 22a17be0b9b56e0b35cfc5d1376e51b1e6dc0ed6

    The bug was fixed in 50fd77045e2f858a53486b5e02e1798c92ab946c. I checked that:

    • The test fails on 50fd77045e2f858a53486b5e02e1798c92ab946c~
    • The test passes on 50fd77045e2f858a53486b5e02e1798c92ab946c

    I think generating peers.dat on the fly, if possible to be done easily, would be better than including a pre-generated one. CAddrMan::Serialize() is somewhat involved and definitely not trivial-5-lines of code. Duplicating it into Python will have a maintenance cost in the long run. I like @mzumsande’s idea about addpeeraddress because it will use the actual CAddrMan::Serialize() to generate peers.dat. I was working on that but @jnewbery beat me to it. Looks like it is relatively easy, attaching my version here for reference. Would be happy to re-ACK.

     0diff --git i/src/rpc/net.cpp w/src/rpc/net.cpp
     1index 0f554ec5e7..b3562eaa05 100644
     2--- i/src/rpc/net.cpp
     3+++ w/src/rpc/net.cpp
     4@@ -921,6 +921,7 @@ static RPCHelpMan addpeeraddress()
     5         {
     6             {"address", RPCArg::Type::STR, RPCArg::Optional::NO, "The IP address of the peer"},
     7             {"port", RPCArg::Type::NUM, RPCArg::Optional::NO, "The port of the peer"},
     8+            {"good", RPCArg::Type::BOOL, RPCArg::Default{false}, "If set, add to the tried table"},
     9         },
    10         RPCResult{
    11             RPCResult::Type::OBJ, "", "",
    12@@ -929,7 +930,7 @@ static RPCHelpMan addpeeraddress()
    13             },
    14         },
    15         RPCExamples{
    16-            HelpExampleCli("addpeeraddress", "\"1.2.3.4\" 8333")
    17+            HelpExampleCli("addpeeraddress", "\"1.2.3.4\" 8333 true")
    18     + HelpExampleRpc("addpeeraddress", "\"1.2.3.4\", 8333")
    19         },
    20         [&](const RPCHelpMan& self, const JSONRPCRequest& request) -> UniValue
    21@@ -941,6 +942,7 @@ static RPCHelpMan addpeeraddress()
    22
    23     const std::string& addr_string{request.params[0].get_str()};
    24     const uint16_t port{static_cast<uint16_t>(request.params[1].get_int())};
    25+    const bool good{request.params[2].isTrue()};
    26
    27     UniValue obj(UniValue::VOBJ);
    28     CNetAddr net_addr;
    29@@ -951,7 +953,12 @@ static RPCHelpMan addpeeraddress()
    30         address.nTime = GetAdjustedTime();
    31         // The source address is set equal to the address. This is equivalent to the peer
    32         // announcing itself.
    33-        if (node.addrman->Add({address}, address)) success = true;
    34+        if (node.addrman->Add({address}, address)) {
    35+            if (good) {
    36+                node.addrman->Good(address);
    37+            }
    38+            success = true;
    39+        }
    40     }
    41
    42     obj.pushKV("success", success);
    
  55. jonatack commented at 9:18 am on September 14, 2021: member
    “You say it, you own it” is an expression that comes up time after time over the decades and why, when no test seemed forthcoming after my review feedback, I proposed one. It’s as simple as that. If other people owning their suggestions leads to improvement, that’s great! This is about Bitcoin; “do no harm” and collaborating to strengthen its robustness is the goal. I’ll try these ideas, thank you :+1:
  56. jonatack renamed this:
    test, bugfix: addrman tried table corruption on restart with asmap
    rpc, test: add addpeerinfo tried, test addrman checks on restart with asmap
    on Sep 14, 2021
  57. jonatack renamed this:
    rpc, test: add addpeerinfo tried, test addrman checks on restart with asmap
    rpc, test: add addpeeraddress "tried", test addrman checks on restart with asmap
    on Sep 14, 2021
  58. jonatack force-pushed on Sep 14, 2021
  59. jonatack commented at 12:40 pm on September 14, 2021: member
    Thank you very much to @jamesob, @rajarshimaitra, @benthecarman, @ben-kaufman, @jarolrod, @laanwj, and @vasild for the tested review ACKs. Updated the approach and pull title/description per review feedback.
  60. jonatack force-pushed on Sep 14, 2021
  61. in test/functional/rpc_net.py:268 in 8c41132419 outdated
    262@@ -262,6 +263,12 @@ def test_addpeeraddress(self):
    263         assert_equal(node.addpeeraddress(address="1.2.3.4", port=8333), {"success": False})
    264         assert_equal(len(node.getnodeaddresses(count=0)), 1)
    265 
    266+        self.log.debug('Test adding a second address to the tried table')
    267+        assert_equal(node.addpeeraddress(address="1.2.3.0", port=8333, tried=True), {"success": True})
    268+        with node.assert_debug_log(expected_msgs=["new 1, tried 1, total 2"]):
    


    MarcoFalke commented at 1:13 pm on September 14, 2021:

    nit in the second commit:

    Maybe assert on the longer string “Addrman checks started: new 1, tried 1, total 2” to clarify that this is triggered by the addrman check? Either in addition or as replacement for the comment below, which is not verified by the test itself.


    jonatack commented at 1:21 pm on September 14, 2021:
    I held off because that aspect of the logging may change in #22872, but that pull can update the test too, so ok taking your feedback.

    jonatack commented at 1:32 pm on September 14, 2021:
    done
  62. in test/functional/feature_asmap.py:89 in 55d1411a16 outdated
    84+        self.log.info('Test bitcoind -asmap with an addrman containing new and tried entries')
    85+        self.fill_addrman(node_id=0)
    86+        self.stop_node(0)
    87+        shutil.copyfile(self.asmap_raw, self.default_asmap)
    88+        self.start_node(0, ["-asmap", "-checkaddrman=1"])
    89+        self.node.getnodeaddresses()  # getnodeaddresses re-runs the addrman checks
    


    MarcoFalke commented at 1:16 pm on September 14, 2021:

    nit in the last commit:

    Maybe the same here with “Addrman checks started: new 2, tried 2, total 4”


    jonatack commented at 1:32 pm on September 14, 2021:
    done
  63. MarcoFalke approved
  64. MarcoFalke commented at 1:17 pm on September 14, 2021: member

    review ACK 55d1411a161d17757b61270a85626a62c3d63085 🎽

    Signature:

     0-----BEGIN PGP SIGNED MESSAGE-----
     1Hash: SHA512
     2
     3review ACK 55d1411a161d17757b61270a85626a62c3d63085 🎽
     4-----BEGIN PGP SIGNATURE-----
     5
     6iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
     7pUi7FgwAm0GAoP38/kTeZXLd4BiAr/zxha4e32USwtRrk6M2AlXfnKfkXXvWUJ03
     87xZ2f7bpNiBUpIOpUXh0dhc+uKZrqNzGo18M9MygCGEs2r0hZajTwxd8dxjLyMiF
     9Dfkh8OhroQKoEyc/F5yskCCNLvZdQKIaDfOAMHGOikeb0yK6PTqPJXOBfZTO3PY9
    1065XkGyp8hjs6hJNTF+3LbnPW0jbs6KObah46AaG3REYdTYEBJeLdUavnxFqQP7n2
    1127FoTQZkpdQ4Maq/3UGZxYJl3u24RFql7j9DhLT8o4DDXZw8ghu22Wnpx21pHf7n
    122F8W/iuCzlGRy71E8EPhetpOFMAiTtwff5tZ7TnnEtRrMuAEjUo47bZ8FT0IWjy7
    135hX75sE8xUZHa1yZNxdhSQqSpp3nmPdM3fMyTz+uN/kssN9z04IEcwFlpIHYIwyF
    14TQX+H8N4oNZSrW0OgZjE2dZfNi6oLEVtSJYYmw2UFtWgK9wkVclVAGynLr8zIugP
    15cEMl5t0V
    16=Q1W4
    17-----END PGP SIGNATURE-----
    

    Timestamp of file with hash 04e5bf52e7493be34aac2e3fdb2f7099eaf7de5cac3f8c7898beff8518d261cb -

  65. MarcoFalke renamed this:
    rpc, test: add addpeeraddress "tried", test addrman checks on restart with asmap
    test: add addpeeraddress "tried", test addrman checks on restart with asmap
    on Sep 14, 2021
  66. in src/rpc/net.cpp:945 in 55d1411a16 outdated
    941@@ -941,6 +942,7 @@ static RPCHelpMan addpeeraddress()
    942 
    943     const std::string& addr_string{request.params[0].get_str()};
    944     const uint16_t port{static_cast<uint16_t>(request.params[1].get_int())};
    945+    const bool tried{request.params[2].isTrue()};
    


    jnewbery commented at 1:28 pm on September 14, 2021:
    Nit: there’s nothing wrong with declaring and setting a local tried variable at the top of the function, but standard best practice would be to declare the variable where it’s needed (https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-introduce / https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rnr-top), or better yet, don’t even use a local variable and just test request.params[2].isTrue() directly below.

    jonatack commented at 1:49 pm on September 14, 2021:
    That’s true. It’s subjective, but giving each of the params a name together in one place seems to make the code easier to understand in this case.

    jnewbery commented at 2:42 pm on September 14, 2021:
    Sure. It’s a style nit. Feel free to ignore.
  67. in test/functional/feature_asmap.py:85 in 55d1411a16 outdated
    79@@ -72,6 +80,15 @@ def test_default_asmap(self):
    80                 self.start_node(0, [arg])
    81         os.remove(self.default_asmap)
    82 
    83+    def test_asmap_interaction_with_addrman_containing_entries(self):
    84+        self.log.info('Test bitcoind -asmap with an addrman containing new and tried entries')
    85+        self.fill_addrman(node_id=0)
    


    jnewbery commented at 1:37 pm on September 14, 2021:
    Note that the addresses will get added to the new/tried table according to the asmap that was loaded in the previous subtest, which is currently test_default_asmap(). If the ordering of the tests changes, then the asmap used for the previous test may be different from this, and the addresses wouldn’t necessarily be rebucketed in the same locations.

    jonatack commented at 1:43 pm on September 14, 2021:

    Good point. Do you think this would be worthwhile?

    0     def test_asmap_interaction_with_addrman_containing_entries(self):
    1         self.log.info("Test bitcoind -asmap with an addrman containing new and tried entries")
    2-        self.fill_addrman(node_id=0)
    3         self.stop_node(0)
    4         shutil.copyfile(self.asmap_raw, self.default_asmap)
    5         self.start_node(0, ["-asmap", "-checkaddrman=1"])
    6+        self.fill_addrman(node_id=0)
    7+        self.restart_node(0, ["-asmap", "-checkaddrman=1"])
    8         with self.node.assert_debug_log(expected_msgs=["Addrman checks started: new 2, tried 2, total 4"]):
    9             self.node.getnodeaddresses()  # getnodeaddresses re-runs the addrman checks
    

    jonatack commented at 2:30 pm on September 14, 2021:
    updated

    jnewbery commented at 2:42 pm on September 14, 2021:
    :+1: that would make the subtests independent
  68. jnewbery commented at 1:38 pm on September 14, 2021: member
    utACK 55d1411a161d17757b61270a85626a62c3d63085
  69. jonatack force-pushed on Sep 14, 2021
  70. MarcoFalke commented at 1:43 pm on September 14, 2021: member

    review ACK 4de3271b1cb65111ea4b37d8fa9791aa0cd63b81 🍛

    Signature:

     0-----BEGIN PGP SIGNED MESSAGE-----
     1Hash: SHA512
     2
     3review ACK 4de3271b1cb65111ea4b37d8fa9791aa0cd63b81 🍛
     4-----BEGIN PGP SIGNATURE-----
     5
     6iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
     7pUjVpwwAnW+G/Az3yxcnBnDlYpPF+EnaE+9mRU+rBuVH4ta4EXbKmnkGBXcUMFES
     8vGyOSh214YYpBVMO0ibGV/5V0PwSeijSnSgFS90JgpL6kbGJ9a+Bcp2uc+nH8yAI
     96dCGwn40XN8oA0sNHbbqffsaFe4PdDIPsR+zV8Up/2SynpJhwxcN58Nf1KvC+dZD
    10vliNJO3LnMfVB9Gdah6ObJ2d99iqqWIyg8+aaWxYpUwMp/4oRtkKjZ7CHS9C4TdE
    11aeVybThyVnD8csbWpx/yKAxjiP9dVZ2yPzXmhTmVZPoWrUH+T3nN8IgYLv0+EkU6
    12NFXIdz1qZZbJ93TKl9RFIPxgB8LX/NH0nO9sRZji61jsSFKJMSv6HmQfYdAO5/tC
    138VUD2pqVnSvbnKYCfpfQEk8lkepDfM0FSH4uE7h3U8VysCbPsDJZpomC845eynai
    14THR4aeDwVylKCVnHZ7MUWSxAxxvjuh/i6SUYsPEKyxM8r9PcErhmYML2mafVKRdq
    1586qyGL//
    16=dWm5
    17-----END PGP SIGNATURE-----
    

    Timestamp of file with hash 4c359cb1138b60d4da195cd16d1d3fb668bf9409ad9ed02c51ecd1964b7f50f2 -

  71. in test/functional/rpc_net.py:280 in 4de3271b1c outdated
    262@@ -262,6 +263,12 @@ def test_addpeeraddress(self):
    263         assert_equal(node.addpeeraddress(address="1.2.3.4", port=8333), {"success": False})
    264         assert_equal(len(node.getnodeaddresses(count=0)), 1)
    265 
    266+        self.log.debug("Test adding a second address to the tried table")
    267+        assert_equal(node.addpeeraddress(address="1.2.3.0", port=8333, tried=True), {"success": True})
    268+        with node.assert_debug_log(expected_msgs=["Addrman checks started: new 1, tried 1, total 2"]):
    269+            addrs = node.getnodeaddresses(count=0)  # getnodeaddresses re-runs the addrman checks
    270+            assert_equal(len(addrs), 2)
    


    jonatack commented at 2:02 pm on September 14, 2021:
    I’m mildly worried this assert could be racy. Saw it fail once with 2 != 1 today. At the same time, it would be nice to assert on something in addition to the logging.

    jnewbery commented at 2:51 pm on September 14, 2021:

    1.2.3.0 is the same /16 as 1.2.3.4, and when using addpeeraddress(), the source address is set to the destination address. That means that 1.2.3.0 will be placed in the same new bucket as 1.2.3.4, and has a 1/64 chance of the bucket position colliding (depending on what addrman’s nKey has been set to). If that happens, then 1.2.3.0 won’t be added to addrman.

    You can reduce the chance of a collision by using an address from a different /16.


    jonatack commented at 2:54 pm on September 14, 2021:
    Excellent, thank you! Updating.

    jnewbery commented at 12:40 pm on September 15, 2021:
    I think this may still fail with a probability of 1/2^16 = 1/65536 since there could still be bucket/position collisions.

    jonatack commented at 1:12 pm on September 15, 2021:

    Sounds good. WDYT:

     0diff --git a/src/rpc/net.cpp b/src/rpc/net.cpp
     1index f7b6a2a333..227eec722f 100644
     2--- a/src/rpc/net.cpp
     3+++ b/src/rpc/net.cpp
     4@@ -956,7 +956,7 @@ static RPCHelpMan addpeeraddress()
     5         if (node.addrman->Add({address}, address)) {
     6             success = true;
     7             if (tried) {
     8-                // Mark entry as good and possibly move it to the "tried" addresses table.
     9+                // Attempt to move the address to the tried addresses table.
    10                 node.addrman->Good(address);
    11             }
    12         }
    13diff --git a/test/functional/rpc_net.py b/test/functional/rpc_net.py
    14index e2dfa3ea23..1b468aec22 100755
    15--- a/test/functional/rpc_net.py
    16+++ b/test/functional/rpc_net.py
    17@@ -254,16 +254,24 @@ class NetTest(BitcoinTestFramework):
    18 
    19         self.log.debug("Test that adding a valid address succeeds")
    20         assert_equal(node.addpeeraddress(address="1.2.3.4", port=8333), {"success": True})
    21-        addrs = node.getnodeaddresses(count=0)
    22-        assert_equal(len(addrs), 1)
    23-        assert_equal(addrs[0]["address"], "1.2.3.4")
    24-        assert_equal(addrs[0]["port"], 8333)
    25+        with node.assert_debug_log(expected_msgs=["Addrman checks started: new 1, tried 0, total 1"]):
    26+            addrs = node.getnodeaddresses(count=0)
    27+            assert_equal(len(addrs), 1)
    28+            assert_equal(addrs[0]["address"], "1.2.3.4")
    29+            assert_equal(addrs[0]["port"], 8333)
    30 
    31         self.log.debug("Test that adding the same address again when already present fails")
    32         assert_equal(node.addpeeraddress(address="1.2.3.4", port=8333), {"success": False})
    33         assert_equal(len(node.getnodeaddresses(count=0)), 1)
    34 
    35-        self.log.debug("Test adding a second address (with a different /16) to the tried table")
    36+        # RPC addpeeraddress sets the source address equal to the destination address.
    37+        # If addresses with the same /16 are passed to it, they will be placed in the
    38+        # same new bucket and have a 1/64 chance of the bucket positions colliding
    39+        # (depending on the value of nKey in the addrman), in which case the new
    40+        # address won't be added to the addrman.  Therefore, reduce the chance of a
    41+        # collision by using an address from a different /16.  There may still remain
    42+        # a 1/2^16 = 1/65536 probability of bucket/position collisions.
    43+        self.log.debug("Test adding a second address to the tried table")
    44         assert_equal(node.addpeeraddress(address="2.0.0.0", port=8333, tried=True), {"success": True})
    45         with node.assert_debug_log(expected_msgs=["Addrman checks started: new 1, tried 1, total 2"]):
    46             addrs = node.getnodeaddresses(count=0)  # getnodeaddresses re-runs the addrman checks
    

    jonatack commented at 1:16 pm on September 15, 2021:
    (an alternative could be to merge the first valid address test with the “tried” test)

    mzumsande commented at 1:16 pm on September 15, 2021:
    or maybe first add an addr to tried, and the next one to new - then new collisions wouldn’t be an issue.

    jonatack commented at 1:25 pm on September 15, 2021:
    @mzumsande done

    mzumsande commented at 1:56 pm on September 15, 2021:
    Hmm, unfortunately this issue also seems to be possible in the regression test commit where 4 addrs are added.

    jonatack commented at 2:39 pm on September 15, 2021:
    Good catch – inversed the order.
  72. jonatack force-pushed on Sep 14, 2021
  73. jonatack force-pushed on Sep 14, 2021
  74. in src/rpc/net.cpp:959 in d58681743a outdated
    952@@ -951,7 +953,13 @@ static RPCHelpMan addpeeraddress()
    953         address.nTime = GetAdjustedTime();
    954         // The source address is set equal to the address. This is equivalent to the peer
    955         // announcing itself.
    956-        if (node.addrman->Add({address}, address)) success = true;
    957+        if (node.addrman->Add({address}, address)) {
    958+            success = true;
    959+            if (tried) {
    960+                // Mark entry as good and possibly move it to the "tried" addresses table.
    


    amitiuttarwar commented at 11:58 pm on September 14, 2021:

    just a suggestion, the first part of the comment doesn’t add much

    0                // Attempt to move the address to the "tried" table.
    

    jonatack commented at 1:15 pm on September 15, 2021:
    thanks for the feedback, done
  75. in test/functional/rpc_net.py:266 in d58681743a outdated
    262@@ -262,6 +263,12 @@ def test_addpeeraddress(self):
    263         assert_equal(node.addpeeraddress(address="1.2.3.4", port=8333), {"success": False})
    264         assert_equal(len(node.getnodeaddresses(count=0)), 1)
    265 
    266+        self.log.debug("Test adding a second address (with a different /16) to the tried table")
    


    amitiuttarwar commented at 0:12 am on September 15, 2021:
    could be nice to pull out the (with a different /16) from the log & add a comment with the reasoning of why (so it maps to a different bucket) as described in #22831 (review)

    jonatack commented at 1:14 pm on September 15, 2021:
    Oops, meant to post it here. SGTM, what do you think: #22831 (review)

    jonatack commented at 2:43 pm on September 15, 2021:
    done!
  76. amitiuttarwar commented at 0:18 am on September 15, 2021: contributor

    review ACK d58681743a3ea99814c2c77ad8b7a4ea6ca3e14b

    two comment suggestions to consider if you retouch

  77. mzumsande commented at 12:34 pm on September 15, 2021: member

    Tested ACK d58681743a3ea99814c2c77ad8b7a4ea6ca3e14b

    I verified that peers.dat, which is now created at runtime, leads to Check() errors in 181a1207ba6bd179d181f3e2534ef8676565ce72 with the toy asmap ip_asn.map.

  78. Allow passing "tried" to rpc addpeeraddress to call CAddrMan::Good()
    Co-authored-by: Martin Zumsande <mzumsande@gmail.com>
    Co-authored-by: John Newbery <john@johnnewbery.com>
    Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
    ef242f5213
  79. Add test for rpc addpeeraddress with "tried" argument
    Co-authored-by: Amiti Uttarwar <amiti@uttarwar.org>
    Co-authored-by: John Newbery <john@johnnewbery.com>
    Co-authored-by: Martin Zumsande <mzumsande@gmail.com>
    869f136816
  80. jonatack force-pushed on Sep 15, 2021
  81. Add test for addrman consistency check on restart with asmap
    PR #22697 introduced a reproducible issue in commit 181a1207 that causes the
    addrman tried table to fail consistency checks and significantly lose peer
    entries when the `-asmap` configuration option is used.
    
    The issue occurs on bitcoind restart due to an initialization order change
    in `src/init.cpp` in that commit whereby CAddrman asmap is set after
    deserializing `peers.dat`, rather than before.
    
    Issue reported on the `#bitcoin-core-dev` IRC channel starting at
    https://www.erisian.com.au/bitcoin-core-dev/log-2021-08-23.html#l-263.
    
    ```
    addrman lost 22813 new and 2 tried addresses due to collisions or invalid addresses
    ADDRMAN CONSISTENCY CHECK FAILED!!! err=-17
    bitcoind: ./addrman.h:707: void CAddrMan::Check() const: Assertion `false' failed. Aborted
    ```
    
    How to reproduce:
    - `git checkout 181a1207` and recompile
    - launch bitcoind with `-asmap` and `-checkaddrman=1` config options
    - restart bitcoind
    - bitcoind aborts on second call to `CAddrMan::Check()`
    
    This commit adds a regression test to reproduce the case; it passes or fails
    with the same error.
    
    Co-authored-by: John Newbery <john@johnnewbery.com>
    Co-authored-by: Martin Zumsande <mzumsande@gmail.com>
    cdaab90662
  82. jonatack force-pushed on Sep 15, 2021
  83. jonatack commented at 2:43 pm on September 15, 2021: member

    Very good suggestions (thanks!) Updated per git diff d586817 cdaab90 along with further test improvements.

      0diff --git a/src/rpc/net.cpp b/src/rpc/net.cpp
      1index f7b6a2a333..227eec722f 100644
      2--- a/src/rpc/net.cpp
      3+++ b/src/rpc/net.cpp
      4@@ -956,7 +956,7 @@ static RPCHelpMan addpeeraddress()
      5         if (node.addrman->Add({address}, address)) {
      6             success = true;
      7             if (tried) {
      8-                // Mark entry as good and possibly move it to the "tried" addresses table.
      9+                // Attempt to move the address to the tried addresses table.
     10                 node.addrman->Good(address);
     11             }
     12         }
     13diff --git a/test/functional/feature_asmap.py b/test/functional/feature_asmap.py
     14index 46d55077d7..2dc1e3a7cb 100755
     15--- a/test/functional/feature_asmap.py
     16+++ b/test/functional/feature_asmap.py
     17@@ -14,7 +14,7 @@ Verify node behaviour and debug log when launching bitcoind in these cases:
     18 
     19 4. `bitcoind -asmap/-asmap=` with no file specified, using the default asmap
     20 
     21-5. `bitcoind -asmap` with an addrman containing new and tried entries
     22+5. `bitcoind -asmap` restart with an addrman containing new and tried entries
     23 
     24 6. `bitcoind -asmap` with no file specified and a missing default asmap file
     25 
     26@@ -42,8 +42,8 @@ class AsmapTest(BitcoinTestFramework):
     27         self.extra_args = [["-checkaddrman=1"]]  # Do addrman checks on all operations.
     28 
     29     def fill_addrman(self, node_id):
     30-        """Add 2 new addresses and 2 tried addresses to the addrman."""
     31-        for addr, tried in [[0, False], [1, False], [2, True], [3, True]]:
     32+        """Add 2 tried addresses to the addrman, followed by 2 new addresses."""
     33+        for addr, tried in [[0, True], [1, True], [2, False], [3, False]]:
     34             self.nodes[node_id].addpeeraddress(address=f"101.{addr}.0.0", tried=tried, port=8333)
     35 
     36     def test_without_asmap_arg(self):
     37@@ -81,13 +81,18 @@ class AsmapTest(BitcoinTestFramework):
     38         os.remove(self.default_asmap)
     39 
     40     def test_asmap_interaction_with_addrman_containing_entries(self):
     41-        self.log.info("Test bitcoind -asmap with an addrman containing new and tried entries")
     42+        self.log.info("Test bitcoind -asmap restart with addrman containing new and tried entries")
     43         self.stop_node(0)
     44         shutil.copyfile(self.asmap_raw, self.default_asmap)
     45         self.start_node(0, ["-asmap", "-checkaddrman=1"])
     46         self.fill_addrman(node_id=0)
     47         self.restart_node(0, ["-asmap", "-checkaddrman=1"])
     48-        with self.node.assert_debug_log(expected_msgs=["Addrman checks started: new 2, tried 2, total 4"]):
     49+        with self.node.assert_debug_log(
     50+            expected_msgs=[
     51+                "Addrman checks started: new 2, tried 2, total 4",
     52+                "Addrman checks completed successfully",
     53+            ]
     54+        ):
     55             self.node.getnodeaddresses()  # getnodeaddresses re-runs the addrman checks
     56         os.remove(self.default_asmap)
     57 
     58diff --git a/test/functional/rpc_net.py b/test/functional/rpc_net.py
     59index e2dfa3ea23..4f5d8ce241 100755
     60--- a/test/functional/rpc_net.py
     61+++ b/test/functional/rpc_net.py
     62@@ -239,6 +239,14 @@ class NetTest(BitcoinTestFramework):
     63         assert_raises_rpc_error(-8, "Network not recognized: Foo", self.nodes[0].getnodeaddresses, 1, "Foo")
     64 
     65     def test_addpeeraddress(self):
     66+        """RPC addpeeraddress sets the source address equal to the destination address.
     67+        If an address with the same /16 as an existing new entry is passed, it will be
     68+        placed in the same new bucket and have a 1/64 chance of the bucket positions
     69+        colliding (depending on the value of nKey in the addrman), in which case the
     70+        new address won't be added.  The probability of collision can be reduced to
     71+        1/2^16 = 1/65536 by using an address from a different /16.  We avoid this here
     72+        by first testing adding a tried table entry before testing adding a new table one.
     73+        """
     74         self.log.info("Test addpeeraddress")
     75         self.restart_node(1, ["-checkaddrman=1"])
     76         node = self.nodes[1]
     77@@ -252,19 +260,21 @@ class NetTest(BitcoinTestFramework):
     78         assert_equal(node.addpeeraddress(address="", port=8333), {"success": False})
     79         assert_equal(node.getnodeaddresses(count=0), [])
     80 
     81-        self.log.debug("Test that adding a valid address succeeds")
     82-        assert_equal(node.addpeeraddress(address="1.2.3.4", port=8333), {"success": True})
     83-        addrs = node.getnodeaddresses(count=0)
     84-        assert_equal(len(addrs), 1)
     85-        assert_equal(addrs[0]["address"], "1.2.3.4")
     86-        assert_equal(addrs[0]["port"], 8333)
     87+        self.log.debug("Test that adding a valid address to the tried table succeeds")
     88+        assert_equal(node.addpeeraddress(address="1.2.3.4", tried=True, port=8333), {"success": True})
     89+        with node.assert_debug_log(expected_msgs=["Addrman checks started: new 0, tried 1, total 1"]):
     90+            addrs = node.getnodeaddresses(count=0)  # getnodeaddresses re-runs the addrman checks
     91+            assert_equal(len(addrs), 1)
     92+            assert_equal(addrs[0]["address"], "1.2.3.4")
     93+            assert_equal(addrs[0]["port"], 8333)
     94 
     95-        self.log.debug("Test that adding the same address again when already present fails")
     96-        assert_equal(node.addpeeraddress(address="1.2.3.4", port=8333), {"success": False})
     97+        self.log.debug("Test that adding an already-present tried address to the new and tried tables fails")
     98+        for value in [True, False]:
     99+            assert_equal(node.addpeeraddress(address="1.2.3.4", tried=value, port=8333), {"success": False})
    100         assert_equal(len(node.getnodeaddresses(count=0)), 1)
    101 
    102-        self.log.debug("Test adding a second address (with a different /16) to the tried table")
    103-        assert_equal(node.addpeeraddress(address="2.0.0.0", port=8333, tried=True), {"success": True})
    104+        self.log.debug("Test that adding a second address, this time to the new table, succeeds")
    105+        assert_equal(node.addpeeraddress(address="2.0.0.0", port=8333), {"success": True})
    106         with node.assert_debug_log(expected_msgs=["Addrman checks started: new 1, tried 1, total 2"]):
    107             addrs = node.getnodeaddresses(count=0)  # getnodeaddresses re-runs the addrman checks
    
  84. jnewbery commented at 4:57 pm on September 15, 2021: member

    utACK cdaab90662a54e331de0e49a89596bbb94a8ac45

    This isn’t a blocker, but I’ve never liked relying too heavily on the assert_debug_log() pattern in the functional tests. It’s essentially asserting that a specific line of code is hit, which couples the functional tests very closely with the implementation. Ideally, the functional tests would test the functionality, and the implementation could be changed without having to update the tests.

    In this case, we could extend the getnodeaddresses RPC to return whether the address is in the tried or new table, and (if it’s in the new table) its refcount. That’d allow us to more closely test the contents of the addrman using a public interface.

    Like I said, not a blocker for this PR, but perhaps something to consider as a potential follow-up.

  85. jonatack commented at 5:18 pm on September 15, 2021: member
    @jnewbery Agree–that sounds like a good idea for a possible follow-up. Edit: proposed in #23035.
  86. in test/functional/rpc_net.py:249 in cdaab90662
    244+        placed in the same new bucket and have a 1/64 chance of the bucket positions
    245+        colliding (depending on the value of nKey in the addrman), in which case the
    246+        new address won't be added.  The probability of collision can be reduced to
    247+        1/2^16 = 1/65536 by using an address from a different /16.  We avoid this here
    248+        by first testing adding a tried table entry before testing adding a new table one.
    249+        """
    


    vasild commented at 9:17 am on September 17, 2021:
    If this nondeterminism turns out to be a problem also in other cases, then I guess a new option bitcoind -addrmandeterministic could be introduced which creates a deterministic addrman.
  87. in test/functional/rpc_net.py:280 in cdaab90662
    284 
    285+        self.log.debug("Test that adding a second address, this time to the new table, succeeds")
    286+        assert_equal(node.addpeeraddress(address="2.0.0.0", port=8333), {"success": True})
    287+        with node.assert_debug_log(expected_msgs=["Addrman checks started: new 1, tried 1, total 2"]):
    288+            addrs = node.getnodeaddresses(count=0)  # getnodeaddresses re-runs the addrman checks
    289+            assert_equal(len(addrs), 2)
    


    vasild commented at 9:24 am on September 17, 2021:
    We don’t check the contents of addrs[] here, but checked it above. If the reason for this is that the order is undeterministic, then I guess it can be sorted and then checked that it contains 1.2.3.4 and 2.0.0.0.

    jonatack commented at 2:39 pm on September 19, 2021:
    Done in #23035.
  88. vasild approved
  89. vasild commented at 11:57 am on September 17, 2021: member

    ACK cdaab90662a54e331de0e49a89596bbb94a8ac45

    The new test Test bitcoind -asmap restart with addrman containing new and tried entries fails before the fix (50fd77045e~) and passes with the fix (50fd77045e).

  90. jamesob commented at 4:24 pm on September 20, 2021: member
    This seems ready for merge.
  91. mzumsande commented at 5:19 pm on September 20, 2021: member
    re-ACK cdaab90662a54e331de0e49a89596bbb94a8ac45 (based on code review of diff to d586817)
  92. jamesob commented at 7:16 pm on September 20, 2021: member

    Therefore, #22791 fixed not just the proximal cause of the bug (initialization order), but also the ultimate cause (bad encapsulation and partially-initalized objects) by making that member const and private. Just changing the initialization order would be a partial fix, rather than eliminating a whole class of bugs.

    I don’t want to belabor drama here, but there is a categorical difference between (i) extant bugs and (ii) footgun conditions that might lead to a bug. One case requires a clear and immediate fix (and if feasible an accompanying test) and the other can be responsibly treated in a number of ways with varying rapidity.

    I’d encourage everyone to take a deep breath and remember that we’re all on the same team, and everyone is here to improve the project.

    While I appreciate the sentiment here, this is a controversial, security-critical project that is likely to someday experience subversion, overt or otherwise. Consequently we should be eager to recognize and address process failures, and we should certainly not be gunshy about adding test coverage, whether or not its implementation is ideal.

  93. MarcoFalke commented at 7:34 am on September 21, 2021: member

    Thanks for adjusting this to use addpeeraddress (and hashing out the non-determinism issues that came with it).

    Concept ACK (didn’t re-review)

  94. MarcoFalke merged this on Sep 21, 2021
  95. MarcoFalke closed this on Sep 21, 2021

  96. jonatack deleted the branch on Sep 21, 2021
  97. jnewbery commented at 8:25 am on September 21, 2021: member

    @jamesob - I’m not going to continue the discussion of process in this PR, since I think we’re basically going in circles at this point. However, I do want to respond to your final point:

    we should certainly not be gunshy about adding test coverage, whether or not its implementation is ideal.

    I totally disagree with this. Our standards for test code should be as high as for code in the product itself, since all code (test or production) carries an ongoing maintenance cost. If a proposed test introduces difficult-to-maintain binary files or intermittently fails, then those issues absolutely should be brought up in review and resolved before merge. I agree with @fanquake’s assessment here:

    I think this perspective is important because it can determine how we fix bugs. In this case, the bug is already fixed, and this PR is adding regression testing (which is important!). However given the impact and scope are low, there isn’t any sort of rush to get this regression testing added that we can’t do it “right” the first time.

    There’s no rush here. A merge-first-fix-later attitude increases the overall cost of this code and transfers that cost onto developers who’ll have to maintain/modify the tests later.

  98. sidhujag referenced this in commit e7cc3a2168 on Sep 21, 2021
  99. Fabcien referenced this in commit db2ebc85a4 on Oct 20, 2022
  100. DrahtBot locked this on Oct 30, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-19 09:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me