test: add stress tests for initialization #23289

pull jamesob wants to merge 2 commits into bitcoin:master from jamesob:2021-10-stress-test-init changing 3 files +194 −3
  1. jamesob commented at 9:46 pm on October 15, 2021: member

    In the course of coming up with a test plan for #23280, I thought it would be neat to include a Python snippet showing how I tested the initialization process. I quickly realized I was reinventing the functional test framework… so here’s a new test.

    This change bangs init around like the Fonz hitting a jukebox. It adds some interesting (read: lazy and random) coverage for the initialization process by

    • interrupting init with SIGTERM after certain log statements,
    • interrupting init at random points, and
    • starting init with some essential data missing (block files, block indices, etc.) to test init error paths.

    As far as I can tell, some of these code paths are uncovered otherwise (namely the startup errors).


    Incidentally, I think I may have uncovered some kind of a bug or race condition with indexing initialization based on an intermittent failure in this testcase. This test sometimes fails after shutting down immediately after loadblk thread start:

    02021-10-15T21:14:51.295000Z TestFramework (INFO): Starting node and will exit after line 'loadblk thread start'
    1  36    2021-10-15T21:14:51.296000Z TestFramework.node0 (DEBUG): bitcoind started, waiting for RPC to come up
    2  37    2021-10-15T21:14:51.493000Z TestFramework (INFO): terminating node after 110 log lines seen
    3  38    2021-10-15T21:14:51.625000Z TestFramework (INFO): Starting node and will exit after line 'txindex thread start'
    4  39    2021-10-15T21:14:51.625000Z TestFramework.node0 (DEBUG): bitcoind started, waiting for RPC to come up
    5  ------> [[ FAILURE HERE ]] 2021-10-15T21:15:21.626000Z TestFramework (WARNING): missed line {bail_line}; bailing now after {num_lines} lines
    

    and then fails to start up afterwards. Combined logs showing Error: txindex best block of the index goes beyond pruned data, when the node under test is not pruned:

     0  node0 2021-10-15T21:16:51.848439Z [shutoff] [validationinterface.cpp:244] [ChainStateFlushed] Enqueuing ChainStateFlushed: block hash=1014bc4ff4917602ae53d10e9dfe230af4b7d52a6cdaa8a47798b9c288180907
     1   node0 2021-10-15T21:16:51.848954Z [shutoff] [init.cpp:302] [Shutdown] Shutdown: done
     2   test  2021-10-15T21:16:51.882000Z TestFramework (ERROR): Unexpected exception caught during testing
     3				     Traceback (most recent call last):
     4				       File "/home/james/src/bitcoin/test/functional/test_framework/test_framework.py", line 132, in main
     5					 self.run_test()
     6				       File "/home/james/src/bitcoin/./test/functional/stress_init.py", line 87, in run_test
     7					 check_clean_start()
     8				       File "/home/james/src/bitcoin/./test/functional/stress_init.py", line 60, in check_clean_start
     9					 node.wait_for_rpc_connection()
    10				       File "/home/james/src/bitcoin/test/functional/test_framework/test_node.py", line 224, in wait_for_rpc_connection
    11					 raise FailedToStartError(self._node_msg(
    12				     test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization
    13   test  2021-10-15T21:16:51.882000Z TestFramework (DEBUG): Closing down network thread
    14   test  2021-10-15T21:16:51.933000Z TestFramework (INFO): Stopping nodes
    15   test  2021-10-15T21:16:51.933000Z TestFramework.node0 (DEBUG): Stopping node
    16
    17   node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again)
    18   node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again)
    19   node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again)
    20   node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again)
    21   node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again)
    
  2. dongcarl requested review from dongcarl on Oct 15, 2021
  3. jamesob force-pushed on Oct 15, 2021
  4. DrahtBot added the label Tests on Oct 15, 2021
  5. jamesob force-pushed on Oct 15, 2021
  6. in test/functional/stress_init.py:84 in 1888c8dc40 outdated
    79+                    sigterm_node()
    80+                    break
    81+
    82+                if (time.time() - start) > MAX_SECS_TO_WAIT:
    83+                    self.log.warning(f"missed line {bail_line}; bailing now after {num_lines} lines")
    84+                    sigterm_node()
    


    brunoerg commented at 9:00 pm on October 16, 2021:
    if line and bail_line.lower() in line.lower() and (time.time() - start) > MAX_SECS_TO_WAIT sigterm_node() will be called twice, is this the intention?

    lsilva01 commented at 5:27 am on October 18, 2021:
    I think it won’t happen, as there is a break after these conditions.

    brunoerg commented at 5:44 pm on October 18, 2021:
    True! Thanks! missed that break
  7. in test/functional/test_runner.py:278 in 1888c8dc40 outdated
    274@@ -275,6 +275,7 @@
    275     'wallet_taproot.py',
    276     'p2p_fingerprint.py',
    277     'feature_uacomment.py',
    278+    'stress_init.py',
    


    MarcoFalke commented at 10:19 am on October 17, 2021:
    I think this would be called feature_init, no?

    jamesob commented at 4:11 pm on October 18, 2021:
    Yep sure, will change.
  8. MarcoFalke approved
  9. MarcoFalke commented at 10:20 am on October 17, 2021: member
    Strong concept ACK. I tried doing this manually for a few releases in the past, but a functional test simplifies that a lot.
  10. practicalswift commented at 7:30 pm on October 17, 2021: contributor

    Strong Concept ACK

    Functional testing, unit testing, fuzz testing and now stress testing: great to see!

    Fault injection testing and mutation testing next in line? :)

  11. lsilva01 approved
  12. lsilva01 commented at 2:34 pm on October 18, 2021: contributor
    Tested ACK 1888c8d on Ubuntu 20.04
  13. brunoerg approved
  14. brunoerg commented at 5:50 pm on October 18, 2021: member
    tACK 1888c8dc40d1dbbffe2f522749151ca8d0efee00
  15. jamesob force-pushed on Oct 18, 2021
  16. jamesob force-pushed on Oct 19, 2021
  17. dongcarl commented at 4:30 pm on October 19, 2021: member

    Code Review ACK 71ba24a724556af02901d2aa06a71d24e18f6f3c

    This is great, and is very helpful for testing initialization! Thank you James!

  18. jamesob requested review from MarcoFalke on Oct 19, 2021
  19. mzumsande commented at 9:48 pm on October 19, 2021: member

    Concept ACK, very cool idea.

    I also encountered the intermittent Error: txindex best block of the index goes beyond pruned data once while running the new test yesterday - what is the best way to proceed? Should that issue be analysed/fixed before merging this?

  20. in test/functional/feature_init.py:2 in 71ba24a724 outdated
    0@@ -0,0 +1,171 @@
    1+#!/usr/bin/env python3
    2+# Copyright (c) 2017-2020 The Bitcoin Core developers
    


    MarcoFalke commented at 9:01 am on October 20, 2021:
    2017 was 5 years ago
  21. in test/functional/test_framework/test_node.py:376 in de0a4e084e outdated
    367@@ -368,21 +368,28 @@ def is_node_stopped(self):
    368     def wait_until_stopped(self, timeout=BITCOIND_PROC_WAIT_TIMEOUT):
    369         wait_until_helper(self.is_node_stopped, timeout=timeout, timeout_factor=self.timeout_factor)
    370 
    371+    @property
    372+    def chain_path(self) -> str:
    373+        return os.path.join(self.datadir, self.chain)
    374+
    375+    @property
    376+    def debug_log_path(self) -> str:
    


    MarcoFalke commented at 9:12 am on October 20, 2021:
    Why not return a Path here when the function is called …_path?
  22. in test/functional/feature_init.py:53 in 71ba24a724 outdated
    50+
    51+        def check_clean_start():
    52+            """Ensure that node restarts successfully after various interrupts."""
    53+            node.start(extra_args=['-txindex=1'])
    54+            node.wait_for_rpc_connection()
    55+            node.waitforblockheight(200)
    


    MarcoFalke commented at 9:15 am on October 20, 2021:
    is this needed? wait_for_rpc_connection should already do that by waiting for the mempool to be loaded.

    jamesob commented at 1:40 pm on October 20, 2021:
    Yeah, because I think it’s important to explicitly verify that the number of blocks that the setup dictated (200) are present. This is to check for the case where, for example, the block index somehow gets screwed up and fewer blocks are loaded than the original count, despite having loaded the mempool successfully. But am I missing some reason this isn’t necessary?

    MarcoFalke commented at 1:43 pm on October 20, 2021:
    Wouldn’t an assert_equal be a better choice then?

    jamesob commented at 2:09 pm on October 20, 2021:
    Oh, what, to avoid a timeout? I guess that makes sense.
  23. in test/functional/feature_init.py:92 in 71ba24a724 outdated
    87+                    self.log.info(f"terminating node after {num_lines} log lines seen")
    88+                    sigterm_node()
    89+                    break
    90+
    91+                if (time.time() - start) > MAX_SECS_TO_WAIT:
    92+                    self.log.warning(f"missed line {bail_line}; bailing now after {num_lines} lines")
    


    MarcoFalke commented at 9:18 am on October 20, 2021:
    Shouldn’t this be an assertion error?

    jamesob commented at 1:00 pm on October 20, 2021:
    So I actually made this change last night… but the problem is that this is actually hit pretty routinely (about every third run) on master (but oddly not on @dongcarl’s #23280?). Each time I manually checked the debug.log and saw the loadblk thread started line, so I’m not sure how the Python code would be missing that as-written. I’ll investigate a bit because I agree, an assertion here would be preferable.
  24. in test/functional/feature_init.py:141 in 71ba24a724 outdated
    136+            Path('blocks') / 'blk*.dat': 'Error loading block database.',
    137+        }
    138+        chain_path = Path(node.chain_path)
    139+
    140+        for file_patt, err_fragment in files_to_disturb.items():
    141+            target_file = Path(glob.glob(str(chain_path / file_patt))[0])
    


    MarcoFalke commented at 9:24 am on October 20, 2021:

    Haven’t tried this, but what happens when you use:

    0target_file = node.chain_path.glob(file_patt)[0]
    

    jamesob commented at 1:45 pm on October 20, 2021:
    Works, but had to add a list(...) and str(file_patt). Still probably better than what I had.

    MarcoFalke commented at 2:11 pm on October 20, 2021:
    What is the point of str(Path(str()))? Just use str?

    jamesob commented at 2:22 pm on October 20, 2021:
    Okay I don’t understand what you’re saying anymore so if you want me to change it, suggest a patch.
  25. MarcoFalke approved
  26. MarcoFalke commented at 9:25 am on October 20, 2021: member
    ACK
  27. jamesob force-pushed on Oct 20, 2021
  28. jamesob force-pushed on Oct 20, 2021
  29. in test/functional/feature_init.py:144 in c870ee0d95 outdated
    139+            Path('chainstate') / '*.ldb': 'Error opening block database.',
    140+            Path('blocks') / 'blk*.dat': 'Error loading block database.',
    141+        }
    142+
    143+        for file_patt, err_fragment in files_to_disturb.items():
    144+            target_file = list(node.chain_path.glob(str(file_patt)))[0]
    


    MarcoFalke commented at 2:24 pm on October 20, 2021:
    0            'blocks/index/*.ldb': 'Error opening block database.',
    1            'chainstate/*.ldb': 'Error opening block database.',
    2            'blocks/blk*.dat': 'Error loading block database.',
    3        }
    4
    5        for file_patt, err_fragment in files_to_disturb.items():
    6            target_file = list(node.chain_path.glob(file_patt))[0]
    

    jamesob commented at 2:29 pm on October 20, 2021:
    Done, thanks.
  30. jamesob commented at 2:26 pm on October 20, 2021: member

    I also encountered the intermittent Error: txindex best block of the index goes beyond pruned data once while running the new test yesterday - what is the best way to proceed? Should that issue be analysed/fixed before merging this?

    On the one hand, holding off on merging test coverage because there may be a bug in master seems wrong. On the other hand, spurious test failures in CI for a hard-to-track-down bug will be annoying.

    For context and neither here nor there: this bug seems pretty minor - under very particular circumstances (SIGTERM during just the right init point) it may cause a user to have to reindex. But that said, we should certainly have a better understanding of why it’s happening.

  31. MarcoFalke commented at 2:28 pm on October 20, 2021: member
    If txindex is causing issues here, maybe submit it in a follow-up pull request?
  32. jamesob force-pushed on Oct 20, 2021
  33. in test/functional/test_framework/test_node.py:393 in ff2fedef5d outdated
    391         yield
    392 
    393         while True:
    394             found = True
    395-            with open(debug_log, encoding='utf-8') as dl:
    396+            with open(str(self.debug_log_path), encoding='utf-8') as dl:
    


    MarcoFalke commented at 3:22 pm on October 20, 2021:
    Can you explain why str() is needed here? It should be possible to open a Path in python

    jamesob commented at 3:26 pm on October 20, 2021:
    Weird, thought I’d started off by doing that and then had to add str. Fixed.
  34. test: add node.chain_path and node.debug_log_path
    To allow easier access to the node's datadir and debug logs.
    23f85616a8
  35. jamesob force-pushed on Oct 20, 2021
  36. in test/functional/feature_init.py:5 in b32f5a5f9b outdated
    0@@ -0,0 +1,175 @@
    1+#!/usr/bin/env python3
    2+# Copyright (c) 2021 The Bitcoin Core developers
    3+# Distributed under the MIT software license, see the accompanying
    4+# file COPYING or http://www.opensource.org/licenses/mit-license.php.
    5+"""Stress tests related to node initialization (init.cpp)."""
    


    jonatack commented at 3:39 pm on October 20, 2021:

    nit, the file name could go out of date without being noticed

    0"""Stress tests related to node initialization."""
    
  37. in test/functional/feature_init.py:9 in b32f5a5f9b outdated
    0@@ -0,0 +1,175 @@
    1+#!/usr/bin/env python3
    2+# Copyright (c) 2021 The Bitcoin Core developers
    3+# Distributed under the MIT software license, see the accompanying
    4+# file COPYING or http://www.opensource.org/licenses/mit-license.php.
    5+"""Stress tests related to node initialization (init.cpp)."""
    6+import time
    7+import random
    8+import os
    9+from pathlib import Path
    


    jonatack commented at 3:40 pm on October 20, 2021:

    sort

    0-import time
    1-import random
    2 import os
    3 from pathlib import Path
    4+import random
    5+import time
    

    jnewbery commented at 11:28 am on October 21, 2021:
    This comment was marked resolved, but the imports are still unsorted.

    jamesob commented at 4:33 pm on October 26, 2021:
    Sorted imports are not in the Python style guide. I think keeping them sorted is an unnecessary burden to authors and reviewers.
  38. in test/functional/feature_init.py:54 in b32f5a5f9b outdated
    49+
    50+        def check_clean_start():
    51+            """Ensure that node restarts successfully after various interrupts."""
    52+            node.start(extra_args=['-txindex=1'])
    53+            node.wait_for_rpc_connection()
    54+            assert_equal(200, node.getblockchaininfo()['blocks'])
    


    jonatack commented at 3:45 pm on October 20, 2021:

    This would be simpler, I think.

    0            assert_equal(200, node.getblockcount())
    
  39. in test/functional/feature_init.py:85 in b32f5a5f9b outdated
    71+        for bail_line in lines_to_bail_after:
    72+            self.log.info(f"Starting node and will exit after line '{bail_line}'")
    73+            node.start(extra_args=['-txindex=1'])
    74+            logfile = open(node.debug_log_path, 'r', encoding='utf8')
    75+
    76+            MAX_SECS_TO_WAIT = 30
    


    jonatack commented at 3:48 pm on October 20, 2021:

    maybe better to snakecase this localvar throughout, as the variable name is reused with a different value in a different scope below

    0            max_seconds_to_wait = 30
    
  40. in test/functional/feature_init.py:90 in b32f5a5f9b outdated
    85+                # XXX mysteriously, if we remove this log statement the test
    86+                # will occasionally fail by missing the 'loadblk thread start'
    87+                # line and hitting the 'missed line' AssertionError. I think
    88+                # this may be some kind of overzealous-optimization bug in
    89+                # CPython but I really don't know.
    90+                self.log.debug(f'saw logline: {line}')
    


    jonatack commented at 3:56 pm on October 20, 2021:

    It would be nice to be able to omit this, as it renders the debug log level difficult to follow.

    Some other logging suggestions to clean up the output:

     0         for bail_line in lines_to_bail_after:
     1-            self.log.info(f"Starting node and will exit after line '{bail_line}'")
     2+            self.log.debug(f"Starting node and will exit after line '{bail_line}'")
     3             node.start(extra_args=['-txindex=1'])
     4             logfile = open(node.debug_log_path, 'r', encoding='utf8')
     5 
     6@@ -90,7 +90,7 @@ class InitStressTest(BitcoinTestFramework):
     7                 self.log.debug(f'saw logline: {line}')
     8 
     9                 if line and bail_line.lower() in line.lower():
    10-                    self.log.info(f"terminating node after {num_lines} log lines seen")
    11+                    self.log.debug(f"Terminating node after {num_lines} log lines seen")
    12                     sigterm_node()
    13                     break
    14 
    15@@ -109,7 +109,7 @@ class InitStressTest(BitcoinTestFramework):
    16         for _ in range(40):
    17             # Startup usually creates ~70 log lines
    18             terminate_after = random.randint(1, 80)
    19-            self.log.info(f"Starting node and will exit after {terminate_after} lines")
    20+            self.log.debug(f"Starting node and will exit after {terminate_after} lines")
    21             node.start(extra_args=['-txindex=1'])
    22             logfile = open(node.debug_log_path, 'r', encoding='utf8')
    23 
    24@@ -123,7 +123,7 @@ class InitStressTest(BitcoinTestFramework):
    25                     num_lines += 1
    26 
    27                 if num_lines >= terminate_after or (time.time() - start) > MAX_SECS_TO_WAIT:
    28-                    self.log.info(f"terminating node after {num_lines} log lines seen")
    29+                    self.log.debug(f"Terminating node after {num_lines} log lines seen")
    30                     sigterm_node()
    31                     break
    32 
    33@@ -144,7 +144,7 @@ class InitStressTest(BitcoinTestFramework):
    34         for file_patt, err_fragment in files_to_disturb.items():
    35             target_file = list(node.chain_path.glob(file_patt))[0]
    36 
    37-            self.log.info(f"tweaking file to ensure failure {target_file}")
    38+            self.log.info(f"Tweaking file to ensure failure {target_file}")
    39             bak_path = str(target_file) + ".bak"
    40             target_file.rename(bak_path)
    41 
    42@@ -165,7 +165,7 @@ class InitStressTest(BitcoinTestFramework):
    43                 match=ErrorMatch.PARTIAL_REGEX,
    44             )
    45 
    46-            self.log.info(f"restoring file from {bak_path} and restarting")
    47+            self.log.info(f"Restoring file from {bak_path} and restarting")
    48             Path(bak_path).rename(target_file)
    49             check_clean_start()
    50             self.stop_node(0)
    
  41. in test/functional/test_framework/test_node.py:23 in b32f5a5f9b outdated
    19@@ -20,6 +20,7 @@
    20 import collections
    21 import shlex
    22 import sys
    23+from pathlib import Path
    


    jonatack commented at 3:58 pm on October 20, 2021:

    nit, sort

     0@@ -13,6 +13,7 @@ import json
     1 import decimal
     2 from enum import Enum
     3 import errno
     4 import http.client
     5 import json
     6 import logging
     7 import os
     8+from pathlib import Path
     9 import re
    10 import subprocess
    11 import tempfile
    12 import time
    13 import urllib.parse
    14 import collections
    15 import shlex
    16 import sys
    17-from pathlib import Path
    18``
    
  42. jonatack commented at 4:03 pm on October 20, 2021: member
    Concept ACK. Nice idea!
  43. jamesob force-pushed on Oct 20, 2021
  44. mzumsande commented at 5:37 pm on October 20, 2021: member

    For context and neither here nor there: this bug seems pretty minor - under very particular circumstances (SIGTERM during just the right init point) it may cause a user to have to reindex. But that said, we should certainly have a better understanding of why it’s happening.

    I see different things locally: For me, the failure happens in roughly 1-2% of the runs in the will exit after line 'txindex thread start' run, whereas check_clean_start of the previous run for loadblk still passes. Accordingly, if I remove all lines except txindex thread start from lines_to_bail_after, I’ll still get occasional fails (without any SIGTERM involved), so the failure does seem to occur “unprovoked”.

  45. jamesob force-pushed on Oct 20, 2021
  46. fanquake commented at 1:53 am on October 21, 2021: member
    Concept ACK
  47. jamesob force-pushed on Oct 21, 2021
  48. in test/functional/feature_init.py:28 in d5f0dd1efd outdated
    23+        self.setup_clean_chain = False
    24+        self.num_nodes = 1
    25+
    26+    def run_test(self):
    27+        """
    28+        - test bailing initialization after seeing a certain log line.
    


    jnewbery commented at 11:33 am on October 21, 2021:
    “to bail” is a bit colloquial and may not be understood by non-native english speakers. It might be better to use a more standard, technical term such as “to terminate”.

    jamesob commented at 4:50 pm on October 26, 2021:
    Fixed
  49. in test/functional/feature_init.py:47 in d5f0dd1efd outdated
    42+        node = self.nodes[0]
    43+
    44+        def sigterm_node():
    45+            node.process.terminate()
    46+            node.process.wait()
    47+            node.debug_log_path.unlink()
    


    jnewbery commented at 11:52 am on October 21, 2021:

    Doing this will probably make it more difficult to debug failures, since the bitcoind log is erased. For example, the second time I ran this test, I hit the following failure:

     0 ./test/functional/feature_init.py --nocleanup
     12021-10-21T11:43:53.709000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_lh9p1xfw
     22021-10-21T11:43:54.157000Z TestFramework (INFO): Starting node and will exit after line 'scheduler thread start'
     32021-10-21T11:43:54.164000Z TestFramework (INFO): terminating node after 49 log lines seen
     42021-10-21T11:43:54.165000Z TestFramework (INFO): Starting node and will exit after line 'Loading P2P addresses'
     52021-10-21T11:43:54.308000Z TestFramework (INFO): terminating node after 66 log lines seen
     62021-10-21T11:43:54.317000Z TestFramework (INFO): Starting node and will exit after line 'Loading banlist'
     72021-10-21T11:43:54.461000Z TestFramework (INFO): terminating node after 69 log lines seen
     82021-10-21T11:43:54.469000Z TestFramework (INFO): Starting node and will exit after line 'Loading block index'
     92021-10-21T11:43:54.619000Z TestFramework (INFO): terminating node after 81 log lines seen
    102021-10-21T11:43:54.626000Z TestFramework (INFO): Starting node and will exit after line 'Switching active chainstate'
    112021-10-21T11:43:54.773000Z TestFramework (INFO): terminating node after 82 log lines seen
    122021-10-21T11:43:54.781000Z TestFramework (INFO): Starting node and will exit after line 'Loaded best chain:'
    132021-10-21T11:43:54.927000Z TestFramework (INFO): terminating node after 92 log lines seen
    142021-10-21T11:43:54.935000Z TestFramework (INFO): Starting node and will exit after line 'init message: Verifying blocks'
    152021-10-21T11:43:55.082000Z TestFramework (INFO): terminating node after 93 log lines seen
    162021-10-21T11:43:55.090000Z TestFramework (INFO): Starting node and will exit after line 'loadblk thread start'
    172021-10-21T11:43:55.240000Z TestFramework (INFO): terminating node after 115 log lines seen
    182021-10-21T11:43:55.359000Z TestFramework (INFO): Starting node and will exit after line 'txindex thread start'
    192021-10-21T11:43:55.517000Z TestFramework (ERROR): Assertion failed
    20Traceback (most recent call last):
    21  File "/home/vagrant/bitcoin/test/functional/test_framework/test_framework.py", line 132, in main
    22    self.run_test()
    23  File "./test/functional/feature_init.py", line 103, in run_test
    24    raise AssertionError(f"node failed to start (line: '{bail_line}')")
    25AssertionError: node failed to start (line: 'txindex thread start')
    262021-10-21T11:43:55.569000Z TestFramework (INFO): Stopping nodes
    27Traceback (most recent call last):
    28  File "./test/functional/feature_init.py", line 176, in <module>
    29    InitStressTest().main()
    30  File "/home/vagrant/bitcoin/test/functional/test_framework/test_framework.py", line 155, in main
    31    exit_code = self.shutdown()
    32  File "/home/vagrant/bitcoin/test/functional/test_framework/test_framework.py", line 304, in shutdown
    33    self.stop_nodes()
    34  File "/home/vagrant/bitcoin/test/functional/test_framework/test_framework.py", line 555, in stop_nodes
    35    node.stop_node(wait=wait, wait_until_stopped=False)
    36  File "/home/vagrant/bitcoin/test/functional/test_framework/test_node.py", line 336, in stop_node
    37    self.stop(wait=wait)
    38  File "/home/vagrant/bitcoin/test/functional/test_framework/test_node.py", line 184, in __getattr__
    39    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    40AssertionError: [node 0] Error: no RPC connection
    41[node 0] Cleaning up leftover process
    

    The bitcoind debug log and combined logs would be more helpful if they were complete.


    MarcoFalke commented at 8:35 am on December 1, 2021:
    I think this was not addressed?

    jamesob commented at 1:16 pm on December 1, 2021:

    This suggestion is at odds with the nature of the test itself, since the logfile contents determine when to sigterm and so have to be cleared each iteration.

    If someone would like to improve this test by consuming the node’s stdout instead of the logfile, that’s one possible improvement suitable for a follow-up.


    MarcoFalke commented at 11:45 am on December 10, 2021:
    Seems odd to delete a file to allow reading it. Why can´t this use the same approach like assert_debug_log, that is remember the last read position?
  50. in test/functional/feature_init.py:74 in d5f0dd1efd outdated
    69+        ]
    70+        if self.is_wallet_compiled():
    71+            lines_to_bail_after.append('Verifying wallet')
    72+
    73+        for bail_line in lines_to_bail_after:
    74+            self.log.info(f"Starting node and will exit after line '{bail_line}'")
    


    jnewbery commented at 11:53 am on October 21, 2021:
    This feels a little bit verbose for info logging. I’d suggest moving to debug. Same for the “terminating node after” log below.
  51. in test/functional/feature_init.py:91 in d5f0dd1efd outdated
    86+                    # XXX mysteriously, if we remove this log statement the test
    87+                    # will occasionally fail by missing the 'loadblk thread start'
    88+                    # line and hitting the 'missed line' AssertionError. I think
    89+                    # this may be some kind of overzealous-optimization bug in
    90+                    # CPython but I really don't know.
    91+                    self.log.debug(f'saw logline: {line}')
    


    jnewbery commented at 12:09 pm on October 21, 2021:
    This results in very verbose test logs
  52. in test/functional/feature_init.py:166 in d5f0dd1efd outdated
    156+            # tweaked_contents = bytearray(contents)
    157+            # tweaked_contents[50:250] = b'1' * 200
    158+            # target_file.write_bytes(bytes(tweaked_contents))
    159+            #
    160+            # At the moment I can't get this to work (bitcoind loads successfully?) so
    161+            # investigate doing this later.
    


    jnewbery commented at 12:10 pm on October 21, 2021:
    I think it’s generally better to leave TODOs out of the source and track them as issues, so that they can be discussed/amended/removed.
  53. in test/functional/feature_init.py:112 in d5f0dd1efd outdated
    107+
    108+        self.log.info("Terminate at some random point in the init process")
    109+
    110+        for _ in range(40):
    111+            # Startup usually creates ~70 log lines
    112+            terminate_after = random.randint(1, 80)
    


    jnewbery commented at 12:18 pm on October 21, 2021:
    Is there a way to derive this value programatically from the subtests above? I imagine it could change over time as log lines are added/removed in bitcoind, and this value would become outdated.

    jamesob commented at 4:51 pm on October 26, 2021:
    Good idea, done.
  54. jnewbery commented at 12:19 pm on October 21, 2021: member

    Concept ACK.

    This test fails intermittently for me. It failed on the AssertionError: node failed to start (line: 'txindex thread start') line on the second run. I think we should remove all failures before merging this to master to avoid spurious CI failures in unrelated PRs.

    I wonder if it makes sense to use BBF programs to hook into USDTs rather than have the python test framework poll for log lines as a trigger for process termination. There’s potentially a lot of non-determinism in timing between the bitcoind process writing a log line, having that log flushed to disk, and the python process reading that line and terminating the bitcoind process. With a BPF program being triggered by USDTs, the kernel could terminate the process at the exact point we’re interested in. This test is already disabled for windows, so I don’t think it’d be a huge problem to require eBPF (although it may make it impossible to run on CI).

  55. in test/functional/feature_init.py:52 in d5f0dd1efd outdated
    47+            node.debug_log_path.unlink()
    48+            node.debug_log_path.touch()
    49+
    50+        def check_clean_start():
    51+            """Ensure that node restarts successfully after various interrupts."""
    52+            node.start(extra_args=['-txindex=1'])
    


    mzumsande commented at 7:18 pm on October 24, 2021:

    I think I have found the cause for the intermittent failures, it’s related to #15946, see index/base.cpp:

    It happens when the node starts with -txindex=1 but ThreadSync is interrupted after it sets m_best_block_index to Genesis, and before it gets do any further work (like indexing the next block).

    Then, with the next restart, BaseIndex::Init() tries to check whether we know the last synced block of the partially synced index (which we might not if we were pruning, but do in this case).

    The problem is now that the loop while (block_to_test && block->pprev && (block->pprev->nStatus & BLOCK_HAVE_DATA)) in BaseIndex::Init() won’t set prune_violation to false because block_to_test is Genesis, block->pprev evaluates to false, so the comparison block_to_test == block isn’t reached.

    Since I knew next to nothing about the indexes before looking into this: Do reviewers think that the correct way to fix this would be to rewrite the while loop in Init() to also check block_to_test == block in case of the Genesis block?


    dongcarl commented at 9:25 pm on October 25, 2021:

    I think you have identified the right cause. I obtained a datadir from jamesob that triggers the “Error: txindex best block of the index goes beyond pruned data…” error on master when run with the following invocation:

    0./src/bitcoind -txindex=1 -datadir=$PWD/james-datadir/home/james/tmp/bitcoin_func_test_cf43f1hi/node0/
    

    Now, if I apply the following (very naive, should not be merged) patch, the above invocation no longer fails and bitcoind operates normally:

     0diff --git a/src/index/base.cpp b/src/index/base.cpp
     1index fc6dd77a72..7549ae5d57 100644
     2--- a/src/index/base.cpp
     3+++ b/src/index/base.cpp
     4@@ -98,6 +98,9 @@ bool BaseIndex::Init()
     5                 }
     6                 block = block->pprev;
     7             }
     8+            if (prune_violation) {
     9+                prune_violation = block_to_test != active_chain.Genesis();
    10+            }
    11         }
    12         if (prune_violation) {
    13             return InitError(strprintf(Untranslated("%s best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again)"), GetName()));
    

    MarcoFalke commented at 4:04 pm on October 26, 2021:
    If txindex is causing issues, maybe remove it for now and add it in a later pull?

    jamesob commented at 4:37 pm on October 26, 2021:
    Good idea, will do.

    jamesob commented at 4:51 pm on October 26, 2021:
    Nice detective work, guys! I’m going to follow @MarcoFalke’s advice and disable txindex here, but create an issue to follow this up.
  56. DrahtBot commented at 6:02 am on October 26, 2021: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #10102 (Multiprocess bitcoin by ryanofsky)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  57. test: add stress tests for initialization d9803f7a0a
  58. jamesob force-pushed on Oct 26, 2021
  59. jamesob commented at 7:06 pm on October 26, 2021: member
    Okay, I’ve addressed a bunch of feedback and disabled txindex for the time being. I think this is a good first pass on this test provided all the non-deterministic failures have been removed. @mzumsande/@dongcarl if you’d like to PR the indexing change and fix you’re welcome to, otherwise I can take care of it shortly.
  60. mzumsande referenced this in commit 96badd667c on Oct 26, 2021
  61. mzumsande commented at 10:07 pm on October 26, 2021: member

    Okay, I’ve addressed a bunch of feedback and disabled txindex for the time being. I think this is a good first pass on this test provided all the non-deterministic failures have been removed. @mzumsande/@dongcarl if you’d like to PR the indexing change and fix you’re welcome to, otherwise I can take care of it shortly.

    I opened #23365 to fix the intermittent issue.

  62. laanwj commented at 12:28 pm on November 10, 2021: member
    Concept ACK
  63. jamesob commented at 4:51 pm on November 16, 2021: member
    Anything left here? I think as things stand this is a stable test that adds some good coverage. AFAIK there’s no outstanding feedback aside from a few nits and marginal suggestions.
  64. mzumsande referenced this in commit 698c524698 on Nov 26, 2021
  65. laanwj commented at 7:49 pm on November 30, 2021: member
    Code review ACK d9803f7a0a33688f7429cf10384244f4770851ca
  66. laanwj merged this on Nov 30, 2021
  67. laanwj closed this on Nov 30, 2021

  68. sidhujag referenced this in commit f5c617dc72 on Dec 1, 2021
  69. jnewbery commented at 10:15 am on December 1, 2021: member

    I’m surprised this was merged, given that there were no ACKs on the latest branch, and there were still unresolved approach/implementation issues and comments, eg:

  70. fanquake commented at 1:18 am on December 2, 2021: member
    This is also the cause of the recent sporadic failures in our CI. See #23646.
  71. MarcoFalke referenced this in commit bf66e258a8 on Dec 13, 2021
  72. 0xB10C referenced this in commit 20ddba85a3 on Dec 13, 2021
  73. sidhujag referenced this in commit 1fd54088ea on Dec 13, 2021
  74. RandyMcMillan referenced this in commit 7e002ad069 on Dec 23, 2021
  75. MarcoFalke referenced this in commit d69af93223 on Jan 3, 2022
  76. rebroad referenced this in commit c54a14e60e on Feb 3, 2022
  77. Fabcien referenced this in commit fdee373b10 on Nov 23, 2022
  78. DrahtBot locked this on Dec 10, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-01 13:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me