test: add stress tests for initialization

jamesob commented at 9:46 pm on October 15, 2021: member

In the course of coming up with a test plan for #23280, I thought it would be neat to include a Python snippet showing how I tested the initialization process. I quickly realized I was reinventing the functional test framework… so here’s a new test.

This change bangs init around like the Fonz hitting a jukebox. It adds some interesting (read: lazy and random) coverage for the initialization process by

interrupting init with SIGTERM after certain log statements,
interrupting init at random points, and
starting init with some essential data missing (block files, block indices, etc.) to test init error paths.

As far as I can tell, some of these code paths are uncovered otherwise (namely the startup errors).

Incidentally, I think I may have uncovered some kind of a bug or race condition with indexing initialization based on an intermittent failure in this testcase. This test sometimes fails after shutting down immediately after loadblk thread start:

02021-10-15T21:14:51.295000Z TestFramework (INFO): Starting node and will exit after line 'loadblk thread start'
1  36   │ 2021-10-15T21:14:51.296000Z TestFramework.node0 (DEBUG): bitcoind started, waiting for RPC to come up
2  37   │ 2021-10-15T21:14:51.493000Z TestFramework (INFO): terminating node after 110 log lines seen
3  38   │ 2021-10-15T21:14:51.625000Z TestFramework (INFO): Starting node and will exit after line 'txindex thread start'
4  39   │ 2021-10-15T21:14:51.625000Z TestFramework.node0 (DEBUG): bitcoind started, waiting for RPC to come up
5  ------> [[ FAILURE HERE ]] 2021-10-15T21:15:21.626000Z TestFramework (WARNING): missed line {bail_line}; bailing now after {num_lines} lines

and then fails to start up afterwards. Combined logs showing Error: txindex best block of the index goes beyond pruned data, when the node under test is not pruned:

 0  node0 2021-10-15T21:16:51.848439Z [shutoff] [validationinterface.cpp:244] [ChainStateFlushed] Enqueuing ChainStateFlushed: block hash=1014bc4ff4917602ae53d10e9dfe230af4b7d52a6cdaa8a47798b9c288180907
 1   node0 2021-10-15T21:16:51.848954Z [shutoff] [init.cpp:302] [Shutdown] Shutdown: done
 2   test  2021-10-15T21:16:51.882000Z TestFramework (ERROR): Unexpected exception caught during testing
 3				     Traceback (most recent call last):
 4				       File "/home/james/src/bitcoin/test/functional/test_framework/test_framework.py", line 132, in main
 5					 self.run_test()
 6				       File "/home/james/src/bitcoin/./test/functional/stress_init.py", line 87, in run_test
 7					 check_clean_start()
 8				       File "/home/james/src/bitcoin/./test/functional/stress_init.py", line 60, in check_clean_start
 9					 node.wait_for_rpc_connection()
10				       File "/home/james/src/bitcoin/test/functional/test_framework/test_node.py", line 224, in wait_for_rpc_connection
11					 raise FailedToStartError(self._node_msg(
12				     test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization
13   test  2021-10-15T21:16:51.882000Z TestFramework (DEBUG): Closing down network thread
14   test  2021-10-15T21:16:51.933000Z TestFramework (INFO): Stopping nodes
15   test  2021-10-15T21:16:51.933000Z TestFramework.node0 (DEBUG): Stopping node
16
17   node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again)
18   node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again)
19   node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again)
20   node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again)
21   node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again)

dongcarl requested review from dongcarl on Oct 15, 2021

jamesob force-pushed on Oct 15, 2021

DrahtBot added the label Tests on Oct 15, 2021

jamesob force-pushed on Oct 15, 2021

in test/functional/stress_init.py:84 in 1888c8dc40 outdated

79+                    sigterm_node()
80+                    break
81+
82+                if (time.time() - start) > MAX_SECS_TO_WAIT:
83+                    self.log.warning(f"missed line {bail_line}; bailing now after {num_lines} lines")
84+                    sigterm_node()

brunoerg commented at 9:00 pm on October 16, 2021:

if line and bail_line.lower() in line.lower() and (time.time() - start) > MAX_SECS_TO_WAIT sigterm_node() will be called twice, is this the intention?

lsilva01 commented at 5:27 am on October 18, 2021:

I think it won’t happen, as there is a break after these conditions.

brunoerg commented at 5:44 pm on October 18, 2021:

True! Thanks! missed that break

in test/functional/test_runner.py:278 in 1888c8dc40 outdated

274@@ -275,6 +275,7 @@
275     'wallet_taproot.py',
276     'p2p_fingerprint.py',
277     'feature_uacomment.py',
278+    'stress_init.py',

MarcoFalke commented at 10:19 am on October 17, 2021:

I think this would be called feature_init, no?

jamesob commented at 4:11 pm on October 18, 2021:

Yep sure, will change.

MarcoFalke approved

MarcoFalke commented at 10:20 am on October 17, 2021: member

Strong concept ACK. I tried doing this manually for a few releases in the past, but a functional test simplifies that a lot.

practicalswift commented at 7:30 pm on October 17, 2021: contributor

Strong Concept ACK

Functional testing, unit testing, fuzz testing and now stress testing: great to see!

Fault injection testing and mutation testing next in line? :)

lsilva01 approved

lsilva01 commented at 2:34 pm on October 18, 2021: contributor

Tested ACK 1888c8d on Ubuntu 20.04

brunoerg approved

brunoerg commented at 5:50 pm on October 18, 2021: member

tACK 1888c8dc40d1dbbffe2f522749151ca8d0efee00

jamesob force-pushed on Oct 18, 2021

jamesob force-pushed on Oct 19, 2021

dongcarl commented at 4:30 pm on October 19, 2021: member

Code Review ACK 71ba24a724556af02901d2aa06a71d24e18f6f3c

This is great, and is very helpful for testing initialization! Thank you James!

jamesob requested review from MarcoFalke on Oct 19, 2021

mzumsande commented at 9:48 pm on October 19, 2021: member

Concept ACK, very cool idea.

I also encountered the intermittent Error: txindex best block of the index goes beyond pruned data once while running the new test yesterday - what is the best way to proceed? Should that issue be analysed/fixed before merging this?

in test/functional/feature_init.py:2 in 71ba24a724 outdated

0@@ -0,0 +1,171 @@
1+#!/usr/bin/env python3
2+# Copyright (c) 2017-2020 The Bitcoin Core developers

MarcoFalke commented at 9:01 am on October 20, 2021:

2017 was 5 years ago

in test/functional/test_framework/test_node.py:376 in de0a4e084e outdated

367@@ -368,21 +368,28 @@ def is_node_stopped(self):
368     def wait_until_stopped(self, timeout=BITCOIND_PROC_WAIT_TIMEOUT):
369         wait_until_helper(self.is_node_stopped, timeout=timeout, timeout_factor=self.timeout_factor)
370 
371+    @property
372+    def chain_path(self) -> str:
373+        return os.path.join(self.datadir, self.chain)
374+
375+    @property
376+    def debug_log_path(self) -> str:

MarcoFalke commented at 9:12 am on October 20, 2021:

Why not return a Path here when the function is called …_path?

in test/functional/feature_init.py:53 in 71ba24a724 outdated

50+
51+        def check_clean_start():
52+            """Ensure that node restarts successfully after various interrupts."""
53+            node.start(extra_args=['-txindex=1'])
54+            node.wait_for_rpc_connection()
55+            node.waitforblockheight(200)

MarcoFalke commented at 9:15 am on October 20, 2021:

is this needed? wait_for_rpc_connection should already do that by waiting for the mempool to be loaded.

jamesob commented at 1:40 pm on October 20, 2021:

Yeah, because I think it’s important to explicitly verify that the number of blocks that the setup dictated (200) are present. This is to check for the case where, for example, the block index somehow gets screwed up and fewer blocks are loaded than the original count, despite having loaded the mempool successfully. But am I missing some reason this isn’t necessary?

MarcoFalke commented at 1:43 pm on October 20, 2021:

Wouldn’t an assert_equal be a better choice then?

jamesob commented at 2:09 pm on October 20, 2021:

Oh, what, to avoid a timeout? I guess that makes sense.

in test/functional/feature_init.py:92 in 71ba24a724 outdated

87+                    self.log.info(f"terminating node after {num_lines} log lines seen")
88+                    sigterm_node()
89+                    break
90+
91+                if (time.time() - start) > MAX_SECS_TO_WAIT:
92+                    self.log.warning(f"missed line {bail_line}; bailing now after {num_lines} lines")

MarcoFalke commented at 9:18 am on October 20, 2021:

Shouldn’t this be an assertion error?

jamesob commented at 1:00 pm on October 20, 2021:

So I actually made this change last night… but the problem is that this is actually hit pretty routinely (about every third run) on master (but oddly not on @dongcarl’s #23280?). Each time I manually checked the debug.log and saw the loadblk thread started line, so I’m not sure how the Python code would be missing that as-written. I’ll investigate a bit because I agree, an assertion here would be preferable.

in test/functional/feature_init.py:141 in 71ba24a724 outdated

136+            Path('blocks') / 'blk*.dat': 'Error loading block database.',
137+        }
138+        chain_path = Path(node.chain_path)
139+
140+        for file_patt, err_fragment in files_to_disturb.items():
141+            target_file = Path(glob.glob(str(chain_path / file_patt))[0])

MarcoFalke commented at 9:24 am on October 20, 2021:

Haven’t tried this, but what happens when you use:

0target_file = node.chain_path.glob(file_patt)[0]

jamesob commented at 1:45 pm on October 20, 2021:

Works, but had to add a list(...) and str(file_patt). Still probably better than what I had.

MarcoFalke commented at 2:11 pm on October 20, 2021:

What is the point of str(Path(str()))? Just use str?

jamesob commented at 2:22 pm on October 20, 2021:

Okay I don’t understand what you’re saying anymore so if you want me to change it, suggest a patch.

MarcoFalke approved

MarcoFalke commented at 9:25 am on October 20, 2021: member

ACK

jamesob force-pushed on Oct 20, 2021

in test/functional/feature_init.py:144 in c870ee0d95 outdated

139+            Path('chainstate') / '*.ldb': 'Error opening block database.',
140+            Path('blocks') / 'blk*.dat': 'Error loading block database.',
141+        }
142+
143+        for file_patt, err_fragment in files_to_disturb.items():
144+            target_file = list(node.chain_path.glob(str(file_patt)))[0]

MarcoFalke commented at 2:24 pm on October 20, 2021:

0            'blocks/index/*.ldb': 'Error opening block database.',
1            'chainstate/*.ldb': 'Error opening block database.',
2            'blocks/blk*.dat': 'Error loading block database.',
3        }
4
5        for file_patt, err_fragment in files_to_disturb.items():
6            target_file = list(node.chain_path.glob(file_patt))[0]

jamesob commented at 2:29 pm on October 20, 2021:

Done, thanks.

jamesob commented at 2:26 pm on October 20, 2021: member

I also encountered the intermittent Error: txindex best block of the index goes beyond pruned data once while running the new test yesterday - what is the best way to proceed? Should that issue be analysed/fixed before merging this?

On the one hand, holding off on merging test coverage because there may be a bug in master seems wrong. On the other hand, spurious test failures in CI for a hard-to-track-down bug will be annoying.

For context and neither here nor there: this bug seems pretty minor - under very particular circumstances (SIGTERM during just the right init point) it may cause a user to have to reindex. But that said, we should certainly have a better understanding of why it’s happening.

MarcoFalke commented at 2:28 pm on October 20, 2021: member

If txindex is causing issues here, maybe submit it in a follow-up pull request?

jamesob force-pushed on Oct 20, 2021

in test/functional/test_framework/test_node.py:393 in ff2fedef5d outdated

391         yield
392 
393         while True:
394             found = True
395-            with open(debug_log, encoding='utf-8') as dl:
396+            with open(str(self.debug_log_path), encoding='utf-8') as dl:

MarcoFalke commented at 3:22 pm on October 20, 2021:

Can you explain why str() is needed here? It should be possible to open a Path in python

jamesob commented at 3:26 pm on October 20, 2021:

Weird, thought I’d started off by doing that and then had to add str. Fixed.

test: add node.chain_path and node.debug_log_path

To allow easier access to the node's datadir and debug logs.

23f85616a8

jamesob force-pushed on Oct 20, 2021

in test/functional/feature_init.py:5 in b32f5a5f9b outdated

0@@ -0,0 +1,175 @@
1+#!/usr/bin/env python3
2+# Copyright (c) 2021 The Bitcoin Core developers
3+# Distributed under the MIT software license, see the accompanying
4+# file COPYING or http://www.opensource.org/licenses/mit-license.php.
5+"""Stress tests related to node initialization (init.cpp)."""

jonatack commented at 3:39 pm on October 20, 2021:

nit, the file name could go out of date without being noticed

0"""Stress tests related to node initialization."""

in test/functional/feature_init.py:9 in b32f5a5f9b outdated

0@@ -0,0 +1,175 @@
1+#!/usr/bin/env python3
2+# Copyright (c) 2021 The Bitcoin Core developers
3+# Distributed under the MIT software license, see the accompanying
4+# file COPYING or http://www.opensource.org/licenses/mit-license.php.
5+"""Stress tests related to node initialization (init.cpp)."""
6+import time
7+import random
8+import os
9+from pathlib import Path

jonatack commented at 3:40 pm on October 20, 2021:

sort

0-import time
1-import random
2 import os
3 from pathlib import Path
4+import random
5+import time

jnewbery commented at 11:28 am on October 21, 2021:

This comment was marked resolved, but the imports are still unsorted.

jamesob commented at 4:33 pm on October 26, 2021:

Sorted imports are not in the Python style guide. I think keeping them sorted is an unnecessary burden to authors and reviewers.

in test/functional/feature_init.py:54 in b32f5a5f9b outdated

49+
50+        def check_clean_start():
51+            """Ensure that node restarts successfully after various interrupts."""
52+            node.start(extra_args=['-txindex=1'])
53+            node.wait_for_rpc_connection()
54+            assert_equal(200, node.getblockchaininfo()['blocks'])

jonatack commented at 3:45 pm on October 20, 2021:

This would be simpler, I think.

0            assert_equal(200, node.getblockcount())

in test/functional/feature_init.py:85 in b32f5a5f9b outdated

71+        for bail_line in lines_to_bail_after:
72+            self.log.info(f"Starting node and will exit after line '{bail_line}'")
73+            node.start(extra_args=['-txindex=1'])
74+            logfile = open(node.debug_log_path, 'r', encoding='utf8')
75+
76+            MAX_SECS_TO_WAIT = 30

jonatack commented at 3:48 pm on October 20, 2021:

maybe better to snakecase this localvar throughout, as the variable name is reused with a different value in a different scope below

0            max_seconds_to_wait = 30

in test/functional/feature_init.py:90 in b32f5a5f9b outdated

85+                # XXX mysteriously, if we remove this log statement the test
86+                # will occasionally fail by missing the 'loadblk thread start'
87+                # line and hitting the 'missed line' AssertionError. I think
88+                # this may be some kind of overzealous-optimization bug in
89+                # CPython but I really don't know.
90+                self.log.debug(f'saw logline: {line}')

jonatack commented at 3:56 pm on October 20, 2021:

It would be nice to be able to omit this, as it renders the debug log level difficult to follow.

Some other logging suggestions to clean up the output:

 0         for bail_line in lines_to_bail_after:
 1-            self.log.info(f"Starting node and will exit after line '{bail_line}'")
 2+            self.log.debug(f"Starting node and will exit after line '{bail_line}'")
 3             node.start(extra_args=['-txindex=1'])
 4             logfile = open(node.debug_log_path, 'r', encoding='utf8')
 5 
 6@@ -90,7 +90,7 @@ class InitStressTest(BitcoinTestFramework):
 7                 self.log.debug(f'saw logline: {line}')
 8 
 9                 if line and bail_line.lower() in line.lower():
10-                    self.log.info(f"terminating node after {num_lines} log lines seen")
11+                    self.log.debug(f"Terminating node after {num_lines} log lines seen")
12                     sigterm_node()
13                     break
14 
15@@ -109,7 +109,7 @@ class InitStressTest(BitcoinTestFramework):
16         for _ in range(40):
17             # Startup usually creates ~70 log lines
18             terminate_after = random.randint(1, 80)
19-            self.log.info(f"Starting node and will exit after {terminate_after} lines")
20+            self.log.debug(f"Starting node and will exit after {terminate_after} lines")
21             node.start(extra_args=['-txindex=1'])
22             logfile = open(node.debug_log_path, 'r', encoding='utf8')
23 
24@@ -123,7 +123,7 @@ class InitStressTest(BitcoinTestFramework):
25                     num_lines += 1
26 
27                 if num_lines >= terminate_after or (time.time() - start) > MAX_SECS_TO_WAIT:
28-                    self.log.info(f"terminating node after {num_lines} log lines seen")
29+                    self.log.debug(f"Terminating node after {num_lines} log lines seen")
30                     sigterm_node()
31                     break
32 
33@@ -144,7 +144,7 @@ class InitStressTest(BitcoinTestFramework):
34         for file_patt, err_fragment in files_to_disturb.items():
35             target_file = list(node.chain_path.glob(file_patt))[0]
36 
37-            self.log.info(f"tweaking file to ensure failure {target_file}")
38+            self.log.info(f"Tweaking file to ensure failure {target_file}")
39             bak_path = str(target_file) + ".bak"
40             target_file.rename(bak_path)
41 
42@@ -165,7 +165,7 @@ class InitStressTest(BitcoinTestFramework):
43                 match=ErrorMatch.PARTIAL_REGEX,
44             )
45 
46-            self.log.info(f"restoring file from {bak_path} and restarting")
47+            self.log.info(f"Restoring file from {bak_path} and restarting")
48             Path(bak_path).rename(target_file)
49             check_clean_start()
50             self.stop_node(0)

in test/functional/test_framework/test_node.py:23 in b32f5a5f9b outdated

19@@ -20,6 +20,7 @@
20 import collections
21 import shlex
22 import sys
23+from pathlib import Path

jonatack commented at 3:58 pm on October 20, 2021:

nit, sort

 0@@ -13,6 +13,7 @@ import json
 1 import decimal
 2 from enum import Enum
 3 import errno
 4 import http.client
 5 import json
 6 import logging
 7 import os
 8+from pathlib import Path
 9 import re
10 import subprocess
11 import tempfile
12 import time
13 import urllib.parse
14 import collections
15 import shlex
16 import sys
17-from pathlib import Path
18``

jonatack commented at 4:03 pm on October 20, 2021: member

Concept ACK. Nice idea!

jamesob force-pushed on Oct 20, 2021

mzumsande commented at 5:37 pm on October 20, 2021: member

For context and neither here nor there: this bug seems pretty minor - under very particular circumstances (SIGTERM during just the right init point) it may cause a user to have to reindex. But that said, we should certainly have a better understanding of why it’s happening.

I see different things locally: For me, the failure happens in roughly 1-2% of the runs in the will exit after line 'txindex thread start' run, whereas check_clean_start of the previous run for loadblk still passes. Accordingly, if I remove all lines except txindex thread start from lines_to_bail_after, I’ll still get occasional fails (without any SIGTERM involved), so the failure does seem to occur “unprovoked”.

jamesob force-pushed on Oct 20, 2021

fanquake commented at 1:53 am on October 21, 2021: member

Concept ACK

jamesob force-pushed on Oct 21, 2021

in test/functional/feature_init.py:28 in d5f0dd1efd outdated

23+        self.setup_clean_chain = False
24+        self.num_nodes = 1
25+
26+    def run_test(self):
27+        """
28+        - test bailing initialization after seeing a certain log line.

jnewbery commented at 11:33 am on October 21, 2021:

“to bail” is a bit colloquial and may not be understood by non-native english speakers. It might be better to use a more standard, technical term such as “to terminate”.

jamesob commented at 4:50 pm on October 26, 2021:

Fixed

in test/functional/feature_init.py:47 in d5f0dd1efd outdated

42+        node = self.nodes[0]
43+
44+        def sigterm_node():
45+            node.process.terminate()
46+            node.process.wait()
47+            node.debug_log_path.unlink()

jnewbery commented at 11:52 am on October 21, 2021:

Doing this will probably make it more difficult to debug failures, since the bitcoind log is erased. For example, the second time I ran this test, I hit the following failure:

 0→ ./test/functional/feature_init.py --nocleanup
 12021-10-21T11:43:53.709000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_lh9p1xfw
 22021-10-21T11:43:54.157000Z TestFramework (INFO): Starting node and will exit after line 'scheduler thread start'
 32021-10-21T11:43:54.164000Z TestFramework (INFO): terminating node after 49 log lines seen
 42021-10-21T11:43:54.165000Z TestFramework (INFO): Starting node and will exit after line 'Loading P2P addresses'
 52021-10-21T11:43:54.308000Z TestFramework (INFO): terminating node after 66 log lines seen
 62021-10-21T11:43:54.317000Z TestFramework (INFO): Starting node and will exit after line 'Loading banlist'
 72021-10-21T11:43:54.461000Z TestFramework (INFO): terminating node after 69 log lines seen
 82021-10-21T11:43:54.469000Z TestFramework (INFO): Starting node and will exit after line 'Loading block index'
 92021-10-21T11:43:54.619000Z TestFramework (INFO): terminating node after 81 log lines seen
102021-10-21T11:43:54.626000Z TestFramework (INFO): Starting node and will exit after line 'Switching active chainstate'
112021-10-21T11:43:54.773000Z TestFramework (INFO): terminating node after 82 log lines seen
122021-10-21T11:43:54.781000Z TestFramework (INFO): Starting node and will exit after line 'Loaded best chain:'
132021-10-21T11:43:54.927000Z TestFramework (INFO): terminating node after 92 log lines seen
142021-10-21T11:43:54.935000Z TestFramework (INFO): Starting node and will exit after line 'init message: Verifying blocks'
152021-10-21T11:43:55.082000Z TestFramework (INFO): terminating node after 93 log lines seen
162021-10-21T11:43:55.090000Z TestFramework (INFO): Starting node and will exit after line 'loadblk thread start'
172021-10-21T11:43:55.240000Z TestFramework (INFO): terminating node after 115 log lines seen
182021-10-21T11:43:55.359000Z TestFramework (INFO): Starting node and will exit after line 'txindex thread start'
192021-10-21T11:43:55.517000Z TestFramework (ERROR): Assertion failed
20Traceback (most recent call last):
21  File "/home/vagrant/bitcoin/test/functional/test_framework/test_framework.py", line 132, in main
22    self.run_test()
23  File "./test/functional/feature_init.py", line 103, in run_test
24    raise AssertionError(f"node failed to start (line: '{bail_line}')")
25AssertionError: node failed to start (line: 'txindex thread start')
262021-10-21T11:43:55.569000Z TestFramework (INFO): Stopping nodes
27Traceback (most recent call last):
28  File "./test/functional/feature_init.py", line 176, in <module>
29    InitStressTest().main()
30  File "/home/vagrant/bitcoin/test/functional/test_framework/test_framework.py", line 155, in main
31    exit_code = self.shutdown()
32  File "/home/vagrant/bitcoin/test/functional/test_framework/test_framework.py", line 304, in shutdown
33    self.stop_nodes()
34  File "/home/vagrant/bitcoin/test/functional/test_framework/test_framework.py", line 555, in stop_nodes
35    node.stop_node(wait=wait, wait_until_stopped=False)
36  File "/home/vagrant/bitcoin/test/functional/test_framework/test_node.py", line 336, in stop_node
37    self.stop(wait=wait)
38  File "/home/vagrant/bitcoin/test/functional/test_framework/test_node.py", line 184, in __getattr__
39    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
40AssertionError: [node 0] Error: no RPC connection
41[node 0] Cleaning up leftover process

The bitcoind debug log and combined logs would be more helpful if they were complete.

MarcoFalke commented at 8:35 am on December 1, 2021:

I think this was not addressed?

jamesob commented at 1:16 pm on December 1, 2021:

This suggestion is at odds with the nature of the test itself, since the logfile contents determine when to sigterm and so have to be cleared each iteration.

If someone would like to improve this test by consuming the node’s stdout instead of the logfile, that’s one possible improvement suitable for a follow-up.

MarcoFalke commented at 11:45 am on December 10, 2021:

Seems odd to delete a file to allow reading it. Why can´t this use the same approach like assert_debug_log, that is remember the last read position?

in test/functional/feature_init.py:74 in d5f0dd1efd outdated

69+        ]
70+        if self.is_wallet_compiled():
71+            lines_to_bail_after.append('Verifying wallet')
72+
73+        for bail_line in lines_to_bail_after:
74+            self.log.info(f"Starting node and will exit after line '{bail_line}'")

jnewbery commented at 11:53 am on October 21, 2021:

This feels a little bit verbose for info logging. I’d suggest moving to debug. Same for the “terminating node after” log below.

in test/functional/feature_init.py:91 in d5f0dd1efd outdated

86+                    # XXX mysteriously, if we remove this log statement the test
87+                    # will occasionally fail by missing the 'loadblk thread start'
88+                    # line and hitting the 'missed line' AssertionError. I think
89+                    # this may be some kind of overzealous-optimization bug in
90+                    # CPython but I really don't know.
91+                    self.log.debug(f'saw logline: {line}')

jnewbery commented at 12:09 pm on October 21, 2021:

This results in very verbose test logs

in test/functional/feature_init.py:166 in d5f0dd1efd outdated

156+            # tweaked_contents = bytearray(contents)
157+            # tweaked_contents[50:250] = b'1' * 200
158+            # target_file.write_bytes(bytes(tweaked_contents))
159+            #
160+            # At the moment I can't get this to work (bitcoind loads successfully?) so
161+            # investigate doing this later.

jnewbery commented at 12:10 pm on October 21, 2021:

I think it’s generally better to leave TODOs out of the source and track them as issues, so that they can be discussed/amended/removed.

in test/functional/feature_init.py:112 in d5f0dd1efd outdated

107+
108+        self.log.info("Terminate at some random point in the init process")
109+
110+        for _ in range(40):
111+            # Startup usually creates ~70 log lines
112+            terminate_after = random.randint(1, 80)

jnewbery commented at 12:18 pm on October 21, 2021:

Is there a way to derive this value programatically from the subtests above? I imagine it could change over time as log lines are added/removed in bitcoind, and this value would become outdated.

jamesob commented at 4:51 pm on October 26, 2021:

Good idea, done.

jnewbery commented at 12:19 pm on October 21, 2021: member

Concept ACK.

This test fails intermittently for me. It failed on the AssertionError: node failed to start (line: 'txindex thread start') line on the second run. I think we should remove all failures before merging this to master to avoid spurious CI failures in unrelated PRs.

I wonder if it makes sense to use BBF programs to hook into USDTs rather than have the python test framework poll for log lines as a trigger for process termination. There’s potentially a lot of non-determinism in timing between the bitcoind process writing a log line, having that log flushed to disk, and the python process reading that line and terminating the bitcoind process. With a BPF program being triggered by USDTs, the kernel could terminate the process at the exact point we’re interested in. This test is already disabled for windows, so I don’t think it’d be a huge problem to require eBPF (although it may make it impossible to run on CI).

in test/functional/feature_init.py:52 in d5f0dd1efd outdated

47+            node.debug_log_path.unlink()
48+            node.debug_log_path.touch()
49+
50+        def check_clean_start():
51+            """Ensure that node restarts successfully after various interrupts."""
52+            node.start(extra_args=['-txindex=1'])

mzumsande commented at 7:18 pm on October 24, 2021:

I think I have found the cause for the intermittent failures, it’s related to #15946, see index/base.cpp:

It happens when the node starts with -txindex=1 but ThreadSync is interrupted after it sets m_best_block_index to Genesis, and before it gets do any further work (like indexing the next block).

Then, with the next restart, BaseIndex::Init() tries to check whether we know the last synced block of the partially synced index (which we might not if we were pruning, but do in this case).

The problem is now that the loop while (block_to_test && block->pprev && (block->pprev->nStatus & BLOCK_HAVE_DATA)) in BaseIndex::Init() won’t set prune_violation to false because block_to_test is Genesis, block->pprev evaluates to false, so the comparison block_to_test == block isn’t reached.

Since I knew next to nothing about the indexes before looking into this: Do reviewers think that the correct way to fix this would be to rewrite the while loop in Init() to also check block_to_test == block in case of the Genesis block?

dongcarl commented at 9:25 pm on October 25, 2021:

I think you have identified the right cause. I obtained a datadir from jamesob that triggers the “Error: txindex best block of the index goes beyond pruned data…” error on master when run with the following invocation:

0./src/bitcoind -txindex=1 -datadir=$PWD/james-datadir/home/james/tmp/bitcoin_func_test_cf43f1hi/node0/

Now, if I apply the following (very naive, should not be merged) patch, the above invocation no longer fails and bitcoind operates normally:

 0diff --git a/src/index/base.cpp b/src/index/base.cpp
 1index fc6dd77a72..7549ae5d57 100644
 2--- a/src/index/base.cpp
 3+++ b/src/index/base.cpp
 4@@ -98,6 +98,9 @@ bool BaseIndex::Init()
 5                 }
 6                 block = block->pprev;
 7             }
 8+            if (prune_violation) {
 9+                prune_violation = block_to_test != active_chain.Genesis();
10+            }
11         }
12         if (prune_violation) {
13             return InitError(strprintf(Untranslated("%s best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again)"), GetName()));

MarcoFalke commented at 4:04 pm on October 26, 2021:

If txindex is causing issues, maybe remove it for now and add it in a later pull?

jamesob commented at 4:37 pm on October 26, 2021:

Good idea, will do.

jamesob commented at 4:51 pm on October 26, 2021:

Nice detective work, guys! I’m going to follow @MarcoFalke’s advice and disable txindex here, but create an issue to follow this up.

DrahtBot commented at 6:02 am on October 26, 2021: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#10102 (Multiprocess bitcoin by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

test: add stress tests for initialization d9803f7a0a

jamesob force-pushed on Oct 26, 2021

jamesob commented at 7:06 pm on October 26, 2021: member

Okay, I’ve addressed a bunch of feedback and disabled txindex for the time being. I think this is a good first pass on this test provided all the non-deterministic failures have been removed. @mzumsande/@dongcarl if you’d like to PR the indexing change and fix you’re welcome to, otherwise I can take care of it shortly.

mzumsande referenced this in commit 96badd667c on Oct 26, 2021

mzumsande commented at 10:07 pm on October 26, 2021: member

Okay, I’ve addressed a bunch of feedback and disabled txindex for the time being. I think this is a good first pass on this test provided all the non-deterministic failures have been removed. @mzumsande/@dongcarl if you’d like to PR the indexing change and fix you’re welcome to, otherwise I can take care of it shortly.

I opened #23365 to fix the intermittent issue.

laanwj commented at 12:28 pm on November 10, 2021: member

Concept ACK

jamesob commented at 4:51 pm on November 16, 2021: member

Anything left here? I think as things stand this is a stable test that adds some good coverage. AFAIK there’s no outstanding feedback aside from a few nits and marginal suggestions.

mzumsande referenced this in commit 698c524698 on Nov 26, 2021

laanwj commented at 7:49 pm on November 30, 2021: member

Code review ACK d9803f7a0a33688f7429cf10384244f4770851ca

laanwj merged this on Nov 30, 2021

laanwj closed this on Nov 30, 2021

sidhujag referenced this in commit f5c617dc72 on Dec 1, 2021

jnewbery commented at 10:15 am on December 1, 2021: member

I’m surprised this was merged, given that there were no ACKs on the latest branch, and there were still unresolved approach/implementation issues and comments, eg:

#23289 (review)
#23289#pullrequestreview-785566556

fanquake commented at 1:18 am on December 2, 2021: member

This is also the cause of the recent sporadic failures in our CI. See #23646.

MarcoFalke referenced this in commit bf66e258a8 on Dec 13, 2021

0xB10C referenced this in commit 20ddba85a3 on Dec 13, 2021

sidhujag referenced this in commit 1fd54088ea on Dec 13, 2021

RandyMcMillan referenced this in commit 7e002ad069 on Dec 23, 2021

MarcoFalke referenced this in commit d69af93223 on Jan 3, 2022

rebroad referenced this in commit c54a14e60e on Feb 3, 2022

Fabcien referenced this in commit fdee373b10 on Nov 23, 2022

DrahtBot locked this on Dec 10, 2022

test: add stress tests for initialization #23289

Conflicts