test: fix feature_init.py intermittencies

furszy commented at 2:41 pm on June 30, 2025: member

Aims to solve #32600. Found it while working on #26966 (this was really annoying there).

This ensures the node is index-synced before perturbing files. If the index sync gets interrupted before it starts, the database could be empty, making any following perturbation ineffective (which explains why the node does not abort during startup in the #32600 logs).

Also, the first commit avoids initializing components not under test. This reduces log flooding, which helped in understanding the issue.

Patch to reproduce the issue on master using feature_init.py (this simulates a node shutting down before the index starts syncing):

 0diff --git a/src/index/base.cpp b/src/index/base.cpp
 1--- a/src/index/base.cpp	(revision 1e03052c3fefb188f047e72548f2c6b0cc019e50)
 2+++ b/src/index/base.cpp	(date 1751293306725)
 3@@ -185,6 +185,7 @@
 4 void BaseIndex::Sync()
 5 {
 6     const CBlockIndex* pindex = m_best_block_index.load();
 7+    m_interrupt();
 8     if (!m_synced) {
 9         std::chrono::steady_clock::time_point last_log_time{0s};
10         std::chrono::steady_clock::time_point last_locator_write_time{0s};

DrahtBot renamed this:
~~test: fix feature_init.py intermittencies~~
test: fix feature_init.py intermittencies
on Jun 30, 2025

DrahtBot added the label Tests on Jun 30, 2025

DrahtBot commented at 2:41 pm on June 30, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32835.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	hodlinator, maflcko, achow101
Concept ACK	theStack
Approach ACK	TheCharlatan

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#32427 ((RFC) kernel: Replace leveldb-based BlockTreeDB with flat-file based store by TheCharlatan)
#30469 (index: Fix coinstats overflow by fjahr)
#26966 (index: initial sync speedup, parallelize process by furszy)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

TheCharlatan commented at 2:58 pm on June 30, 2025: contributor

Approach ACK

in test/functional/feature_init.py:105 in 1e03052c3f outdated

 99@@ -100,39 +100,90 @@ def check_clean_start():
100         check_clean_start()
101         self.stop_node(0)
102 
103+        # Prior to deleting/perturbing index files, ensure indexes are synchronized (i.e., data exists to modify)
104+        self.start_node(0, extra_args=args)
105+        tip_height = node.getblockcount()
106+        self.wait_until(lambda: all(i["synced"] and i["best_block_height"] == tip_height for i in node.getindexinfo().values()))
107+        self.stop_node(0)

maflcko commented at 3:07 pm on June 30, 2025:

any reason to start the node again? Can’t this wait_until be done in check_clean_start?

furszy commented at 3:30 pm on June 30, 2025:

any reason to start the node again? Can’t this wait_until be done in check_clean_start?

Just because check_clean_start() is also called inside the deletion rounds and didn’t want to start all indexes there on each cycle round. But that seems easily solvable.. one minute..

furszy commented at 3:37 pm on June 30, 2025:

Done.

furszy force-pushed on Jun 30, 2025

in test/functional/feature_init.py:104 in 2a9ae6f162 outdated

 99@@ -97,42 +100,88 @@ def check_clean_start():
100             self.log.debug("Terminating node after terminate line was found")
101             sigterm_node()
102 
103-        check_clean_start()
104+        # Prior to deleting/perturbing index files, ensure indexes are synchronized (i.e., data exists to modify)
105+        check_clean_start(args, lambda n: self.wait_until(lambda h=n.getblockcount(): all(i["synced"] and i["best_block_height"] == h for i in n.getindexinfo().values())))

maflcko commented at 3:41 pm on June 30, 2025:

this fails lint and getindexinfo is already a no-op when no args are provided, so no need to conditionally run it? could just inline it?

furszy commented at 3:53 pm on June 30, 2025:

this fails lint and getindexinfo is already a no-op when no args are provided, so no need to conditionally run it? could just inline it?

Saw it late, pushed.

furszy force-pushed on Jun 30, 2025

DrahtBot added the label CI failed on Jun 30, 2025

DrahtBot commented at 3:44 pm on June 30, 2025: contributor

🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/45063074304 LLM reason (✨ experimental): Lint check failed due to a Python code style error identified by ruff.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

furszy force-pushed on Jun 30, 2025

DrahtBot removed the label CI failed on Jun 30, 2025

fanquake commented at 12:39 pm on July 2, 2025: member

cc @hodlinator

in test/functional/feature_init.py:59 in e36e0b1aeb outdated

59-        def check_clean_start():
60+        def check_clean_start(args=None):
61             """Ensure that node restarts successfully after various interrupts."""
62-            node.start()
63+            extra_args = args or []
64+            node.start(extra_args)

hodlinator commented at 1:08 pm on July 2, 2025:

nit: Could simplify:

0        def check_clean_start(args=[]):
1            """Ensure that node restarts successfully after various interrupts."""
2            node.start(args)

furszy commented at 11:45 pm on July 2, 2025:

that was actually done on purpose. See https://leimao.github.io/blog/Python-Default-Argument-Mutable-Object/.

hodlinator commented at 1:08 pm on July 4, 2025:

Oof, that’s nasty.

in test/functional/feature_init.py:215 in e36e0b1aeb outdated

211@@ -158,7 +212,10 @@ def check_clean_start():
212                     tf.seek(150)
213                     tf.write(b"1" * 200)
214 
215-            start_expecting_error(err_fragment)
216+            self.log.info(f"Expecting error frag {err_fragment}")

hodlinator commented at 1:12 pm on July 2, 2025:

nit: Could add quotes and make output as verbose as variables:

0            self.log.info(f"Expecting error fragment: {err_fragment!r}")

furszy commented at 11:55 pm on July 2, 2025:

This is actually just a leftover from debugging the issue in #26966.. sorry for that. Will just remove the line.

in test/functional/feature_init.py:186 in e36e0b1aeb outdated

202                 self.log.info(f"Deleting file to ensure failure {target_file}")
203                 bak_path = str(target_file) + ".bak"
204                 target_file.rename(bak_path)
205 
206-            start_expecting_error(err_fragment)
207+            start_expecting_error(err_fragment, startup_args)

hodlinator commented at 1:16 pm on July 2, 2025:

nit: It might be good to send in the same startup_args to check_clean_start() further down in the loop? https://github.com/bitcoin/bitcoin/blob/e36e0b1aebe437c090ff897d27bbec0a7c5a100a/test/functional/feature_init.py#L192

(Side remark: It is unbalanced that we don’t call check_clean_start() at the bottom of the perturbation-loop, but saves execution time and should be alright as we check for different error messages).

furszy commented at 11:50 pm on July 2, 2025:

It might be good to send in the same startup_args to check_clean_start() further down in the loop?

I don’t think so. All files are restored from the backup after deletion, so we know for sure the indexes are synced at that point (nothing changed).

hodlinator commented at 1:16 pm on July 4, 2025:

If we know for sure, why are we running check_clean_start()? IMO we should do one of:

Trust that we fully restored the state and remove check_clean_start() here.
Run check_clean_start(startup_args) to verify that the index being tested was restored.

furszy commented at 1:46 pm on July 4, 2025:

If we know for sure, why are we running check_clean_start()?

I assume we are doing it to verify nothing else broke after deleting + restoring some data, which is mostly relevant for chain data. But you have a fair point. Pushed.

in test/functional/feature_init.py:124 in e36e0b1aeb outdated

146-            'indexes/blockfilter/basic/db/*.*': 'LevelDB error: Corruption',
147-            'indexes/coinstats/db/*.*': 'LevelDB error: Corruption',
148-            'indexes/txindex/*.log': 'LevelDB error: Corruption',
149-            'indexes/txindex/CURRENT': 'LevelDB error: Corruption',
150-            # Perturbing these files does not result in a startup error:
151-            # 'indexes/blockfilter/basic/*.dat', 'indexes/txindex/MANIFEST*', 'indexes/txindex/LOCK'

hodlinator commented at 7:36 pm on July 2, 2025:

This comment for the perturbed files is lost, unlike the one for the deleted files above.

furszy commented at 11:47 pm on July 2, 2025:

ups, reverting. Thanks.

in test/functional/feature_init.py:116 in e36e0b1aeb outdated

118+                'filepath_glob': 'blocks/index/*.ldb',
119+                'error_message': 'Error opening block database.',
120+                'startup_args': [],
121+            },
122+            {
123+                'filepath_glob': 'chainstate/*.ldb',

hodlinator commented at 7:53 pm on July 2, 2025:

nit: Could use @dataclass for slightly less stringly typed code, and tuple for immutability:

  0diff --git a/test/functional/feature_init.py b/test/functional/feature_init.py
  1index 6ee075f906..1913002962 100755
  2--- a/test/functional/feature_init.py
  3+++ b/test/functional/feature_init.py
  4@@ -9,6 +9,7 @@ import platform
  5 import shutil
  6 import signal
  7 import subprocess
  8+from dataclasses import dataclass, field
  9 
 10 from test_framework.test_framework import BitcoinTestFramework
 11 from test_framework.test_node import (
 12@@ -107,74 +108,55 @@ class InitTest(BitcoinTestFramework):
 13 
 14         self.log.info("Test startup errors after removing certain essential files")
 15 
 16-        deletion_rounds = [
 17-            {
 18-                'filepath_glob': 'blocks/index/*.ldb',
 19-                'error_message': 'Error opening block database.',
 20-                'startup_args': [],
 21-            },
 22-            {
 23-                'filepath_glob': 'chainstate/*.ldb',
 24-                'error_message': 'Error opening coins database.',
 25-                'startup_args': ['-checklevel=4'],
 26-            },
 27-            {
 28-                'filepath_glob': 'blocks/blk*.dat',
 29-                'error_message': 'Error loading block database.',
 30-                'startup_args': ['-checkblocks=200', '-checklevel=4'],
 31-            },
 32-            {
 33-                'filepath_glob': 'indexes/txindex/MANIFEST*',
 34-                'error_message': 'LevelDB error: Corruption: CURRENT points to a non-existent file',
 35-                'startup_args': ['-txindex=1'],
 36-            },
 37+ [@dataclass](/bitcoin-bitcoin/contributor/dataclass/)
 38+        class RoundInfo:
 39+            filepath_glob: str
 40+            error_message: str
 41+            startup_args: list[str] = field(default_factory=list)
 42+
 43+        deletion_rounds: tuple[RoundInfo, ...] = (
 44+            RoundInfo(filepath_glob='blocks/index/*.ldb',
 45+                      error_message='Error opening block database.'),
 46+            RoundInfo(filepath_glob='chainstate/*.ldb',
 47+                      error_message='Error opening coins database.',
 48+                      startup_args=['-checklevel=4']),
 49+            RoundInfo(filepath_glob='blocks/blk*.dat',
 50+                      error_message='Error loading block database.',
 51+                      startup_args=['-checkblocks=200', '-checklevel=4']),
 52+            RoundInfo(filepath_glob='indexes/txindex/MANIFEST*',
 53+                      error_message='LevelDB error: Corruption: CURRENT points to a non-existent file.',
 54+                      startup_args=['-txindex=1']),
 55             # Removing these files does not result in a startup error:
 56             # 'indexes/blockfilter/basic/*.dat', 'indexes/blockfilter/basic/db/*.*', 'indexes/coinstats/db/*.*',
 57             # 'indexes/txindex/*.log', 'indexes/txindex/CURRENT', 'indexes/txindex/LOCK'
 58-        ]
 59-
 60-        perturbation_rounds = [
 61-            {
 62-                'filepath_glob': 'blocks/index/*.ldb',
 63-                'error_message': 'Error loading block database.',
 64-                'startup_args': [],
 65-            },
 66-            {
 67-                'filepath_glob': 'chainstate/*.ldb',
 68-                'error_message': 'Error opening coins database.',
 69-                'startup_args': [],
 70-            },
 71-            {
 72-                'filepath_glob': 'blocks/blk*.dat',
 73-                'error_message': 'Corrupted block database detected.',
 74-                'startup_args': ['-checkblocks=200', '-checklevel=4'],
 75-            },
 76-            {
 77-                'filepath_glob': 'indexes/blockfilter/basic/db/*.*',
 78-                'error_message': 'LevelDB error: Corruption',
 79-                'startup_args': ['-blockfilterindex=1'],
 80-            },
 81-            {
 82-                'filepath_glob': 'indexes/coinstats/db/*.*',
 83-                'error_message': 'LevelDB error: Corruption',
 84-                'startup_args': ['-coinstatsindex=1'],
 85-            },
 86-            {
 87-                'filepath_glob': 'indexes/txindex/*.log',
 88-                'error_message': 'LevelDB error: Corruption',
 89-                'startup_args': ['-txindex=1'],
 90-            },
 91-            {
 92-                'filepath_glob': 'indexes/txindex/CURRENT',
 93-                'error_message': 'LevelDB error: Corruption',
 94-                'startup_args': ['-txindex=1'],
 95-            },
 96-        ]
 97+        )
 98+
 99+        perturbation_rounds: tuple[RoundInfo, ...] = (
100+            RoundInfo(filepath_glob='blocks/index/*.ldb',
101+                      error_message='Error loading block database.'),
102+            RoundInfo(filepath_glob='chainstate/*.ldb',
103+                      error_message='Error opening coins database.'),
104+            RoundInfo(filepath_glob='blocks/blk*.dat',
105+                      error_message='Corrupted block database detected.',
106+                      startup_args=['-checkblocks=200', '-checklevel=4']),
107+            RoundInfo(filepath_glob='indexes/blockfilter/basic/db/*.*',
108+                      error_message='LevelDB error: Corruption',
109+                      startup_args=['-blockfilterindex=1']),
110+            RoundInfo(filepath_glob='indexes/coinstats/db/*.*',
111+                      error_message='LevelDB error: Corruption',
112+                      startup_args=['-coinstatsindex=1']),
113+            RoundInfo(filepath_glob='indexes/txindex/*.log',
114+                      error_message='LevelDB error: Corruption',
115+                      startup_args=['-txindex=1']),
116+            RoundInfo(filepath_glob='indexes/txindex/CURRENT',
117+                      error_message='LevelDB error: Corruption',
118+                      startup_args=['-txindex=1']),
119+        )
120 
121         for round_info in deletion_rounds:
122-            file_patt = round_info['filepath_glob']
123-            err_fragment = round_info['error_message']
124-            startup_args = round_info['startup_args']
125+            file_patt = round_info.filepath_glob
126+            err_fragment = round_info.error_message
127+            startup_args = round_info.startup_args
128             target_files = list(node.chain_path.glob(file_patt))
129 
130             for target_file in target_files:
131@@ -195,9 +177,9 @@ class InitTest(BitcoinTestFramework):
132         self.log.info("Test startup errors after perturbing certain essential files")
133         dirs = ["blocks", "chainstate", "indexes"]
134         for round_info in perturbation_rounds:
135-            file_patt = round_info['filepath_glob']
136-            err_fragment = round_info['error_message']
137-            startup_args = round_info['startup_args']
138+            file_patt = round_info.filepath_glob
139+            err_fragment = round_info.error_message
140+            startup_args = round_info.startup_args
141 
142             for dir in dirs:
143                 shutil.copytree(node.chain_path / dir, node.chain_path / f"{dir}_bak")

furszy commented at 11:53 pm on July 2, 2025:

Thanks, but I’m about a ~0 here. I don’t really find that change better. I’ll just leave the current code unless someone feels strongly about changing it

hodlinator commented at 8:33 pm on July 2, 2025: contributor

Concept ACK e36e0b1aebe437c090ff897d27bbec0a7c5a100a

Concept: Theory for intermittent failure in #32600 makes sense, better to wait for all indexes before shutting down.

Testing: Successfully tested with the suggested diff modifying BaseIndex::Sync() to prevent syncing.

Approach: Left some nits, but most important thing I found was a comment that was probably accidentally removed.

test: feature_init, only init what's needed per perturbation/deletion round

Avoids initializing and syncing components not under test.
This not only speeds up execution a bit but also helps isolate
and debug issues more easily, as logs aren't flooded with
unrelated details.

abd07cf733

furszy force-pushed on Jul 3, 2025

furszy commented at 0:00 am on July 3, 2025: member

Updated per feedback. Thanks @hodlinator

DrahtBot added the label CI failed on Jul 3, 2025

DrahtBot removed the label CI failed on Jul 3, 2025

theStack commented at 3:19 pm on July 3, 2025: contributor

Concept ACK

test: feature_init, ensure indexes are synced prior to perturbing files 4207d9bf82

furszy force-pushed on Jul 4, 2025

hodlinator approved

hodlinator commented at 12:26 pm on July 7, 2025: contributor

ACK 4207d9bf823bea9f94b484ebf3c9274ca781b245

Changes since my concept review (https://github.com/bitcoin/bitcoin/pull/32835#issuecomment-3029241989):

Addressed my feedback (https://github.com/bitcoin/bitcoin/pull/32835#pullrequestreview-2979101334)

(Would still prefer less stringly typed code, but have a feeling my suggestion in #32835 (review) could probably be improved).

DrahtBot requested review from TheCharlatan on Jul 7, 2025

DrahtBot requested review from theStack on Jul 7, 2025

maflcko commented at 2:04 pm on July 7, 2025: member

lgtm ACK 4207d9bf823bea9f94b484ebf3c9274ca781b245 🍄

Signature:

0untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
1RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
2trusted comment: lgtm ACK 4207d9bf823bea9f94b484ebf3c9274ca781b245 🍄
32EXDV5heWoQfQZkpGwQh96nSo8LQOg6o+F9Pz8zWpupTdDL71GjRRQw5M7JI1Td094wyEHaiN9RTI4W/Cj9MBQ==

achow101 commented at 0:09 am on July 8, 2025: member

ACK 4207d9bf823bea9f94b484ebf3c9274ca781b245

achow101 merged this on Jul 8, 2025

achow101 closed this on Jul 8, 2025

furszy deleted the branch on Jul 8, 2025

test: fix feature_init.py intermittencies #32835

Code Coverage & Benchmarks

Reviews

Conflicts