Fixes: #23099
Handle disk-full more gracefully in functional tests
Fixes: #23099
Handle disk-full more gracefully in functional tests
<!--e57a25ab6845829454e8d69fc972939a-->
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
<!--006a51241073e994b41acfe9ec718e94-->
For detailed information about the code coverage, see the test coverage report.
<!--021abf342d371248e50ceaed478a90ca-->
See the guideline for information on the review process.
| Type | Reviewers |
|---|---|
| ACK | cbergqvist, tdb3, itornaza, achow101 |
| Concept ACK | epiccurious |
| Stale ACK | maflcko, kevkevinpal, davidgumberg |
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
<!--174a7506f384e20aa4161008e828411d-->
Reviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
29 | @@ -30,6 +30,9 @@ 30 | 31 | os.environ["REQUIRE_WALLET_TYPE_SET"] = "1" 32 | 33 | +# Should fail testing and immediately stop if free space falls below. 34 | +MIN_FREE_SPACE = (50 + 16) * 1024 * 1024
Where comes 50+16 from?
I based the values based on the discussions on #23100 (comment) and #23100 (review) and
What is the total size of the directory when no test dir is cleaned up and all tests are run?
The total size of the generated testdir after running ./test_runner.py --nocleanup is 11GB on my system
See the list of the generated folders and their sizes on the attached file below:
sudo du -h /tmp/test_runner_₿_🏃_20240130_125641/ > ~/tmp-test_runner_₿_🏃_20240130_125641-test-files.txt
1.1G /tmp/test_runner_₿_🏃_20240130_125641/feature_block_289
So this value seems wrong either way? How can 66 MB be enough to fit 1.1GB, or even 11GB?
Looks like I need at least 1.25GB (sudo mount -t tmpfs -o size=1250M none /tmp) to run the tests on my system with no errors
I have plotted a map using gnuplot from the data collected tracking the size of the tmpdir when running the tests using:
watch -n 10 "sudo du -s /tmp/test_runner_₿_🏃_20240130_153709/ | awk '{print strftime(\"%H:%M:%S\"), int(\$1/1024)}' | tee -a ~/watch-tee1.log"
641 | @@ -639,6 +642,11 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage= 642 | logging.debug("Early exiting after test failure") 643 | break 644 | 645 | + testdir_disk_usage_stats = shutil.disk_usage(testdir) 646 | + if (testdir_disk_usage_stats.free < MIN_FREE_SPACE): 647 | + logging.debug("Not space enough on device: %s" % testdir)
nit: In python the () are not needed. Also, could use f-strings?
fixed
641 | @@ -639,6 +642,11 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage= 642 | logging.debug("Early exiting after test failure") 643 | break 644 | 645 | + testdir_disk_usage_stats = shutil.disk_usage(testdir)
I wonder if the check could be done upfront, to not even run any test, if it is unclear whether the device will even hold enough to run the tests.
moved the check up run_checks(...)
Concept ACK ebb842a
576 | @@ -574,6 +577,11 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage= 577 | if not result.wasSuccessful(): 578 | sys.exit("Early exiting after failure in TestFramework unit tests") 579 | 580 | + # Exit the test if there is not enough space on the testing dir 581 | + testdir_disk_usage_stats = shutil.disk_usage(tmpdir) 582 | + if (testdir_disk_usage_stats.free < MIN_FREE_SPACE): 583 | + sys.exit(f"Tests aborted: Insufficient space available on the device. Directory: {tmpdir}")
I think we can reword this to
Tests aborted: Insufficient space available in directory: {tmpdir}
Since the device may have space but the tmpdir we are using might have insufficient space
Tested ACK.
Aborts as expected with low disk space.
Temporary test directory at /tmp/test_runner_₿_🏃_20240131_100802
Running Unit Tests for Test Framework Modules
.....................
----------------------------------------------------------------------
Ran 21 tests in 22.349s
OK
Tests aborted: Insufficient space available in directory: /tmp/test_runner_₿_🏃_20240131_100802
Right now, the test aborts for low disk space after running the Unit Tests for Test Framework Modules.
Instead, does it make sense to abort for low disk space before?
@epiccurious moved aborting the tests before the Unit Tests for Test Framework Modules.
561 | @@ -559,6 +562,11 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage= 562 | if os.path.isdir(cache_dir): 563 | print("%sWARNING!%s There is a cache directory here: %s. If tests fail unexpectedly, try deleting the cache directory." % (BOLD[1], BOLD[0], cache_dir)) 564 | 565 | + # Exit the test if there is not enough space on the testing dir 566 | + testdir_disk_usage_stats = shutil.disk_usage(tmpdir) 567 | + if (testdir_disk_usage_stats.free < MIN_FREE_SPACE):
if testdir_disk_usage_stats.free < MIN_FREE_SPACE:
nit: no need for () in python
fixed
561 | @@ -559,6 +562,11 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage= 562 | if os.path.isdir(cache_dir): 563 | print("%sWARNING!%s There is a cache directory here: %s. If tests fail unexpectedly, try deleting the cache directory." % (BOLD[1], BOLD[0], cache_dir)) 564 | 565 | + # Exit the test if there is not enough space on the testing dir 566 | + testdir_disk_usage_stats = shutil.disk_usage(tmpdir) 567 | + if testdir_disk_usage_stats.free < MIN_FREE_SPACE:
# Exit the test if there is not enough space on the testing dir
if shutil.disk_usage(tmpdir).free < MIN_FREE_SPACE:
nit: can just inline this?
fixed
lgtm
661 | @@ -655,7 +662,7 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage= 662 | 663 | all_passed = all_passed and coverage_passed 664 | 665 | - # Clean up dangling processes if any. This may only happen with --failfast option. 666 | + # Clean up dangling processes if any. This may only happen with --failfast option or not enough space on test directory.
This change is wrong, no?
In general, please pay attention to the details yourself, before and after pushing the changes. The goal of review is not to write the changeset for you line-by-line, but to double check a finished, complete changeset for any accidental oversights.
fixed
review ACK 16baf8651d17ac1e0c7d7fd3526c19ad018ceb9c
tested ACK 16baf86
This may be out of scope for this PR but I have filled up an SD card almost full and when I run the following,
./test/functional/test_runner.py -t /Volumes/32GBSDCARD/
Temporary test directory at /Volumes/32GBSDCARD//test_runner_₿_🏃_20240222_215044
Tests aborted: Insufficient space available in directory: /Volumes/32GBSDCARD//test_runner_₿_🏃_20240222_215044
./test/functional/mempool_accept.py --tmpdir /Volumes/32GBSDCARD/
I get the following error
Traceback (most recent call last):
File "/Users/kevkevin/DEVDIR/bitcoin/test/functional/test_framework/test_framework.py", line 130, in main
self.setup()
File "/Users/kevkevin/DEVDIR/bitcoin/test/functional/test_framework/test_framework.py", line 266, in setup
os.makedirs(self.options.tmpdir, exist_ok=False)
File "<frozen os>", line 225, in makedirs
FileExistsError: [Errno 17] File exists: '/Volumes/32GBSDCARD'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/kevkevin/DEVDIR/bitcoin/test/functional/test_framework/test_framework.py", line 148, in main
self.log.exception("Unexpected exception caught during testing")
^^^^^^^^
AttributeError: 'MempoolAcceptanceTest' object has no attribute 'log'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/kevkevin/DEVDIR/bitcoin/./test/functional/mempool_accept.py", line 376, in <module>
MempoolAcceptanceTest().main()
File "/Users/kevkevin/DEVDIR/bitcoin/test/functional/test_framework/test_framework.py", line 154, in main
exit_code = self.shutdown()
^^^^^^^^^^^^^^^
File "/Users/kevkevin/DEVDIR/bitcoin/test/functional/test_framework/test_framework.py", line 304, in shutdown
if self.success == TestStatus.FAILED and self.options.pdbonfailure:
^^^^^^^^^^^^
AttributeError: 'MempoolAcceptanceTest' object has no attribute 'success'
df -h
/dev/disk8s1 30Gi 29Gi 160Mi 100% 2.5k 1.6M 0% /Volumes/32GBSDCARD
@kevkevinpal --tmpdir must not exist. In python this is expressed via exist_ok=False. Feel free to fixup the docstring with that. In any case, --tmpdir is different from --tmpdirprefix.
29 | @@ -30,6 +30,9 @@ 30 | 31 | os.environ["REQUIRE_WALLET_TYPE_SET"] = "1" 32 | 33 | +# Should fail testing and immediately stop if free space falls below. 34 | +MIN_FREE_SPACE = 1250 * 1024 * 1024
Thanks for the monitoring script you provided earlier @BrandonOdiwuor! Was able to get disk usage almost at 5GB using:
$ ./test/functional/test_runner.py --cachedir=/mnt/tmp/cache --tmpdir=/mnt/tmp --extended --jobs=16
So maybe some heuristic could be applied when running many jobs in parallel? Default number of jobs is 4, so maybe something like
MIN_FREE_SPACE = max(jobs, 4) * 320 * 1024 * 1024
That way we get 1280 * 1024 * 1024 for 4 jobs or less, and 5120 * 1024 * 1024 for 16 jobs.
Not sure how much logic this deserves. There is also --nocleanup, see #29335 (review), which should be the max
561 | @@ -559,6 +562,10 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage= 562 | if os.path.isdir(cache_dir): 563 | print("%sWARNING!%s There is a cache directory here: %s. If tests fail unexpectedly, try deleting the cache directory." % (BOLD[1], BOLD[0], cache_dir)) 564 | 565 | + # Exit the test if there is not enough space on the testing dir 566 | + if shutil.disk_usage(tmpdir).free < MIN_FREE_SPACE: 567 | + sys.exit(f"Tests aborted: Insufficient space available in directory: {tmpdir}")
Since the MIN_FREE_SPACE value is a bit of a guess, it might make sense to either make this into a VERY LOUD non-fatal warning, or add an arg like --non-fatal-disk-space-check to make it into a warning. That way people running in a bit resource-constrained environments and/or only interested in a specific low-disk-usage test don't need to edit the MIN_FREE_SPACE to be able to run their tests.
Seems fine to turn it into a warning.
@maflcko I have converted this to a warning as suggested
ACK 951a004!
I managed to create a disk image with available space just above MIN_FREE_SPACE such that the warning is bypassed when running test_runner.py but one of the tests (feature_signet.py) fails with error:
raise AssertionError("Unexpected stderr {} != {}".format(stderr, expected_stderr))
AssertionError: Unexpected stderr Warning: Disk space for "/Volumes/Untitled/test_runner_₿_🏃_20240227_223648/feature_signet_91/node2/signet/blocks" may not accommodate the block files. Approximately 1 GB of data will be stored in this directory. !=
All of the tests pass for me when there's about 1.35 GB of space available, so 1280*1024*1024 would probably work better.
However, we've already acknowledged that this variable is sort of just a guess, so I'm not sure how critical it is to get exactly right. I'm also not sure how different these tests' disk space usage is across systems/machines. The fact that it's shown only as a warning makes sense then.
nit: I think the warning could be louder. The check above uses bold in its warning. Maybe this one can follow suit?
I think it would be good to stop the test runner once we actually run into an out of disk space error
Something like this:
while i < test_count:
if failfast and not all_passed:
break
for test_result, testdir, stdout, stderr, skip_reason in job_queue.get_next():
test_results.append(test_result)
i += 1
done_str = "{}/{} - {}{}{}".format(i, test_count, BOLD[1], test_result.name, BOLD[0])
if test_result.status == "Passed":
logging.debug("%s passed, Duration: %s s" % (done_str, test_result.time))
elif test_result.status == "Skipped":
logging.debug(f"{done_str} skipped ({skip_reason})")
else:
all_passed = False
print("%s failed, Duration: %s s\n" % (done_str, test_result.time))
print(BOLD[1] + 'stdout:\n' + BOLD[0] + stdout + '\n')
print(BOLD[1] + 'stderr:\n' + BOLD[0] + stderr + '\n')
if combined_logs_len and os.path.isdir(testdir):
pass #[omitted]
+ if stderr.strip().split('\n')[-1] == "OSError: [Errno 28] No space left on device":
+ logging.debug(f"Early exiting after test failure due to disk running out of space in {tmpdir}")
+ break
+
if failfast:
logging.debug("Early exiting after test failure")
break
We could also avoid the question of picking a good number for warning if we used this approach without printing any warning, while still solving the underlying issue of stopping the test runner from continuing to dump test failures and errors that are not obviously related to being out of disk space, and having an opportunity to communicate the reason for failure to the user.
I am not opposed to the warning message, but if we keep it I think it should
there may be insufficient free space in {tmpdir} to run the bitcoin functional test suite. Running the test suite with fewer than {amt} of free space might cause tests to fail.reACK 951a00440539ba1faba62c49c9cf0a7d8c950769
Tested with --failfast and normally and both seem to be working properly with the correct error message
cc @maflcko could you have another look at this
@BrandonOdiwuor As the pull request author, the burden is on you to read comments and reply to them, if they have questions.
29 | @@ -30,6 +30,9 @@ 30 | 31 | os.environ["REQUIRE_WALLET_TYPE_SET"] = "1" 32 | 33 | +# Minimum amount of space to run the tests. 34 | +MIN_FREE_SPACE = 1250 * 1024 * 1024
When a high enough parallelism is used, this is not enough.
It may be better to scale this based on the number of jobs that will be run in parallel. Could put a lower bound of 1.1 GB for feature_block which is an outlier, then start increasing by 100 MB for each additional job that will be run.
Also, if --nocleanup is provided, we'd probably want to warn at the size where nothing is cleaned up.
@achow101 could you elaborate more on this
Also, if --nocleanup is provided, we'd probably want to warn at the size where nothing is cleaned up.
I have updated the PR as suggested:
When a high enough parallelism is used, this is not enough.
It may be better to scale this based on the number of jobs that will be run in parallel. Could put a lower bound of 1.1 GB for feature_block which is an outlier, then start increasing by 100 MB for each additional job that will be run.
updated the PR to include the check for minimum space required for --nocleanup
I agree that if a test fails out of disk space, then we should stop running any further tests as well.
I think it would be useful for each test to also have a disk space check, and that could even have its own exit code that the test runner can use to detect when a test failed because of low disk space.
@achow101 @davidgumberg would it be a good idea to do both i.e include the check with a warning at the beginning and also check after every test is run if any fails then stop running any further test?
@achow101 @davidgumberg would it be a good idea to do both i.e include the check with a warning at the beginning and also check after every test is run if any fails then stop running any further test?
Yes, I think so.
I have updated the PR to include both the warning at the start and an exit if any tests fail due to insufficient space as suggested above
tested ACK 2f7fcd5fe7289f7ad3efd566dbbd725264654f8f
The user is notified when appropriate that there might be insufficient free space to run the tests. Moreover, if a test fails due to insufficient space, the functional test runner halts. Also, there is a relevant warning when the --nocleanup flag is used and hence the free disk requirements get bigger.
Note: All tests run on macos 14.4.1 (23E224)
--with-incompatible-bdb and --enable-suppress-external-warningsmake check and all tests pass--extended and all tests pass as wellUsed -j 13 for the number of jobs to be run in parallel since the M2 Pro Max chip supports them. Calculated the required space to run the tests with the 13 parallel jobs as described in the pr:
1.1 * 1024 + (13 - 1) * 100 MB = 2326 MB
and reated an image at /Volumes/tmp with the a little less space set to 2.2 GB, with the df -h /Volumes/tmp command returning:
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk4s2 2.2Gi 16Mi 2.2Gi 1% 4 4.3G 0% /Volumes/tmp
Running the functional tests with ./test/functional/test_runner.py -j13 -t /Volumes/tmp I get the warning up front and all tests are executed.
Temporary test directory at /Volumes/tmp/test_runner_₿_🏃_20240402_192153
WARNING! There may be insufficient free space in /Volumes/tmp/test_runner_₿_🏃_20240402_192153 to run the Bitcoin functional test suite. Running the test suite with fewer than 2326.0 MB of free space might cause tests to fail.
Kept reducing the available space in the /Volumes/tmp partition, and a test eventually fails at 1.8 GB of free space
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk4s1 1.8Gi 12Mi 1.8Gi 1% 13 4.3G 0% /Volumes/tmp
with the following error:
Early exiting after test failure due to disk running out of space in /Volumes/tmp/test_runner_₿_🏃_20240402_202655
--nocleanupTesting the with the ./test/functional/test_runner.py -j13 -t /Volumes/tmp --nocleanup requirement for a minimum of 12 GB, with a partition of 11 GB and the df -h displaying:
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk4s2 11Gi 27Mi 11Gi 1% 4 4.3G 0% /Volumes/tmp
I get the expected insufficient free space warning and all tests are executed.
WARNING! There may be insufficient free space in /Volumes/tmp/test_runner_₿_🏃_20240402_203805 to run the functional test suite with --nocleanup. A minimum of 12 GB of free space is required.
Tested down to at 5.5 GB and eventually a test fails halting the test runner as well.
Early exiting after test failure due to disk running out of space in /Volumes/tmp/test_runner_₿_🏃_20240402_210436
666 | @@ -646,6 +667,9 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage= 667 | logging.debug("Early exiting after test failure") 668 | break 669 | 670 | + if "[Errno 28] No space left on device" in stdout: 671 | + sys.exit(f"Early exiting after test failure due to disk running out of space in {tmpdir}")
nit: The directory is left full upon exit. The user would need to clean it up before re-running tests. Having the user manually clean up the directory seems reasonable to me (the user may want to see actions leading to the failure). An enhanced error message adds value.
diff --git a/test/functional/test_runner.py b/test/functional/test_runner.py
index dd3ea370bd..f08eb63b21 100755
--- a/test/functional/test_runner.py
+++ b/test/functional/test_runner.py
@@ -668,7 +668,9 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage=
break
if "[Errno 28] No space left on device" in stdout:
- sys.exit(f"Early exiting after test failure due to disk running out of space in {tmpdir}")
+ sys.exit(f"Early exiting after test failure due to insuffient free space in {tmpdir}\n"
+ f"Test execution data left in {tmpdir}.\n"
+ f"Additional storage is needed to execute testing.")
print_results(test_results, max_len_name, (int(time.time() - start_time)))
updated the exit error message
Nice work. Adding free storage awareness is a great addition.
ACK for 2f7fcd5fe7289f7ad3efd566dbbd725264654f8f. Left a nit inline.
Used a small ramdisk (1.2GB) with high jobs count (16) to force space exhaustion. As expected, received the initial warning Running the test suite with fewer than 2626.0 MB of free space might cause tests to fail. and also received the exiting error Early exiting after test failure due to disk running out of space...
Could you please take another look at this after incorporating the suggested changes @achow101 @maflcko @kevkevinpal @davidgumberg @epiccurious @tdb3 @itornaza
tested re-ACK for 858fa78637041ae704005d4b6e564dd8245f4b5d
--with-incompatible-bdb and --enable-suppress-external-warningsmake check and all tests pass--extended and all tests pass as wellRan the functional tests with the following options:
./test/functional/test_runner.py -t /Volumes/ramdisk with a ramdisk of 512 MB./test/functional/test_runner.py -t /Volumes/ramdisk --nocleanup with a ramdisk of 4 GBThe expected warnings are displayed upfront:
WARNING! There may be insufficient free space in /Volumes/ramdisk/test_runner_₿_🏃_20240411_211144 to run the Bitcoin functional test suite. Running the test suite with fewer than 1426.0 MB of free space might cause tests to fail.
and for the --nocleanup case:
WARNING! There may be insufficient free space in /Volumes/ramdisk/tmp/test_runner_₿_🏃_20240411_213828 to run the functional test suite with --nocleanup. A minimum of 12 GB of free space is required.
eventually when a test fails due to lack of disk space, the test harness exits and the new error message is displayed and is indeed much more informative!
Early exiting after test failure due to insuffient free space in /Volumes/ramdisk/tmp/test_runner_₿_🏃_20240411_212332
Test execution data left in /Volumes/ramdisk/tmp/test_runner_₿_🏃_20240411_212332.
Additional storage is needed to execute testing.
re-ACK for 858fa78637041ae704005d4b6e564dd8245f4b5d
Thanks for incorporating the suggestion.
Used a small ramdisk (1.2GB) with high jobs count (20) to force space exhaustion. As expected, received the initial warning Running the test suite with fewer than 3026.0 MB of free space might cause tests to fail. and also received the exiting error Early exiting after test failure due to disk running out of space...Test execution data left in ...Additional storage is needed to execute testing with the new text added.
reACK https://github.com/bitcoin/bitcoin/pull/29335/commits/858fa78637041ae704005d4b6e564dd8245f4b5d
Looks great, tested on a tmpfs 1 GiB in size, and the test runner prints the intended warning, and fails immediately when a test fails due to disk full.
666 | @@ -646,6 +667,11 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage= 667 | logging.debug("Early exiting after test failure") 668 | break 669 | 670 | + if "[Errno 28] No space left on device" in stdout: 671 | + sys.exit(f"Early exiting after test failure due to insuffient free space in {tmpdir}\n" 672 | ++ f"Test execution data left in {tmpdir}.\n" 673 | ++ f"Additional storage is needed to execute testing.")
In 858fa78637041ae704005d4b6e564dd8245f4b5d "test: Handle functional test disk-full error"
The +s at the beginning of the lines are unecessary.
f"Test execution data left in {tmpdir}.\n"
f"Additional storage is needed to execute testing.")
fixed
ACK 7c0c599f1e6d412b3cb92cd67aa46644a8286cb6
reACK 357ad110548d726021547d85b5b2bfcf3191d7e3. Looks good!
Tested on RAM disk of 1GB, 1430MB (doas mount -t tmpfs -o size=1430m tmpfs /mnt/tmp/) and varied the --jobs and --nocleanup, observing expected warnings + error on disk full.
re-ACK 357ad110548d726021547d85b5b2bfcf3191d7e3
Ran again all the tests as in #29335 (comment) and I get the same results as expected on disk space in the cases it is being insufficient.
reACK 357ad110548d726021547d85b5b2bfcf3191d7e3