Abort functional tests early upon low drive space #23100

katesalazar commented at 3:20 PM on September 26, 2021: contributor

This change rushes a fix for #23099.

This approach works for test/functional/test_runner.py in my system, changes to test/fuzz/test_runner.py are untested at the time of writing (I don't know how to run this command).

DrahtBot added the label Tests on Sep 26, 2021

laanwj commented at 10:49 AM on September 27, 2021: member

Concept ACK

in test/fuzz/test_runner.py:189 in 74524705a1 outdated

 182 | @@ -182,6 +183,11 @@ def main():
 183 |              use_valgrind=args.valgrind,
 184 |          )
 185 |  
 186 | +        shutil_disk_usage_stat_ = shutil.disk_usage(args.corpus_dir)
 187 | +        if shutil_disk_usage_stat_.free < 1:
 188 | +            logging.debug("No space left on device")
 189 | +            sys.exit(-1)

maflcko commented at 10:55 AM on September 27, 2021:

Is this needed? Pretty sure libFuzzer will print this error by itself.

Also, the fuzz inputs usually do not reside in the /tmp dir, which was the motivation for this patch?

in test/functional/test_runner.py:548 in b4aa309a73 outdated

 543 | @@ -544,6 +544,11 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage=
 544 |                  logging.debug("Early exiting after test failure")
 545 |                  break
 546 |  
 547 | +            shutil_disk_usage_stat_ = shutil.disk_usage(testdir)
 548 | +            if shutil_disk_usage_stat_.free < 1:

maflcko commented at 3:51 PM on September 27, 2021:

I think this can be bumped to 16 M, because it is not possible to create a datadir that is smaller than that.

The first block file will pre-allocate 16M:

16M	/tmp/aaa/regtest/blocks/blk00000.dat

katesalazar commented at 3:09 AM on September 28, 2021:

We cannot apply that, because that could make the output line "no free space left on device" be shown when that's not really the failure reason.

Though I've applied your idea in a different way.

in test/functional/test_runner.py:550 in b4aa309a73 outdated

 543 | @@ -544,6 +544,11 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage=
 544 |                  logging.debug("Early exiting after test failure")
 545 |                  break
 546 |  
 547 | +            shutil_disk_usage_stat_ = shutil.disk_usage(testdir)
 548 | +            if shutil_disk_usage_stat_.free < 1:
 549 | +                logging.debug("No space left on device")
 550 | +                sys.exit(-1)

maflcko commented at 3:53 PM on September 27, 2021:

This will leave dangling processes

katesalazar commented at 4:14 PM on September 27, 2021:

~~Unable to reproduce on my system when current test is feature_taproot.py --previous_release when execution flow reaches the line (don't know what other tests were running concurrently).~~

in test/functional/test_runner.py:32 in cb735b0a38 outdated

  26 | @@ -27,6 +27,15 @@
  27 |  import logging
  28 |  import unittest
  29 |  
  30 | +# Should fail testing and immediately stop if free space falls below.
  31 | +# This would be checked before running any and every test.
  32 | +# TODO consider increasing to 16 * 1024 * 1024 (https://github.com/bitcoin/bitcoin/pull/23100#discussion_r716825510)

maflcko commented at 1:22 PM on September 29, 2021:

why add a todo?

katesalazar commented at 1:36 PM on September 29, 2021:

because you wrote you think a test requires 16M

a todo is the best I can add here about that

maflcko commented at 1:40 PM on September 29, 2021:

I think it would be better to change the number instead of adding a TODO to change the number

katesalazar commented at 1:46 PM on September 29, 2021:

I think so as well, but I can't personally commit it

maflcko commented at 1:52 PM on September 29, 2021:

why?

katesalazar commented at 1:55 PM on September 29, 2021:

because I don't trust you, because I don't even know you, because one should only commit code one can take ownership of, because I can't take ownership of an assertion you seem not to be even sure of, because I would lie by committing it!

maflcko commented at 2:14 PM on September 29, 2021:

You don't have to trust me. The value is in the source code BLOCKFILE_CHUNK_SIZE, or you can test it yourself: #23100 (review)

Even if wasn't, what is the worst thing that could happen by increasing the number?

katesalazar commented at 5:12 PM on September 29, 2021:

The value is in the source code BLOCKFILE_CHUNK_SIZE, or you can test it yourself

I would have a look, thanks.

I was hoping that many of the >200 functional tests would not strictly require that a large space reserve. 🙄

Even if wasn't, what is the worst thing that could happen by increasing the number?

In case the constant didn't apply to all tests, it would result on unnecessary test halting for maybe successfully passable tests, and even maybe aborting a test run able to end successfully.

katesalazar commented at 5:48 PM on September 30, 2021:

I have had a casual look, I'd say there are many functional tests ending and being removed upon success way before reaching 16MiB, I'm feeling like removing MIN_FREE_SPACE_BEFORE_TEST. 🤔

katesalazar commented at 6:18 AM on October 1, 2021:

There's even some functional tests occupying only KiBs and in a blink of an eye it's successful and gone.

In a few days I'd be removing MIN_FREE_SPACE_BEFORE_TEST.

meshcollider commented at 2:24 AM on October 4, 2021: contributor

@katesalazar just a few notes:

Please squash your commits
MIN_FREE_SPACE_BEFORE_TEST is currently unused, so that needs to be corrected (unless you decide to remove it - but it think it is better to check there is enough space before starting each test, rather than checking after each).
I agree with Marco's comments about the 16M (and note that he is the test framework maintainer, so please do take his advice seriously 🙂)

katesalazar commented at 5:19 AM on October 4, 2021: contributor

@meshcollider wrote:

I agree with Marco's comments about the 16M (and note that he is the test framework maintainer, so please do take his advice seriously 🙂)

Does agreeing come simply from trusting Marco, or does it come from real positive a posteriori knowledge of the technology, you have?

maflcko commented at 7:03 AM on October 4, 2021: member

Again, no one asked you to trust me. Everything is in the source code, which you can read or execute yourself: #23100 (review)

katesalazar commented at 8:29 AM on October 4, 2021: contributor

You keep not proving your point, I am not applying your suggestion myself. It's better you commit your unproven magic numbers under your name in a subsequent PR or an alternative PR.

fanquake commented at 6:02 AM on October 5, 2021: member

I am not applying your suggestion myself. It's better you commit your unproven magic numbers under your name

That's fine, you don't have to do anything you don't want to. However, this Pull Request is not going to be merged with the review comments left unaddressed, or the TODO left in the code. So you either need to address everything, or you can close this PR.

katesalazar commented at 6:45 AM on October 5, 2021: contributor

I've watched example_test.py failing to run successfully, apparently refusing to start under ~10M free space. It's likely I'll investigate and end up adding Marco's suggestion.

katesalazar commented at 7:10 AM on October 5, 2021: contributor

In:

https://github.com/bitcoin/bitcoin/blob/9e530c6352c3e3d4f2936bbbb1bcb34ff9ca6378/src/util/system.cpp#L144-L150

katesalazar commented at 7:17 AM on October 5, 2021: contributor

I think it can refuse to boot with 'Disk space too low' with more than 16M free, tried using 30M, it seems it refused to boot.

katesalazar renamed this:
~~Abort tests early upon filling the test logs drive~~
Abort functional tests early upon low drive space
on Oct 5, 2021

katesalazar force-pushed on Oct 5, 2021

in test/functional/test_runner.py:31 in 471b970ec5 outdated

  26 | @@ -27,6 +27,9 @@
  27 |  import logging
  28 |  import unittest
  29 |  
  30 | +# Should fail testing and immediately stop if free space falls below.
  31 | +MIN_FREE_SPACE = 16 * 1024 * 1024  # Could be larger.

maflcko commented at 8:56 AM on October 5, 2021:

Instead of a comment "Could be larger" it could make sense to explain why this was chosen or where the magic number is from.

Also, why are you not choosing the value from constexpr min_disk_space?

katesalazar force-pushed on Oct 5, 2021

in test/functional/test_runner.py:582 in 86c66206bb outdated

 550 | @@ -546,6 +551,12 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage=
 551 |                  logging.debug("Early exiting after test failure")
 552 |                  break
 553 |  
 554 | +            shutil_disk_usage_stat_ = shutil.disk_usage(testdir)
 555 | +            if (shutil_disk_usage_stat_.free < MIN_FREE_SPACE
 556 | +                    and i < test_count - jobs):
 557 | +                logging.debug("No space left on device to continue testing")
 558 | +                break

maflcko commented at 9:34 AM on October 5, 2021:

The comment below " # Clean up dangling processes if any. This may only happen with --failfast option." will need to be updated

katesalazar commented at 9:44 AM on October 5, 2021:

Yes! I will just let the CI build finish successfully before changing it.

in test/functional/test_runner.py:32 in f2b6419b24 outdated

  26 | @@ -27,6 +27,11 @@
  27 |  import logging
  28 |  import unittest
  29 |  
  30 | +# Should (in general but not always) fail testing and immediately stop if free space falls below.
  31 | +# Constant value 50M comes from `bool CheckDiskSpace(2)` at
  32 | +# `src/util/system.h` via uses in `src/node/blockstorage.cpp`.

fanquake commented at 5:33 AM on October 7, 2021:

Please don't put source file names / C++ implementation details here, as they are likely to just become outdated in regards to the code. Also, comments like "in general but not always" are not really useful, unless they are elaborated on (and that would be overkill here), as the reader is just left wondering about the "but not always" conditions.

meshcollider commented at 12:08 PM on October 8, 2021: contributor

Please squash your commits again :)

It is good practice to squash after every fixup so that the PR is immediately ready to review and merge. Otherwise, squashing later will invalidate the ACKs on that commit hash.

yanmaani commented at 3:26 PM on October 8, 2021: none

Conceptually it'd be cleaner if you could detect Bitcoin Core complaining about free space once and then abort the test, but I don't know if that's feasible.

katesalazar commented at 8:52 PM on October 8, 2021: contributor

It is good practice to squash after every fixup so that the PR is immediately ready to review and merge. Otherwise, squashing later will invalidate the ACKs on that commit hash.

Understood and grabbed this advice for urgent changes. Which you won't see me taking care of in the next maybe 5 or 10 years, for reasons.

I'll squash on demand, because that's the rules. But squashing is a lossy* process, on Git and on GitHub, and I don't think I should be doing it lightly.

* Let's imagine for a while the references log does not exist.

Conceptually it'd be cleaner if you could detect Bitcoin Core complaining about free space once and then abort the test, but I don't know if that's feasible.

Could spawn a thread which monitors outputs looking for "Disk space too low". Probably that output and more cases, "No space left on drive" (IDK exactly)...

That is a great idea if a test can take a very long time to naturally abort once is theoretically clear it will fail. I would doubt that's the case, but IDK.

katesalazar force-pushed on Oct 8, 2021

katesalazar commented at 9:07 PM on October 8, 2021: contributor

BTW I very much appreciate your patience with my bi-polar bear things, thank you.

katesalazar force-pushed on Oct 9, 2021

katesalazar commented at 3:19 PM on October 9, 2021: contributor

dc439011d155033dec4ad0af2059808ce9b8cd73 is an amend of (only) the commit message of 646029c4365bc921a84cf8952200d81f6e68b1b4 (which passed CI checks).

meshcollider commented at 10:25 AM on October 10, 2021: contributor

But squashing is a lossy* process, on Git and on GitHub, and I don't think I should be doing it lightly.

You should, because in general people don't review PRs that look WIP - and unsquashed fixup commits immediately signal WIP. PRs are required to be clean and tidy for merge. There's no need to worry about loss because the new commit should always be an improvement over the old. Feel free to keep a backup of your old commits on a separate branch locally though if you want to.

meshcollider commented at 10:29 AM on October 10, 2021: contributor

utACK dc439011d155033dec4ad0af2059808ce9b8cd73

Thanks also for your patience with the review process :)

theStack commented at 1:41 PM on October 11, 2021: contributor

Concept ACK

It happened more than once to me that /tmp got clogged up with left-overs of failed test runs that I forgot to clean, so I'm supporting an early low-space detection. Will test later.

theStack commented at 3:22 PM on October 12, 2021: contributor

FWIW, on my system, functional tests start to fail already when less than ~67MB are available on the /tmp partition. This matches the code in CheckDiskSpace which needs min_disk_space (50MB) plus the value additional_bytes that is passed to this function, which in turn are the 16MB mentioned above for first block pre-allocation.

katesalazar commented at 3:33 PM on October 12, 2021: contributor

FWIW, on my system, functional tests start to fail already when less than ~67MB are available on the /tmp partition. This matches the code in CheckDiskSpace which needs min_disk_space (50MB) plus the value additional_bytes that is passed to this function, which in turn are the 16MB mentioned above for first block pre-allocation.

I never tested those conditions, but really looked like it should behave the way you describe.

katesalazar commented at 4:39 PM on October 12, 2021: contributor

Let's set MIN_FREE_SPACE to ((50 + 16) * 1024 * 1024).

katesalazar commented at 7:58 PM on October 13, 2021: contributor

A ((50 + 16) * 1024 * 1024) value should be better. I'll wait a few days and if there's no opposition I'll submit such change and report it to ACKer/s.

Thanks @theStack.

katesalazar commented at 7:13 AM on October 19, 2021: contributor

@meshcollider wrote:

utACK dc43901

I pushed 4b41e9d7745a6856263f on top of dc439011d155033dec.

I don't know what's the software allowing to push an 4b41e9d7745a6856263f ACK (would that be the case) on top of an dc439011d155033dec ACK, more importantly I don't know what's the software allowing to automatically drafting a replay of an 4b41e9d7745a6856263f ACK (would that be the case) on the rev resulting from squashing 4b41e9d7745a6856263f in dc439011d155033dec.

As the code should be the same and only the commit message should have to be reviewed, a tool should automatically draft such ACK on the squashed rev for accelerating the review process.

If that's a manual process as I suspect, then this tool fails to provide the features this project needs.

fanquake commented at 7:16 AM on October 19, 2021: member

@katesalazar I don't understand your previous comment. In any case, you need to squash your commits. You should just be doing that by default.

katesalazar commented at 7:28 AM on October 19, 2021: contributor

@fanquake wrote:

I don't understand your previous comment.

Don't worry, I'm surely missing some task you do when you review.

Let's suppose there is a rev A changing 1000 lines, it's acknowledged. A change happens and rev B has to be submitted changing 1 single line.

Submitting B on top of A is straight forward and safe.

You want me to directly push B fixed up in A, i.e. rev C, so that C(codebase) = B(A(codebase)) = (B∘A)(codebase).

I just need you to teach me what's the command that tells you that I didn't sneak a rogue change in C changing some 10-ish lines unexpectedly.

How do you compare A to C and get the equivalent to B without me even ever pushing B?

katesalazar commented at 7:33 AM on October 19, 2021: contributor

How do you compare A to C and get the equivalent to B without me even ever pushing B?

Clarification, I know git diff A C gives it; but my question is, do you really do this to see B? Or do you just read all over all of the changes in A once again when C is pushed?

fanquake commented at 7:35 AM on October 19, 2021: member

You probably want git range-diff. I'd suggest reading our productivity docs.

katesalazar force-pushed on Oct 19, 2021

katesalazar commented at 9:14 AM on October 19, 2021: contributor

You probably want git range-diff. I'd suggest reading our productivity docs.

Oh, we stand on the shoulders of giants!

But even if range-diff works, how would you make sure that the old feature branch tip revision (rev) isn't removed by the Git garbage collector when you finally range-diff a feature branch tip against the previously reviewed rev which is no longer part of a branch AKA reference (ref)?

A reviewed rev is worth a lot more an asset than an unreviewed rev. You would want to make sure that the garbage collector doesn't remove a reviewed rev orphaned from a reference until the new tip is reviewed.

Don't you need to reconfigure the references log or the garbage collector so that they use a larger buffer or timeframe?

Or what you do is to just never update your local copy of a feature branch ref unless you are going to review the new feature branch tip right away, hence the previously reviewed rev orphaned from the pull request ref becomes obsolete and discardable in the single step of reviewing the new feature branch tip?

maflcko commented at 9:39 AM on October 19, 2021: member

You can create a branch or tag to make sure git doesn't gc it

katesalazar commented at 9:51 AM on October 19, 2021: contributor

Understood. Good enough, thanks.

Sjors commented at 10:45 AM on December 9, 2021: member

Concept ACK. I tend to run functional tests on a RAM disk which occasionally runs out of space, resulting in weird behavior. Hopefully this PR helps.

katesalazar commented at 8:15 PM on December 10, 2021: contributor

save dev time good :) code review ack is invested time ;D thanks for the necromancy, cheers

On Thu, Dec 9, 2021 at 11:45 AM Sjors Provoost @.***> wrote:

Concept ACK. I tend to run functional tests on a RAM disk which occasionally runs out of space, resulting in weird behavior. Hopefully this PR helps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bitcoin/bitcoin/pull/23100#issuecomment-989733726, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMRS4WZBYMCU4X3DRKAEJRDUQCCGHANCNFSM5EY4R27A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

katesalazar force-pushed on Jan 9, 2022

katesalazar marked this as a draft on Jan 21, 2022

katesalazar force-pushed on Jan 21, 2022

Abort functional tests early upon low drive space

As an exception, last functional tests to be run do not impose drive
space requirements.

Some neighboring lines do not follow PEP8 strictly, so neither does
some of the lines in this change.

Includes many suggestions courtesy of MarcoFalke and fanquake and
theStack.

84d7df8629

XXX must investigate why this fixes CI XXX 4bc38b1a74

in test/functional/test_runner.py:570 in 2c98a20807 outdated

 564 | @@ -565,7 +565,11 @@ def run_tests(*, test_list, src_dir, build_dir, tmpdir, jobs=1, enable_coverage=
 565 |                      logging.debug("Early exiting after test failure")
 566 |                      break
 567 |  
 568 | -            shutil_disk_usage_stat_ = shutil.disk_usage(testdir)
 569 | +            try:  # XXX
 570 | +              shutil_disk_usage_stat_ = shutil.disk_usage(testdir)
 571 | +            except FileNotFoundError:  # XXX

katesalazar commented at 7:39 AM on January 22, 2022:

have to investigate why this fixed CI

katesalazar force-pushed on Jan 22, 2022

Revert "XXX must investigate why this fixes CI XXX"

This reverts commit 4bc38b1a74c772d608bd947ef92e7faac3e47314.

4af6367cdf

Revert "Revert "XXX must investigate why this fixes CI XXX""

This reverts commit 4af6367cdf39b85202ededa879c89a2adb5dcf92.

3a1e705a5d

katesalazar force-pushed on Jan 24, 2022

fanquake commented at 2:24 PM on April 26, 2022: member

have to investigate why this fixed CI

What is the status of this? Are you still working on it? It's not mergable in it's current state.

katesalazar commented at 2:50 PM on April 26, 2022: contributor

What is the status of this? Are you still working on it?

This was finished at some point, but I failed to keep up with changes in the master branch.

It's not mergable in it's current state.

I don't have the time, hardware and drive I'd want in order to clean this PR up (4bc38b1a74) but I'm happy to let it go whenever you need so.

fanquake added the label Up for grabs on Apr 26, 2022

fanquake commented at 2:56 PM on April 26, 2022: member

I don't have the time, hardware and drive I'd want in order to clean this PR up

That's fine. We'll close it as "Up for Grabs".

fanquake closed this on Apr 26, 2022

bitcoin locked this on Apr 26, 2023

maflcko removed the label Up for grabs on Jan 29, 2024