Recently, many Travis builds fail due to timeouts, which happen in the comparison tool. Just a test to see if it can also be triggered without. Not meant to be merged as-is.
test: disable comparison tool #6278
pull laanwj wants to merge 1 commits into bitcoin:master from laanwj:2015_06_disable_comparison_tool changing 1 files +1 −1-
laanwj commented at 4:48 AM on June 13, 2015: member
-
test: disable comparison tool cf6b62bbda
- laanwj added the label Tests on Jun 13, 2015
-
dexX7 commented at 2:54 PM on June 13, 2015: contributor
Hmmm.. that's interesting. Are you sure this is related to the comparison tool?
I'm asking, because I've never seen timeouts in another project/fork, which is currently based on Bitcoin Core 0.10. There are probably less than 10 builds per day though. But what I actually noticed: after testing a migration to 0.11 (based on 053110d), the Boost tests timed out from time to time, but not always.
I assumed this was somehow a mistake on my part, or simply "bad luck". The number of 0.11 builds was very low, and right now 0.10 is still used, so this might not be related at all. On top, Travis builds are routed through the container based infrastructure.
However, when running
./src/test/test_bitcoin --log_level=test_suitein a loop locally (done a few minutes ago), it stops at some point here:Entering test suite "scheduler_tests" Entering test case "manythreads"I think this rules out that it's related to Travis. I'm going to test a clean version of Core now.
-
dexX7 commented at 3:00 PM on June 13, 2015: contributor
The current master seems to have the same problem.
When running
test_bitcoinin a loop, it stops at some point during thescheduler_tests.Tested on Ubuntu 14.04 LTS x64 with:
Bitcoin version v0.11.99.0-ab0ec67 (2015-06-12 16:49:53 +0200) Using OpenSSL version OpenSSL 1.0.1f 6 Jan 2014 Using BerkeleyDB version Berkeley DB 4.8.30: (April 9, 2010) -
laanwj commented at 5:52 AM on June 15, 2015: member
The scheduler test is another possible source of hangs. It has been fixed a few times, but it's still possible for there to be some race condition that makes it either never finish or really slowly.
Adding timestamps to the test output, as well as more verbose logging to test_bitcoin, may be a good idea.
-
laanwj commented at 5:54 AM on June 15, 2015: member
On the other hand I see a lot of false positives in travis end in the comparison tool .e.g. the end usually looks like
11:06:51 15 BitcoindComparisonTool$1.onPreMessageReceived: Got empty header message from bitcoind 11:06:51 1 BitcoindComparisonTool.main: Block "b3" completed processing 11:06:51 1 BitcoindComparisonTool.main: Testing block b3 499b1ec0ece4c4ef3b123d7498e3a5cfc85685fc9998a3f07b0fc7c977433627 11:06:51 1 BitcoindComparisonTool.main: Sent inv with block 499b1ec0ece4c4ef3b123d7498e3a5cfc85685fc9998a3f07b0fc7c977433627 11:06:51 15 BitcoindComparisonTool$1.onPreMessageReceived: Got empty header message from bitcoind Exception in thread "main" java.lang.NullPointerException at com.google.bitcoin.core.BitcoindComparisonTool.main(BitcoindComparisonTool.java:311) No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself. The build has been terminatedI think this is the NULL pointer that @theuni is looking for too.
-
theuni commented at 3:10 AM on June 17, 2015: member
Ah right.
Nope, it just slipped my mind. Will do.
-
laanwj commented at 8:21 AM on June 17, 2015: member
Yippie. I found one build that random-errored on master that didn't involve the comparison tool.
make[3]: Entering directory `/home/travis/build/bitcoin/bitcoin/bitcoin-x86_64-unknown-linux-gnu/src' Running 158 test cases... No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself. The build has been terminatedProbably the scheduler-tests again, although without more diagnostics it's hard to say.
-
dexX7 commented at 1:37 PM on June 17, 2015: contributor
although without more diagnostics it's hard to say.
Locally I'm able to pin it down:
STATUS=0; while [ $STATUS=0 ]; do ./src/test/test_bitcoin --run_test=scheduler_tests/manythreads --log_level=all; STATUS=$?; doneAfter a few rounds it always stops with:
Entering test case "manythreads" test/scheduler_tests.cpp(70): info: check nTasks == 0 passed test/scheduler_tests.cpp(82): info: check nTasks == 100 passed test/scheduler_tests.cpp(83): info: check first < last passed test/scheduler_tests.cpp(84): info: check last > now passedI confirmed it via Travis in a similar manner, see here and here.
With additional output, it looks like it doesn't make it past joining the threads in the test:
// Drain the task queue then exit threads microTasks.stop(true); microThreads.join_all(); // ... wait until all the threads are done | <---- not passedThere seem to be a few reports mentioning issues in this context:
I'm not sure, how this plays together with the comparison tool though.
-
gavinandresen commented at 1:49 PM on June 17, 2015: contributor
@dexX7 what OS and version of boost? (I cannot reproduce a hang using your while loop on OSX 10.10.3 and boost 1.58.0). @laanwj : if it turns out to be a bug in boost or the OS handling gazillions of threads, perhaps just removing the scheduler stress test would be the right thing to do. We aren't actually using gazillions of threads with the scheduler in Core code, just one.
-
dexX7 commented at 2:04 PM on June 17, 2015: contributor
@gavinandresen: locally I'm using Ubuntu 14.04.2 with Boost 1.54. Travis uses Ubuntu 12.04 with Boost 1.55.
Disabling the scheduler test may reduce the number of occurrences, but as this PR indicates, timeouts were also seen during the comparison tool tests. It's not yet shown that this is caused by Boost threads, and it's only my guess.
-
laanwj commented at 4:39 PM on June 17, 2015: member
The comparison tool failures and the scheduler tests failures are completely unrelated, the only thing they have in common is that they cause transient travis failures. @dexX7 Thanks for the extensive report, I'll have a look at the scheduler tests again, and see if I can find the issue. I doubt that it is an issue with boost, just with our usage of it (under load). Disabling the test is also a possibility, but I'd prefer to find out what is wrong.
- laanwj closed this on Jun 19, 2015
- MarcoFalke locked this on Sep 8, 2021