issueMarcoFalke
opened this issue on
February 3, 2016
MarcoFalke
commented at 8:30 PM on February 3, 2016:
member
This is somewhat rare but still worth to take a look sometime. I can reproduce locally and travis complains in wallet.py:
check -salvagewallet
start_node: bitcoind started, calling bitcoin-cli -rpcwait getblockcount
No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
The build has been terminated
Edit:
This will now show up as
check -salvagewallet
Unexpected exception caught during testing: Exception('bitcoind exited with status 1 during initialization',)
laanwj added the label Wallet on Feb 4, 2016
laanwj
commented at 1:12 PM on February 15, 2016:
member
laanwj referenced this in commit 85695802e2 on Feb 15, 2016
laanwj referenced this in commit ca8fb59ae1 on Feb 15, 2016
laanwj
commented at 2:02 PM on March 15, 2016:
member
I've noticed that this mostly happens on the 32 bit travis build (.3)
laanwj
commented at 10:14 AM on March 25, 2016:
member
I don't think we're even sure whether -salvagewallet terminates. In this case every time the last line seem to call some bitcoin-cli command with -rpcwait. In case the client would crash, or exit, it would just wait forever.
I so wish we had full access to the VM image that runs the tests in these cases.
MarcoFalke
commented at 11:24 AM on March 25, 2016:
member
In case the client would crash, or exit, it would just wait forever.
The client does not crash nor exit, iIrc. salvagewallet returns some sort of error and the client tries to shutdown. Somehow this fails and the "write-peers.dat-thread" just continues to run. Haven't had the time to look into that yet...
laanwj
commented at 11:26 AM on March 25, 2016:
member
So you managed to reproduce this locally?
dohh you say so in the OP
MarcoFalke
commented at 11:29 AM on March 25, 2016:
member
You can just run
$ for i in {0..10000}; do qa/pull-tester/rpc-tests.py wallet --nocleanup;if [ $? -ne 0 ];then echo; echo "fail"; sleep 1d;fi ;done
and come back 20 minutes later to dig through the dumps.
laanwj
commented at 12:00 PM on March 25, 2016:
member
I did a similar thing, I put the part of wallet.py that does the salvage in an infinite loop.
Managed to make it hang after ~100 iterations.
Looking at the debug.log it didn't hang during the salvage, but manages to go into normal operation. And the node didn't start shutting down at all. I could even succesfully send it a stop command and it stopped running. This won't set the python script in motion again though. Curious.
At least it looks like a problem in the script framework, not in the salvage process itself, that does put my concerns to rest a bit.
Oh, I didn't notice it spawns three nodes. The third one indeed has:
2016-03-25 10:41:54 Renamed wallet.dat to wallet.1458902514.bak
2016-03-25 10:41:54 CDBEnv::Salvage: Database salvage found errors, all data may not be recoverable.
2016-03-25 10:41:54 Salvage(aggressive) found 295 records
laanwj
commented at 12:32 PM on March 25, 2016:
member
Ok, I was right in my gist above. I can reproduce this 100% of the time now with a specific wallet:
After the salvage goes wrong (for whatever reason) the node will just exit.
The test framework somehow doesn't detect this sudden exit. I think it will try to send a stop command (with -rpcwait) which never arrives, because the node is not running anymore in the first place. It will try forever to send it.
This makes the test hang 100% of the time:
--- a/src/wallet/wallet.cpp
+++ b/src/wallet/wallet.cpp
@@ -390,8 +390,9 @@ bool CWallet::Verify(const string& walletFile, string& warningString, string& er
if (GetBoolArg("-salvagewallet", false))
{
// Recover readable keypairs:
- if (!CWalletDB::Recover(bitdb, walletFile, true))
- return false;
+ return false;
}
if (boost::filesystem::exists(GetDataDir() / walletFile))
That, of course, leaves the question why the wallet ends up in a non-salvageable state. But at least the test should not hang in this case, but fail with a clearer error.
laanwj referenced this in commit 81b0587ba6 on Mar 25, 2016
laanwj referenced this in commit f22f8ed8c0 on Mar 25, 2016
laanwj referenced this in commit 8869c3bbbb on Mar 25, 2016
laanwj
commented at 2:46 PM on March 25, 2016:
member
That, of course, leaves the question why the wallet ends up in a non-salvageable state.
The underlying problem is weird. It seems that randomly, after exit, the wallet database fails verification with DB_SALVAGE|DB_AGRESSIVE but otherwise works fine. E.g.
int result = db.verify(strFile.c_str(), NULL, &strDump, flags);
returns DB_VERIFY_BAD.
However, the wallet works as normally. Non -salvagewallet doesn't see any problems at all.
Unlike the error message would have you believe, the salvage successfully recovers the entire wallet.
laanwj referenced this in commit 018b60c5ea on Mar 26, 2016
MarcoFalke renamed this: salvagewallet sometimes never terminates salvagewallet fails verification on Apr 16, 2016
laanwj
commented at 9:33 AM on May 12, 2016:
member
This test will be disabled in #8038 until this is be fixed.
laanwj closed this on May 12, 2016
laanwj reopened this on May 12, 2016
makevoid referenced this in commit ae90831ab7 on Jun 13, 2016
ftrader referenced this in commit 5ba022fab1 on Dec 9, 2016
schinzelh referenced this in commit 61d56ee22e on Mar 14, 2017
UdjinM6 referenced this in commit d6f3ae95db on May 30, 2017
jnewbery
commented at 11:11 PM on June 5, 2017:
member
I've been digging into this failure over the last couple of days. This is what I've found so far:
This is fairly easy to reproduce. Just uncomment the -salvagewallet test and set BASE_SCRIPTS in test_runner.py to be a list [["wallet.py"] * 50]. In a run of 50, I usually get one failure.
the problem is caused by running Berkeley DB's Db::verify() with the DB_AGGRESSIVE flag. The documentation for that is (emphasis mine):
Output all the key/data pairs in the file that can be found. By default, Db::verify() does not assume corruption. For example, if a key/data pair on a page is marked as deleted, it is not then written to the output file. When DB_AGGRESSIVE is specified, corruption is assumed, and any key/data pair that can be found is written. In this case, key/data pairs that are corrupted or have been deleted may appear in the output (even if the file being salvaged is in no way corrupt), and the output will almost certainly require editing before being loaded into a database.
you can repro the corruption manually by running the db_dump utility function, which is bundled with BDB. -r runs salvage and -R runs salvage with aggressive.
This is the wallet.dat file that I was using in my tests (renamed wallet.dat.txt as github has rules about file types): wallet.dat.txt. If I run db_dump on this .dat file with -R, the output ends:
if I run db_dump with -r, then I don't see the DB_VERIFY_BAD message.
piping db_dump through to db_load rebuilds a database. If I use -R on db_dump, then this fails, because the database dump is corrupt. If I use -r, on db_dump, then it succeeds, and most (all?) of the key-value pairs appear to be intact.
Digging a bit more into the corruption, if I look at the output of db_dump -R, it's a superset of the output of db_dump -r. There are some extra rows that look like normal keys/values, but there are also a bunch of these:
554e4b4e4f574e5f4b4559 is ascii for UNKNOWN_KEY. You can see where that string is output in btree/bt_verify.c in the BDB source.
Critically, in the first group, we have (with my commentary):
554e4b4e4f574e5f4b4559 // UNKNOWN_KEY
020000000183365de8669a47426115d8224516599ce66f9e8dd75738e30b1c1812cc65210d010000006b483045022100fbec000102000000010000000000000000000000000000000000000000000000000000000000000000ffffffff0502d8000101ffffffff0200f902950000000023210230734f482295f0361adc9fdd518a17ce3914e30c5944e266c9c2d62a81bfabc0ac0000000000000000266a24aa21a9ede2f61c3f71d1defd3fa999dfa36953755c690689799962b48bebd836974e8cf900000000717b8547d59a8248a8f337f18b7ff9740c60ad7661a034534a3dc8b5ca85cb75000000000000030b66726f6d6163636f756e7400016e0232320974696d65736d6172740a313439363637373630380000000000e87c35590000732300010274783e056c621538aad4f361910ae102f93a9ca6b669 // some value
554e4b4e4f574e5f4b4559 // UNKNOWN_KEY
02000000010817df69d135686315416c5c2ab83d740882b5c733a2e24762dc4a1b52640e2000000000494830450221008732d20ce6ec67b39bbd6065149d66cd15fa10b1666512d2f72c315dadb4a5a802202de2365122b8f014994bdd8a98955f7ade117a1aadebd2b7769d1325f53c15e801ffffffff0200000000000000001976a914954ff91555b588a81796e3d1ae6313339ce28e4788acc0e4022a010000001976a914c58fb7d7c9407cf2eab0a7a52677869334c2c46488ac00000000ccf2bff45a3b0101020000000106e45a28096dd10e52d073e6a93ed5ae884aa1dc69fb4106df2656396da5d75f000000006b483045022100ede821e3abae05f5b94454bfe6e3cdf78525206355929b3bdc1fd11b5dc8d59102 // some value
554e4b4e4f574e5f4b4559 // UNKNOWN_KEY
036b65792103c58cdbfc2b4cedd96e56d2714aa980a1478feb0bc5f9f7b6036e075eb974faef // bitcoin pub key key
d63081d302010104202941b6af510f195ac980a8d5218d06b870d18ba057a2e8b3f63a9eb9be6d9cf7a08185308182020101302c06072a8648ce3d0101022100fffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f300604010004010704210279be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798022100fffffffffffffffffffffffffffffffebaaedce6af48a03bbfd25e8cd0364141020101a12403220003c58cdbfc2b4cedd96e56d2714aa980a1478feb0bc5f9f7b6036e075eb974faef60d62c885546d4410667ec2ce0cf0562b90fa067e6a8cd3dc20f127d659300d1 // bitcoin pubkey value
036b65792103c9c89b90f2ae69722c1ec7cea861a0c925bce1dfd996467aa7eb3fe887450655
d63081d302010104204b1b5c9cc37717ac4ed9f4376fa93bea36e8f065d4ea0f9991eec9905e916ecba08185308182020101302c06072a8648ce3d0101022100fffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f300604010004010704210279be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798022100fffffffffffffffffffffffffffffffebaaedce6af48a03bbfd25e8cd0364141020101a12403220003c9c89b90f2ae69722c1ec7cea861a0c925bce1dfd996467aa7eb3fe887450655b92751b48a09b2bff3ebf0ca0d40af28c2d3aa229711abd6967a5fa6c43d61c7
the output format for BDB should be key in one row followed by value in the next row. You can see here that the final 554e4b4e4f574e5f4b4559 key is followed by 036b65792103c58cdbfc2b4cedd96e56d2714aa980a1478feb0bc5f9f7b6036e075eb974faef, which is actually another key row (for a bitcoin public key). That means that the output of db_dump is out of alignment - there's actually an odd number of rows, so after this line, parsing keys/values gets mixed up.
I added some logging into ReadKeyValue() in bitcoind and see exactly the same thing when running with -salvagewallet. The key/values get misaligned when re-reading after calling Db::verify():
2017-06-05 19:57:03 Renamed wallet.dat to wallet.1496692623.bak
2017-06-05 19:57:03 CDBEnv::Salvage: Database salvage found errors, all data may not be recoverable.
2017-06-05 19:57:03 CDBEnv::Salvage: WARNING: Number of keys in data does not match number of values.
2017-06-05 19:57:03 CDBEnv::Salvage: WARNING: Unexpected end of file while reading salvage output.
2017-06-05 19:57:03 Salvage(aggressive) found 102 records
2017-06-05 19:57:03 ReadKeyValue: Reading record type tx
2017-06-05 19:57:03 ReadKeyValue: Reading tx record. hash: aa2f899c1b2fcf578522467a4d36d46b43d2ba318d9d3c17ddf0679630e67d02
2017-06-05 19:57:03 ReadKeyValue: Reading record type tx
2017-06-05 19:57:03 ReadKeyValue: Reading tx record. hash: 5fd7a56d395626df0641fb69dca14a88aed53ea9e673d0520ed16d09285ae406
2017-06-05 19:57:03 ReadKeyValue: Reading record type tx
2017-06-05 19:57:03 ReadKeyValue: Reading tx record. hash: 2f66a28ccfbbf6fccad3d168cbef6ffd36ed14d2f7303ca797708aebf3deb510
2017-06-05 19:57:03 ReadKeyValue: Reading record type tx
2017-06-05 19:57:03 ReadKeyValue: Reading tx record. hash: aada45522a31c27cab548975be7f5282471f484258451916bcad018b694dc311
2017-06-05 19:57:03 ReadKeyValue: Reading record type tx
2017-06-05 19:57:03 ReadKeyValue: Reading tx record. hash: 580ce8c795c5e3c5eaa946da7c054e6cd6336739c730c4fb87c9fa735632de15
2017-06-05 19:57:03 ReadKeyValue: Reading record type tx
2017-06-05 19:57:03 ReadKeyValue: Reading tx record. hash: ddae1c851b2275a9bb12d53bfe8a1f0ff9a7cdb149d8734042fe439e72cdd420
2017-06-05 19:57:03 ReadKeyValue: Reading record type tx
2017-06-05 19:57:03 ReadKeyValue: Reading tx record. hash: f3450597d463df2edd39cd7f0f0aa6e07801e143c2ecb3a96f33ba62bc1aa429
2017-06-05 19:57:03 ReadKeyValue: Reading record type tx
2017-06-05 19:57:03 ReadKeyValue: Reading tx record. hash: 31f5262195fd40bce453e44d69b6a69c3af902e10a9161f3d4aa3815626c053e
2017-06-05 19:57:03 ReadKeyValue: Reading record type tx
2017-06-05 19:57:03 ReadKeyValue: Reading tx record. hash: b6aa0e873ad8854e15855f9caf2a932edef07bfb5a261172351959defd83144d
2017-06-05 19:57:03 ReadKeyValue: Reading record type 0<81><D3>^B^A^A^D )A<B6><AF>Q^O^YZɀ<A8><D5>!<8D>^F<B8>pы<A0>W<A2><E8><B3><F6>:<9E><B9><BE>m<9C>����0<81><82>^B^A^A0,^F^G*<86>H<CE>=^A^A^B!^@<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FE><FF><FF><FC>/0^F^D^A^@^D^A^G^D!^By<BE>f~<F9>ܻ<AC>U<A0>b<95>·^K^G^B<9B><FC><DB>-<CE>(<D9>Y<F2><81>[^V<F8>^W<98>^B!^@<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FE><BA><AE><DC><E6><AF>H<A0>;<BF><D2>^<8C><D0>6AA^B^A^A<A1>$^C"^@^CŌ<DB><FC>+L<ED><D9>nV<D2>qJ<A9><80><A1>G<8F><EB>^K<C5><F9><F7><B6>^Cn^G^<B9>t<FA><EF>
2017-06-05 19:57:03 ReadKeyValue: Reading 0<81><D3>^B^A^A^D )A<B6><AF>Q^O^YZɀ<A8><D5>!<8D>^F<B8>pы<A0>W<A2><E8><B3><F6>:<9E><B9><BE>m<9C>����0<81><82>^B^A^A0,^F^G*<86>H<CE>=^A^A^B!^@<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FE><FF><FF><FC>/0^F^D^A^@^D^A^G^D!^By<BE>f~<F9>ܻ<AC>U<A0>b<95>·^K^G^B<9B><FC><DB>-<CE>(<D9>Y<F2><81>[^V<F8>^W<98>^B!^@<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FE><BA><AE><DC><E6><AF>H<A0>;<BF><D2>^<8C><D0>6AA^B^A^A<A1>$^C"^@^CŌ<DB><FC>+L<ED>
<D9>nV<D2>qJ<A9><80><A1>G<8F><EB>^K<C5><F9><F7><B6>^Cn^G^<B9>t<FA><EF> record (no-op)
2017-06-05 19:57:03 ReadKeyValue: Reading record type 0<81><D3>^B^A^A^D KESC\<9C><C3>w^W<AC>N<D9><F4>7o<A9>;<EA>6<E8><F0>e<D4><EA>^O<99><91><EE>ɐ^<91>nˠ<81><85>0<81><82>^B^A^A0,^F^G*<86>H<CE>=^A^A^B!^@<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF>
<FF><FF><FF><FF><FF><FF><FF><FF><FF><FE><FF><FF><FC>/0^F^D^A^@^D^A^G^D!^By<BE>f~<F9>ܻ<AC>U<A0>b<95>·^K^G^B<9B><FC><DB>-<CE>(<D9>Y<F2><81>[^V<F8>^W<98>^B!^@<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FE><BA><AE><DC><E6><AF>H<A0>;<BF><D2>^<8C><D0>6AA^B^A^A<A1>$^C"^@^C<C9>ț<90><F2><AE>ir,^^<C7>Ψa<A0><C9>%<BC><E1><DF>ٖFz<A7><EB>?<E8><87>E^FU
2017-06-05 19:57:03 ReadKeyValue: Reading 0<81><D3>^B^A^A^D KESC\<9C><C3>w^W<AC>N<D9><F4>7o<A9>;<EA>6<E8><F0>e<D4><EA>^O<99><91><EE>ɐ^<91>nˠ<81><85>0<81><82>^B^A^A0,^F^G*<86>H<CE>=^A^A^B!^@<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF>
<FF><FF><FF><FF><FF><FF><FE><FF><FF><FC>/0^F^D^A^@^D^A^G^D!^By<BE>f~<F9>ܻ<AC>U<A0>b<95>·^K^G^B<9B><FC><DB>-<CE>(<D9>Y<F2><81>[^V<F8>^W<98>^B!^@<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FE><BA><AE><DC><E6><AF>H<A0>;<BF><D2>^<8C><D0>6AA^B^A^A<A1>$^C"^@^C<C9>ț<90><F2><AE>ir,^^<C7>Ψa<A0><C9>%<BC><E1><DF>ٖFz<A7><EB>?<E8><87>E^FU record (no-op)
2017-06-05 19:57:03 ReadKeyValue: Reading record type 0<81><D3>^B^A^A^D u<9C><E8><AB><F4>4D7<F2><D4>i<A9>uN<F7>3<C1>^HW<8C>0<B9><F2>ݑ<B4>^YN<FE><E0>@W<A0><81><85>0<81><82>^B^A^A0,^F^G*<86>H<CE>=^A^A^B!^@<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FE><FF><FF><FC>/0^F^D^A^@^D^A^G^D!^By<BE>f~<F9>ܻ<AC>U<A0>b<95>·^K^G^B<9B><FC><DB>-<CE>(<D9>Y<F2><81>[^V<F8>^W<98>^B!^@<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FE><BA><AE><DC><E6><AF>H<A0>;<BF><D2>^<8C><D0>6AA^B^A^A<A1>$^C"^@^C<DE>u^XN"m<81><B8><9E>p<97><C4>5a<CA><C6>^V<A9>*<B6>p<99><AA><A5><B5><DC><D3>>N`(^Z
2017-06-05 19:57:03 ReadKeyValue: Reading 0<81><D3>^B^A^A^D u<9C><E8><AB><F4>4D7<F2><D4>i<A9>uN<F7>3<C1>^HW<8C>0<B9><F2>ݑ<B4>^YN<FE><E0>@W<A0><81><85>0<81><82>^B^A^A0,^F^G*<86>H<CE>=^A^A^B!^@<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FE><FF><FF><FC>/0^F^D^A^@^D^A^G^D!^By<BE>f~<F9>ܻ<AC>U<A0>b<95>·^K^G^B<9B><FC><DB>-<CE>(<D9>Y<F2><81>[^V<F8>^W<98>^B!^@<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FE><BA><AE><DC><E6><AF>H<A0>;<BF><D2>^<8C><D0>6AA^B^A^A<A1>$^C"^@^C<DE>u^XN"m<81><B8><9E>p<97><C4>5a<CA><C6>^V<A9>*<B6>p<99><AA><A5><B5><DC><D3>>N`(^Z record (no-op)
2017-06-05 19:57:03 ReadKeyValue: Reading record type test
2017-06-05 19:57:03 ReadKeyValue: Reading test record (no-op)
2017-06-05 19:57:03 ReadKeyValue: Reading record type
2017-06-05 19:57:03 ReadKeyValue: Reading record (no-op)
2017-06-05 19:57:03 ReadKeyValue: Reading record type рыба
2017-06-05 19:57:03 ReadKeyValue: Reading рыба record (no-op)
2017-06-05 19:57:03 ReadKeyValue: Reading record type
2017-06-05 19:57:03 ReadKeyValue: Reading record (no-op)
2017-06-05 19:57:03 ReadKeyValue: Reading record type
2017-06-05 19:57:03 ReadKeyValue: Reading record (no-op)
2017-06-05 19:57:03 ReadKeyValue: Reading record type 𝅘𝅥𝅯
2017-06-05 19:57:03 ReadKeyValue: Reading 𝅘𝅥𝅯 record (no-op)
2017-06-05 19:57:03 ReadKeyValue: Reading record type ^@^@
2017-06-05 19:57:03 ReadKeyValue: Reading ^@^@ record (no-op)
(you can see at the end that it's reading keys as values and vice versa).
After running with -salvagewallet, the wallet.dat file is empty. All key-value pairs have gone. The wallet backup file wallet.<timestamp>.bak contains the db from before the salvage operation so still has all the key-value pairs.
Next steps:
-salvagewallet should be changed to not set the aggressive flag. Note the documentation for that flag: the output will almost certainly require editing before being loaded into a database. Currently the salvagewallet operation just directly reads that output, so it's not entirely surprising that it occasionally fails.
we may want an option to run salvage wallet with the aggressive flag set, but that should only be once we have the bitcoin_wallet_tool from #8745.
once #8745 is merged, we should add targeted testing of salvagewallet (rather then the incidental testing we have in wallet.py). Gavin had some testing here: #1895 / https://gist.github.com/gavinandresen/3812689 which we might use as a start.
TBD whether we want to merge my extra logging on BDB reads. We could also add extra logging on writes.
we should try to figure out what BD writes are causing the DB to get into a state where Db::verify() with the DB_AGGRESSIVE flag fails. Additional logging will help there. I can also add an option to the integration tests which verifies the wallet.dat files of every node at the end of every test.
jnewbery
commented at 3:10 PM on June 6, 2017:
member
This is a metadata mirror of the GitHub repository
bitcoin/bitcoin.
This site is not affiliated with GitHub.
Content is generated from a GitHub metadata backup.
generated: 2026-04-17 06:15 UTC
This site is hosted by @0xB10C More mirrored repositories can be found on mirror.b10c.me