sqlite: Disallow writing from multiple SQLiteBatchs #29112

pull achow101 wants to merge 3 commits into bitcoin:master from achow101:sqlite-concurrent-writes changing 4 files +129 −6
  1. achow101 commented at 10:27 pm on December 18, 2023: member

    The way that we have configured SQLite to run means that only one database transaction can be open at a time. Typically, each individual read and write operation will be its own transaction that is opened and committed automatically by SQLite. However, sometimes we want these operations to be batched into a multi-statement transaction, so SQLiteBatch::TxnBegin, SQLiteBatch::TxnCommit, and SQLiteBatch::TxnAbort are used to manage the transaction of the database.

    However, once a db transaction is begun with one SQLiteBatch, any operations performed by another SQLiteBatch will also occur within the same transaction. Furthermore, those other SQLiteBatchs will not be expecting a transaction to be active, and will abort it once the SQLiteBatch is destructed. This is problematic as it will prevent some data from being written, and also cause the SQLiteBatch that opened the transaction in the first place to be in an unexpected state and throw an error.

    To avoid this situation, we need to prevent the multiple batches from writing at the same time. To do so, I’ve implemented added a CSemaphore within SQLiteDatabase which will be used by any SQLiteBatch trying to do a write operation. wait() is called by TxnBegin, and at the beginning of WriteKey, EraseKey, and ErasePrefix. post() is called in TxnCommit, TxnAbort and at the end of WriteKey, EraseKey, and ErasePrefix. To avoid deadlocking on TxnBegin() followed by a WriteKey(), `SQLiteBatch will now also track whether a transaction is in progress so that it knows whether to use the semaphore.

    This issue is not a problem for BDB wallets since BDB uses WAL and provides transaction objects that must be used if an operation is to occur within a transaction. Specifically, we either pass a transaction pointer, or a nullptr, to all BDB operations, and this allows for concurrent transactions so it doesn’t have this problem.

    Fixes #29110

  2. DrahtBot commented at 10:27 pm on December 18, 2023: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK furszy, ryanofsky, josibake

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    No conflicts as of last run.

  3. achow101 added the label Needs backport (26.x) on Dec 18, 2023
  4. achow101 removed the label Needs backport (26.x) on Dec 18, 2023
  5. achow101 added the label Wallet on Dec 18, 2023
  6. achow101 force-pushed on Dec 18, 2023
  7. DrahtBot added the label CI failed on Dec 18, 2023
  8. maflcko commented at 10:23 am on December 19, 2023: member
    0                                     File "/ci_container_base/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/wallet_descriptor.py", line 68, in test_concurrent_writes
    1                                       assert_equal(cache_records, 10000)
    2                                     File "/ci_container_base/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/test_framework/util.py", line 57, in assert_equal
    3                                       raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
    4                                   AssertionError: not(0 == 10000)
    
  9. achow101 force-pushed on Dec 19, 2023
  10. achow101 force-pushed on Dec 19, 2023
  11. DrahtBot removed the label CI failed on Dec 19, 2023
  12. in src/wallet/sqlite.cpp:507 in 4d846b7b44 outdated
    471@@ -471,6 +472,8 @@ bool SQLiteBatch::WriteKey(DataStream&& key, DataStream&& value, bool overwrite)
    472     sqlite3_reset(stmt);
    473     if (res != SQLITE_DONE) {
    474         LogPrintf("%s: Unable to execute statement: %s\n", __func__, sqlite3_errstr(res));
    475+    } else if (!m_txn) {
    476+        m_database.m_sqlite_semaphore.post();
    477     }
    


    furszy commented at 2:22 pm on December 20, 2023:

    In 4d846b7b448:

    This will produce a deadlock. if the statement execution fails, the semaphore will remain taken forever. Blocking all future statements execution (see https://github.com/furszy/bitcoin-core/commit/20344b91a03259eb33652863700562f5ef198128). Same will occur on ExecStatement.


    achow101 commented at 0:26 am on December 21, 2023:
    Fixed. Also pulled the test with a few changes.
  13. in src/wallet/sqlite.cpp:621 in 4d846b7b44 outdated
    611@@ -606,10 +612,13 @@ std::unique_ptr<DatabaseCursor> SQLiteBatch::GetNewPrefixCursor(Span<const std::
    612 
    613 bool SQLiteBatch::TxnBegin()
    614 {
    615+    m_database.m_sqlite_semaphore.wait();
    616     if (!m_database.m_db || sqlite3_get_autocommit(m_database.m_db) == 0) return false;
    


    furszy commented at 2:51 pm on December 20, 2023:
    Need to check if TxnBegin() was already called on this object prior to acquiring the semaphore. Otherwise two TxnBegin() calls will cause a deadblock. Can throw a logic exception error if it happens.

    achow101 commented at 0:26 am on December 21, 2023:
    Fixed
  14. in src/wallet/sqlite.cpp:659 in 4d846b7b44 outdated
    616     if (!m_database.m_db || sqlite3_get_autocommit(m_database.m_db) == 0) return false;
    617     int res = sqlite3_exec(m_database.m_db, "BEGIN TRANSACTION", nullptr, nullptr, nullptr);
    618     if (res != SQLITE_OK) {
    619         LogPrintf("SQLiteBatch: Failed to begin the transaction\n");
    620+    } else {
    621+        m_txn = true;
    


    furszy commented at 3:05 pm on December 20, 2023:

    This can also cause a deadlock when the statement execution fails.

    The semaphore is not released after a TxnBegin failure which will block indefinitely if the software continues and executes a single write or another TxnBegin attempt. Which could happen as TxnBegin failure might be temporary.


    achow101 commented at 0:26 am on December 21, 2023:
    Fixed
  15. furszy commented at 3:15 pm on December 20, 2023: member
    Found possible deadlock causes and crafted a test that exercises one of them: https://github.com/furszy/bitcoin-core/commit/20344b91a03259eb33652863700562f5ef198128. Feel free to pull it here.
  16. achow101 force-pushed on Dec 21, 2023
  17. in src/wallet/test/db_tests.cpp:215 in 689ed43c2e outdated
    212+    // that the database remains active after failing to store an existing record.
    213+    for (const auto& database : TestDatabases(m_path_root)) {
    214+        std::unique_ptr<DatabaseBatch> handler = Assert(database)->MakeBatch();
    215+
    216+        // Write original record
    217+        std::unique_ptr<DatabaseBatch> batch = database->MakeBatch();
    


    furszy commented at 3:30 pm on December 23, 2023:

    In 689ed43c2:

    Pulling mistake here. Introduced another DatabaseBatch. Just need one.


    achow101 commented at 0:23 am on January 3, 2024:
    Fixed
  18. furszy commented at 4:18 pm on December 23, 2023: member

    For the sake of completeness and to avoid potential issues, have expanded the test coverage to exercise the previously mentioned “concurrent database transactions deadlock” behavior. All yours: https://github.com/furszy/bitcoin-core/commit/a80cc56f0d1bdc82e9458f957db03c08620a6dd7. It verifies that the database handler cannot commit/abort changes occurring in a different database handler.

    Also, while was working on this, discovered another deadlocking cause.. test https://github.com/furszy/bitcoin-core/commit/68e77b58e2284a92daff49a324fd229497da37ae. As it is a bdb-only issue, we could address it in another PR. But, it will mean that https://github.com/furszy/bitcoin-core/commit/a80cc56f0d1bdc82e9458f957db03c08620a6dd7 will need to be adapted to only run on sqlite (replace TestDatabases() for sqlite db creation call).

  19. achow101 force-pushed on Jan 3, 2024
  20. achow101 commented at 0:23 am on January 3, 2024: member

    Also, while was working on this, discovered another deadlocking cause.. test furszy@68e77b5. As it is a bdb-only issue, we could address it in another PR. But, it will mean that furszy@a80cc56 will need to be adapted to only run on sqlite (replace TestDatabases() for sqlite db creation call).

    Hmm, seems like BDB doesn’t behave how I thought it did. It looks like we can’t have concurrent transactions that try to operate on the same data, at least not without changing the isolation level, and I don’t think it’s worth the effort to figure that out when we’re planning on removing BDB soon anyways. I’ve taken your test and updated it to just use SQLite.

    I don’t think this particular issue is a problem for us however. The fact that it deadlocks in this particular test case is probably a good thing since it means that BDB is handling locking internally. If there were two threads trying to read/write at the same time, BDB would force one thread to wait for the other, as this PR is doing for SQLite. That’s the behavior that we want. It’s only a deadlock (and test failure) since the test is trying to do the behavior of two threads within one thread.

  21. in src/wallet/test/db_tests.cpp:291 in f1b5b259f6 outdated
    235+    std::string value2 = "value_2";
    236+
    237+    DatabaseOptions options;
    238+    DatabaseStatus status;
    239+    bilingual_str error;
    240+    const auto& database = MakeSQLiteDatabase(m_path_root / "sqlite", options, status, error);
    


    furszy commented at 1:27 pm on January 4, 2024:

    This will fail to compile without sqlite support. Need to scope the test under the USE_SQLITE compile flag.

     0diff --git a/src/wallet/test/db_tests.cpp b/src/wallet/test/db_tests.cpp
     1--- a/src/wallet/test/db_tests.cpp	(revision f1b5b259f6d4759e7fc514b82b4096f7669e8325)
     2+++ b/src/wallet/test/db_tests.cpp	(date 1704374489489)
     3@@ -228,6 +228,7 @@
     4     }
     5 }
     6 
     7+#ifdef USE_SQLITE
     8 BOOST_AUTO_TEST_CASE(concurrent_txn_dont_interfere)
     9 {
    10     std::string key = "key";
    11@@ -269,6 +270,7 @@
    12     BOOST_CHECK(handler2->Read(key, read_value));
    13     BOOST_CHECK_EQUAL(read_value, value2);
    14 }
    15+#endif // USE_SQLITE
    16 
    17 
    18 BOOST_AUTO_TEST_SUITE_END()
    

    achow101 commented at 2:54 pm on January 4, 2024:
    Done
  22. furszy commented at 1:45 pm on January 4, 2024: member

    I don’t think this particular issue is a problem for us however. The fact that it deadlocks in this particular test case is probably a good thing since it means that BDB is handling locking internally. If there were two threads trying to read/write at the same time, BDB would force one thread to wait for the other, as this PR is doing for SQLite. That’s the behavior that we want. It’s only a deadlock (and test failure) since the test is trying to do the behavior of two threads within one thread.

    The issue I envision is nested WalletBatch within the same thread. We need to be careful on not introducing an inner WalletBatch creation when there is an existing WalletBatch on any upper level.

    Still, agree on not doing the changes here. Isn’t related. I should revive #25297 after we finish #28574.

  23. achow101 force-pushed on Jan 4, 2024
  24. furszy commented at 2:36 pm on January 8, 2024: member

    Another possible deadlock can arise when the txn abortion fails during the destruction of the batch object. Essentially, if the txn abortion fails and the object is destructed, the semaphore remains locked indefinitely, leading to a deadlock in any subsequent write operation.

    Made a test replicating the behavior; cherry-pick 5d2b79a65dcb10abdd349471c125595d677909ab and fbd59f89d3f5ed7bc026efa47d1f78c5cee04291 (from https://github.com/furszy/bitcoin-core/commits/sqlite-concurrent-writes/ )

  25. maflcko added this to the milestone 27.0 on Jan 11, 2024
  26. achow101 force-pushed on Jan 11, 2024
  27. achow101 commented at 10:04 pm on January 11, 2024: member

    Another possible deadlock can arise when the txn abortion fails during the destruction of the batch object. Essentially, if the txn abortion fails and the object is destructed, the semaphore remains locked indefinitely, leading to a deadlock in any subsequent write operation.

    Made a test replicating the behavior; cherry-pick 5d2b79a and fbd59f8 (from https://github.com/furszy/bitcoin-core/commits/sqlite-concurrent-writes/ )

    Changed TxnAbort and TxnCommit to always release the lock. Also pulled the test, and modified it to test TxnCommit and explicit TxnAbort

  28. furszy commented at 3:05 pm on January 15, 2024: member

    Changed TxnAbort and TxnCommit to always release the lock.

    This breaks the db txn isolation property. The db txn is still active at the engine level, the lock shouldn’t be released when the sqlite statement fails to execute (only when the abort/commit command executes successfully). Otherwise, other concurrent handlers are allowed to access db at the same time and, in the worst possible outcome, dump information to disk that should have been discarded.

    Check this out 71170998. Made a test that exercises the behavior. Feel free to pull it.

    Only the handler who originated the db txn should be able to commit and/or abort such txn, not other handlers. If other handlers were able to commit txns that they did not start, they could be saving incomplete information to disk and affecting other processes.

    Aside from that, thanks for pulling all the tests commits. We are getting closer.

  29. achow101 commented at 4:04 pm on January 15, 2024: member

    The db txn is still active at the engine level, the lock shouldn’t be released when the sqlite statement fails to execute (only when the abort/commit command executes successfully). Otherwise, other concurrent handlers are allowed to access db at the same time and, in the worst possible outcome, dump information to disk that should have been discarded.

    If it fails to abort when the batch is destructed, then we’re deadlocked. I don’t see how this can be resolved without deadlocking and still preserving isolation.

  30. DrahtBot added the label CI failed on Jan 15, 2024
  31. furszy commented at 11:11 pm on January 15, 2024: member

    If it fails to abort when the batch is destructed, then we’re deadlocked. I don’t see how this can be resolved without deadlocking and still preserving isolation.

    Working on a solution. Will push it in few hours.

  32. furszy commented at 0:55 am on January 16, 2024: member

    Done. Created #29253 to fix the root issue. I took some of my commits there.

    With it, the fix for the problem #29112#pullrequestreview-1821862931 is straightforward. After resetting the database connection, we can release the lock, as closing the connection automatically rolls back the active transaction (see the PR). This ensures isolation across db handlers and will prevent the deadlocks here.

    To show the end result, created a quick branch merging all the changes: This PR + #29253 + the test shared in comment (71170998). See https://github.com/furszy/bitcoin-core/tree/2024_pr29112_with_dangling_txn_fixes. All tests passing there :).

  33. achow101 force-pushed on Jan 16, 2024
  34. achow101 commented at 8:15 pm on January 16, 2024: member
    Reverted this to b316a81704fdd7fc5984ee8c189a937302b70c83 as #29253 deals with the problem of failing TxnAbort() and TxnCommit().
  35. DrahtBot removed the label CI failed on Jan 16, 2024
  36. achow101 referenced this in commit a01da41112 on Jan 31, 2024
  37. DrahtBot added the label Needs rebase on Jan 31, 2024
  38. achow101 force-pushed on Feb 1, 2024
  39. DrahtBot removed the label Needs rebase on Feb 1, 2024
  40. in src/wallet/sqlite.h:122 in e7ab074cb0 outdated
    116@@ -115,6 +117,10 @@ class SQLiteDatabase : public WalletDatabase
    117 
    118     ~SQLiteDatabase();
    119 
    120+    // Batches must acquire this semaphore on writing, and release when done writing.
    121+    // This ensures that only one batch is modifying the database at a time.
    122+    CSemaphore m_sqlite_semaphore;
    


    ryanofsky commented at 3:39 pm on February 1, 2024:

    In commit “sqlite: Ensure that only one SQLiteBatch is writing to db at a time” (e7ab074cb0d07dd9a16015702d55491413e286bb)

    Would suggest calling this m_write_semaphore not m_sqlite_semaphore to be clear it is only needed for writing the database.


    achow101 commented at 7:12 pm on February 1, 2024:
    Renamed
  41. in src/wallet/sqlite.h:71 in e7ab074cb0 outdated
    57@@ -58,6 +58,8 @@ class SQLiteBatch : public DatabaseBatch
    58     sqlite3_stmt* m_delete_stmt{nullptr};
    59     sqlite3_stmt* m_delete_prefix_stmt{nullptr};
    60 
    61+    bool m_txn{false};
    


    ryanofsky commented at 3:56 pm on February 1, 2024:

    In commit “sqlite: Ensure that only one SQLiteBatch is writing to db at a time” (e7ab074cb0d07dd9a16015702d55491413e286bb)

    Suggest adding a comment about how this variable is used /** Boolean tracking whether this batch object has started a active transaction and whether it owns SqliteDatabase::m_write_semaphore. If the batch object starts a transaction, it acquires the semaphore and sets this to true, keeping the semaphore until the transaction ends to prevent other batch objects from writing to the database. If the batch object doesn’t start a transaction, it just acquires the semaphore transiently when writing and does not set m_txn. m_txn is different than SqliteDatabase::HasActiveTxn(), becaue it is only true when this batch object has started a transaction, not when any batch object has started a transaction. */


    achow101 commented at 7:12 pm on February 1, 2024:
    Added a comment.
  42. in src/wallet/sqlite.cpp:419 in e7ab074cb0 outdated
    415@@ -416,6 +416,7 @@ void SQLiteBatch::Close()
    416             // by closing and reopening the database. Closing the database should also ensure that any changes made since the transaction
    417             // was opened will be rolled back and future transactions can succeed without committing old data.
    418             force_conn_refresh = true;
    419+            m_database.m_sqlite_semaphore.post();
    


    ryanofsky commented at 4:17 pm on February 1, 2024:

    In commit “sqlite: Ensure that only one SQLiteBatch is writing to db at a time” (e7ab074cb0d07dd9a16015702d55491413e286bb)

    Can’t add a review comment there, but line 411 if (m_database.HasActiveTxn()) above needs to be changed to if (m_txn) to prevent a race condition. The SQLiteBatch::Close function is called by the SQLiteBatch destructor so it called on every batch object, whether or not is a created a transaction. If a batch object is being destroyed in one thread while a batch object in another thread has an ongoing transaction, the batch object that’s being destroyed could see HasActiveTxn return true and abort a transaction it doesn’t own.

    Would also suggest adding a comment above this m_sqlite_semaphore.post() line to explain it. Maybe // If TxnAbort call failed, it will not have released the semaphore, so release it here to avoid deadlocks on future writes.


    achow101 commented at 7:12 pm on February 1, 2024:
    Done as suggested.
  43. in src/wallet/sqlite.cpp:497 in e7ab074cb0 outdated
    493@@ -493,13 +494,18 @@ bool SQLiteBatch::WriteKey(DataStream&& key, DataStream&& value, bool overwrite)
    494     if (!BindBlobToStatement(stmt, 1, key, "key")) return false;
    495     if (!BindBlobToStatement(stmt, 2, value, "value")) return false;
    496 
    497+    if (!m_txn) m_database.m_sqlite_semaphore.wait();
    


    ryanofsky commented at 4:20 pm on February 1, 2024:

    In commit “sqlite: Ensure that only one SQLiteBatch is writing to db at a time” (e7ab074cb0d07dd9a16015702d55491413e286bb)

    Would suggest adding a comment like // Acquire the write semaphore if it was not previously acquired while creating a transaction.


    achow101 commented at 7:13 pm on February 1, 2024:
    Done as suggested
  44. in src/wallet/sqlite.cpp:520 in e7ab074cb0 outdated
    516@@ -511,13 +517,18 @@ bool SQLiteBatch::ExecStatement(sqlite3_stmt* stmt, Span<const std::byte> blob)
    517     // Bind: leftmost parameter in statement is index 1
    518     if (!BindBlobToStatement(stmt, 1, blob, "key")) return false;
    519 
    520+    if (!m_txn) m_database.m_sqlite_semaphore.wait();
    


    ryanofsky commented at 4:23 pm on February 1, 2024:

    In commit “sqlite: Ensure that only one SQLiteBatch is writing to db at a time” (e7ab074cb0d07dd9a16015702d55491413e286bb)

    Would suggest the same comment here. // Acquire the write semaphore if it was not previously acquired while creating a transaction.

    Separately, it seems like SQLiteBatch::WriteKey and SQLiteBatch::ExecStatement could be deduplicated or consolidated. They are basically identical copies of the same code.


    achow101 commented at 7:14 pm on February 1, 2024:

    Added a comment.

    Separately, it seems like SQLiteBatch::WriteKey and SQLiteBatch::ExecStatement could be deduplicated or consolidated. They are basically identical copies of the same code.

    The main difference is that WriteKey has an additional BindBlobToStatement, and a way to deduplicate that didn’t occur to me when writing this originally. Will leave for a followup if anyone wants to do it though.

  45. in src/wallet/sqlite.cpp:650 in e7ab074cb0 outdated
    644@@ -634,30 +645,41 @@ std::unique_ptr<DatabaseCursor> SQLiteBatch::GetNewPrefixCursor(Span<const std::
    645 
    646 bool SQLiteBatch::TxnBegin()
    647 {
    648+    if (m_txn) return false;
    649+    m_database.m_sqlite_semaphore.wait();
    650     if (!m_database.m_db || m_database.HasActiveTxn()) return false;
    


    ryanofsky commented at 4:29 pm on February 1, 2024:

    In commit “sqlite: Ensure that only one SQLiteBatch is writing to db at a time” (e7ab074cb0d07dd9a16015702d55491413e286bb)

    Semaphore is leaked here if m_db is null. Would suggest:

    0if (!m_database.m_db || m_txn) return false;
    1m_database.m_sqlite_semaphore.wait();
    

    I would also suggest dropping the m_database.HasActiveTxn() check because it is confusing (it should never be true after the semaphore is acquired).

    If there really is an active transaction it is better to trigger the “Failed to begin the transaction” log print below, return false, and release the semaphore, than to just silently return false. Alternately this could assert(!m_database.HasActiveTxn());


    achow101 commented at 7:14 pm on February 1, 2024:
    Done. Changed the HasActiveTxn() check to an assert.
  46. in src/wallet/sqlite.cpp:663 in e7ab074cb0 outdated
    659 }
    660 
    661 bool SQLiteBatch::TxnCommit()
    662 {
    663-    if (!m_database.HasActiveTxn()) return false;
    664+    if (!m_txn || !m_database.HasActiveTxn()) return false;
    


    ryanofsky commented at 4:43 pm on February 1, 2024:

    In commit “sqlite: Ensure that only one SQLiteBatch is writing to db at a time” (e7ab074cb0d07dd9a16015702d55491413e286bb)

    Would suggest just writing if (!m_database.m_db || !m_txn) return false; and dropping the HasActiveTxn() check.

    HasActiveTxn should never be false if m_txn is true, and if it is, it is better to trigger the “Failed to commit transaction” log print and return false than to silently return false. Or alternately, this could assert(m_database.HasActiveTxn())


    achow101 commented at 7:14 pm on February 1, 2024:
    Done. Changed the HasActiveTxn() check to an assert.
  47. in src/wallet/sqlite.cpp:676 in e7ab074cb0 outdated
    673 }
    674 
    675 bool SQLiteBatch::TxnAbort()
    676 {
    677-    if (!m_database.HasActiveTxn()) return false;
    678+    if (!m_txn || !m_database.HasActiveTxn()) return false;
    


    ryanofsky commented at 4:49 pm on February 1, 2024:

    In commit “sqlite: Ensure that only one SQLiteBatch is writing to db at a time” (e7ab074cb0d07dd9a16015702d55491413e286bb)

    Similar comments as for TxnCommit(), would suggest just writing if (!m_database.m_db || !m_txn) return false; and dropping the HasActiveTxn() check which should never be false if m_txn is true.


    achow101 commented at 7:14 pm on February 1, 2024:
    Done. Changed the HasActiveTxn() check to an assert.
  48. ryanofsky commented at 5:06 pm on February 1, 2024: contributor
    Code review fe617be10860cb5108399192ca65d85dc270622d. Looks good but I think there is still a race in SQLiteBatch::Close and a semaphore leak in SQLiteBatch::TxnBegin. Still reviewing, mostly just looked at the first commit so far.
  49. achow101 force-pushed on Feb 1, 2024
  50. in src/wallet/sqlite.cpp:420 in 63e88efe79 outdated
    416             // If transaction cannot be aborted, it means there is a bug or there has been data corruption. Try to recover in this case
    417             // by closing and reopening the database. Closing the database should also ensure that any changes made since the transaction
    418             // was opened will be rolled back and future transactions can succeed without committing old data.
    419             force_conn_refresh = true;
    420+            // If TxnAbort failed, the semaphore was not released, so release it here to avoid deadlocks on future writes.
    421+            m_database.m_write_semaphore.post();
    


    furszy commented at 9:14 pm on February 5, 2024:
    This is an extreme edge case, but unlocking the semaphore prior to refreshing the connection could theoretically allow another thread to dump this information to disk. To be 100% sure that no race will ever happen, this line should be after the connection refreshing process (after line 447).

    achow101 commented at 5:24 pm on February 6, 2024:
    Moved it.
  51. furszy commented at 10:02 pm on February 5, 2024: member

    Code reviewed, looking good. Left a last comment.

    Also, have slightly modified my previous unit test to validate the changes (ba333a239b8aec4a7a6a79d11d4668730ee24803). It is not really needed but feel free to take it if you like it (conceptually, it is similar to your test, the diff is that it is verifying the writes execution order and checks that when a batch finishes writing data to disk, another batch in a concurrent process can read the data immediately).

  52. josibake commented at 5:02 pm on February 6, 2024: member

    Concept ACK

    Started reviewing, looks good. Will continue review tomorrow. Might be worth updating the last part of the description with the insights from #29112 (comment) ?

  53. sqlite: Ensure that only one SQLiteBatch is writing to db at a time
    A SQLiteBatch need to wait for any other batch to finish writing before
    it can begin writing, otherwise db txn state may be incorrectly
    modified. To enforce this, each SQLiteDatabase has a semaphore which
    acts as a lock and is acquired by a batch when it begins a write, erase,
    or a transaction, and is released by it when it is done.
    
    To avoid deadlocking on itself for writing during a transaction,
    SQLiteBatch also keeps track of whether it has begun a transaction.
    395bcd2454
  54. tests: Test for concurrent writes with db tx
    There are occasions where a multi-statement tx is begun in one batch,
    and a second batch is created which does a normal write (without a
    multi-statement tx). These should not conflict with each other and all
    of the data should end up being written to disk.
    548ecd1155
  55. test: wallet, coverage for concurrent db transactions
    Verifying that a database handler can't commit/abort changes
    occurring in a different database handler.
    cfcb9b1ecf
  56. achow101 force-pushed on Feb 6, 2024
  57. furszy commented at 9:58 pm on February 6, 2024: member
    ACK cfcb9b1ecf9be5267487713fa1e112ca09a1ae55
  58. DrahtBot requested review from josibake on Feb 6, 2024
  59. in test/functional/wallet_descriptor.py:41 in 548ecd1155 outdated
    35@@ -33,6 +36,41 @@ def skip_test_if_missing_module(self):
    36         self.skip_if_no_sqlite()
    37         self.skip_if_no_py_sqlite3()
    38 
    39+    def test_concurrent_writes(self):
    40+        self.log.info("Test sqlite concurrent writes are in the correct order")
    41+        self.restart_node(0, extra_args=["-unsafesqlitesync=0"])
    


    ryanofsky commented at 1:45 am on February 7, 2024:

    In commit “tests: Test for concurrent writes with db tx” (548ecd11553bea28bc79e6f9840836283e9c4e99)

    Could there be a comment to explain unsafesqlitesync? Is it just to make the test faster, or is it actually important to the functioning of the test?


    achow101 commented at 4:21 pm on February 7, 2024:
    The option’s help text states Set SQLite synchronous=OFF to disable waiting for the database to sync to disk. This is unsafe and can cause data loss and corruption. This option is only used by tests to improve their performance (default: false)
  60. in src/wallet/test/db_tests.cpp:293 in cfcb9b1ecf
    289+    DatabaseOptions options;
    290+    DatabaseStatus status;
    291+    bilingual_str error;
    292+    const auto& database = MakeSQLiteDatabase(m_path_root / "sqlite", options, status, error);
    293+
    294+    std::unique_ptr<DatabaseBatch> handler = Assert(database)->MakeBatch();
    


    ryanofsky commented at 1:50 am on February 7, 2024:

    In commit “test: wallet, coverage for concurrent db transactions” (cfcb9b1ecf9be5267487713fa1e112ca09a1ae55)

    Would suggest calling this “handler1” instead of “handler” because otherwise when the comments below refer to a “handler” is not clear if they are referring to this variable specifically or a handler generically.

    Also s/handler/batch/g throughout this function would make it more meaningful and consistent with the rest of wallet code, IMO

  61. in src/wallet/test/db_tests.cpp:306 in cfcb9b1ecf
    302+    // But, the same key, does not exist in another handler
    303+    std::unique_ptr<DatabaseBatch> handler2 = Assert(database)->MakeBatch();
    304+    BOOST_CHECK(handler2->Exists(key));
    305+
    306+    // Attempt to commit the handler txn calling the handler2 methods.
    307+    // Which, must not be possible.
    


    ryanofsky commented at 1:53 am on February 7, 2024:

    In commit “test: wallet, coverage for concurrent db transactions” (cfcb9b1ecf9be5267487713fa1e112ca09a1ae55)

    Would be good to clarify why it is not possible. Maybe “Which must not be possible because batch2->TxnBegin() was never called.”

  62. in src/wallet/test/db_tests.cpp:282 in cfcb9b1ecf
    278@@ -279,8 +279,48 @@ BOOST_AUTO_TEST_CASE(txn_close_failure_dangling_txn)
    279     BOOST_CHECK(!batch2->Exists(key));
    280 }
    281 
    282-#endif // USE_SQLITE
    283+BOOST_AUTO_TEST_CASE(concurrent_txn_dont_interfere)
    


    ryanofsky commented at 2:07 am on February 7, 2024:

    In commit “test: wallet, coverage for concurrent db transactions” (cfcb9b1ecf9be5267487713fa1e112ca09a1ae55

    Nothing about this test seems sqlite-specific. Seems like it would be nicer to call TestDatabases and be generic.


    furszy commented at 12:39 pm on February 7, 2024:
    This test was originally generic and it was moved to sqlite-only because of a bdb bug, see #29112#pullrequestreview-1795448339.
  63. in test/functional/wallet_descriptor.py:52 in 548ecd1155 outdated
    47+        with concurrent.futures.ThreadPoolExecutor(max_workers=1) as thread:
    48+            topup = thread.submit(wallet.keypoolrefill, newsize=1000)
    49+
    50+            # Then while the topup is running, we need to do something that will call
    51+            # ChainStateFlushed which will trigger a write to the db, hopefully at the
    52+            # same time that the topup still has an open db transaction.
    


    ryanofsky commented at 2:11 am on February 7, 2024:

    In commit “tests: Test for concurrent writes with db tx” (548ecd11553bea28bc79e6f9840836283e9c4e99)

    This seems fine, but you could probably increase the probability of this by calling keypoolrefill in a loop until gettxoutsetinfo returns instead of calling it once.

  64. ryanofsky approved
  65. ryanofsky commented at 2:19 am on February 7, 2024: contributor

    Code review ACK cfcb9b1ecf9be5267487713fa1e112ca09a1ae55. This looks great and I think it is ready for merge. Just holding off because josibake seemed ready to review #29112 (comment) and might have more feedback.

    I did leave some comments but they are all minor and can be ignored. I was was also wondering about this in the PR description:

    re: #29112#issue-2047565192

    To avoid this situation, we need to prevent the multiple batches from writing at the same time. Although SQLite does have facilities for doing that, they require the Write Ahead Log (WAL) mode, which is a feature that we explicitly chose to disable since it spread wallet data to multiple files. As such, we need to be handling the concurrency ourselves.

    This doesn’t seem accurate to me. As long as all the batch objects are sharing a sqlite3* m_db pointer and using the same connection to the database, a semaphore or mutex needs to be used to keep their transactions separate. This would be true whether the database is in WAL or journaling mode. Also if each batch object had it’s own sqlite3* m_db connection, the semaphore should not be needed regardless of the database mode, IIUC.

  66. josibake commented at 10:03 am on February 7, 2024: member

    ACK https://github.com/bitcoin/bitcoin/pull/29112/commits/cfcb9b1ecf9be5267487713fa1e112ca09a1ae55

    Looks good, nothing to add. Happy to reACK if you end up taking ryanofsky’s suggestions.

  67. DrahtBot removed review request from josibake on Feb 7, 2024
  68. achow101 commented at 4:24 pm on February 7, 2024: member

    To avoid this situation, we need to prevent the multiple batches from writing at the same time. Although SQLite does have facilities for doing that, they require the Write Ahead Log (WAL) mode, which is a feature that we explicitly chose to disable since it spread wallet data to multiple files. As such, we need to be handling the concurrency ourselves.

    This doesn’t seem accurate to me. As long as all the batch objects are sharing a sqlite3* m_db pointer and using the same connection to the database, a semaphore or mutex needs to be used to keep their transactions separate. This would be true whether the database is in WAL or journaling mode. Also if each batch object had it’s own sqlite3* m_db connection, the semaphore should not be needed regardless of the database mode, IIUC.

    Indeed, I misunderstood the documentation. Dropped the sentence.


    Going to leave the nits for now.

  69. ryanofsky assigned ryanofsky on Feb 8, 2024
  70. ryanofsky merged this on Feb 8, 2024
  71. ryanofsky closed this on Feb 8, 2024


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-06-29 07:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me