Add Muhash3072 implementation in Python #19105

pull fjahr wants to merge 4 commits into bitcoin:master from fjahr:csi-2-muhash-py changing 4 files +144 −16
  1. fjahr commented at 12:06 pm on May 29, 2020: member

    This is the second in a series of pull requests to implement an Index for UTXO set statistics.

    This pull request adds a Python implementation of Muhash3072, a homomorphic hashing algorithm to be used for hashing the UTXO set. The Python implementation can then be used to compare behavior with the C++ version.

  2. DrahtBot added the label Build system on May 29, 2020
  3. DrahtBot added the label Tests on May 29, 2020
  4. DrahtBot added the label Utils/log/libs on May 29, 2020
  5. in test/functional/test_framework/muhash.py:20 in fc6f2bd337 outdated
    15+        r1, r2 = r2, r1 - q * r2
    16+    if r1 > 1:
    17+        return None
    18+    if t1 < 0:
    19+        t1 += n
    20+    return t1
    


    theStack commented at 11:57 am on June 2, 2020:

    To simplify the code, one could just use Fermat’s little theorem here to calculate the modular inverse. The drawback is that it’s much slower than the extended Euclidean algorithm, calculating the modinv of a random 3072-bit number takes approx. 100-150ms on my machine, which is at least 1 order of magnitude slower. Not sure if that’s an issue and in the case of tests whether performance or readability is more important :-)

    0def modinv(a, n):
    1    """Compute the modular inverse of a prime modulo n using Fermat's little theorem."""
    2    return pow(a, n-2, n)
    

    fjahr commented at 8:25 pm on June 2, 2020:
    Thanks, I see your point, however, given how often these tests run on people’s machines and in the CI environment, performance does matter quite a bit. But I think it’s a great question to discuss during the PR review club next week :)

    sipa commented at 10:31 pm on June 2, 2020:
    Also, if comprehension is a concern, I suspect that people who are familiar with modular inverses will generally understand both the euclidean and the fermat approach; and to people who aren’t familiar with it both will look like black magic.

    theStack commented at 12:57 pm on June 3, 2020:
    @fjahr @sipa: Fair enough! By the way, Python 3.8 introduced support for negative exponents in the modulo (three argument) pow() (see https://bugs.python.org/issue36027, https://github.com/python/cpython/pull/13266), internally calculating the modular inverse via the extended Euclidean algorithm. I.e. somewhere in the future it could be just a very simple: return pow(a, -1, n)

    Sjors commented at 10:39 am on June 9, 2020:

    That’s worth annotating in a TODO: // Python 3.8: return pow(a, -1, n) (tested locally with Python 3.8.2)

    I like having the simpler approach here. It’s yet another sanity check that our implementation is correct, given the lack of test vectors. But given the performance impact, if it really matters compared to the rest of the test, better leave that as a TODO.


    narula commented at 5:48 pm on June 10, 2020:
    It would be nice to have a unittest to confirm that modinv(a, n) produces the same results as using pow(a, n-2, n)

    fjahr commented at 2:41 pm on June 11, 2020:
    @narula I added a unit test for this.
  6. fjahr commented at 8:29 pm on June 2, 2020: member
    Added a super simple test that reimplements the C++ impl unit test in Python. I am working on more exhaustive tests for the next PR in this series.
  7. fjahr force-pushed on Jun 5, 2020
  8. jnewbery added the label Review club on Jun 5, 2020
  9. DrahtBot commented at 2:03 am on June 6, 2020: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #19253 (Tests: tidy up address.py and segwit_addr.py by jnewbery)
    • #19145 (Add hash_type MUHASH for gettxoutsetinfo by fjahr)
    • #17977 (Implement BIP 340-342 validation (Schnorr/taproot/tapscript) by sipa)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  10. in test/functional/test_framework/muhash.py:8 in a6ee4c2cee outdated
    0@@ -0,0 +1,80 @@
    1+# Copyright (c) 2020 Pieter Wuille
    2+# Distributed under the MIT software license, see the accompanying
    3+# file COPYING or http://www.opensource.org/licenses/mit-license.php.
    4+"""Native Python MuHash3072 implementation."""
    5+
    6+import hashlib
    7+
    8+def modinv(a, n):
    


    jnewbery commented at 1:38 pm on June 9, 2020:
    This is duplication of the code in key.py. Can you move it into util.py or a new crypto_util.py instead of repeating it?

    fjahr commented at 2:41 pm on June 11, 2020:
    done
  11. in test/functional/test_framework/muhash.py:11 in a6ee4c2cee outdated
    17+        return None
    18+    if t1 < 0:
    19+        t1 += n
    20+    return t1
    21+
    22+def rot32(v, bits):
    


    jnewbery commented at 1:49 pm on June 9, 2020:
    Calling this function with bits < 0 or bits > 32 throws. Perhaps call bits %= 32 to reduce it to a valid value?

    fjahr commented at 2:41 pm on June 11, 2020:
    done
  12. in test/functional/test_framework/muhash.py:27 in a6ee4c2cee outdated
    22+def rot32(v, bits):
    23+    """Rotate the 32-bit value v left by bits bits."""
    24+    return ((v << bits) & 0xffffffff) | (v >> (32 - bits))
    25+
    26+def chacha20_doubleround(s):
    27+    """Apply a ChaCha20 double round to 16-element state array s."""
    


    jnewbery commented at 1:53 pm on June 9, 2020:
    I suggest you link to the specification for chacha here: https://cr.yp.to/chacha/chacha-20080128.pdf or here: https://tools.ietf.org/html/rfc7539

    troygiorshev commented at 1:39 pm on June 10, 2020:
    Possibly only the first link. There are slight differences between those two and https://tools.ietf.org/html/rfc8439. See #19225

    fjahr commented at 2:42 pm on June 11, 2020:
    done
  13. in test/functional/test_framework/muhash.py:28 in a6ee4c2cee outdated
    23+    """Rotate the 32-bit value v left by bits bits."""
    24+    return ((v << bits) & 0xffffffff) | (v >> (32 - bits))
    25+
    26+def chacha20_doubleround(s):
    27+    """Apply a ChaCha20 double round to 16-element state array s."""
    28+    for a, b, c, d in ((0, 4,  8, 12), (1, 5,  9, 13), (2, 6, 10, 14), (3, 7, 11, 15), (0, 5, 10, 15), (1, 6, 11, 12), (2, 7,  8, 13), (3, 4,  9, 14)):
    


    jnewbery commented at 1:58 pm on June 9, 2020:
    nit: remove double spaces

    jnewbery commented at 2:45 pm on June 9, 2020:

    Explicitly naming the quarter rounds would make this clearer:

    0    QUARTER_ROUNDS = [(0, 4, 8, 12),
    1                      (1, 5, 9, 13),
    2                      (2, 6, 10, 14),
    3                      (3, 7, 11, 15),
    4                      (0, 5, 10, 15),
    5                      (1, 6, 11, 12),
    6                      (2, 7, 8, 13),
    7                      (3, 4, 9, 14)]
    8    for a, b, c, d in QUARTER_ROUNDS:
    

    fjahr commented at 2:42 pm on June 11, 2020:
    done

    fjahr commented at 2:43 pm on June 11, 2020:
    done
  14. in test/functional/test_framework/muhash.py:42 in a6ee4c2cee outdated
    37+
    38+def chacha20_32_to_384(key32):
    39+    """Specialized ChaCha20 implementation with 32-byte key, 0 IV, 384-byte output."""
    40+    init = [1634760805, 857760878, 2036477234, 1797285236] + [0] * 12
    41+    for i in range(8):
    42+        init[4 + i] = int.from_bytes(key32[4*i:4*(i+1)], 'little')
    


    jnewbery commented at 2:00 pm on June 9, 2020:
    nit: this would be slightly easier on the eye with spaces around the operators (see https://www.python.org/dev/peps/pep-0008/#other-recommendations)

    jnewbery commented at 3:13 pm on June 9, 2020:

    Again, being explicitly about where these magic numbers come from would be nice:

    0    # See RFC 7539 section 2.3 for chacha20 parameters
    1    CONSTANTS = [0x61707865, 0x3320646e, 0x79622d32, 0x6b206574]
    2    key_bytes = []
    3    for i in range(8):
    4        key_bytes[i] = int.from_bytes(key32[4*i:4*(i+1)], 'little')
    5    INITIALIZATION_VECTOR = [0] * 4
    6    init = CONSTANTS + key_bytes + INITIALIZATION_VECTOR
    

    fjahr commented at 2:43 pm on June 11, 2020:
    done but also added extra brackets because otherwise it seemed weird to me

    fjahr commented at 2:43 pm on June 11, 2020:
    done
  15. in test/functional/test_framework/muhash.py:55 in a6ee4c2cee outdated
    39+    """Specialized ChaCha20 implementation with 32-byte key, 0 IV, 384-byte output."""
    40+    init = [1634760805, 857760878, 2036477234, 1797285236] + [0] * 12
    41+    for i in range(8):
    42+        init[4 + i] = int.from_bytes(key32[4*i:4*(i+1)], 'little')
    43+    out = bytearray()
    44+    for pos in range(6):
    


    jnewbery commented at 3:17 pm on June 9, 2020:
    What is pos? Would counter be a better name, since this value is used as the block counter?

    fjahr commented at 2:54 pm on June 11, 2020:
    yeah, I think that’s better. done
  16. in test/functional/test_framework/muhash.py:47 in a6ee4c2cee outdated
    42+        init[4 + i] = int.from_bytes(key32[4*i:4*(i+1)], 'little')
    43+    out = bytearray()
    44+    for pos in range(6):
    45+        init[12] = pos
    46+        s = list(init)
    47+        for rnd in range(10):
    


    jnewbery commented at 3:17 pm on June 9, 2020:
    s/rnd/round/. rnd is often used to mean random, which is confusing here.

    fjahr commented at 2:46 pm on June 11, 2020:
    It’s not used anyway so I just didn’t assign the value to a variable. Since the loop is just one line I think it’s still understandable.
  17. in test/functional/test_framework/muhash.py:46 in a6ee4c2cee outdated
    41+    for i in range(8):
    42+        init[4 + i] = int.from_bytes(key32[4*i:4*(i+1)], 'little')
    43+    out = bytearray()
    44+    for pos in range(6):
    45+        init[12] = pos
    46+        s = list(init)
    


    jnewbery commented at 3:20 pm on June 9, 2020:

    I think this would be clearer as:

    0        s = init.copy()
    

    troygiorshev commented at 1:52 pm on June 10, 2020:
    Also .copy() is around twice as fast.

    fjahr commented at 2:46 pm on June 11, 2020:
    done
  18. in test/functional/test_framework/muhash.py:54 in a6ee4c2cee outdated
    49+        for i in range(16):
    50+            out.extend(((s[i] + init[i]) & 0xffffffff).to_bytes(4, 'little'))
    51+    return bytes(out)
    52+
    53+def data_to_num3072(data):
    54+    """Map a byte array data to a 3072-bit number."""
    


    jnewbery commented at 3:39 pm on June 9, 2020:

    I think this docstring should indicate that it’s hashing using Chacha20

    0    """Hash a 32-byte array data to a 3072-bit number using 6 Chacha20 operations."""
    

    fjahr commented at 2:46 pm on June 11, 2020:
    done
  19. in test/functional/test_framework/muhash.py:59 in a6ee4c2cee outdated
    54+    """Map a byte array data to a 3072-bit number."""
    55+    bytes384 = chacha20_32_to_384(data)
    56+    return int.from_bytes(bytes384, 'little')
    57+
    58+class MuHash3072:
    59+    """Class representing the MuHash3072 computation of a set."""
    


    jnewbery commented at 3:40 pm on June 9, 2020:

    Reference where the algorithm comes from:


    fjahr commented at 2:46 pm on June 11, 2020:
    done
  20. in test/functional/rpc_blockchain.py:248 in a6ee4c2cee outdated
    244@@ -241,6 +245,15 @@ def _test_gettxoutsetinfo(self):
    245         del res['disk_size'], res3['disk_size']
    246         assert_equal(res, res3)
    247 
    248+        self.log.info("Test that MuHash implementation in Python returns the same result as C++")
    


    jnewbery commented at 3:42 pm on June 9, 2020:
    This should be a unit test in muhash.py. See #18576 for details on the python unit tests.

    fjahr commented at 2:46 pm on June 11, 2020:
    done
  21. in test/functional/test_framework/muhash.py:66 in a6ee4c2cee outdated
    53+def data_to_num3072(data):
    54+    """Map a byte array data to a 3072-bit number."""
    55+    bytes384 = chacha20_32_to_384(data)
    56+    return int.from_bytes(bytes384, 'little')
    57+
    58+class MuHash3072:
    


    ysangkok commented at 4:38 pm on June 10, 2020:
    this should implement MutableSet

    fjahr commented at 2:54 pm on June 11, 2020:
    Thanks for this suggestion, but I am not convinced that this is a good idea. I think it would be confusing because this is not really a Set but a hash representing a Set. So there is no way for example to know about the inclusion of a specific value or the number of values already added. There is also no enforcement here that this is actually a Set. You could add the same value twice and it would work. We are just not sure if this breaks security guarantees and the way MuHash is intended to be used in Bitcoin Core, for now, this should not happen because the UTXO set is a Set. So doing this would send the wrong message I think. And I would also like to keep this class as plain as possible.

    ysangkok commented at 2:56 pm on September 1, 2020:

    You could add the same value twice and it would work.

    Sets commonly ignore the second insertion. The set in Python also works like this.


    fjahr commented at 3:43 pm on September 1, 2020:
    Yes, but that is the whole point: Muhash cannot ignore the second insertion. Adding an element the second time changes the internal state because there is no way to tell if an element is already in the Muhash or not.
  22. in test/functional/test_framework/muhash.py:79 in a6ee4c2cee outdated
    63+    def __init__(self):
    64+        """Initialize for an empty set."""
    65+        self.numerator = 1
    66+        self.denominator = 1
    67+
    68+    def insert(self, data):
    


    ysangkok commented at 4:39 pm on June 10, 2020:
    the method is called add in collections.abc.
  23. in test/functional/test_framework/muhash.py:83 in a6ee4c2cee outdated
    67+
    68+    def insert(self, data):
    69+        """Insert a byte array data in the set."""
    70+        self.numerator = (self.numerator * data_to_num3072(data)) % self.MODULUS
    71+
    72+    def remove(self, data):
    


    ysangkok commented at 4:39 pm on June 10, 2020:
    the method is called discard in collections.abc
  24. ysangkok changes_requested
  25. ysangkok commented at 4:40 pm on June 10, 2020: none
    it would be clearer to use the official interfaces defined in collections.abc instead of inventing new names.
  26. fjahr force-pushed on Jun 11, 2020
  27. fjahr force-pushed on Jun 11, 2020
  28. fjahr force-pushed on Jun 11, 2020
  29. fjahr force-pushed on Jun 11, 2020
  30. fjahr commented at 3:02 pm on June 11, 2020: member
    Rebased and addressed all review comments. This is now also using SHA256 as discussed in #19055 .
  31. DrahtBot added the label Needs rebase on Jun 11, 2020
  32. fjahr force-pushed on Jun 11, 2020
  33. DrahtBot removed the label Needs rebase on Jun 12, 2020
  34. Sjors commented at 6:03 pm on June 12, 2020: member
    The chacha code also deserves its own file (and commit).
  35. sipa commented at 6:08 pm on June 12, 2020: member
    @Sjors I’d agree if it was generically useful ChaCha20 code, but given that’s it’s a minimal specialized implementation just for 3072-bit outputs with IV 0, I’m less convinced.
  36. jnewbery commented at 7:56 pm on June 12, 2020: member
    We’ll need a python chacha implementation when BIP324 lands, and the functionality in chacha20_doubleround() could be part of that, but until BIP324 I think it’s fine to have it in the same file as the muhash code.
  37. in test/functional/test_framework/util.py:637 in 46c3ef6387 outdated
    629@@ -629,3 +630,33 @@ def find_vout_for_address(node, txid, addr):
    630         if any([addr == a for a in tx["vout"][i]["scriptPubKey"]["addresses"]]):
    631             return i
    632     raise RuntimeError("Vout not found for address: txid=%s, addr=%s" % (txid, addr))
    633+
    634+def modinv(a, n):
    635+    """Compute the modular inverse of a modulo n
    636+
    637+    See https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm#Modular_integers.
    


    Sjors commented at 10:59 am on July 6, 2020:
    You could add a note here on why we don’t use pow(a, n-2, n)
  38. in test/functional/test_framework/muhash.py:92 in 88997dd738 outdated
    87+        """Remove a byte array from the set."""
    88+        self.denominator = (self.denominator * data_to_num3072(data)) % self.MODULUS
    89+
    90+    def digest(self):
    91+        """Extract the final hash. Does not modify this object."""
    92+        # TODO: With Python 3.8 modinv can be replaced with pow(a, -1, n)
    


    Sjors commented at 11:01 am on July 6, 2020:
    Nit: since there’s another place that uses modinv, the helper function might be a better place to put this TODO. The helper function is useful anyway, because modinv is more readable than pow(a, -1, n).
  39. Sjors approved
  40. Sjors commented at 11:13 am on July 6, 2020: member

    ACK 88997dd738759ed0e399f0da56bb9937079ffdd4

    You could add a test framework unit test for the chacha functions.

    I don’t mind merging this before the C++ implementation, though it might make sense to wait for #19105.

  41. fjahr force-pushed on Jul 9, 2020
  42. fjahr commented at 11:52 am on July 9, 2020: member
    Took @Sjors suggestions: moved the TODO comment to the util function and added chacha20 test vectors with nonce 0.
  43. Sjors commented at 1:37 pm on July 9, 2020: member
    re-ACK 352c702901b369132de109ddf7b6b1addc512b34
  44. in test/functional/test_framework/util.py:639 in 352c702901 outdated
    634+def modinv(a, n):
    635+    """Compute the modular inverse of a modulo n using the extended Euclidean
    636+    Algorithm. See https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm#Modular_integers.
    637+    """
    638+    # TODO: This does not use Python's own modular inverse pow(a, n-2, n) since
    639+    # that uses Fermat's little theorem which is significantly slower. However,
    


    jnewbery commented at 7:31 am on July 16, 2020:

    “… uses Fermat’s little theorem which is significantly slower” doesn’t really make sense. Fermat’s little theorem is that a^p = a (mod p). It’s not an algorithm for calculating powers.

    I’d suggest just removing this TODO entirely.


    Sjors commented at 10:26 am on July 16, 2020:
    // TODO: change to pow(a, -1, n) should be kept though (because it’s more readable)

    jnewbery commented at 1:26 pm on July 16, 2020:
    meh. Now that it’s implemented, it’s implemented. I don’t see much benefit to changing it (and in general I don’t think we should be leaving TODOs in the code)

    theStack commented at 2:48 pm on July 16, 2020:
    My two cents: it almost always makes sense to use what the standard library provides instead of reinventing the wheel with own implementations. Same as C++17 and C++20 substitutes will eventually be replaced when we switch over (e.g. for std::span or std::optional), the same should be done for Python functions in the future. No need to keep the codebase larger than necessary. “Now that it’s implemented, it’s implemented.” doesn’t really sound like a convincing argument to me.

    troygiorshev commented at 3:07 pm on July 16, 2020:
    Suggest to remove the TODO and open this change as a PR or issue and discuss it there. Then comments won’t get lost, we can show benchmarks, etc. Eventually this change will need review (with evidence) before it is merged. We’re not just going to blindly trust a TODO, so there’s no value in having it.

    fjahr commented at 4:24 pm on July 16, 2020:
    For now, I have kept a shortened TODO since personally I have a slight preference to keep it. While I dislike most instances of TODOs in the code, this is an ok one IMO because it is very clear what there is to do and when it should be done. I don’t think it’s better to open a PR now that will have to stay open for a long time until the min version is bumped. It’s just a different kind of clutter in the project.

    jnewbery commented at 4:30 pm on July 16, 2020:
    ok!
  45. in test/functional/test_framework/muhash.py:6 in 352c702901 outdated
     6+from .util import (
     7+    assert_equal,
     8+    modinv,
     9+)
    10+
    11+import hashlib
    


    jnewbery commented at 7:38 am on July 16, 2020:
    Normally, standard library modules are imported before local project modules: https://www.python.org/dev/peps/pep-0008/#imports

    jonatack commented at 8:02 am on July 16, 2020:
    Yes, further on this, a couple of Python linters I’ve been finding useful are pycodestyle (pip install pycodestyle) and black (pip install black), not necessarily to follow blindly but they do provide helpful suggestions.
  46. in test/functional/test_framework/muhash.py:104 in 352c702901 outdated
     99+        muhash.insert([0]*32)
    100+        muhash.insert([1] + [0]*31)
    101+        muhash.remove([2] + [0]*31)
    102+        finalized = muhash.digest()
    103+        # This mirrors the result in the C++ MuHash3072 unit test
    104+        assert_equal(finalized[::-1].hex(), "a44e16d5e34d259b349af21c06e65d653915d2e208e4e03f389af750dc0bfdc3")
    


    jnewbery commented at 7:50 am on July 16, 2020:
    prefer unittest’s self.assertEqual to our own version here.
  47. jnewbery commented at 7:51 am on July 16, 2020: member
    Looks good. Just a few minor comments inline.
  48. test: Move modinv to util and add unit test ab30cece0e
  49. test: Add Python MuHash3072 implementation to test framework b85543cb73
  50. test: Add basic Python/C++ Muhash implementation parity unit test 0e2b400fea
  51. test: Add chacha20 test vectors in muhash 36ec9801a4
  52. fjahr force-pushed on Jul 16, 2020
  53. fjahr commented at 4:24 pm on July 16, 2020: member
    Addressed @jnewbery ’s review comments.
  54. jnewbery commented at 4:32 pm on July 16, 2020: member
    utACK 36ec9801a
  55. laanwj commented at 2:48 pm on September 1, 2020: member
    Code review ACK 36ec9801a4edb9663ef9ce9ad298233766b903e8
  56. laanwj merged this on Sep 1, 2020
  57. laanwj closed this on Sep 1, 2020

  58. sidhujag referenced this in commit a97b9a0281 on Sep 3, 2020
  59. in test/functional/test_framework/muhash.py:16 in 36ec9801a4
    11+def rot32(v, bits):
    12+    """Rotate the 32-bit value v left by bits bits."""
    13+    bits %= 32  # Make sure the term below does not throw an exception
    14+    return ((v << bits) & 0xffffffff) | (v >> (32 - bits))
    15+
    16+def chacha20_doubleround(s):
    


    vindard commented at 12:13 pm on February 20, 2021:
    Adding a little explainer video here for anyone else curious about ChaCha20: https://youtu.be/UeIpq-C-GSA
  60. laanwj removed the label Build system on Feb 22, 2021
  61. Fabcien referenced this in commit 8986cfad54 on Sep 21, 2021
  62. kittywhiskers referenced this in commit 3c455b14e5 on Feb 26, 2022
  63. kittywhiskers referenced this in commit 65393bbaba on Feb 26, 2022
  64. kittywhiskers referenced this in commit 8324766317 on Apr 3, 2022
  65. kittywhiskers referenced this in commit 757a8cabe8 on Apr 20, 2022
  66. kittywhiskers referenced this in commit f3521c334d on Apr 24, 2022
  67. kittywhiskers referenced this in commit 5afca7f73c on Apr 27, 2022
  68. DrahtBot locked this on Aug 16, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-09-29 01:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me