contrib: Remove brittle, confusing and redundant UTF8 encoding from Python IO #33702

pull maflcko wants to merge 7 commits into bitcoin:master from maflcko:2510-everything-is-utf8 changing 63 files +217 −299
  1. maflcko commented at 10:07 am on October 25, 2025: member

    Historically, there was an attempt via test/lint/lint-python-utf8-encoding.py to enforce explicit UTF8 in every Python IO statement (open, subprocess, …). However, the lint check has many problems:

    • The check is incomplete and many IO statements lack the explicit UTF8 specification.
    • It was added at a time when some systems were not UTF8 by default.
    • The check is brittle, as it depends on a fragile regex.

    In theory, now that the minimum Python version is 3.10 (since commit 2123c94448ed142e78942421c597a1f264859c48), the check could be replaced by PYTHONWARNDEFAULTENCODING=1 from https://docs.python.org/3/whatsnew/3.10.html#optional-encodingwarning-and-encoding-locale-option. However, this comes with many other problems:

    • All our Python scripts already assume and require UTF8 to be set externally. On almost all modern systems, this is already the default. Some Windows versions do not have UTF8 by default and require PYTHONUTF8=1 to be set for the tests to run already today (with or without the changes in this pull). Also, the CI and many other Bash scripts force UTF8 via LC_ALL. Finally, Python 3.15 will likely enable UTF8 on all systems by default, per https://peps.python.org/pep-0686/#abstract.
    • So adding UTF8 to every single IO call is redundant, verbose, and confusing, given that it is the expected default.

    So fix all issues, by:

    • Removing the test/lint/lint-python-utf8-encoding.py check.
    • Removing the encoding on the individual IO calls.
    • Clarifying the existing docs around the existing UTF8 requirement and assumption.

    Obviously, every IO call is still free to specify UTF8 or any other encoding explicitly, if there is a documented need for it in the future.

  2. DrahtBot renamed this:
    contrib: Remove brittle, confusing and redundant UTF8 encoding from Python IO
    contrib: Remove brittle, confusing and redundant UTF8 encoding from Python IO
    on Oct 25, 2025
  3. DrahtBot added the label Scripts and tools on Oct 25, 2025
  4. DrahtBot commented at 10:07 am on October 25, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33702.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK theStack, laanwj

    If your review is incorrectly listed, please copy-paste <!–meta-tag:bot-skip–> into the comment that the bot should ignore.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #33184 (test: Refactor rpc_getblockstats.py to use MiniWallet by enirox001)
    • #32929 (qa: Avoid knock-on exception in assert_start_raises_init_error by hodlinator)
    • #32928 (test: add logging to mock external signers by Sjors)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  5. maflcko force-pushed on Oct 25, 2025
  6. fanquake commented at 4:43 pm on October 27, 2025: member
  7. maflcko force-pushed on Oct 29, 2025
  8. laanwj commented at 12:22 pm on October 31, 2025: member
    Concept ACK
  9. in test/lint/lint-submodule.py:3 in fa106c959c outdated
    0@@ -1,6 +1,6 @@
    1 #!/usr/bin/env python3
    2 #
    3-# Copyright (c) 2022 The Bitcoin Core developers
    4+# Copyright (c) 2022-present The Bitcoin Core developers
    


    laanwj commented at 12:50 pm on October 31, 2025:

    A copyright header change slipped in here 😄

    Edit: oh, never mind, there’s more of these here. don’t think it’s important.


    maflcko commented at 1:36 pm on October 31, 2025:

    Yeah, I try to update them while touching the file. Otherwise, one of the bot accounts will “follow-up” with it. Just some examples from this year: #32008, #31642, #32963, …

    Will leave as-is for now, but I am thinking that I may follow-up with something to close this bot and churn issue.

  10. laanwj approved
  11. laanwj commented at 1:00 pm on October 31, 2025: member
    Code review ACK fa2fd0ba1fbd4085df24a2c5472636957db28521
  12. theStack approved
  13. theStack commented at 8:23 pm on November 3, 2025: contributor

    Concept and code-review ACK fa2fd0ba1fbd4085df24a2c5472636957db28521

    Didn’t test on Windows. Regarding the first commit, I was wondering why supporting locale-dependent bash scripts was ever relevant, and it seems this was due to a workaround for a segfault-causing bug in shellcheck back then (see #13454 (comment) ff., in case other reviewers wonder as well).

  14. maflcko requested review from hebasto on Nov 4, 2025
  15. DrahtBot added the label Needs rebase on Nov 22, 2025
  16. hebasto commented at 2:51 pm on November 22, 2025: member

    I’ve tested fa2fd0ba1fbd4085df24a2c5472636957db28521 on a Windows 11 laptop using VS 2022 17.14.21.

    Functional tests work as expected.

  17. lint: Do not allow locale dependent shell scripts
    Bash is discouraged, and there was never a need to write locale
    dependent Bash.
    
    So remove the option and clarify that the LC_ALL settings enable UTF-8
    mode in Python.
    fa83e3a81d
  18. test: Clarify that Python UTF-8 mode is the default today for most systems
    It will likely be the default for all systems, starting with Python
    3.15, according to https://peps.python.org/pep-0686/#abstract.
    
    It is hard to find a system other than Windows that has it not enabled
    today. Nonetheless, Bitcoin Core requires UTF-8 in scripts and normally
    enforces it via LC_ALL=C.UTF-8 or PYTHONUTF8=1.
    faf39d8539
  19. lint: Drop check to enforce encoding to be specified in Python scripts
    The check was incomplete and brittle. A better check would be to enable
    `PYTHONWARNDEFAULTENCODING=1`
    https://docs.python.org/3/whatsnew/3.10.html#optional-encodingwarning-and-encoding-locale-option
    
    However, it is unclear what the goal of adding explicit encodings
    everywhere is, given that:
    
    * Most modern systems already have UTF-8 enabled by default, except for
      Windows.
    * Python 3.15 will likely enable it globally by default, according to
      https://peps.python.org/pep-0686/#abstract
    * Adding the explicit encodings will bloat all code for no benefit.
    
    So remove the lint check and drop all redundant encoding= kwargs.
    
    All encoding= that are set for a reason, are kept.
    fa7d72bd1b
  20. contrib: Remove confusing and redundant encoding from IO
    The encoding arg is confusing, because it is not applied consistently
    for all IO.
    
    Also, it is useless, as the majority of files are ASCII encoded, which
    are fine to encode and decode with any mode.
    
    Moreover, UTF-8 is already required for most scripts to work properly,
    so setting the encoding twice is redundant.
    
    So remove the encoding from most IO. It would be fine to remove from all
    IO, however I kept it for two files:
    
    * contrib/asmap/asmap-tool.py: This specifically looks for utf-8
      encoding errors, so it makes sense to sepecify the utf-8 encoding
      explicitly.
    * test/functional/test_framework/test_node.py: Reading the debug log in
      text mode specifically counts the utf-8 characters (not bytes), so it
      makes sense to specify the utf-8 encoding explicitly.
    fae612424b
  21. scripted-diff: Bump copyright headers after encoding changes
    Historically, the headers have been bumped some time after a file has
    been touched. Do it now to avoid having to touch them again in the
    future for that reason.
    
    -BEGIN VERIFY SCRIPT-
     sed -i --regexp-extended 's;( 20[0-2][0-9])(-20[0-2][0-9])? The Bitcoin Core developers;\1-present The Bitcoin Core developers;g' $( git show --pretty="" --name-only HEAD~0 )
    -END VERIFY SCRIPT-
    fa71c15f86
  22. contrib: Use text=True in subprocess over manual encoding handling
    All touched Python scripts already assume and require UTF8, so manually
    specifying encoding or decoding for functions in the subprocess module
    is redundant to just using text=True, which exists since Python 3.7
    fab085c15f
  23. test: Fix "typo" in written invalid content
    The appended content is irrelevant, but fix the "typo" to avoid
    spellchecker warnings.
    fad6118586
  24. maflcko force-pushed on Nov 26, 2025
  25. maflcko commented at 11:39 am on November 26, 2025: member
    rebased for a one-line conflict and a one-line silent conflict. Should be trivial to re-review via git range-diff
  26. theStack approved
  27. theStack commented at 11:59 am on November 26, 2025: contributor

    re-ACK fad61185861a6a9ed806c387aa63d2b31262b1db

    as per $ git range-diff fa2fd0ba1f...fad6118586

  28. DrahtBot requested review from laanwj on Nov 26, 2025
  29. DrahtBot removed the label Needs rebase on Nov 26, 2025
  30. laanwj approved
  31. laanwj commented at 11:46 am on December 2, 2025: member

    Re-ACK fad61185861a6a9ed806c387aa63d2b31262b1db

    0$ git range-diff 3bb30658e631ed45b6c8609292facc7ae3dd0f61..fa2fd0ba1fbd4085df24a2c5472636957db28521 b30262dcaa28c40a0a5072847b7194b3db203160..fad61185861a6a9ed806c387aa63d2b31262b1db
    

    • 4: 4444edeecc528ccd88b7dd78ffddbccd69762132 ! 4: fae612424b3e70acd6011a4459518174463b3424 contrib: Remove confusing and redundant encoding from IO - the only differing commit: checked that it is rebase

  32. fanquake merged this on Dec 3, 2025
  33. fanquake closed this on Dec 3, 2025

  34. in test/functional/feature_framework_startup_failures.py:52 in fad6118586
    51         except subprocess.TimeoutExpired as e:
    52             print("Unexpected child process timeout!\n"
    53                   "WARNING: Timeouts like this halt execution of TestNode logic, "
    54                   "meaning dangling bitcoind processes are to be expected.\n"
    55-                  f"<CHILD OUTPUT BEGIN>\n{e.output.decode('utf-8')}\n<CHILD OUTPUT END>",
    56+                  f"<CHILD OUTPUT BEGIN>\n{e.output}\n<CHILD OUTPUT END>",
    


    hodlinator commented at 2:13 pm on December 3, 2025:

    The change on this specific line is incorrect, see 9481948e0d979dccedd7b4e6616ef4903bc1cdba (#32929) for a fix.

    Verified that the changes to test/lint/lint-spelling.py (where we are also printing output from an exception field) to the contrary are correct in this PR by passing in an invalid argument in the subprocess call.


    maflcko commented at 1:48 pm on December 8, 2025:
    Heh, I wish all of this was written in Rust, so that types are checked at compile-time :sweat_smile:
  35. maflcko deleted the branch on Dec 8, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-12-12 03:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me