contrib: Remove brittle, confusing and redundant UTF8 encoding from Python IO #33702

pull maflcko wants to merge 7 commits into bitcoin:master from maflcko:2510-everything-is-utf8 changing 63 files +216 −298
  1. maflcko commented at 10:07 am on October 25, 2025: member

    Historically, there was an attempt via test/lint/lint-python-utf8-encoding.py to enforce explicit UTF8 in every Python IO statement (open, subprocess, …). However, the lint check has many problems:

    • The check is incomplete and many IO statements lack the explicit UTF8 specification.
    • It was added at a time when some systems were not UTF8 by default.
    • The check is brittle, as it depends on a fragile regex.

    In theory, now that the minimum Python version is 3.10 (since commit 2123c94448ed142e78942421c597a1f264859c48), the check could be replaced by PYTHONWARNDEFAULTENCODING=1 from https://docs.python.org/3/whatsnew/3.10.html#optional-encodingwarning-and-encoding-locale-option. However, this comes with many other problems:

    • All our Python scripts already assume and require UTF8 to be set externally. On almost all modern systems, this is already the default. Some Windows versions do not have UTF8 by default and require PYTHONUTF8=1 to be set for the tests to run already today (with or without the changes in this pull). Also, the CI and many other Bash scripts force UTF8 via LC_ALL. Finally, Python 3.15 will likely enable UTF8 on all systems by default, per https://peps.python.org/pep-0686/#abstract.
    • So adding UTF8 to every single IO call is redundant, verbose, and confusing, given that it is the expected default.

    So fix all issues, by:

    • Removing the test/lint/lint-python-utf8-encoding.py check.
    • Removing the encoding on the individual IO calls.
    • Clarifying the existing docs around the existing UTF8 requirement and assumption.

    Obviously, every IO call is still free to specify UTF8 or any other encoding explicitly, if there is a documented need for it in the future.

  2. DrahtBot renamed this:
    contrib: Remove brittle, confusing and redundant UTF8 encoding from Python IO
    contrib: Remove brittle, confusing and redundant UTF8 encoding from Python IO
    on Oct 25, 2025
  3. DrahtBot added the label Scripts and tools on Oct 25, 2025
  4. DrahtBot commented at 10:07 am on October 25, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33702.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #33184 (test: Replace legacy wallet with MiniWallet in rpc_getblockstats.py by enirox001)
    • #32929 (qa: Avoid knock-on exception in assert_start_raises_init_error by hodlinator)
    • #32928 (test: add logging to mock external signers by Sjors)
    • #31974 (Drop testnet3 by Sjors)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  5. maflcko force-pushed on Oct 25, 2025
  6. fanquake commented at 4:43 pm on October 27, 2025: member
  7. lint: Do not allow locale dependent shell scripts
    Bash is discouraged, and there was never a need to write locale
    dependent Bash.
    
    So remove the option and clarify that the LC_ALL settings enable UTF-8
    mode in Python.
    fa96a2f2f6
  8. test: Clarify that Python UTF-8 mode is the default today for most systems
    It will likely be the default for all systems, starting with Python
    3.15, according to https://peps.python.org/pep-0686/#abstract.
    
    It is hard to find a system other than Windows that has it not enabled
    today. Nonetheless, Bitcoin Core requires UTF-8 in scripts and normally
    enforces it via LC_ALL=C.UTF-8 or PYTHONUTF8=1.
    fa626ba475
  9. lint: Drop check to enforce encoding to be specified in Python scripts
    The check was incomplete and brittle. A better check would be to enable
    `PYTHONWARNDEFAULTENCODING=1`
    https://docs.python.org/3/whatsnew/3.10.html#optional-encodingwarning-and-encoding-locale-option
    
    However, it is unclear what the goal of adding explicit encodings
    everywhere is, given that:
    
    * Most modern systems already have UTF-8 enabled by default, except for
      Windows.
    * Python 3.15 will likely enable it globally by default, according to
      https://peps.python.org/pep-0686/#abstract
    * Adding the explicit encodings will bloat all code for no benefit.
    
    So remove the lint check and drop all redundant encoding= kwargs.
    
    All encoding= that are set for a reason, are kept.
    fa716dd2a0
  10. contrib: Remove confusing and redundant encoding from IO
    The encoding arg is confusing, because it is not applied consistently
    for all IO.
    
    Also, it is useless, as the majority of files are ASCII encoded, which
    are fine to encode and decode with any mode.
    
    Moreover, UTF-8 is already required for most scripts to work properly,
    so setting the encoding twice is redundant.
    
    So remove the encoding from most IO. It would be fine to remove from all
    IO, however I kept it for two files:
    
    * contrib/asmap/asmap-tool.py: This specifically looks for utf-8
      encoding errors, so it makes sense to sepecify the utf-8 encoding
      explicitly.
    * test/functional/test_framework/test_node.py: Reading the debug log in
      text mode specifically counts the utf-8 characters (not bytes), so it
      makes sense to specify the utf-8 encoding explicitly.
    4444edeecc
  11. scripted-diff: Bump copyright headers after encoding changes
    Historically, the headers have been bumped some time after a file has
    been touched. Do it now to avoid having to touch them again in the
    future for that reason.
    
    -BEGIN VERIFY SCRIPT-
     sed -i --regexp-extended 's;( 20[0-2][0-9])(-20[0-2][0-9])? The Bitcoin Core developers;\1-present The Bitcoin Core developers;g' $( git show --pretty="" --name-only HEAD~0 )
    -END VERIFY SCRIPT-
    fa991397ae
  12. contrib: Use text=True in subprocess over manual encoding handling
    All touched Python scripts already assume and require UTF8, so manually
    specifying encoding or decoding for functions in the subprocess module
    is redundant to just using text=True, which exists since Python 3.7
    fa106c959c
  13. test: Fix "typo" in written invalid content
    The appended content is irrelevant, but fix the "typo" to avoid
    spellchecker warnings.
    fa2fd0ba1f
  14. maflcko force-pushed on Oct 29, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-10-31 09:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me