stabilize translations by reverting old ids by text content #33270

pull l0rinc wants to merge 2 commits into bitcoin:master from l0rinc:l0rinc/stabilize-translations changing 5 files +1897 −1716
  1. l0rinc commented at 9:23 am on August 29, 2025: contributor

    Summary

    Regenerating the Qt translation template (src/qt/locale/bitcoin_en.xlf) previously used sequential _msg<N> IDs. When strings moved within the sorted message list, IDs shifted, creating large diffs and making translation platforms treat unchanged strings as new.

    This PR switches XLF <trans-unit id=...> values to stable IDs derived from a SHA256 hash of the message context and source text.

    Details

    As mentioned in #33224 (comment), any change that moves a translatable string within the sorted messages can renumber subsequent entries when IDs are sequential.

    Some English texts can appear in multiple places (e.g. “Clear”) with different meanings depending on the Qt context. To avoid collisions, the hash includes the <group resname=...> context.

    Fix

    The translate target already runs contrib/devtools/stabilize_xlf_ids.py. Instead of trying to preserve/renumber _msg<N> IDs by matching old/new units, the script now rewrites every <trans-unit id> to:

    • sha256(resname + "\0" + source-text) (hex)
    • keeping the existing plural [N] suffix for plural forms

    The script also checks for duplicate IDs and aborts if any are detected.

    Note: this is a one-time mass change from _msg<N> IDs to hashes; after that, rerunning translate on an unchanged tree should not churn IDs.

    With this PR applied, IDs become stable hashes so ordering changes no longer cause renumbering churn.

    Reproducer

    You can test the replacer by changing any translatable text:

     0diff --git a/src/qt/bitcoingui.cpp b/src/qt/bitcoingui.cpp
     1--- a/src/qt/bitcoingui.cpp	(revision a38b8cf788922174538161cda81ce28f2ac462ec)
     2+++ b/src/qt/bitcoingui.cpp	(date 1767310724675)
     3@@ -258,7 +258,7 @@
     4     connect(modalOverlay, &ModalOverlay::triggered, tabGroup, &QActionGroup::setEnabled);
     5 
     6     overviewAction = new QAction(platformStyle->SingleColorIcon(":/icons/overview"), tr("&Overview"), this);
     7-    overviewAction->setStatusTip(tr("Show general overview of wallet"));
     8+    overviewAction->setStatusTip(tr("Show general overview of Wallet"));
     9     overviewAction->setToolTip(overviewAction->statusTip());
    10     overviewAction->setCheckable(true);
    11     overviewAction->setShortcut(QKeySequence(QStringLiteral("Alt+1")));
    

    and regenerating the translations

    0cmake --preset dev-mode -DWITH_USDT=OFF && cmake --build build_dev_mode --target translate
    

    You will notice that the new text moved to a different place with a different hash, but all other messages retained their IDs.

  2. DrahtBot commented at 9:24 am on August 29, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33270.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK hebasto

    If your review is incorrectly listed, please copy-paste <!–meta-tag:bot-skip–> into the comment that the bot should ignore.

  3. l0rinc renamed this:
    translations: recreate baseline to simplify testing
    stabilize translations by reverting old ids by text content
    on Aug 29, 2025
  4. l0rinc force-pushed on Aug 29, 2025
  5. DrahtBot added the label CI failed on Aug 29, 2025
  6. DrahtBot commented at 9:27 am on August 29, 2025: contributor

    🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/49171015206 LLM reason (✨ experimental): Lint failure: Python style error E401 (multiple imports on one line) in contrib/devtools/stabilize_xlf_ids.py.

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  7. maflcko commented at 10:03 am on August 29, 2025: member
    Not sure if this needs a script. Post-freeze changes should be rare enough to just manually take over the affected string without going through any script.
  8. l0rinc force-pushed on Aug 29, 2025
  9. l0rinc marked this as ready for review on Aug 29, 2025
  10. DrahtBot removed the label CI failed on Aug 29, 2025
  11. l0rinc commented at 10:14 pm on August 29, 2025: contributor

    @maflcko, this is for every translation update, after this change previously translated values are kept and recognized by Transifex. @achow101 created a test Bitcoin Core translation sandbox for me where I’ve uploaded the latest bitcoin_en.xlf. Between Core versions most of the translations need to be reassigned (in the example of 29 vs 30 only 10% of the original IDs were kept so the translations all showed that most of the strings don’t have pairs)

    I have generated a dummy French translation and simulated a review on it:

    Before this PR, changing a single line invalidated 10% of the other translations as well because of the sorting bug described above:

    Uploaded the same file that’s generated with this PR for #33224 agains a 100% approved language:

    And since the translation IDs are resurrected, it correctly shows that a single entry needs to be retranslated:

  12. maflcko commented at 1:42 pm on August 30, 2025: member

    @maflcko, this is for every translation update, after this change previously translated values are kept and recognized by Transifex.

    Ah, I see. I wonder if the shasum of the full content can be used as an id for the translation string. Conceptually it seems simpler to get a stable id, than to try to artificially number and re-number the strings, depending on the history. (Obviously this wouldn’t help with #33224, but this instance should be trivial to handle manually as a one-off.)

  13. hebasto commented at 9:42 am on September 1, 2025: member
    Since we are not using ID-based translations, it seems reasonable to consider removing the id attributes from trans-unit elements in the XML file.
  14. l0rinc commented at 6:30 pm on September 1, 2025: contributor

    we are not using ID-based translations @hebasto, are suggesting that the bitcoin-test clone where we tried it isn’t representative? It did seem to me that we managed to reproduce and fix the instability by stabilizing the ids. But if we’re not using the ids, why aren’t the translations stable? What alternative do you suggest to stabilize them?

    shasum of the full content can be used as an id for the translation string @maflcko we could of course do that, but it would likely invalidate every single text in the next release. We could of course do it step-by-step and only add hashes for the new entries (instead of max-id + 1, as I did in the script here). Note that hashes would prohibit same-text-with-different-translations, e.g. Clear could mean “delete” or “agree”, based on context.

  15. hebasto commented at 7:44 pm on September 1, 2025: member

    we are not using ID-based translations

    @hebasto, are suggesting that the bitcoin-test clone where we tried it isn’t representative? It did seem to me that we managed to reproduce and fix the instability by stabilizing the ids. But if we’re not using the ids, why aren’t the translations stable?

    We do not use them, but Transifex does.

    What alternative do you suggest to stabilize them?

    As I wrote in #33270 (comment):

    … it seems reasonable to consider removing the id attributes from trans-unit elements in the XML file.

  16. l0rinc commented at 7:49 pm on September 1, 2025: contributor
    Do I understand you correctly that if we remove the IDs, Transifex will identify messages by content instead? Any reason we didn’t do that before?
  17. hebasto commented at 7:55 pm on September 1, 2025: member

    Do I understand you correctly that if we remove the IDs, Transifex will identify messages by content instead?

    That’s my guess, though I haven’t tested it.

    Any reason we didn’t do that before?

    I’m not aware of any specific reason.

  18. l0rinc commented at 8:07 pm on September 1, 2025: contributor

    My understanding is that Translation Memory Fillups are Growth plan feature only.

    I have tried uploading the English translation without any ids (called keys in Transifex)

    adding back a single id results in only that value being imported:

  19. achow101 commented at 10:58 pm on September 1, 2025: member
    The id attribute is required by the XLIFF 1.2 spec: https://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#trans-unit
  20. hebasto commented at 10:37 am on September 2, 2025: member

    stabilize translations…

    Concept ACK on this idea.

    Could we avoid reintroducing Python scripts into the translate build target?

  21. l0rinc commented at 5:42 pm on September 2, 2025: contributor

    It wouldn’t be difficult to make this into a script that generates an id mapping that’s we’d run manually, if we’re worried about python, something like:

    0sed -E -i \
    1  -e 's|id="_msg1160"|id="_msg1239"|g' \
    2  -e 's|id="_msg1159"|id="_msg1160"|g' \
    3  -e 's|id="_msg120\[0\]"|id="_msg1240\[0\]"|g' \
    4  -e 's|id="_msg120\[1\]"|id="_msg1240\[1\]"|g' \
    5  'src/qt/locale/bitcoin_en.xlf'
    

    Would that be more useful in your opinion? What are the worries exactly, can you please point me to the discussion that you were referring to with “reintroducing Python scripts” (or is this just a reference to the cmake migration)? We could of course migrate this to cmake as well, but do we really want to do that :)? I don’t…

  22. hebasto commented at 2:10 pm on September 5, 2025: member

    … can you please point me to the discussion that you were referring to with “reintroducing Python scripts”…

    The Python dependency for the translate build target was recently removed in #33209 by @purpleKarrot.

  23. purpleKarrot commented at 6:12 am on September 6, 2025: contributor

    do we really want to do that :)? I don’t…

    I would do It, but I am concerned about the bus factor, so I would prefer to train someone else. 05255d5d1ec1852d8d8d7683ccbf28351f57b89e is an example for replacing a sed replacement with cmake code.

  24. fanquake commented at 11:00 am on December 3, 2025: member
    What is the status of this?
  25. l0rinc commented at 12:07 pm on December 3, 2025: contributor
    It worked correctly for stabilizing the translations - removals and moves alike. Some reviewers would prefer rewriting the logic from python to cmake, but it should work currently as-is. Reviews and reproducers are welcome.
  26. fanquake commented at 11:20 am on December 5, 2025: member
    Ok. I think @hebasto needs to make a call here then.
  27. maflcko commented at 1:43 pm on December 5, 2025: member

    shasum of the full content can be used as an id for the translation string

    @maflcko we could of course do that, but it would likely invalidate every single text in the next release. We could of course do it step-by-step and only add hashes for the new entries (instead of max-id + 1, as I did in the script here). Note that hashes would prohibit same-text-with-different-translations, e.g. Clear could mean “delete” or “agree”, based on context.

    In that case, the context should be hashed, too.

    I presumed that the majority of texts were invalidated anyway, so it shouldn’t matter much. But I haven’t confirmed this, and maybe I am wrong.

  28. l0rinc commented at 1:47 pm on December 5, 2025: contributor

    I presumed that the majority of texts were invalidated anyway

    This PR restores the IDs of all unchanged values so the translators should only see the changed entries.

  29. maflcko commented at 2:24 pm on December 5, 2025: member

    I presumed that the majority of texts were invalidated anyway

    This PR restores the IDs of all unchanged values so the translators should only see the changed entries.

    I understand, but my point is that there is no value in trying to keep the exact old IDs.

    Basically, every update invalidates all of them:

    • 5c9513ece92 in 2023 invalidated down to msg 16
    • be419674da6 in 2024 invalidated down to msg 61
    • 656e16aa5e6 in 2025 invalidated down to msg 168
  30. translations: refresh Qt translation template baseline
    Regenerate `src/qt/bitcoinstrings.cpp` and `src/qt/locale/bitcoin_en.{ts,xlf}` from the current source tree to provide a clean baseline for translation template updates.
    bdd7f13b6a
  31. translations: use context+text hashes for XLF ids
    Sequential `_msg<N>` IDs shift when strings move in the sorted message list, creating churn and misidentifying unchanged strings on translation platforms.
    
    Derive `<trans-unit id>` values as sha256(`resname` + "\0" + source text), preserving plural `[N]` suffixes and aborting on duplicate ids.
    a38b8cf788
  32. l0rinc commented at 11:34 pm on January 1, 2026: contributor

    In that case, the context should be hashed, too.

    Updated the script to do that: instead of trying to preserve/renumber sequential _msg<N> IDs by matching old/new XLF entries, the script now rewrites every <trans-unit id> deterministically to sha256(parent-group-resname + "\0" + source-text) (keeping the plural [0]/[1] suffixes. This makes IDs independent of message ordering (so insertions/deletions don’t renumber unrelated strings), hashes the Qt context too (so identical text like &amp;Copy from different contexts won’t collide), and the script aborts if it would produce any duplicate IDs or if a trans-unit isn’t under a group[@resname]. This is a lot simpler than the previous solution, but it needs a bootstrapping phase (done in this commit) - and a translation readjustment for every other language since all of the IDs will mismatch now.

    Let me know what you think of this approach: @maflcko, @hebasto, @fanquake.

  33. l0rinc force-pushed on Jan 1, 2026

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-01-22 21:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me