stabilize translations by reverting old ids by text content #33270

pull l0rinc wants to merge 2 commits into bitcoin:master from l0rinc:l0rinc/stabilize-translations changing 4 files +144 −59
  1. l0rinc commented at 9:23 am on August 29, 2025: contributor

    Summary

    Previously, recalculating the translation string templates caused a misalignment: after any changed string, subsequent entries were re-numbered and old values received new “stable” IDs.

    You can reproduce by checking out #33224 and running:

    0cmake --preset dev-mode -DWITH_USDT=OFF -DENABLE_IPC=OFF && cmake --build build_dev_mode --target translate
    

    This will change many translation IDs:

    0git diff | egrep '^[+-].+trans-unit id=' 
    1-      <trans-unit id="_msg1040">
    2-      <trans-unit id="_msg1041">
    3+      <trans-unit id="_msg1040">
    4...
    5-      <trans-unit id="_msg1160">
    6+      <trans-unit id="_msg1159">
    7+      <trans-unit id="_msg1160">
    

    Details

    As mentioned in #33224 (comment), changing a translatable string’s content moves it within the sorted messages and gives it a new sequential ID. Messages between its old and new positions were then misidentified, making translation work on Transifex and similar platforms harder.

    Fix

    The translate CMake target now runs a small Python script that restores IDs from the previous translation file. It maps each message by (group resname + source text) to old and new IDs, and performs a single-pass search-and-replace on the newly generated XLF to restore the original IDs. Unchanged strings keep their IDs; genuinely new strings get the next sequential _msg following the very last id.

    Reproducer

    This PR has two commits; the first resets the baseline to make testing simpler.

    Deleting a translation

     0diff --git a/src/init.cpp b/src/init.cpp
     1--- a/src/init.cpp	(revision e94ee7674c5b023268fe33df8839b6879a81afab)
     2+++ b/src/init.cpp	(date 1756457795674)
     3@@ -904,7 +904,6 @@
     4     }
     5
     6     if (args.IsArgSet("-datacarriersize") || args.IsArgSet("-datacarrier")) {
     7-        InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
     8     }
     9
    10     // We no longer limit the orphanage based on number of transactions but keep the option to warn users who still have it in their config.
    

    correctly deletes only that one

    0 git diff | egrep '^[+-].+trans-unit id='
    1-      <trans-unit id="_msg1039">
    

    Adding a new one:

     0diff --git a/src/init.cpp b/src/init.cpp
     1--- a/src/init.cpp	(revision e94ee7674c5b023268fe33df8839b6879a81afab)
     2+++ b/src/init.cpp	(date 1756457688152)
     3@@ -905,6 +905,7 @@
     4
     5     if (args.IsArgSet("-datacarriersize") || args.IsArgSet("-datacarrier")) {
     6         InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
     7+        InitWarning(_("Options2 '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
     8     }
     9
    10     // We no longer limit the orphanage based on number of transactions but keep the option to warn users who still have it in their config.
    

    keep all other ones intact

    0git diff | egrep '^[+-].+trans-unit id='
    1+      <trans-unit id="_msg1239">
    

    Modifying a single one

     0diff --git a/src/init.cpp b/src/init.cpp
     1--- a/src/init.cpp	(revision e94ee7674c5b023268fe33df8839b6879a81afab)
     2+++ b/src/init.cpp	(date 1756457951822)
     3@@ -904,7 +904,7 @@
     4     }
     5
     6     if (args.IsArgSet("-datacarriersize") || args.IsArgSet("-datacarrier")) {
     7-        InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
     8+        InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated and are expected to be removed in a future version."));
     9     }
    10
    11     // We no longer limit the orphanage based on number of transactions but keep the option to warn users who still have it in their config.
    

    is marked as a clean delete/add

    0 git diff | egrep '^[+-].+trans-unit id='
    1-      <trans-unit id="_msg1039">
    2+      <trans-unit id="_msg1239">
    

    And as a bonus, only changing the case of a translation

     0diff --git a/src/init.cpp b/src/init.cpp
     1--- a/src/init.cpp	(revision e94ee7674c5b023268fe33df8839b6879a81afab)
     2+++ b/src/init.cpp	(date 1756457080223)
     3@@ -904,7 +904,7 @@
     4     }
     5
     6     if (args.IsArgSet("-datacarriersize") || args.IsArgSet("-datacarrier")) {
     7-        InitWarning(_("Options '-datacarrier' or '-datacarriersize' are set but are marked as deprecated. They will be removed in a future version."));
     8+        InitWarning(_("OPTIONS '-DATACARRIER' OR '-DATACARRIERSIZE' ARE SET BUT ARE MARKED AS DEPRECATED. THEY WILL BE REMOVED IN A FUTURE VERSION."));
     9     }
    10
    11     // We no longer limit the orphanage based on number of transactions but keep the option to warn users who still have it in their config.
    

    results in keeping the message id

    0git diff | egrep '^[+-].+trans-unit id='
    1-      <trans-unit id="_msg1039">
    2+      <trans-unit id="_msg1039">
    
  2. translations: recreate baseline to simplify testing 14cb1f8211
  3. DrahtBot commented at 9:24 am on August 29, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33270.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK hebasto

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  4. l0rinc renamed this:
    translations: recreate baseline to simplify testing
    stabilize translations by reverting old ids by text content
    on Aug 29, 2025
  5. l0rinc force-pushed on Aug 29, 2025
  6. DrahtBot added the label CI failed on Aug 29, 2025
  7. DrahtBot commented at 9:27 am on August 29, 2025: contributor

    🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/49171015206 LLM reason (✨ experimental): Lint failure: Python style error E401 (multiple imports on one line) in contrib/devtools/stabilize_xlf_ids.py.

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  8. maflcko commented at 10:03 am on August 29, 2025: member
    Not sure if this needs a script. Post-freeze changes should be rare enough to just manually take over the affected string without going through any script.
  9. stabilize translations by reverting old ids by text content
    Previously, recalculating the translation string templates caused a misalignment: after any changed string, subsequent entries were re-numbered and old values received new “stable” IDs.
    Changing a translatable string’s content moves it within the sorted messages and gives it a new sequential ID. Messages between its old and new positions were then misidentified, making translation work on Transifex and similar platforms harder.
    The translate CMake target now runs a small Python script that restores IDs from the previous translation file. It maps each message by (group resname + source text) to old and new IDs, and performs a single-pass search-and-replace on the newly generated XLF to restore the original IDs. Unchanged strings keep their IDs; genuinely new strings get the next sequential _msg following the very last id.
    291749d9f3
  10. l0rinc force-pushed on Aug 29, 2025
  11. l0rinc marked this as ready for review on Aug 29, 2025
  12. DrahtBot removed the label CI failed on Aug 29, 2025
  13. l0rinc commented at 10:14 pm on August 29, 2025: contributor

    @maflcko, this is for every translation update, after this change previously translated values are kept and recognized by Transifex. @achow101 created a test Bitcoin Core translation sandbox for me where I’ve uploaded the latest bitcoin_en.xlf. Between Core versions most of the translations need to be reassigned (in the example of 29 vs 30 only 10% of the original IDs were kept so the translations all showed that most of the strings don’t have pairs)

    I have generated a dummy French translation and simulated a review on it:

    Before this PR, changing a single line invalidated 10% of the other translations as well because of the sorting bug described above:

    Uploaded the same file that’s generated with this PR for #33224 agains a 100% approved language:

    And since the translation IDs are resurrected, it correctly shows that a single entry needs to be retranslated:

  14. maflcko commented at 1:42 pm on August 30, 2025: member

    @maflcko, this is for every translation update, after this change previously translated values are kept and recognized by Transifex.

    Ah, I see. I wonder if the shasum of the full content can be used as an id for the translation string. Conceptually it seems simpler to get a stable id, than to try to artificially number and re-number the strings, depending on the history. (Obviously this wouldn’t help with #33224, but this instance should be trivial to handle manually as a one-off.)

  15. hebasto commented at 9:42 am on September 1, 2025: member
    Since we are not using ID-based translations, it seems reasonable to consider removing the id attributes from trans-unit elements in the XML file.
  16. l0rinc commented at 6:30 pm on September 1, 2025: contributor

    we are not using ID-based translations @hebasto, are suggesting that the bitcoin-test clone where we tried it isn’t representative? It did seem to me that we managed to reproduce and fix the instability by stabilizing the ids. But if we’re not using the ids, why aren’t the translations stable? What alternative do you suggest to stabilize them?

    shasum of the full content can be used as an id for the translation string @maflcko we could of course do that, but it would likely invalidate every single text in the next release. We could of course do it step-by-step and only add hashes for the new entries (instead of max-id + 1, as I did in the script here). Note that hashes would prohibit same-text-with-different-translations, e.g. Clear could mean “delete” or “agree”, based on context.

  17. hebasto commented at 7:44 pm on September 1, 2025: member

    we are not using ID-based translations

    @hebasto, are suggesting that the bitcoin-test clone where we tried it isn’t representative? It did seem to me that we managed to reproduce and fix the instability by stabilizing the ids. But if we’re not using the ids, why aren’t the translations stable?

    We do not use them, but Transifex does.

    What alternative do you suggest to stabilize them?

    As I wrote in #33270 (comment):

    … it seems reasonable to consider removing the id attributes from trans-unit elements in the XML file.

  18. l0rinc commented at 7:49 pm on September 1, 2025: contributor
    Do I understand you correctly that if we remove the IDs, Transifex will identify messages by content instead? Any reason we didn’t do that before?
  19. hebasto commented at 7:55 pm on September 1, 2025: member

    Do I understand you correctly that if we remove the IDs, Transifex will identify messages by content instead?

    That’s my guess, though I haven’t tested it.

    Any reason we didn’t do that before?

    I’m not aware of any specific reason.

  20. l0rinc commented at 8:07 pm on September 1, 2025: contributor

    My understanding is that Translation Memory Fillups are Growth plan feature only.

    I have tried uploading the English translation without any ids (called keys in Transifex)

    adding back a single id results in only that value being imported:

  21. achow101 commented at 10:58 pm on September 1, 2025: member
    The id attribute is required by the XLIFF 1.2 spec: https://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#trans-unit
  22. hebasto commented at 10:37 am on September 2, 2025: member

    stabilize translations…

    Concept ACK on this idea.

    Could we avoid reintroducing Python scripts into the translate build target?


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-09-02 12:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me