#1650 Investigate why Transifex is not exporting all msgstr's we push to it
Closed: Fixed None Opened 8 years ago by jdennis.

If we push a language po file to Transifex containing n msgstr's (a msgstr is the translation of a msgid, i.e. the English string in our source code) and then pull the same language po from Transifex we sometimes receive less than n msgstr's back (in other words we're losing translations). Sometimes the msgstr loss is significant, on the order of 30% of the translations. We need to figure out what is going on, document it and provide a fix if possible.

Note: Adding this as a task because I've already spent well over a day investigating this problem and we need to track where time is being spent.


Executive Summary:

We were computing the number of translations in a po file
incorrectly. This lead to the erroneous conclusion we were losing
translations when importing a po file from Transifex.

Suggested Fix:

1) Utilize msgfmt instead of msgcmp to compute po file statistics.

2) Prevent msgmerge from incorporating fuzzy translations for new
msdid's.

Detailed Analysis:

A msgid is the raw i18n string extracted from our source code and
provided to a translator to translate into a specific language.

A msgstr is the translation of a msgid into a specific language.

A po file is a set of <msgid, msgstr=""> pairs which provides the mapping
from a string in the source code to a language specific translated
string. There is one po file per national language.

A pot file is the collection of all msgid's in a translation
catalog. It serves as a "template" for po files (hence 'po' + 't' for
template). pot files are created and updated by extracting i18n marked
strings in the source code (typically with the tool xgettext). As the
source code base evolves the set of i18n strings change. The pot file
is regenerated periodically so it has the current set if i18n strings
in the pot file as msgid's.

When the pot file is updated each language po file must also be
updated so the msgid's in the po file match those in the pot file. New
msgid's in the pot file are added to the po file and msgid's in the po
file which are no longer in the pot file are removed. A translator
must subsequently add new translations for the new msgid's the new pot
file introduced. The process of updating a po file to match a pot file
is called "message merge" and is typically done via the msgmerge tool.

The messages in a po file may have optional flags attached to it which
help the translator or the automated i18n tool chain.

One such flag is called "fuzzy" and is meant to indicate the msgstr
translation of the msgid is not 100% accurate and needs review by a
translator. Fuzzy flags may be inserted by a translator or by the i18n
tool chain.

msgmerge tries to make the translators life easier by trying to
recognize new msgid's which are close to previous msgid's and copying
the msgstr translation associated with the closest match msgid. These
msgstr's are marked with the fuzzy flag because they are not 100%
correct but hopefully provides the translator with a good starting
point that she may need only to tweak. It would be much more work for
a translator if any edit whatsoever to a msgid caused the existing
translation (msgstr) to be thrown away forcing the translator to
re-translate the msgid from scratch again.

Any msgstr marked as fuzzy is NOT considered a valid translation. Only
valid msgstr's are presented to the end user. Fuzzy msgstr's are for a
translators benefit only.

We have a make target called 'msg-stats' which computes statistics
concerning our i18n translations. It is used to see what percentage of
a language is translated and to generally validate the state of our
translation files.

The precise cause of our problem was we were counting the number of
msgstr's in a po file and comparing that to the count of msgid's in
the pot file to compute how much of a language had been translated. In
other words if the pot file had 10 msgid's and a po file had 8
msgstr's that would indicate 80% of the strings in that language had
been translated. But if a msgstr has the fuzzy flag associated with it
it's not a valid translation and does not count. The fix was to
replace our use of msgcmp with msgfmt --statistics which provides a
count of translations (msgstr's without the fuzzy flag), a count of
fuzzy translations, and a count of msgid's without a msgstr
(i.e. untranslated strings). The original logic we were using was
copied from another project and was presumed to be correct, however it
wasn't.

When we pull a po file from Transifex fuzzy translations are never
included because they are not valid translations. Transifex does keep
the fuzzy msgstr's as "suggestions" to the translator.

The confusion arose when we pulled a po file from Transifex and the
number of translations appeared to decrease from what we had in our
copy of the po (presuming our copy of the po contained fuzzy
translations) because we were incorrectly counting fuzzy
translations. This implied there was some type of data loss, when in
fact there wasn't.

How good are fuzzy suggestions?

During the process of trying to figure out how different versions of a
po file differed and why I wrote a tool to analyze pot and po
files. One of the things the tool allowed me to do was see how
msgmerge picked "close" strings to use as fuzzy suggestions. In some
cases it worked very well, however in a large number of cases the
suggestion was significantly incorrect and if a translator was not
careful they might accept a suggestion which was inaccurate. Below are
some examples from our own code base.

The string:
"Permissions"
Was suggested for the strings:
"Add Permission"
"Provisioning"
"Permission Type"
"Permission name"
"Self Service Permissions"

The string:
"Default users group"
Was suggested for the strings:
"Failed users/groups"
"Default user objectclasses"
"Don't create user private group"

The string:
"Member service groups"
Was suggested for the strings:
"Member HBAC service groups"
"Indirect Member HBAC service group"
"Member user group"

The string:
"Password used in bulk enrollment"
Was suggested for the string:
"Generate a random password to be used in bulk enrollment"

The string:
"Certificate"
Was suggested for the strings:
"Certificate Hold"
"New Certificate"
"Certificate Revoked"
"No Valid Certificate"
"Host Certificate"
"Service Certificate"

The string:
"External host"
Was suggested for the strings:
"External"
"External User"
"RunAs External User"
"RunAs External Group"

The string:
"Added sudo rule "%(value)s""
Was suggested for the strings:
"Changed password for "%(value)s""
"Added HBAC rule "%(value)s""

The string:
"Modified service "%(value)s""
Was suggested for the strings:
"Modified privilege "%(value)s""
"Modified HBAC service "%(value)s""
"Modified privilege "%(value)s""
"Modified selfservice "%(value)s""

The string:
"type, filter, subtree and targetgroup are mutually exclusive"
Was suggested for the string:
"filter and memberof are mutually exclusive"

As you can see from the above examples the suggested string can be
significantly different in content, meaning and intent from the actual
string. A translator would have to be alert when being presented with
suggestions to recognize the sometimes subtle but critical distinction
between the actual string and the suggestion.

My concern is with with human nature. When a translator is presented
with hundreds of strings to translate and a large proportion of those
have inaccurate suggestions which can just be "clicked through" and
accepted it seems to me there is a high probability of introducing
inaccurate translations. I think the quality of our translations would
be better if we didn't provide suggestions which can be clicked through
and accepted, instead the translator would have to type the new
translation in from scratch. Yes, this would mean more work for the
translator but it doesn't seem terribly onerous either and could
result in a much higher quality translation. My suggestion would be to
turn off the automatic generation of suggestions during the message
merge phase (i.e. fuzzy msgstr's). Comments?

Seems reasonable thought I would suggest you post this analysis to the Transifex devel/user list and see what suggestions they have based on the similar experience with other projects. I doubt we are the first to get "fuzzy" on the matter.

suggested patch submitted as:

[PATCH 39/39] ticket 1650 - compute accurate translation statistics

Metadata Update from @jdennis:
- Issue assigned to jdennis
- Issue set to the milestone: FreeIPA 2.1.1 (bug fixing)

2 years ago

Login to comment on this ticket.

Metadata