Global statistics on translation levels of fedora products
dnf install translate-toolkit podman
mkdir -p ./src.rpms/f30/ ./results/f30/ virtualenv venv source venv/bin/activate pip install -r requirements.txt
This step is for now manual, I took list of DNF packages from Koji:
./download-f%%-srpm-in-container.sh where %% is the fedora version (30 or 31)
Downloading the file is done inside a container so we can produce stats even if using Fedora 29. This represents about 7 GB for Fedora 30 and takes some time.
The result will be in multiple files inside the results folder.
Applies data cleanups and enhancements (cldr name).
Agregate the data per language, then apply it on territories (it uses stats from CLDR with language per territory).
0.error.language not in cldr.csvcontains unknown languages (lines are removed)
0.error.languages is numeric.csvcontains numeric languages (lines are removed)
0.error.lang with point.csvcontains languages such as ".cp936" ".big5" (lines are removed)
0.error.len(language).csvcontains languages with more than three caracters (lines are removed)
0.error.len(territory).csvcontains territory with more than two caracters (lines are removed)
0.error.no population for this language-territory couple.csvcontains the list of language-territory couple where no language statistics exists (no impact on results)
1.debug.lang.csvall lang (language + script + territory) values for debug (no impact on results)
1.debug.language.csvall lang values for debug (no impact on results)
1.debug.script.csvall script values for debug (no impact on results)
1.debug.territory.csvall territory values for debug (no impact on results)
1.debug.total message = 0.csvall lang values for debug (lines are removed)
3.result.csvfull results per package with source filename and standardized language code, script code and territory code
4.0.cldr.csvlanguage per territory as provided by CLDR
4.1.results_per_language.csvmessage and words progress percentages per language
4.1.results_per_language_ISO3.csvmessage and words progress percentages per language merged with "country code" database using ISO3166-1-Alpha-2 code
4.2.cldr_and_results_full.csvlanguage per territory as provided by CLDR merged with message and words progress percentages per language
4.3.cldr_and_results_grouped.csvaggregation per territory of
4.2.cldr_and_results_full.csv, provides the territory, the number of languages, the population, the messages and words coverage.
4.4.world_stats.csvmerge results of
4.3.cldr_and_results_grouped.csvwith country database and geojson data.
Data in CLDR-raw folder comes from https://github.com/unicode-org/cldr/blob/master/common/main/en.xml
automatic calculation (group by territory + spoken percentage * spoken )
create stats: number of countries with official language > 50% and related population
create stats: number of languages impacting more than one official language
AppData and Zanata statistics: https://github.com/Jibec/fedora-translation-statistics Transtats: https://transtats.fedoraproject.org/releases/