#20 Localization (thus non-vital) files of packages can take up to ~2/3 of the delivered bits
Opened 3 years ago by jpokorny. Modified 3 years ago

Hello,

haven't seen this topic in a cursory search, hence raising it here
demonstrated with one particularly frequent package findutils:

$ rpm -qi findutils | grep '^\(Version\|Release\|Size\)'
Version     : 4.7.0
Release     : 4.fc33
Size        : 1808667
$ rpm -ql findutils | { while read f; do test -d "$f" || echo "$f"; done; } | xargs du -b | tee >(cut -d "$(printf '\t')" -f1 | paste -s -d+ - | bc)
328944  /usr/bin/find
80072   /usr/bin/xargs
25  /usr/lib/.build-id/43/34dd190460c23206f77d2e79168f0164cb5fdf
24  /usr/lib/.build-id/63/c7dd8dd86c5bae0f774c341fe9323bd3a9713c
1375    /usr/share/doc/findutils/AUTHORS
83731   /usr/share/doc/findutils/NEWS
4539    /usr/share/doc/findutils/README
1539    /usr/share/doc/findutils/THANKS
2860    /usr/share/doc/findutils/TODO
24251   /usr/share/info/find-maint.info.gz
89616   /usr/share/info/find.info-1.gz
1878    /usr/share/info/find.info-2.gz
2478    /usr/share/info/find.info.gz
35149   /usr/share/licenses/findutils/COPYING
2343    /usr/share/locale/be/LC_MESSAGES/findutils.mo
48466   /usr/share/locale/bg/LC_MESSAGES/findutils.mo
7982    /usr/share/locale/ca/LC_MESSAGES/findutils.mo
36184   /usr/share/locale/cs/LC_MESSAGES/findutils.mo
34612   /usr/share/locale/da/LC_MESSAGES/findutils.mo
36905   /usr/share/locale/de/LC_MESSAGES/findutils.mo
44457   /usr/share/locale/el/LC_MESSAGES/findutils.mo
34447   /usr/share/locale/eo/LC_MESSAGES/findutils.mo
24941   /usr/share/locale/es/LC_MESSAGES/findutils.mo
33712   /usr/share/locale/et/LC_MESSAGES/findutils.mo
36236   /usr/share/locale/fi/LC_MESSAGES/findutils.mo
37042   /usr/share/locale/fr/LC_MESSAGES/findutils.mo
20984   /usr/share/locale/ga/LC_MESSAGES/findutils.mo
24078   /usr/share/locale/gl/LC_MESSAGES/findutils.mo
35520   /usr/share/locale/hr/LC_MESSAGES/findutils.mo
37131   /usr/share/locale/hu/LC_MESSAGES/findutils.mo
20287   /usr/share/locale/id/LC_MESSAGES/findutils.mo
33636   /usr/share/locale/it/LC_MESSAGES/findutils.mo
28336   /usr/share/locale/ja/LC_MESSAGES/findutils.mo
1916    /usr/share/locale/ko/LC_MESSAGES/findutils.mo
2663    /usr/share/locale/lg/LC_MESSAGES/findutils.mo
6271    /usr/share/locale/lt/LC_MESSAGES/findutils.mo
1514    /usr/share/locale/ms/LC_MESSAGES/findutils.mo
34789   /usr/share/locale/nb/LC_MESSAGES/findutils.mo
35503   /usr/share/locale/nl/LC_MESSAGES/findutils.mo
35962   /usr/share/locale/pl/LC_MESSAGES/findutils.mo
35253   /usr/share/locale/pt/LC_MESSAGES/findutils.mo
36212   /usr/share/locale/pt_BR/LC_MESSAGES/findutils.mo
6589    /usr/share/locale/ro/LC_MESSAGES/findutils.mo
46244   /usr/share/locale/ru/LC_MESSAGES/findutils.mo
24148   /usr/share/locale/sk/LC_MESSAGES/findutils.mo
35181   /usr/share/locale/sl/LC_MESSAGES/findutils.mo
46489   /usr/share/locale/sr/LC_MESSAGES/findutils.mo
34848   /usr/share/locale/sv/LC_MESSAGES/findutils.mo
33280   /usr/share/locale/tr/LC_MESSAGES/findutils.mo
46292   /usr/share/locale/uk/LC_MESSAGES/findutils.mo
38059   /usr/share/locale/vi/LC_MESSAGES/findutils.mo
32873   /usr/share/locale/zh_CN/LC_MESSAGES/findutils.mo
13436   /usr/share/locale/zh_TW/LC_MESSAGES/findutils.mo
21948   /usr/share/man/man1/find.1.gz
5466    /usr/share/man/man1/xargs.1.gz
1808716

(note: regarding 1808667 vs. 1808716 discrepancy;
it must be accounted to .build-id, it seems, EDIT: filed a bug)

We can easily see that, barring find and xargs split, must-have portion
is: 328944 + 80072 + 25 + 24 = 409065, or ~23% of the whole package.

Rest is:

  • documentation (perhaps except for %license droppable,
    see also rpm --nodocs):
    1375+83731+4539+1539+2860+24251+89616+1878+2478+35149+21948+5466
    = 274830, or ~15%

  • localization:
    the rest, i.e. 1124821, or ~62%

At least for quick containers, mockbuilds, etc. only about 1/4 of the packaged
bits are useful. Would there be a room for improvement regarding minimization?


Something like rpm --locale-filter=CMD, perhaps?

Btw. this "content demultiplexing" is what I had in mind that would
nicely combine with cleverly chunked RPMs:
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/EQ2UDRE6NA7IUC7IA7VZMEHIUJQ7H2K6/

You can already set the %_install_langs rpm macro to only install the relevant files. This is common on container builds. Indeed, this is not save the download bandwidth. Only the disk space.

Oh, thanks, so that's something thought of, perfect!
Just the interface towards users is rather buried.

Something like rpm --locale-filter=CMD, perhaps?

This idea, beside eclipsed with %_install_langs as mentioned, is mostly surpassed
with existing --excludepath that I missed originally (the idea was to "functionize"
which language identifiers to allow, where CMD could be something like:

cut -z -c1-1024 /etc/locale.conf /home/*/.config/locale.conf | xargs -0 -I{} sh -x -c "echo '{}' | sed -nE 's|^LANG=([\"]?)([[:alnum:]]+)[[:alnum:]._-]*\1|\2|p'"

). Problem with exclusion approach is that it's harder to work with than with the
list of the desired languanges -- but making %_install_langs trigger something
like the above command would be doable, nonetheless.
Note: the command would need more hardenings for sure.

Sidenote, .build-id links will be explicitly avoidable at install time with
--excludeartifacts option to rpm.

Login to comment on this ticket.

Metadata