#11631 Weird compression problem with Fedora-Rawhide-20231118.n.0 compose
Closed: Fixed a month ago by kevin. Opened 5 months ago by james.

In the latest composes none of the primary repodata files are in gzip format. Eg.

zcat Fedora-Rawhide-20231118.n.0/compose/Server/x86_64/os/repodata/50c50136ab16b1829b491887669d34e65ca3194f7a1ded72ebcb8cbd2aa5818d-primary.xml.zst
gzip: Fedora-Rawhide-20231118.n.0/compose/Server/x86_64/os/repodata/50c50136ab16b1829b491887669d34e65ca3194f7a1ded72ebcb8cbd2aa5818d-primary.xml.zst: not in gzip format

Or from zlib, you get:

zlib.error: Error -3 while decompressing data: incorrect header check

This happens for all of the arches.

I guess it will probably fix itself tomorrow? But maybe not, and would be worth knowing what happened.
Maybe related to the fact that the endings are now .xml.zst instead of .xml.gz ... which would imply it won't fix itself.


Yeah, this is because I updated compose-rawhide01 to f39 and with that createrepo_c 1.0 came in... and it changes defaults.

See: https://pagure.io/releng/issue/11664

I guess I can downgrade it? Or we could adjust things in pungi-fedora (where we pass args to createrepo_c in pungi config).

Or does it cause any issues? I guess it's just unexpected/unplanned...

CC: @mattia

I assume compose-rawhide01 and the template in pungi-fedora is only used for Rawhide compose... if that's true, I think there's no need to downgrade to createrepo_c < 1.0, we only need to adjust the template.
From https://docs.pagure.org/pungi/configuration.html#createrepo-settings I think we need to add
createrepo_extra_args = [--general-compress-type', 'gzip']
(or it's just --compress-type?...)

But if compose-rawhide01 is also used for something related to EPEL7, I think it's safer to downgrade it.

BTW, maybe we can also disable createrepo_database, since it's only useful for yum based repositories? But that's another thing that can be discussed by releng.

Seems very weird to change the default compression type with no data (that I can see) in the repomd.xml to tell any users the default isn't gzip anymore. I mean there was a std. way to add newer compression types and be compatible ... but what do I know.

I guess latest dnf handles this fine? And that's all we care about?

Personally I would have assumed createrepo_c is buggy and would adjust the template as an interim solution ... but maybe this is all intentional and any non-dnf code that looks at repodata needs to update, so rawhide is kind of doing it's job and we just need to care about non-rawhide repos?

FWIW I found out because I was running some non-dnf code to look at repodata ... but I'm not exactly normal.

Yeah, I am not sure what to do here off hand. dnf and dnf5 seem to both handle things fine, but might well be a surprise to folks like yourself using non dnf code.

CC: @ppisar @jkolarik

Hello, this is a consequence of the planned system-wide change, which was also reported in the releng. Was there a gap in how we communicated these changes? Or maybe, it might be better to file a ticket against the dependent components to ensure they are notified about such changes?

It mostly highlights one of the problems I've seen with Fedora change requests, where a significant backwards compatibility break is hidden among other changes. This is especially bad now that there are a lot of change requests for each release, and most can be ignored.

There are four "changes" in this system wide change:

  • Stop generating metadata in sqlite database format by default

Sqlite was an addon to the original format, was an optimization, and everything looking at repodata copes with it not being there (lots of repos didn't produce it).

  • We propose to include just one variant of groups.xml using specified compression

This is a minor compatibility problem, if you don't need group data you won't even notice and I think even rhel-7 yum can be configured to ignore group data (not checked). Obvious workaround is to pretend groupdata is empty and/or use the compressed one (the big problem being if you don't have zstd support).

It does breaks the original format, so it is worth letting people know ... but lots of workarounds.

  • increase major version above 0

I doubt anyone would notice.

  • Switch default compression from gz to zstd.

Breaks the format and probably breaks any non-DNF repodata code. Very hard to workaround if you don't have zstd decompression access (no python-zstandard in rhel8, but it's in epel8 ... and it's not even in epel7).

On first reading I would have assumed this was backwards compatible by adding primary_zstd/etc. fields ... you do mention it breaks everything in the compatibility section, but it might not be obvious it's not just a dnf problem.

Also createrepo_c compression doesn't fill in the "frame header of the compressed data does not contain the content size" so you have to pass random numbers to the python API or you get:

data = zstandard.decompress(gzdata)

  File "/usr/lib64/python3.11/site-packages/zstandard/__init__.py", line 210, in decompress
    return dctx.decompress(data, max_output_size=max_output_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
zstd.ZstdError: could not determine content size in frame header

As the 7th paragraph of: https://python-zstandard.readthedocs.io/en/latest/decompressor.html#zstandard.ZstdDecompressor.decompress

Thanks for the feedback. I acknowledge that there could've been more effort to inform users about changes in default values, such as displaying a warning message when running createrepo_c. Additionally, the scope section in the proposal could have explicitly guided other developers on how to handle upcoming changes or provided workarounds. Going forward, we'll work on improving communication to prevent similar situations in the future.

Regarding the content size issue, could you please file a bug report in the upstream repository along with a reproducer?

no data (that I can see) in the repomd.xml to tell any users the default isn't gzip anymore.

This is how repomd.xml format works: The compression type defined by a file name extension. If there is no extension, one has to detect it from a content of the file.

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

5 months ago

This is how repomd.xml format works: The compression type defined by a file name extension. If there is no extension, one has to detect it from a content of the file.

Why would you think that?

If that was true why would fedora/rh have added group_gz ... why would other people/implementations have added primary_lzma, primary_7zip or primary_zck?

Here is the oldest official discussion I can easily find, where everyone seems in agreement that you can't just change the compression type because that'd break everything:

https://rpm-metadata.dulug.duke.narkive.com/VyxNPMsP/adding-different-compression-types-to-createrepo

...but even ignoring fedora history, upstream discussions and all other implementations, what is the future plan for any new compression code, if this is true ... you just update the libdnf client code to know about the new extension, wait a few months and change the server again breaking all clients that aren't equipped to handle the new compression code, again?

Because you have data/@type="primary" which today links to location/@href="...xml.zst", but in Fedora 36 data/@type="primary" linked to location/@href="...xml.gz". Obviously the compression is not declared in data/@type.

I don't defend that approach. I also find no declaration of a compression format very confusing and error prone. I only explain how it is implemented.

IMHO abusing data/@type to convey both type and compression is also wrong. The repomd format would probably look differently if it was designed from scratch nowadays.

The plan for future incompatible changes is exactly as you described: Disseminate support into clients and switch servers later. With better formats and better handling in clients you can get better error messages, but not magically appearing support for the new features.

I don't know whether changing a default compression in createrepo_c was a good idea. It's difficult to say whether a compatibility is more important than a performance. However, I wish Fedora infrastructure recognized that the tooling is not only used by Fedora and stopped relying on defaults and started specifying formats explicitly. Especially when it produces repositories for different systems on a single host which follows the latest Fedora release. The implicit reliance on defaults impedes everyone. Fedora packagers, upstream, other distributions.

ok. I am switching rawhide back to gzip... we can have a discussion on when/if to switch it later.

Because you have data/@type="primary" which today links to location/@href="...xml.zst", but in Fedora 36 data/@type="primary" linked to location/@href="...xml.gz".

Again, AFAIK, all previous "changes" to the compression of primary (or other labels like group) came with a label extension. Eg. An additional labelled element was in repomd.xml called "primary_lzma" which had primary.xml.lzma and primary.xml.gz was in the "primary" label.

Thanks for the change Kevin ... I guess we should probably move this somewhere else, on Monday.

So... yeah, where do we go from here?

  1. Just close this and keep gz config for rawhide/branched/updates and if we change it, do so after discussing on devel and being deliberate about that change.

  2. Wait until rhel7 goes EOL and then switch to zstd.

  3. Wait until koji can handle rhel7 buildroot repos with newer createrepo_c and then switch back to zstd

  4. Something else

I guess my confusion with the change was that it was unclear to me if part of the change was 'and switch rawhide to zstd/default with createrepo_c update' or the change had nothing to do with changing rawhide and it's fine (for now at least) just to configure it to use gz ?

I suppose I would prefer option 1 and if we want to switch to zstd for rawhide, we make a new change for it.

Thoughts?

I'd say 1+2: let's propose a system change for Rawhide/F41 to switch to zstd compression and then gradually change stable releases to use the same settings (meaning everything below F41 will stay with current settings, but for F41 and higher we will use zstd - new Bodhi will make this possible).

In the same proposal I would also like to see the switch from --compress-type to --general-compress-type, so we compress everything, and maybe disable createrepo_database which was only useful for yum based repos.

And maybe someone else could pull in some clever optimization, since we're already changing things...

So then, I guess the first step might be a discussion about what the compression should be (is zstd really the best choice? perhaps it is?) and what other repodata adjustments to make?

Does someone want to start that discussion? or do you think it's better to propose something and have the discussion as part of that change?

And who should write up the change?

So then, I guess the first step might be a discussion about what the compression should be (is zstd really the best choice? perhaps it is?) and what other repodata adjustments to make?

Does someone want to start that discussion? or do you think it's better to propose something and have the discussion as part of that change?

And who should write up the change?

I've set up the basics at https://fedoraproject.org/wiki/Changes/ChangeComposeSettings
But someone else should do the real work as I'm not able to push such changes... @kevin @ppisar I've put your names in the proposal :stuck_out_tongue_winking_eye: ! Just remove yourself if don't want to be listed there...

I've set the change as a F41, so post F40 branching, to be sure we're ready to update everything to use createrepo_c > 1.0.0 post EPEL7 EOL.
When you're happy with the proposal text let me know and I'll change to ProposalReady.

If I understand the change text correctly, it only pertains compose configuration. No changes in createrepo_c tool are required. I'm happy to help you with a review and consultations. But because I'm not not a releng and don't have an access to the compose configuration, I cannot implement this change. Thus I'll remove my name from the change text.

I'd say 1+2: let's propose a system change for Rawhide/F41 to switch to zstd compression and then gradually change stable releases to use the same settings (meaning everything below F41 will stay with current settings, but for F41 and higher we will use zstd - new Bodhi will make this possible).

This just pushes the same problem down the road, and I'm not sure that we gain much.

The backwards compatible fix is deployed faster, easier and everything will continue to work:

  1. Fix createrepo_c to always use the compatible defaults.

  2. Add _zstd labels from createrepo_c by default, using that compression type.

  3. Get DNF to look for primary_zstd etc.

...and everyone with the latest dnf can see the benefits within a week or so, and anyone with code that worked for 15+ years will continue to work.

Yeah, I like @james suggestion here (but of course it needs changes in createrepo_c and dnf)... @ppisar any chance that could happen?

I will look at it, but not now. I was a month away and need to process a my backlog.

I'd like to undetstands James' idea better:

  1. Fix createrepo_c to always use the compatible defaults.

"compatible" means gzip?
"always" means never change the default?

  1. Add _zstd labels from createrepo_c by default, using that compression type.

What is "labels"? A suffix for data/@type attribute value? E.g. if primary is zstd-compressed, then write type="primary_zstd" into the repomd.xml instead of "primary"?
What does "by default" mean here? That in some cases, e.g. with "--compatibilty" option the suffix should be omitted? Or that zstd compression should become default?

  1. Get DNF to look for primary_zstd etc.

I believe that DNF5 supports primary_zstd. I will check it.

I'd like to undetstands James' idea better:

  1. Fix createrepo_c to always use the compatible defaults.

"compatible" means gzip?
"always" means never change the default?

Yes, that's what compatible means.

  1. Add _zstd labels from createrepo_c by default, using that compression type.

What is "labels"? A suffix for data/@type attribute value? E.g. if primary is zstd-compressed, then write type="primary_zstd" into the repomd.xml instead of "primary"?

Yes, as I pointed out above there's a long history of having a compression type suffix. So the repomd.xml can contain "primary" and a "primary_zstd" in one repo. while in another repo. someone could do a "primary" and "primary_lzma" ... the programs accessing the repos. only needs to support the compatible gzip decompression to access everything, but can optionally support as many other types as they want.

What does "by default" mean here? That in some cases, e.g. with "--compatibilty" option the suffix should be omitted? Or that zstd compression should become default?

That if you run a plain "createrepo ." it creates both the compatible "praimary" data and the new (presumably smaller/faster) "primary_zstd" data.

So, where are we here?

Can we file the suggestions made by james in the last comment upstream with createrepo_c ?

I think the change might depend on if upstream decides to use _zstd, etc... if it did, dnf{5} could just decide what it wants to use and I don't know that we need any change.

Thoughts?

So, where are we here?

Can we file the suggestions made by james in the last comment upstream with createrepo_c ?

I think the change might depend on if upstream decides to use _zstd, etc... if it did, dnf{5} could just decide what it wants to use and I don't know that we need any change.

Thoughts?

I've drafted a PR to Bodhi which adds support for --compatibility flag in createrepo_c command. I think all of of the createrepo_c options we (Fedora and EPEL) care about can now be defined per release, but the settings defined in the createrepo_c.ini config file in ansible needs tweaking to match old and new behavior.

FYI, as I understand the --compatibility flag of createrepo_c, and from a quick test I made locally, that flag just override other settings. So, using the flag in createrepo_c call just ensure to produce repodata files as the old (< 1.0) createrepo_c did, not two formats alongside each other.
In future, if we ever want to switch compression format, we will be at the starting point of this discussion again...

So, I think most of this is being tracked elsewhere now... the proposed change (thanks @mattia), the bodhi changes, etc.

The only thing I don't know is captured is the suggestion to use the _compressionname extensions.
@james Could you file that upstream?

If there's actually anything for us to still do/track here, feel free to reopen or file a new issue

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a month ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog