#3119 Numerous package git repos fail git-fsck, causing issues for mirroring, and need to be fixed
Closed: Accepted 3 months ago by zbyszek. Opened 4 months ago by ngompa.

As part of Fedora ELN work to prepare for CentOS Stream 10 branching from Fedora Linux 40, @sgallagh discovered that there are a number of package git repositories that fail git-fsck. The list of packages are detailed in releng#11822 (to sum up: mostly Java packages that have existed since the beginning of Dist-Git), and the consequences of this are pretty severe:

  • External mirrors of package Git repositories to systems that enforce git-fsck on push (like GitLab) will fail.
  • Remote pull requests are effectively impossible since forks cannot be hosted elsewhere.

The ELN SIG has requested a course of action to fix this to Release Engineering, but Release Engineering would like FESCo approval for this as well as a broad announcement that this is happening and why.

I'm marking this for fast-track given the urgency and timeframe needed to resolve this for the SIG.

Proposal: FESCo approves this one-time effort to fix these packages provided there's an announcement from the ELN SIG to the community to inform everyone of what's going on, why, the impact, and how packagers should respond to this.


Metadata Update from @ngompa:
- Issue assigned to sgallagh

4 months ago

Just for the record, because it's not mentioned explicitly in the proposal above: This is about rewriting a decade of git history (including git pre-history imported from cvs) for the affected repositories, changing most commit hashes, etc.

However, since part of the proposal is to archive the current rawhide HEAD to ensure mapping koji builds etc. to commit hashes is still possible, I am +1 for this one-time effort, since broken git history is a problem in general.

So, a clarifying statement: If we archive the current rawhide HEAD, we need to do it "somewhere else". As long as those faulty commits exist in Fedora's dist-git (even on a different branch), it becomes impossible to fork the repo or migrate it to a new hosting site.

For the record, I've already taken steps to import fixed branches of those sixteen CentOS Stream 10 packages in the releng ticket into Gitlab as the c10s branch over there. I can import branches those back to Fedora fairly easily once we decide if and how we will archive the existing Rawhide branches somewhere. I didn't touch the ones that were Fedora-only (yet).

Metadata Update from @zbyszek:
- Issue tagged with: meeting

4 months ago

So, a clarifying statement: If we archive the current rawhide HEAD, we need to do it "somewhere else". As long as those faulty commits exist in Fedora's dist-git (even on a different branch), it becomes impossible to fork the repo or migrate it to a new hosting site.

Consider me a +1 to the proposal, but my one question is where is "somewhere else" and who is the steward of that archived data? I'd like to see that documented or captured somewhere should we need to dig in to the archived data later.

my one question is where is "somewhere else" and who is the steward of that archived data?

That's a good question. My proposal: export the .git directory from before the rewrite and save it a file in git history of the repo after the rewrite:

fedpkg clone xmltool
cd xmltool
tar -Jcvf xmltool.git.tar.xz .git
git filter-repo ...
git add xmltool.git.tar.xz
git commit -m 'History rewrite: save previous .git directory

[skip changelog]'

This way it will never get lost, we don't need a new "place" to store things, and anyone can trivially dig into the history if they need to.

That's a good question. My proposal: export the .git directory from before the rewrite and save it a file in git history of the repo after the rewrite:

I'd much prefer to create a new rpms-archive distgit namespace and push the old repos there, but if you insist, it would make more sense to put that archive on a separate git checkout --orphan archive branch to avoid confusion or the accidental deletion of the archive.

Metadata Update from @zbyszek:
- Issue untagged with: fast track

4 months ago

This was discussed in today's meeting, but we didn't reach any conclusions.

I'd much prefer to create a new rpms-archive distgit namespace and push the old repos there

That is certainly a possibility. But I think that'd be overkill. We're unlikely to ever need to look at those repos. The rewrite is a rather trivial adjustment of the email address.

it would make more sense to put that archive on a separate git checkout --orphan archive branch to avoid confusion or the accidental deletion of the archive.

OK, I like that. (The archive cannot be deleted, because it's attached to a git commit, but yeah, it seems nicer.)

Updated proposal:

fedpkg clone xmltool
cd xmltool
tar -Jcvf xmltool.git.tar.xz .git
git filter-repo ...
git checkout --orphan archive
git add xmltool.git.tar.xz
git commit -m "Save previous .git directory before rewrite on $(date +%F)"
git switch -

So, a clarifying statement: If we archive the current rawhide HEAD, we need to do it "somewhere else". As long as those faulty commits exist in Fedora's dist-git (even on a different branch), it becomes impossible to fork the repo or migrate it to a new hosting site.

@sgallagh It is possible to have heads that are not cloned automatically. I expect Gitlab to clone refs/heads/* only, so you could have a branch refs/archive/rawhide with the old contents. This would also exclude it from manual git clone, which could be a good thing or a bad thing depending on how you look at it.

Please don't put git in tarballs in git.

Please put it where we always put such things, i.e. in archive/ see https://pagure.io/releng/issue/7265

That is indeed nicer. The only disadvantage is that git fsck would fail on a system which has the archive branch. But if we don't have plans to run fsck there, that's doesn't matter.

So reading back through this and thinking about the issue ("All of the affected packages have the same root issue: a packager many years ago had an extra < character in their author/committer field, which causes a (harmless) validation error."), I am not convinced this is something that requires the heavy hammer of tarring up the history and archiving it and moving forward. Git has ways to correct errors in author and committer fields. I've done this before, especially when a contributor wants to change their email address. The actual commit IDs remain in place so that existing clones and forks work.

I would strongly prefer us exploring 'git commit --amend' and 'git-filter-repo' to correct the known issues rather than archiving history.

I would strongly prefer us exploring 'git commit --amend' and 'git-filter-repo' to correct the known issues rather than archiving history.

That is how we would fix it, but all the commit hashes will change in the branch where we do this, so we need the old commits archived somewhere in case they need to be pulled.

I would strongly prefer us exploring 'git commit --amend' and 'git-filter-repo' to correct the known issues rather than archiving history.

That is how we would fix it, but all the commit hashes will change in the branch where we do this, so we need the old commits archived somewhere in case they need to be pulled.

Right, the committer field is part of what is hashed to create the commit ID. You can absolutely amend or filter-repo to fix things up, but it DOES rewrite the history from that point. In order to retain the original commits, we need to store them somewhere.

FYI, the exact command I ran to fix this up for CentOS Stream 10 was:

git filter-repo --force --email-callback 'return email.replace(b" <akurtako@redhat.com", b"akurtako@redhat.com")'

There isn't a high urgency on this at the moment, so I'm taking it off the meeting agenda.

Metadata Update from @sgallagh:
- Issue untagged with: meeting
- Issue tagged with: stalled

3 months ago

I tested the solution proposed by @fweimer and it works as advertised.

Test:

fedpkg clone xmltool && cd xmltool
cp -av .git/refs/heads archive                   # note that 'archive' must be *outside* .git/ so it doesn't get rewritten
git filter-repo --force --email-callback 'return email.replace(b" <akurtako@redhat.com", b"akurtako@redhat.com")'
mv archive .git/refs/

After that, when the repo is cloned, we don't get the old branches. I checked that the original branches cannot be pushed to gitlab, but the new ones can. The old branches can be referred to via git rev-parse archive/f37 and similar.

(EDIT: Note that this command is intended to be invoked in the "upstream" dist-git repo. When testing locally, after e.g. fedpkg clone, one has to first generate local branches, e.g. via for i in {14..37}; do git checkout f$i;done.)

PROPOSAL: FESCo approves the rewriting the history in those repositories using git filter-repo, with the old branches saved to refs/archive/ namespace.

Metadata Update from @zbyszek:
- Issue untagged with: stalled

3 months ago

I tested the solution proposed by @fweimer and it works as advertised.

Fantastic!

(EDIT: Note that this command is intended to be invoked in the "upstream" dist-git repo. When testing locally, after e.g. fedpkg clone, one has to first generate local branches, e.g. via for i in {14..37}; do git checkout f$i;done.)

Does this mean we'll need to tell people: "Just do a fresh checkout of the affected packages"?

PROPOSAL: FESCo approves the rewriting the history in those repositories using git filter-repo, with the old branches saved to refs/archive/ namespace.

+1

Does this mean we'll need to tell people: "Just do a fresh checkout of the affected packages"?

That is the easiest option and I think we should recommend that. Otherwise, you need to do a git pull (or git pull --rebase if the clone is old enough, because the default changed some time ago) in each branch before using it.

After two weeks and some change, the result is:
APPROVED (+4, 0, 0)

Metadata Update from @zbyszek:
- Issue tagged with: pending announcement

3 months ago

Metadata Update from @zbyszek:
- Issue untagged with: pending announcement
- Issue close_status updated to: Accepted
- Issue status updated to: Closed (was: Open)

3 months ago

Login to comment on this ticket.

Metadata