#1784 libsolv and dnf maintainers: failure to respond to critical bug
Closed: Invalid 6 years ago Opened 6 years ago by bcooksley.

This issue concerns the bug at https://bugzilla.redhat.com/show_bug.cgi?id=1483553

Currently we have an outstanding issue at KDE whereby we're unable to update the image used by our CI system because updates to libsolv and dnf have broken the ability for our Docker image script to complete successfully. This means that projects are unable to be built as we can't add the necessary dependencies into the image.

The failure log can be viewed at https://build.kde.org/view/CI%20Management/job/Docker%20Generate%20FedoraQt5.8%20Image/57/console

To date the DNF and libsolv maintainers have collectively failed to correct the issue, and have pushed out updates which theoretically should have resolved the issue, but which have failed to do so. In addition, following the last reopening they have not responded in any manner to this rather severe outstanding issue. It has been a whole week since it was last reopened.

I'd like to request FESCO require the DNF and libsolv maintainers take appropriate action as soon as practicable, and should they fail to act, instruct the archive maintainers to recall all of the defective updates from the archives to restore the ability of DNF to correctly install packages.


Metadata Update from @jforbes:
- Issue tagged with: meeting

6 years ago

It appears that things are working now? Job 58 completed. Please update if so.

  • AGREED: Issue 1784 is tabled for a week, to see if things are
    working now (+6,0,-0) (jforbes, 16:14:22)

FWIW, I believe the problem was not that dnf/libsolv were not fixed, I think you just needed a rebased container image (which @maxamillion released yesterday) that had the newer builds. I.e., I think you were running the old broken ones instead of the fixed ones.

I had this same problem on Bodhi's CI system, and I just confirmed that the new images seem to resolve it:

https://github.com/fedora-infra/bodhi/pull/1915

You can see the build passed here:

https://ci.centos.org/job/bodhi-bodhi/614/

I can confirm that our image build has now been able to complete successfully.

However i'd like to note that it shouldn't have taken a reroll of the underlying Fedora Docker image to fix this. I'm a bit disappointed in the lack of responsiveness from the DNF developers as well, especially after it got reopened.

Looking closely at the log of one of the failed image builds, I note that the command which ends up failing did not upgrade libsolv or dnf - this was done by an earlier step. This seems to indicate that the upgrade process corrupted the package database, or otherwise failed to complete and left the system in a broken condition. Either of these would be a rather severe defect in a package manager.

If such updates are to be released in the future which result in such a condition, at the bare minimum the DNF developers should be required to coordinate with image maintainers so that we don't end up with another 3 week period where everything is broken.

Metadata Update from @sgallagh:
- Issue close_status updated to: Fixed

6 years ago

Metadata Update from @bcooksley:
- Issue status updated to: Open (was: Closed)

6 years ago

Reopening. Following another DNF update this has regressed again.

Given this is going to be a recurring issue, one-off fixes such as requesting the Docker Image maintainers rebuild the image to side step the root cause of this problem are insufficient.

I'd like to request that FESCO:
- Immediately suspend the privileges of the DNF and libsolv maintainers to upload packages. This will prevent them from causing breakages in others build processes as they have done so here, for the second time.

  • Require that changes to the DNF and libsolv packages be coordinated with all relevant image maintainers, to minimize the window for breakage when such updates be released

  • Require the DNF and libsolv maintainers re-investigate the Docker issue as soon as practical and come up with a long term solution which does not require image rebuilds to circumvent the issue.

While suspending their privileges may seem like an extreme measure, from my perspective this is justified as they completely ignored the issue last time, burying their heads in the sand and ignoring the issue for in excess of 2 weeks. The issue only went away last time because someone else circumvented their breakage.

The log of the breakage can be viewed at https://build.kde.org/view/CI%20Management/job/Docker%20Generate%20FedoraQt5.8%20Image/59/console

I can confirm that the issue has returned for my Bodhi CI tests. I'm seeing a (hand wavy not actually measured total guess) 70-90% failure rate in my PRs. It's been fairly crippling for Bodhi's PRs, though fortunately CI will pass if I ask Jenkins to re-run the tests over and over until they pass. I've filed a releng ticket to request the F26 container to be rebuilt (in my case I currently see this problem only on the F26 container).

I would love to develop an action plan to solve this problem more permanently. @ignatenkobrain, do you have any suggestions on what steps Fedora could take to eliminate or reduce the frequency of this particular problem?

@bowlofeggs to switch RPM to different database, ndb or lmdb. This was planned for F24 (IIRC), but didn't got implemented. F27 got RPM 4.14 and users theoretically can try new backends, but the problem is that libsolv doesn't use librpm to read bdb but custom code. @pmatilai is/was working on porting libsolv to librpm.

So in theory, RPM guys + me could propose change for F28 to switch to different database which would eliminate problem completely.

Another option is that we stop updating glibc / libdb so often ;)

@ignatenkobrain : switching database backend so it's just sweeping the thing under carpet and hoping it goes away. We don't even know how LMDB (or ndb) will behave in this case, the fundamental design issues are still there so they're just as likely to be broken. BDB happened to sort of work for years because it made invalid assumptions, only finally discovered by glibc changing in a way that enforces the issue.

@bowlofeggs , @bcooksley et all: you need to understand that dnf is mostly just a passenger here. It's a really tricky and subtle problem over multiple components that few people truly understand, go read the thing that unearthed this at https://bugzilla.redhat.com/show_bug.cgi?id=1394862 if you're not afraid of losing sleep. Oh and good luck demanding that glibc be downgraded to F25 version...

And on top of that condvar thing, there's a libsolv-related deeper design issue that cannot be solved just like that, but it is being worked on. And there's packages running rpm from scriptlets when they should not, and dnf plugins executing rpm when they should not, and containers mounting things from the host when they shouldn't, AND file locking on overlayfs2 being broken (https://patchwork.kernel.org/patch/9236917/). And whatnot.

Finally, for F26 and newer a libsolv update with a workaround has been supplied, but it will only help if present at the start of the update. It's not at all clear from the bug that the multitude of reporters are actually using an updated image with the libsolv update present. NOBODY can help the case of non-updated images.

@pmatilai our CI system fetches the latest image from Dockerhub each time it performs a build, as noted in our log so that isn't the issue here.

Also, regardless of the root cause of this issue, I see no comment mentioning the background you just provided in the bug which originally triggered this request being made (https://bugzilla.redhat.com/show_bug.cgi?id=1483553). My point above about the dnf/libsolv maintainers being unresponsive stands on this basis.

Based on the above i'd like to amend my request to be:

1) An immediate hard freeze for libdb, rpm, dnf, libsolv and any other involved packages across all versions of Fedora including Rawhide. Explicit permission from FESCO should be required for any changes, even if it's just a typo in a README file.

2) Hard requirement that all of the above maintainers of the affected packages coordinate with all image maintainers, to ensure that updates hit repositories and images are rebuilt at effectively the same time.

As an external user, not someone directly involved in Fedora, I shouldn't have to care about libsolv/dnf/libdb/etc. They should just work. Updates should never break things (especially not in the package manager of all things!), and while the risk they be able to do so continues, the relevant packages should be hard frozen and subject to direct supervision by FESCO.

Oh and BTW, in the KDE Docker build log there's no indication whatsoever that you're actually hitting that "BDB1539 Build signature doesn't match environment" error. Something goes wrong for sure, but YOU CANNOT JUST ASSUME it's the same thing, or a bug in dnf or any of its related compontents. File a bug and it might actually be investigated by somebody, but containers have been known to do evil things like break POSIX (see eg the overlayfs2 locking note) and such.

To provide some additional history, the whole problem started with Build #53 on our CI System, where the image failed to build with:
Running scriptlet: desktop-file-utils-0.23-3.fc26.x86_64 232/232?[91mfailed loading RPMDB

This error message is to my understanding, related to the BDB1539 issue.
Following patches committed as part of the BDB1539 issue, this instead changed to a wall of:
Verifying : at-spi2-atk-devel-2.24.1-1.fc26.x86_64 193/232?[91matk-devel-2.24.0-1.fc26.x86_64 was supposed to be installed but is not! ?[0m

The issue persisted until the upstream Fedora image was rebuilt at which point it succeeded (Build #58). Following changes as of late it has regressed again and these same errors are back, as seen in Build #59.

From my perspective, this is the same issue, or at the very least, an associated regression.
Due to this i'd rather not open a new bug, in part because we were totally ignored for 2-3 weeks by the responsible developers, but also as relevant information may be lost.

The issue was only resolved previously because I escalated the issue to FESCO (this task) and as such I'd prefer it remained with FESCO as i've no confidence in the normal process for issues as critical as a broken package manager.

I have already received a request to integrate additional packages within our image, something which I am currently unable to do due to this breakage.

We will discuss this issue during Friday's FESCo meeting at 16:00UTC in #fedora-meeting on
irc.freenode.net. All interested parties are invited to participate.

One more comment from me:

@bcooksley, you're saying here that things were better for a while and then they got worse again. With somewhat different errors, but you silently assumed that to be just more of the same. Assumed. Silently. There is not ONE comment from you in the bug you're complaining about, not about the previous problems nor about the new ones you're seeing. Which might be related in one way or the other (or not), but it's NOT THE SAME. How exactly do you suppose the maintainers could know you're seeing an issue?

And when one of the people who have actively been looking into the issue tells you to file a new bug about the new case you're seeing, you respond with "I'm not going to bother because I dont think anybody will do anything". Instead you're lobbing here to revoke rights from the only people that possibly COULD help you, IF you bothered to actually talk about the issues you're seeing in the place where they belong to: bugzilla. (hint: a build log linked to a FESCO ticket will not be looked at by anybody)

Way to go, not.

First, the error message we are seeing now is the exact same message we saw after the initial fix went in (which is why the bug got reopened in the first place). No assumptions were made here. A patch which causes other breakages should not require a new bug report as it's a regression.

In that bug, Rex Dieter was making comments on my behalf. This included the initial comment concerning how it was broken, then the subsequent reopen when it got closed but not fixed.

A core part of my irritation, and the reason why I "don't think anybody will do anything" (not what I wrote) was because the bug was then completely ignored for over 2 weeks, until this issue was initially raised. Based on that I have no reason to expect the responsible developers and packagers will respond unless forced to do so by FESCO.

Opening this issue in the first place had that effect, and resulted in the Docker image being rebuilt (which worked around the underlying issue). The only reason we're back here is because another patch release has been made for one of the packages in the DNF stack, which has broken the workaround.

That is also why I want the responsible developers/packagers rights to be limited - to prevent them from re-breaking the workaround until a proper solution can be delivered.

Due to the times of FESCO meetings (i'm GMT+13) it won't be possible for me to attend the meeting, but I do hope that a proper solution to this problem can be worked out.

Oh well, I'll recoup the situation once more, if only for FESCO's benefit (never say finally)

glibc 2.25 introduced a change that broke an invalid assumption Berkeley DB had been making for "forever", more than a decade anyway. The fix BDB was forced to do has percolated the issues upwards the food chain, and people have been scrambling to sort out the fallout from that, in particular the BDB maintainer @pkubat has been going out of his way to alleviate the issues on BDB side throughout the "condvar episode" (https://bugzilla.redhat.com/show_bug.cgi?id=1394862). Bug https://bugzilla.redhat.com/show_bug.cgi?id=1483553 is just another side-effect of the same thing, for which a work-around on libsolv side was provided in September.

And here we have people asking to revoke rights from "dnf stack" maintainers who had zero to do with introducing any of this, but are trying to sort out the issues as new data comes up. Only there hasn't been any substantial new data in the bug in a long time, all I'm seeing is the same old thing with people running containers images without an updated libsolv. So lately the issue seems to be mostly about the general major problem of container images: updates. To help diagnose that, we've now submitted a new libsolv update which makes it easier to tell whether an up-to-date libsolv is being used.

It's entirely possible there could be other more or less related issues going on but to investigate, the maintainers will need workable reproducers and more data, in bugzilla. Oh and if there's no "BDB1539 Build signature doesn't match environment" message in the log, it's not the same issue and should be tracked separately.

Oh and lets not forget (I almost did) that the container case manifests itself only with the non-POSIX overlay2 storage. A kernel patch to fix file locking on the overlayfs has been proposed (https://patchwork.kernel.org/patch/9236917/) but I dont think it's applied anywhere yet. In the meanwhile, the workaround of using an actually POSIX compatible storage (like devicemapper backend) was suggested early in early September in the bug (comment 12). So it's not exactly like people have been stuck with no options for months...

In regards to Docker (container) underlying file systems, in many instances you have no choice but to use the overlayfs backend as:
1) There is no raw disk space available for devicemapper/lvm to be setup in
2) The system doesn't have an existing zfs/brtfs file system present
3) aufs cannot be used as it is either not included in the kernel, or is of a version which contains severe bugs which the Java Runtime can easily trigger as part of it's startup/shutdown routines. These bugs leave the kernel in an unsafe condition and require a complete reboot to resolve.

As for the September workaround, that would be roughly consistent with us reporting that it didn't fix the issue (at least not in full), and then being ignored. I've yet to see an explanation for that.

Yeah well, disk space needs might be inconvenient. But like lots of software out there, package management stack requires working POSIX file locking, and when you take that away ... well, things break, tough luck. Really. It's a filesystem bug (as also pointed out in the bug), and yelling at dnf maintainers or FESCO doesn't help a thing with that.

Implementing pre-transaction filesystem capability tests for required functionality for rpm has been on my long-term todo for quite a while. Testing for working fs locking just added to the list of things to check, inspired by this ticket.

I've repeatedly said that if there are unresolved issues after first updating the container to latest libsolv, bring them up in a new bugzilla bug, nobody is going to debug stuff from a FESCO ticket. And again, it should be a new bug, because the case in 1483553 is considered done and dealt with as far as libsolv is concerned, it's not a "whatever might go wrong with installations on containers" bug but a specific case of BDB returning DB_VERSION_MISMATCHED returned in a specific case that needs DB_PRIVATE flag to work around. Which is what the supplied libsolv updates do.

This issue will be discussed during this week's FESCo meeting on Friday at 16:00UTC in #fedora-meeting on irc.freenode.net.

AFAICS the dnf maintainers are very responsive and seem to try to fix the problem as much as they can. Therefore I am -1 to the initial request. I am +1 to moving the discussion to a proper bug report in Bugzilla as requested by the dnf maintainers.

#agreed dnf maintainers will not be removed. FESCo encourages reporters to file separate tickets for each issue in bugzilla (+7, 0, -0)

https://meetbot.fedoraproject.org/fedora-meeting/2017-11-17/fesco.2017-11-17-16.03.html

Metadata Update from @bowlofeggs:
- Issue untagged with: meeting
- Issue close_status updated to: Invalid
- Issue status updated to: Closed (was: Open)

6 years ago

Shrug

We dropped Fedora from the KDE CI system last week anyway after the failure of the last meeting to take place.

Given the length of time and level of escalation which was required to get a response from the DNF developers, along with their lack of apology for the issues we had (both with the software and their ignorance of the bug), it was decided that we could not have faith that Fedora as a whole would result in a maintainable system going forward.

Login to comment on this ticket.

Metadata