Issue #667: Request to fix CRITPATH update process - fesco

fesco

#667 Request to fix CRITPATH update process

Closed None Opened 12 years ago by dledford.

= Proposal topic =

I have objected, many times, to the CRITPATH update process on the devel mailing list. This has, obviously, done nothing. I'm now filing this ticket as my last resort to get the issue fixed.

= Overview =

I object to the CRITPATH process on the basis that it makes it impossible for the maintainer of a package to actually have the power to fix users' problems in a timely manner yet holds the package maintainer accountable for fixing users' problems. It is unethical to make a person accountable for something over which they don't actually have the power to control.

I would cite my current f16 mdadm update as an example. It originally had mdadm-3.2.2-8.fc16 as the package in testing with one bug, which was verified to resolve the problem (732818). Before the update was approved, I modified the update with mdadm-3.2.2-9 due to another bug, which has since been verified (and which is a non-boot issue for some people, 729205). All total, the update has now lingered for 43 days. Yesterday, I was assigned bug 736530. In that bug, Adam Williamson made this comment:

note that this is a Beta blocker bug through 731177's dependence on it, so
please prioritize - thanks!

I think it's very appropriate here to point out that when my update that fixes known bugs that prevent proper bootup have lingered for 43 days, to have someone tell me essentially "this is important, get on it!" is a huge slap in the face. My response to this treatment is a resounding FU. FU and the horse you rode in on.

= Problem space =

Maintainers need sufficient power to solve users' problems.

= Solution Overview =

I propose that the CRITPATH update acceptance criteria be modified as such:

Any CRITPATH update shall be approved immediately when any one of the three following conditions are met:

1) All bugs listed on the update have been transitioned from ON_QA to VERIFIED indicating that the CRITPATH update solves the issues it is intended to solve. If the update creates new issues after release, then there will be a new CRITPATH update with new bugs to solve those issues.
2) The update receives a total of +3 karma with at least one proven tester +1 karma.
3) The update reaches an age of two weeks and nag mails are now being sent.

= Active Ingredients =

What groups/systems/things are involved and/or affected by the proposal?

Unknown, I don't know the internals of the CRITPATH implementation.

= Owners =

Who owns this proposal?

Whoever created the god awful process we have in place now ought to be responsible for cleaning up their own damn mess.

adamwill commented 12 years ago

note that this is essentially the 'same fix, multiple releases' problem which has come up before: in the case of both bugs cited, the reporters tested the fix and provided karma, but only for F15. The same fixes were submitted to F16, but no reporters were running F16, it seems.

There have been a few proposals to cover the case where matching updates have testing for one release but not for another.

FWIW, I can't easily upkarma mdadm updates for F16 presently as I don't have an F16 system with an mdraid array in it. My laptop has such an array, but it's on F15 so I can do F15 karma...

on 736530: that's simply a standard comment to alert you to the fact the bug is effectively a blocker even though this is not immediately obvious (since it's a blocker-by-proxy and in itself does not block F16Beta). I'd post that comment on any bug in the same situation.

adamwill commented 12 years ago

here's some references, for convenience:

kevin commented 12 years ago

Sorry you are so frustrated with the process.

Adding meeting keyword to discuss at monday's meeting.

dledford commented 12 years ago

For clarification purposes: the f15 update only appears to be 14 days old, it is in fact just as old as the f16 update (43 days). The apparent different is because between the mdadm-3.2.2-7 and mdadm-3.2.2-8 package updates, I redid the f15 errata while I edited the f16 errata.

And while this does bear some "same bug, different release" issue similarity, I would point out that the f15 update is still not approved either (the testers didn't provide karma, and even if they did, there is no proventester karma).

adamwill commented 12 years ago

oh right, I was looking at an older f15 update.

of the proventesters who +1ed the last f15 update I know robatino and myself have been busy with f16 stuff, i haven't done an f15 karma run for a while :( i'll try and do one soon.

kkofler commented 12 years ago

I wholeheartedly agree that the critpath process as it stands is totally and utterly broken and think that dledford's proposal (in particular, the 2-week testing timeout even for critpath packages) is the absolute MINIMUM which should be approved. Considering how painful even the process for regular updates has become, I question the value of having a stricter critpath process AT ALL anymore. IMHO, the regular process is paranoid enough to be suitable for critpath packages (and we should let regular packages go directly to stable again, but that's not the matter of this topic).

For some other examples of how the critpath process is broken, look at some Fedora 14 SECURITY (!!!) updates which were stuck in testing for a MONTH or more (!) due to the critpath process. E.g. https://admin.fedoraproject.org/updates/freetype-2.4.2-5.fc14 (where another 2.5 weeks were wasted by the maintainer because he was waiting for an autokarma of 3, but even if the update had been pushed IMMEDIATELY after the critpath approval, it would still have been stuck in testing for almost a month, also considering the push delay). It is just plain unacceptable to expose our users to security risks out of sheer paranoia over some phantomatic regression which is less likely to happen than the user getting killed in a car accident. (In fact, it is much more likely for the user to get his/her computer broken into due to our delayed security fix than to be hit by a regression from said fix if we had pushed it in a timely manner, and the results of the former are also usually much worse.)

I shall also point out that the original critpath policy FESCo voted (while I was in office; I voted against it, obviously) required 1 proventester karma, period. The text of the policy was "fixed" (by a wiki edit) to 1 proventester + 1 normal to align it with what was implemented for branched trees as part of No Frozen Rawhide. As far as I remember, there was never a formal vote on the "fix", and I think it makes no sense (and NFR should have been fixed instead), the point of a tester being a proventester is that we should TRUST him or her; requiring another karma point is a sign of distrust.

Basically, all these update policies are about distrusting our contributors: we don't trust package maintainers (even ones whose experience we explicitly acknowledged by giving them provenpackager or sponsor privileges), so we put technobureaucratic red tape in their way; we don't even trust our proventesters, instead, we require a second tester. We really need to go back to a culture of trust, not distrust!

kkofler commented 12 years ago

I shall add that while I keep seeing issues CAUSED by the critpath process (and its cousin, the update process) all the time, I have yet to see a SINGLE issue PREVENTED by the process. (In fact, there have been several regressions in critpath packages which went through the process and for which the process just delayed the fix (!) (exactly the phenomenon I had predicted and been warning about all along when this was being discussed in FESCo). Yet, so far, FESCo and QA have been sitting in denial and refusing to acknowledge that their process is broken. Can this please change? Pretty please!)

The lack of major regressions in our releases is NOT evidence that the critpath process was successful at preventing them any more than the lack of elephants on the street is evidence that clapping hands is successful at chasing them away. Even before the critpath process, that was entirely a non-issue.

In the whole history of Fedora, there were only 2 events which triggered this whole madness: one bad security update (D-Bus, which was easily fixable by rolling back to the previous build, and which had a fixed build available soon afterwards) and one update to a package not even installed by default (bind). In my books, that's only one issue which possibly qualifies as "major" (the D-Bus one), which caused no permanent damage (downgrading D-Bus completely fixed it until the permanent fix was out) and which was an isolated incident. An issue in a package most users don't even have installed and which doesn't cause any damage to the system not related to the package cannot possibly be a major issue, so the bind one doesn't quality.

adamwill commented 12 years ago

"I shall add that while I keep seeing issues CAUSED by the critpath process (and its cousin, the update process) all the time, I have yet to see a SINGLE issue PREVENTED by the process."

I'd like to thank Andreas for making this very easy to disprove. There have been four glibc updates for F15 and F16 just in the last few weeks with serious bugs, which were filtered by the critpath process:

https://admin.fedoraproject.org/updates/FEDORA-2011-10556
https://admin.fedoraproject.org/updates/FEDORA-2011-10988
https://admin.fedoraproject.org/updates/FEDORA-2011-11443
https://admin.fedoraproject.org/updates/FEDORA-2011-12360

there have certainly been others, but it's something of a pain to search for, as bodhi doesn't have a straightforward way to search for rejected updates, as close as I can tell.

I certainly have some degree of sympathy with your frustrations with the process, but your utter refusal to acknowledge it could ever possibly achieve anything positive is, I think, hurting your position rather than helping it. The process clearly does have a benefit. Whether a better balance of benefits versus costs can be achieved by tweaking the process is a nuanced question, not the STRAIGHTFORWARD matter of a BLACK AND WHITE nature that ANY IDIOT could see, which you always represent it as.

"Yet, so far, FESCo and QA have been sitting in denial and refusing to acknowledge that their process is broken"

Not to play the blame game, but this really is not a QA process in any sense at all. We handle the group of proven testers because FESCo asked us to. The process was developed and implemented by FESCo.

kkofler commented 12 years ago

Those glibc updates got negative feedback within hours of getting pushed to testing. The only way they could have filtered through even without the critpath policy would have been a direct stable push, or an explicit push ignoring the negative karma. In both cases, the maintainer would have clearly done something wrong, as there's nothing in those updates which would have justified a direct stable push (and ignoring negative feedback would of course be very dumb). And in particular, the paranoid policy FESCo is forcing on ALL updates would have been sufficient to prevent these from going out, the critpath one is overkill. So this is NOT a valid example of a success of the critpath policy.

adamwill commented 12 years ago

That's your standby argument whenever anyone points out any update which was rejected. Given the way you've drawn up the rules of this game, I'm confused as to how you could ever possibly consider the updates policy to have 'won': any update that gets positive karma didn't need to be gated, any update which doesn't get karma is an EPIC FIAL, and any update which gets negative karma 'wouldn't have been pushed anyway' - even though the whole reason the policy exists is that we had documented and infamous instances where maintainers did push updates that 'wouldn't have been pushed anyway'. Remember, the catalyst for the policy was a glibc update very similar to those mentioned above.

What, for you, would constitute an example of the package update policy functioning well? If there is no positive answer to that question, you're not working with a fair set of rules.

adamwill commented 12 years ago

On the original proposal:

I brought this up at today's QA meeting to get a QA group overall take on it. We agreed we're not opposed to dledford's proposals (though we'd stipulate the two week timeout could only apply to updates with no negative karma). We acknowledge that the current proven tester manpower is not enough to cover all critpath updates for all releases in a timely manner, and dledford's proposal seem like a reasonable way to mitigate the impact of this.

kkofler commented 12 years ago

The facts are:[[BR]]
1. The critpath policy could have been much looser and those 4 broken glibc updates would still have been blocked. (In particular, the policy being enforced on all updates would already have blocked all 4 glibc updates above, even without the added critpath paranoia.)[[BR]]
2. If, instead of the current Bodhi stubbornness, there were informative rule-of-thumb guidelines simply saying "do not push an update directly to stable unless …" and "take all negative Bodhi feedback into account and never push a known broken update to stable", then (a) why and with what evidence do you assume the maintainer would just do the wrong thing anyway and (b) if it does happen, why wouldn't the right response be to complain to the offending maintainer and educate him about the policies rather than assuming everyone is incompetent by default? Especially maintainers of critical packages definitely should be experienced enough to know what they're doing.

What, for you, would constitute an example of the package update policy
functioning well? If there is no positive answer to that question, you're not
working with a fair set of rules.

That's a non-sequitur. It just means that the policy makes no sense whatsoever and cannot possibly make sense. A competent maintainer is always going to make a better judgement about the quality of his updates than some dumb automated system with no grasp of logic and with inflexible hardcoded rules. (For an example of how such automated systems are necessarily dumb and unhelpful, try to click "push to stable" on a non-critical package after 6 days 23 hours and 59 minutes of testing. Obviously, it will reject it. Does the added minute of testing suddenly make the update more stable? And the actual push is likely to happen a couple hours later at the earliest anyway. I've been forced several times to wait a few minutes in front of my computer waiting for the 7 "magic" days to expire just because of this stupidity. How's that not a pointless waste of my time?)

kevin commented 12 years ago

We are going to revisit this next week. In the mean time:

nirik is going to try and start some proventester meetings
adamw is going to talk with QA about drumming up more testers.
dledford is going to look at a proposal to split up critpath into functional subparts.

mmaslano commented 12 years ago

ACTION: nirik will bring up question about list of untested updates on proventester meeting (mmaslano, 17:49:09)

dledford commented 12 years ago

Problem 1: The original critpath scope includes too many tasks in the list of covered tasks.

Proposal 1: Reduce the scope of critpath coverage. For instance, the most critical tasks for an installed system are those necessary to update from a broken package preventing proper operation to a fixed version of the package that restores proper functionality. It is not necessary to build a live cd image, to build an rpm package, or compose a new tree to update packages on an already installed machine. Even if the machine in question is a repo server for an internal department it still is not necessary to have the repogen tools in the critpath as the only real critpath even for the repo server is the ability for the reposerver to update its own packages to fixed versions that re-enable successful repo creation (or downgrade if you can't wait for the fix upgrade). Some of these excluded items might make more sense in a different setting, so it might be advisable to differentiate between critpath for pre-GA early branched packages (where for instance anaconda makes sense), post-GA updates (where anaconda is more or less a red herring), and build system updates (where things like gcc-g++, rpmbuild, and a few others make much more sense, and here I draw a distinction between build system update and build root updates...broken packages in the build root are easily backed out via a rel-eng ticket and koji tag change, but package installed in the build system that create build roots are another matter). So, I would recommend that we separate CRITPATH into three different CRITPATH sets: post-GA CRITPATH, early branched CRITPATH, and buildsystem CRITPATH. I would then suggest that these three different sets pull in the packages appropriate for their given target audience. For example, we should be able to drop the livecd building tools, rpm building tools, gcc-c++, and a few others from the post-GA CRITPATH where we really care about people simply being able to update to the latest package to get their problem resolved. Once the three critpath groups are created and populated to pull in the right sub-groups, we would then update bodhi to apply the proper critpath group to any given update depending on the update's destination.

Problem 2: The critpath list in comps.xml does not distinguish between the various reasons that a package might be on the critpath list and does not therefore inform as to the types of testing that might be necessary to verify the package's critpath functionality.

Proposal 2: Update the critpath xml format to allow for a new attribute to be attached to any given package in the xml list, critpath-approval. Use this field to specify the type of approval needed for that specific package to be released by bodhi. Bodhi would use this field directly, if it isn't present, it would fall back to the current defaults (although I might suggest different defaults for the 3 different classes of critpath groups in proposal 1). Then, as packages and their role are inspected by qe/developers, if it is apparent that a given package would benefit from a different approval spec, write that approval spec into the comps.xml file, which will then result in bodhi implementing the new approval rules for that given package (if FESCo would like to be more involved with approving updates, that's up to them, I just didn't know who really should be the responsible party here). I would suggest making bodhi able to parse a flexible rule struct from the xml file so that you can express something like (+1 proventester && +1 other && !-1 any) or (+1 autoqa && +1 hardware owner && !-1 any) or (+1 all critpath bugs && !-1 any). However, since I'm not really an xml expert, I don't even know if expressing rules like this is reasonable in xml format.

toshio commented 12 years ago

== Proposal #1: Would touch ==

mash
script that runs when mash finishes and syncs critpath packages into pkgdb
pkgdb (would need to know about separate critpath categories). Easy way would be a boolean for each. More correct method would be a new db table with the critpath category and package-release that it is referring to.
bodhi need to be updated to 1) grab the separate lists from pkgdb 2) make different decisions based on which critpath list the package was in

A modification of #1 was proposed to not do dep solving on the list of critpath packages; instead make it explicit what's in critpath. This would allow us to remove mash and the script that syncs from mash to pkgdb. Depending on other proposals/issues we might be able to skip pkgdb as well and implement directly in bodhi (but we might also decide that we want to skip comps.xml and implement directly in pkgdb or that we need to maintain both comps.xml and pkgdb with slightly different lists/information).

The downside of this is that the list of dependent packages would need to be manually updated. I think that this additional proposal makes sense if we'd like to prune the list of critpath packages (for instance, I had a package that was critpath because anaconda used one function from it in two places in its code. Was this overkill? Especially post-release where anaconda is not going to be run using libraries in the updates repo?) This could make sense if we think that the pendulum has swung too far towards trying to guarantee product stability vs the barriers we have that prevent easy maintainance of critpath-listed packages.

== Proposal #2 ==

There's several problems with doing this as written:

We'd need to look at and possibly modify any software that uses comps.xml. This may be a problem. In the past (I think Jeremy era, though) we talked about comps.xml experiencing feature creep and wanting to remain focused on its use in the installer.
We'd need to figure out what to do with dependencies. Dependencies don't have an entry in the critpath group so they don't have a place to record this information in there. If we made the list explicit (non-depsolved) that would work. Giving dependencies a default method of approval might be another method but I think we'd want that default method to differ from the present approval-method. Something like telling what things in the default critpath list use the package as a dependency so that someone can test those base packages and then approve this package if those packages work. This may end up not exercising the non-critpath package at all (if the code paths tested in the explicitly listed critpath package don't exercise the dependent critpath package) but this may be acceptable (it's only critpath b/c of the dependency. If the dependent package continues to function in the cases we care enough to test, then maybe it's fine).

An alternative implementation might be to put this information somewhere other than comps.xml. If we're still using comps.xml to feed the extraction of dependent packages, then we'd have to maintain two lists that take user input. If we stop doing that (either we feed the extraction of dependent packages from this other list or we no longer calculate dependencies) then we only have one list. PackageDB would be one place that this could live which is already publically visible and editable so it may be the right place for it if those are desirable. This proposal seems to have more ways it could go than proposal #1 though, so there should be more discussion of what this should really look like to evaluate it.

jwrdegoede commented 12 years ago

Hi All,

I started a thread on the devel list with a similar complaint as dledfords complaint, see:
http://lists.fedoraproject.org/pipermail/devel/2011-October/157576.html
Peter Robinson helpfully pointed me to this ticket, so here I am.

My 2 cents on this:

1) It seems that my case and dledford's case have in common that we both maintain CRITPATH packages which are only used by a small percentage of our users, this make its hard to get proven tester feedback for them (or any feedback at all).

2) I think that the solutions provided in [comment:20 dledford] are too complex.

3) I think that the origin solution suggested by dledford that a CRITPATH package can be pushed after to weeks independent of feedback, as long as it has no negative karma, is a good solution to the problem at hand. It follows the KISS principle, which is important both for implementing it as well as for explaining it to various Fedora contributors. I think the solutions from [comment:20 dledford] will be very hard to explain to new contributors and there for are no good.

4) Besides the 2 week rule it would also be good to trim down the set of packages in CRITPATH in some (simple) way. For example it has already been discussed to split the xorg-x11-drivers package into 2 packages one with important drivers like intel,nouveau and ati and one with less important ones like qxl,wacom,etc. So that the more obscure X drivers are not subject to the CRITPATH process. I'm afraid that such I thing won't help the mdraid case though ...

5) Another solution would be to make more people proven testers, in the case at hand for example me and dledford, and allow (select) maintainers to proven-tester +1 there own package, they will then still need an other +1 under the current rules. But getting a non proven-tester +1 I'm pretty sure I can manage.

About 5. I know that the main goal of all the updates procedures is to try and improve the quality of Fedora as consumed by our end users, but I cannot help getting the feeling that some of it has been put into place because of the carelessness of some contributors and that in a sense it is a case were the good (contributors) are suffering because of measures taken due to behavior of the bad. This feeling has been strengthened a lot by the fact that tightening of the procedures has historical often happened as a response to some incident. Allowing certain contributors to proventester +1 packages (including their own) would go a long way towards fixing this, while at the same time ensuring that contributors who can do this will stay careful, because if they screw up their proventester rights are likely to be taken away again.

toshio commented 12 years ago

Replying to [comment:22 jwrdegoede]

(non-FESCo) +1 to all of your points.

Two things to add:

1) IIRC, adamw has spoken in favor of unrecorded testing before. Having a period of time where an update must sit in the testing repository and people who run with that repository enabled are actively using the update, just not entering information that the update works into bodhi. Mandating that in the absence of explicitly listed testing, there is a period of time when an update must be in updates-testing before going live satisfies this.

2) There is wiggle room in the amount of time to give here. On one hand we want to ask how much time is sufficient for people to have installed the update from the testing repo, run with it a while, and report a problem if they're inclined to do so. On the other we want to ask at what point it becomes unreasonable for end users suffering from lack of a known bugfix to continue to wait. I think there's diminishing returns on both sides of this (in opposite directions in time). Currently, with no timeout, we're benefiting from people who run updates testing reporting bugs but we do not have an adequate safety-net solution for end users who do not get bugfixes due to updates not making it to stable in a timely manner. However, the timeout is a safety net -- we want to reap the majority of benefit from explicit reports of working/non-working state before we get to the timeout.

tmraz commented 12 years ago

(FESCo member) +1 to both jwrgeode and toshio points above. I was trying to get the timeout for the CRITPATH packages in two previous FESCo meetings already however unsuccessfully. I think toshio's reasoning why the timeout is good idea can hopefully help to pass this change.

As for the self +1 votes - again I think it should be allowed even if for nothing else then for example the case where the developer asks for the testing someone without the Fedora account and gives the package +1 on his behalf.

notting commented 12 years ago

2011-10-17 FESCo meeting update: deferred based on pending bodhi change https://fedorahosted.org/bodhi/ticket/642 and statistics gathering.

sgallagh commented 12 years ago

2011-10-31 FESCo meeting update: Based on a review of the statistics, mjg59 recommended dropping proventester requirements on the critical path. FESCo voted to do so, waiting a week to get feedback on the development list.

adamwill commented 12 years ago

I missed the meeting, but that recommendation appears to make no real sense. the whole process was designed around proventesters, and the only function of the proventesters group is to provide the 'special' karma for critpath updates. if we're going to do this we may as well kill the proventesters group as there is no reason for its existence any more. but that seems like a fundamental change to the entire design of the critpath process, to me.

adamwill commented 12 years ago

okay, read the logs, so it seems like the proposal is indeed to drop proventesters. i have no huge objection to that in a practical sense, though I note that while 2 seems like a small number, it is not 0. What were those 2 updates and what would have been the impact on Fedora if they'd gone out?

Remember, the update which instigated this whole process - the infamous glibc one - was only one update...

mmaslano commented 12 years ago

#agreed defer this to next week for response

tmraz commented 12 years ago

We agreed to remove the requirement of proventester karma votes for critical path packages. The requirement will be satisfied by regular karma votes.

Leaving open until the policy change is implemented in bodhi.

adamwill commented 12 years ago

This has been implemented in Bodhi, and I just updated the 'Updates Policy' page to remove all references to proven tester karma (I verified with lmacken that he has made it so bodhi does not require proventester karma for any package at any release stage any more). So the ticket can probably be closed.

kevin commented 12 years ago

Thanks.

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Milestone

None

fesco

Source Code

#667 Request to fix CRITPATH update process Closed None Opened 12 years ago by dledford.

Metadata

#667 Request to fix CRITPATH update process

Closed None Opened 12 years ago by dledford.