#1278 establish Fedora Bat-Signal for ultra-critical security updates
Closed None Opened 10 years ago by mattdm.

This is primarily an idea for the Security SIG, but since there's no tracker for that, I'm putting it here for coordination.

{{{

I think we did a pretty good job in responding to CVE-2014-0160, but there's
also room for improvement.

One particular need is the ability to get in touch with owners of core
components, or if they are not available, provenpackagers with particular
security expertise -- and in either case, also testers with a security
background.

Maybe we need to have some sort of (opt-in) Fedora Bat Signal for
extra-critical and urgent security issues in core packages. We would promise
not to use it unless the internet were actually on fire, as it appears to be
in this case, and then have (escrowed somewhere?) private 24/7 contact
information (phone numbers, SMS).

What do you think? Anyone interested in developing this idea further?
}}}

We need to have responders for

  1. coordination (it helps when one person has the "incident lead" baton; can be passed around as needed)
  2. communications (drafting and sending community messages; email, web, social media)
  3. package fixing (ideally package maintainer is security expert, second best is package maintainer + security expert, third is security expert with provenpackager privileges or assistance from someone who has them, or last resort, provenpackager alone)
  4. quality assurance (again, ideally someone with security expertise to advise and coordinate, but fast widespread testing at all levels helps)
  5. release engineering (lots of work getting an update out as an exception to normal flow)

and the ability to get at least one person in each role out of bed in the event of an emergency.


Coordination/communications could probably be merged together. Jóhann just mentioned on the security list that removing QA would make sense:

{{{
You can forget including QA in this since maintainers dont provide the testing community with test cases so testers cant quickly through test cases for the affected package and provide the necessary karma.

JBG
}}}

So we'd end up with:

  1. Coordination & communications
  2. Package fixing
  3. Release engineering

On the release engineering side, is there really much to do there? If a package gets enough karma to move to stable quickly, doesn't it move relatively quickly on its own? I wasn't sure if release eng. had to get involved for CVE-2014-0160 (heartbleed).

Coordination and communication can be merged together, but it's important that someone have the coordination role and that they and others know who has that baton.

QA is important, whether Jóhann wants to participate or not. We don't want to send out an update that makes the situation ''worse''. At the very least, we need people to provide bodhi karma, but I'd prefer that we are actually relatively confident that the fixes are valid and do not introduce new problems.

Release Engineering needs to sign the packages and do the "push". Although it looks like packages magically move when they get enough karma, this is actually done by human beings behind the scenes. And they need to stay involved in case there are glitches with that process (as there were in the most recent case).

Totally agree on the incident leader baton. I wasn't aware of the human intervention required on the Release Eng. side.

One thing you brought up in the email thread was the way we're all brought together to solve the problem. Getting into a phone bridge might be troublesome with language barriers and people scattered in different countries. I could see an IRC channel working well but there would need to be that "bat signal" to let folks know that something is going on.

Would it be possible to use a service like PagerDuty for that? It's relatively pricey but I'm sure there are alternatives out there.

I like the idea of a coordinated/planned setup for events like these (even though they are thankfully rare).

Perhaps we could work on a wiki page that outlines steps and process here? I don't think we should make it too heavyweight, just that we establish an irc channel, decide on a leader, and pull in others as needed.

I think its impractical to have cell phone contact info for all maintainers of all core components. We should hopefully be able to pull in needed resources via a network effect. (ie, leader asks for maintainer for foo, others reach out to people who might know them, etc).

It might also help to have responders from different time zones available. Since Fedora is an international effort, this might be an easy way to increase coverage.

Replying to [comment:5 till]:

It might also help to have responders from different time zones available. Since Fedora is an international effort, this might be an easy way to increase coverage.

Right, that was a small part of the delay in the last incident - the maintainers were in Europe and had gone offline. Fortunately a provenpackager (dgilmore) was ready and able to just do it (and the patch was clear and uncomplicated), so that was probably about 30 minutes delay while we made that call.

Do we have an updated proposal here? Or somebody who wants to create the wiki page mentioned in comment:4?

Paul Frields started a draft SOP for security updates at https://fedoraproject.org/wiki/User:Pfrields/Critical_security_update_SOP

More updates later :)

Just polling for an update here.

Can rel-eng people provide update here?

This looks like not discussed since a year and we have now enhanced bodhi for updates. Though this ticket looks to me like how to handle urgent security updates but I think there is also a need to have urgent security updates to be pushed first before normal updates. I think more important steps here is repository compose time and then time to push the security updates.

I see we have a draft page https://fedoraproject.org/wiki/Urgent_updates_policy created by Kevin.

There's a second ticket somewhere for the rel-eng side of actually pushing out the urgent updates (don't have it at hand at the second). This one is about communications flow.

Replying to [comment:12 pnemade]:

I guess that second ticket is https://fedorahosted.org/rel-eng/ticket/5886

Yes, that's it. Thanks!

At today's FESCo meeting we decided to defer voting until mattdm 's starts a conversation with the security team.

Bodhi also needs to allow direct stable pushes again if you want to have any chance of fixing such urgent issues in a timely manner for all supported Fedora releases.

Per today's meeting, removing from meeting agenda until that conversation with the security team happens.

@mattdm Do you have any update on the conversation that was supposed to happen?

I think we can take this off the FESCo radar and keep it moving on the Security Team.

@mattdm, Is there any ticket opened or email sent to Security Team that we can link here?

It has been 10 months since this ticket had meaningful updates. Closing as WONTFIX.

Login to comment on this ticket.

Metadata