#8811 method for copying in ostree content to our ostree repos
Closed: Fixed 2 months ago by dustymabe. Opened 10 months ago by dustymabe.

  • Describe the issue

We need to be able to copy in ostree content from our Fedora CoreOS builds into our ostree repos. This is done manually right now by a Fedora release engineer during a release. We'd like to automate this based on fedora messaging messages.

Since I'll be helping develop and manage this "glue" it would be nice if we can run it in openshift so that I can view logs and such without having to gain more access to fedora releng/infra systems. Currently that would require a writable /mnt/koji inside of the openshift app that's being developed, which is probably a lot to ask.

General architecture:

  • listen for a fedora message
  • download ostree tarball
  • extract tarball
  • ostree pull-local into target ostree repo
  • update ostree summary file

We'll probably want a split repo setup where we import most content into a staging repo (like the compose repo we have today) and then sync over part of it when we do an actual release.


Metadata Update from @mohanboddu:
- Issue tagged with: meeting

10 months ago

from the meeting today:

#info We decided to create a separate volume for just ostree stuff with rw access
and dustymabe will test it in stg first

So we're going to migrate the ostree repos to a separate netapp volume (with the same level of backup support) and grant read/write access to that to openshift. There is a wrinkle in that we use two directories:

  1. /mnt/koji/compose/ostree/ - for composing into
  2. /mnt/koji/ostree/ - for prod ostree content (fronted by CDN)

Is it possible to make a single netapp volume have an ostree directory that gets mounted under those locations so that we don't have to change any of our existing consumers? I like the idea of it being a single netapp volume vs multiple as it would be easier to maintain and we get deduplication.

dustymabe will test it in stg first

I think I would like to test this in stg rather than communishift so that we can get a more realistic test. @kevin can you create a volume in stage and show me how to mount it ?

I can try and set this up on friday (2019-10-04)

ok we have worked on this quite a bit today. We have a netapp share fedora-ostree-content that is shared to the staging openshift via two PVs fedora-ostree-content-volume-{1,2} that point at the same NFS share. We were then able to mount those two PVs into pods within two different projects within openshift. We can also mount those shares under /mnt/koji/ostree and /mnt/koji/compose/ostree even though it's a single netapp volume.

Next steps:

  • work on making sure permissions are what we expect them to be
    • will need to possibly look at 1 2
  • get the code reviewed and approved for use in Fedora Infra
  • work out a migration strategy for prod content in the ostree directories
  • get the code reviewed and approved for use in Fedora Infra

I asked and was told I could use this ticket as an RFE for getting the ostree importer into Fedora. Can we please get a security audit of https://github.com/coreos/fedora-coreos-releng-automation/tree/master/coreos-ostree-importer ?

Metadata Update from @dustymabe:
- Issue assigned to kevin

9 months ago

Can we please get a security audit of https://github.com/coreos/fedora-coreos-releng-automation/tree/master/coreos-ostree-importer ?

The security audit process is undergoing some changes. In the infrastructure meeting today @kevin volunteered to try to wrangle those changes and get this assigned for review.

The script looks okay as of commit 7e7ffb5fb6729beecbc8e857064296c16bb152f9.
If any significant changes occur, please re-request a security audit.

Also, for future cases: please note that it would help if people tag security audits with the "security" tag, so that they show up in my overviews.

Updated next steps:

  • work on making sure permissions are what we expect them to be
    • will need to possibly look at 1 2
  • work out a migration strategy for prod content in the ostree directories

Are the next steps here still accurate or did some of those get sorted out?

@jlebon @dustymabe started working on it with the help from @kevin , we will know more by the end of the day.

@kevin and I met today and worked on a strategy for permissions. For now we think we can appropriately run things by using a less restrictive umask (0002) that will allow for the group permissions to include write for newly created files.

We also found that we needed /usr/bin/ostree to create some files with group write permissions. There is an open PR for that now (thanks @jlebon). In the meantime we can workaround with something like find . -type d | xargs sudo chmod g+w after each time we touch the repo in each respective app.

Updated next steps:

  • work out a migration strategy for prod content in the ostree directories
  • update coreos-ostree-importer code to add workaround

status update. I've done a lot of work to get coreos-ostree-importer running in stage and listening to fedora messages, processing the messages, and also sending a response message. I've updated our pipeline to also send a request message to the importer.

At this point we are still here:

  • work out a migration strategy for prod content in the ostree directories
  • update coreos-ostree-importer code to add workaround

For the 2nd item, @jlebon is working on that fix and it shouldn't be long. For the 1st bullet I'll try to work with mohan and kevin to see if we can schedule some time to do the migration.

Regarding:

  • work out a migration strategy for prod content in the ostree directories

here is a rough outline for the migration:

  • 0. make initial copy (rsync) of content from prod repos to new netapp volume
  • 1. make sure no writers to ostree repos are running
    • need to make sure no pungi/bodhi composes are running
    • possibly need to disable new-updates-sync
  • 2. make final copy (rsync) of content from production repos into new netapp volume
  • 3. maybe move prod repos to another directory (minimizes confusion)
  • 4. set up /mnt/koji/ostree /mnt/koji/compose/ostree mounts on composer to point to new netapp volume
  • 5. make any necessary changes to our webserver setup/cloudfront setup so that existing clients can download content

This is in progress now. I'll check off the following items as they get done.

[x] disable new-updates-sync
[x] make sure for real that no composes are running
[x] do a final final rsync
[x] move ostree directories to new directory name
[x] mount over ostree mount locations with the new netapp volume
[x] verify clients can still rpm-ostree upgrade
[x] mount up net-app volume into openshift pods in prod openshift
[x] watch new pungi composes to make sure content gets synced correctly
[x] watch new-updates-sync to make sure content gets synced correctly
[ ] watch coreos-ostree-importer to make sure content gets synced correctly

[x] move ostree directories to new directory name
[x] mount over ostree mount locations with the new netapp volume
[x] verify clients can still rpm-ostree upgrade

we are still monitoring composes to make sure things are progressing.. the pungi initiated composes were stuck because robosignatory wasn't ablt to sign ostree commits (we forgot to add the mount to the robosignatory machine). We added the mount and those commits got signed and we are now progressing.

Another thing we did after we did the final rsync was to run find . -type d | xargs chmod g+w in order to give directories group write permissions but this took FOREVER on the /mnt/koji/ostree/repo repo because it had to look at every file under objects to test if it was a directory or not. An optimization here is to use the fact that files under ./objects/* are directories and files under ./objects/*/* are normal files. This optimized command excludes ./objects from the find and runs much faster:

(ls -d ./objects/*; find . -path ./objects -prune -o -type d) | xargs chmod g+w

ok one issue that we've hit with sharing the NFS mount into the openshift pods is that we have several different processes touching these shared mounts and the file/directory ownership is a bit all over the place. We have:

  • files created by pungi/koji runroot: 0:0 (root:root)
    • this is when a commit first gets created and is run against the compose repo
  • files created by new-updates-sync: 263:263 (ftpsync:ftpsync)
    • syncing from the compose repo to the prod repo
  • files created by robosignatory (signing): ids in the range of 988->992 (fedmsg hub?)
    • signed ostree commits, files under refs/heads/

One proposal here is to pick a uid/gid for all files/directories in the repo and then setgid on the directories so that new files will be group owned by that gid. I'm not sure if this would work everywhere or not.

  • files created by robosignatory (signing): ids in the range of 988->992 (fedmsg hub?)
    • signed ostree commits, files under refs/heads/

robosignatory now runs as uid=992 gid=988 after the switch to Fedora Messaging. The other numbers in the range are from when it ran as part of fedmsg (gid=991) on the autosign box. This means that new files created by robosignatory should have uid=992 and gid=988.

OK here are some facts:

  • prod repo:

    • files created by new-updates-sync: 263:263 (ftpsync:ftpsync)
    • files created by coreos-ostree-importer: $randomuid:99
      • runs in openshift with random uid. gid for files created is 99
  • compose repo:

    • files/dirs created by pungi/koji runroot are 0:0 (root:root)
    • files/dirs created by robosignatory (signing) are 992:988
    • files created by coreos-ostree-importer: $randomuid:99
      • runs in openshift with random uid. gid for files created is 99

OK so a few suggestions here after talking to kevin:

  • prod repo:
    • we switch all files in prod repo to ftpsync:ftpsync
    • we make all directories group writable (something we've tried to do already)
    • we setgid on directories so new files/dirs get gid=263
      • this is needed because coreos-importer has 263 in it's supplemental groups, it's not it's main group

For the compose repo I think this depends on robosignatory. kevin mentioned we might be able to just switch the uid/gid for robosig to be ftpsync (we should think about this more).

  • compose repo:
    • we switch robosig to uid/gid = 263:263
    • we switch all files in compose repo to ftpsync:ftpsync
    • we make all directories group writable (something we've tried to do already)
    • we setgid on directories so new files/dirs get gid=263
      • this is needed because coreos-importer has 263 in its supplemental groups, it's not its main group

If we can't switch robosig something like this could work:

  • compose repo:
    • we switch all files in compose repo to owned by robosig user but ftpsync group: 992:263
    • we make all directories group writable (something we've tried to do already)
    • we setgid on directories so new files/dirs get gid=263
      • this is needed because coreos-importer has 263 in its supplemental groups, it's not its main group

In this last option I think new-updates-sync would still work but not sure.

Yeah, we should be able to change robosignatory user to whatever uid/gid we want... it's not used anywhere but on the autosign01 system.

This plan should be ok. It also fixes another issue/thought I had: We should look at hardlinking the two directories to save room. Hardlinking needs the files to have the same checksum and ownership, so moving everything to ftpsync would work there.

How time sensitive is this? We are currently in freeze and this would need a freeze break... can we wait until after? Or should we go ahead with a freeze break request?

Yeah, we should be able to change robosignatory user to whatever uid/gid we want... it's not used anywhere but on the autosign01 system.

That's great assuming other consumers of any files robosignatory creates can handle the changed file ownership.

This plan should be ok. It also fixes another issue/thought I had: We should look at hardlinking the two directories to save room. Hardlinking needs the files to have the same checksum and ownership, so moving everything to ftpsync would work there.

Yeah we probably need to talk about this a bit more to see what's possible.

How time sensitive is this? We are currently in freeze and this would need a freeze break... can we wait until after? Or should we go ahead with a freeze break request?

I see beta release target is 2020-03-17. I guess that's OK. One thing we probably need to do is prune some stuff though since we only added a small amount of storage the other day.

Yeah, we should be able to change robosignatory user to whatever uid/gid we want... it's not used anywhere but on the autosign01 system.

That's great assuming other consumers of any files robosignatory creates can handle the changed file ownership.

Yeah, I would think so...

This plan should be ok. It also fixes another issue/thought I had: We should look at hardlinking the two directories to save room. Hardlinking needs the files to have the same checksum and ownership, so moving everything to ftpsync would work there.

Yeah we probably need to talk about this a bit more to see what's possible.

ok...

How time sensitive is this? We are currently in freeze and this would need a freeze break... can we wait until after? Or should we go ahead with a freeze break request?

I see beta release target is 2020-03-17. I guess that's OK. One thing we probably need to do is prune some stuff though since we only added a small amount of storage the other day.

ok. We could request a freeze break to manually prune? Or if you like I could just grow the volume more?

ok. We could request a freeze break to manually prune? Or if you like I could just grow the volume more?

maybe just add a todo to check it early next week and see if it's about to run out and if it's getting close add some more

OK - one update to https://pagure.io/releng/issue/8811#comment-629051

OK here are some facts:

  • prod repo:
    • files created by new-updates-sync: 263:263 (ftpsync:ftpsync)
    • files created by coreos-ostree-importer: $randomuid:99
      • runs in openshift with random uid. gid for files created is 99

The nightly branched/rawhide pungi runs also run an ostree pull-local to sync content over in to the prod repo.

I'm going to open an issue to see if we can explore using new-updates-sync there.

What we have done yesterday:

# 20200408 ostree repo work

work items for today:

- open ticket to investigate moving nightly branched/rawhide to new-updates-sync
    - @dusty: @dusty will work on this one
    - @dusty: opened ticket: https://pagure.io/releng/issue/9392
- switch robosig to uid/gid 263:263
    - @dusty: @kevin, can you take care of this part?
    - @dusty: kevin did agree to work on this piece
    - it's working mostly now, but still needs work on the unit file to start correctly.
    - consider this done for now.
- check to see what user/group the nightly branched/rawhide composes run the OSTree syncing into the prod compose repo as
    - @mohan says they run as root
    - @dusty: can we make just the ostree pull run as ftpsync/ftpsync?
    XXX: MOHAN KEVIN WDYT ^^
        - I could give it a try after the current compose is done
        - @dusty: scratch that - not needed as we just 

- check to make sure no composes are running (i.e. nothing writing to the repos)
    - @mohan: Currently there is a rawhide compose running, it will be done at approximately 5 EDT
        - https://koji.fedoraproject.org/koji/tasks?owner=releng&state=active&view=tree&method=all&order=-id

- maybe we should force netapp to take a snapshot here
    - safety precaution before we run the chmod/chown commands below

- run command to switch all directories to setgid
    - these commands should execute in < 60 seconds
    - `cd /mnt/koji/compose/ostree/repo`
    - `(ls -d ./objects/*; find . -path ./objects -prune -o -type d) | xargs chmod g+s`
    - `cd /mnt/koji/ostree/repo/`
    - `(ls -d ./objects/*; find . -path ./objects -prune -o -type d) | xargs chmod g+s`
- run command to switch all directories 263:263
    - these commands should execute in < 60 seconds
    - `cd /mnt/koji/compose/ostree/repo`
    - `(ls -d ./objects/*; find . -path ./objects -prune -o -type d) | xargs chown 263:263`
    - `cd /mnt/koji/ostree/repo/`
    - `(ls -d ./objects/*; find . -path ./objects -prune -o -type d) | xargs chown 263:263`
- run command to switch all files/directories in the repos to 263:263
    - unfortunately this will probably take a long time 😭
        - but that is OK because the directories are the important ones I think
        - so we should just be able to let this run and not worry about it too much
    - `cd /mnt/koji/compose/ostree/repo && chown -R 263:263`
        - Replaced it with `chown 263:263 -R /mnt/koji/compose/ostree/repo`
    - `cd /mnt/koji/ostree/repo/ && chown -R 263:263`
        - Replaced it with `chown 263:263 -R /mnt/koji/ostree/repo`

Can we do one more set of commands to add group write perms on already
created directories?

cd /mnt/koji/compose/ostree/repo
(ls -d ./objects/*; find . -path ./objects -prune -o -type d) | xargs chmod g+w
cd /mnt/koji/ostree/repo/
(ls -d ./objects/*; find . -path ./objects -prune -o -type d) | xargs chmod g+w

Can we do one more set of commands to add group write perms on already
created directories?
cd /mnt/koji/compose/ostree/repo
(ls -d ./objects/; find . -path ./objects -prune -o -type d) | xargs chmod g+w
cd /mnt/koji/ostree/repo/
(ls -d ./objects/
; find . -path ./objects -prune -o -type d) | xargs chmod g+w

Done

here are a few followup PRs that need to be merged and applied in order for this to be complete:

Now that those two PRs are merged I think we can call this DONE!

Metadata Update from @dustymabe:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 months ago

Login to comment on this ticket.

Metadata