#9715 Needs for debuginfod
Closed: Fixed 3 years ago by kevin. Opened 3 years ago by pingou.

Describe what you would like us to do:


In order to deploy debuginfod in openshift @fche and @amerey need:

  • persistent storage to claim (a ~300G volume would be nice) to store the indexes
  • the ability to access /mnt/koji (as read-only of course).

I have created the sysadmin-debuginfod group in FAS that @fche is admin for. Once the playbook openshift-apps/debuginfod.yml created/merge, I'll give this group rbac permission to run it.

When do you need this to be done by? (YYYY/MM/DD)


Soonish would be nice but not an emergency either


Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

3 years ago

There is a mention in an email of having one per running release. Do you need 1 persistent storage for all of those or do you need 1persistent storage each 'frontend' can share?

Stephen, this is all flexible to the resource amount/granularity available or simplicity sought. We can have 1 large server, or N smaller servers each with separate PVs, partitioning the content by whatever criteria such as fedora major release, architecture, even down to file name patterns if necessary. The general rule of thumb is a total index size of O(2%) of total RPM size.

We could start with a single 300GB PV, then use that to explore /mnt/koji for different strategies and propose something more tuned later.

I have a draft openshift dc/etc. that appears self-contained & functional, dependent on:

  • a PV named "debuginfod-volume", which would be this 300GBish PV
  • a PV named "koji-volume", which would be the read-only NFS mount of the /fedora_koji tree
  • an incoming route from the web frontends to a 'debuginfod-debuginfod.$openshift' service into the cluster

I don't have a way of testing ansible wrapping for any of this here at home, so ... not sure how to proceed testing that aspect without access to batcave etc.

I'm not sure there is a need for a distinct stg- vs prod- configuration, please advise. At this point the yml doesn't conditionalize anything on {{ env == "staging" }}.

FWIW if nfs-backed PVs are available in this version of openshift, the following might work:

  kind: PersistentVolume
  apiVersion: v1
  metadata:
    name: koji-volume
  spec:
    capacity:
      storage: 1Gi
    nfs:
      server: ntap-iad2-c02-fedora01-nfs01a # prod, or 10.3.167.64 for stg
      path: /mnt/fedora_koji
      mountOptions: ro
    accessModes:
    - ReadOnlyMany

Thanks to nirik, we have the needed PVs set up on the staging cluster. Now just need the https://pagure.io/fedora-infra/ansible/pull-request/474 PR approved, pulled, and activated.

Thanks to pingou, the PR is pulled & activated! Now just need some typo?-correction in the openshift PV/PVCs. According to https://os.stg.fedoraproject.org/console/project/debuginfod/browse/storage the DC is blocked on Storage.

The debuginfod-storage-stg PVC might just need an accessMode: change from ReadWriteOnce -> ReadWriteMany. Can send a new PR to make that adjustment in roles/openshift-apps/debuginfod/templates/storage.yml.

The koji-volume PVC .... can't see what's wrong with it, having no access to the PV data.

Filed https://pagure.io/fedora-infra/ansible/pull-request/514 for fixing the ReadWriteMany mismatch, but still dunno what can be done about the nfs koji-volume PV problem.

Status: https://debuginfod.stg.fedoraproject.org/ is up and running! It's in a cgroups-constrained openshift cluster, able to use some 15% of 1 CPU, and 12GB of RAM. The former constraint means it's going to take a long time to finish indexing (weeks?). The latter constraint appears soft and hasn't impacted indexing so early.

I'll send a new PR to bump up scanning concurrency even on this little node, hoping to get a little more CPU usage out of it. 76% idleness is a Crime. :-)

OTOH when/if we get access to a less constrained cluster, the index database can be rsync'd from here to there. That should make it start with all previous work saved.

When we get a new home on a prod cluster, it would be great to get a storage PV that is local rather than NFS-backed. During the indexing stage, debuginfod is fairly I/O intensive to its database. With that stored on NFS as on the os.stg. cluster, we're observing O(20ms) latencies for sqlite operations that on local SSD storage take O(0.1ms).

We don't have local storage in our prod cluster either. :)

However, we are soon going to be working on bringing up a new v4 cluster, and hopefully it might have local storage.

Is the plan here to go to prod before freeze (next tuesday), or to wait until after f34 is out?

We can wait till after f34, no external pressure or whatever. (We haven't announced this as an f34 Change or something like that.) We can hang out on the .stg. alone until the v4 OSD is up and running. Local storage would be a serious plus.

Or perhaps could we could consider switching (back) to a VM based service?

BTW just number-crunching, we can index ~20TB of content with the 300GB of storage, but with the NFS setup as it is currently on os.stg, it'd take about 230 days :-). Yeah pretty sure local storage will be a must, whether openshift or VM.

ok, so we moved this to vm's. Both the stg and prod ones are installed.

I think it best to let stg finish indexing, then copy that to prod and fire it up from that (to prevent us having two of them hammering the netapp at the same time).

Please re-open if there's anything further to do here.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Done