fedora-infrastructure

#9715 Needs for debuginfod

Closed: Fixed 3 years ago by kevin. Opened 3 years ago by pingou.

Describe what you would like us to do:

In order to deploy debuginfod in openshift @fche and @amerey need:

persistent storage to claim (a ~300G volume would be nice) to store the indexes
the ability to access /mnt/koji (as read-only of course).

I have created the sysadmin-debuginfod group in FAS that @fche is admin for. Once the playbook openshift-apps/debuginfod.yml created/merge, I'll give this group rbac permission to run it.

When do you need this to be done by? (YYYY/MM/DD)

Soonish would be nice but not an emergency either

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

3 years ago

smooge commented 3 years ago

There is a mention in an email of having one per running release. Do you need 1 persistent storage for all of those or do you need 1persistent storage each 'frontend' can share?

fche commented 3 years ago

Stephen, this is all flexible to the resource amount/granularity available or simplicity sought. We can have 1 large server, or N smaller servers each with separate PVs, partitioning the content by whatever criteria such as fedora major release, architecture, even down to file name patterns if necessary. The general rule of thumb is a total index size of O(2%) of total RPM size.

We could start with a single 300GB PV, then use that to explore /mnt/koji for different strategies and propose something more tuned later.

Edited 3 years ago by fche

fche commented 3 years ago

I have a draft openshift dc/etc. that appears self-contained & functional, dependent on:

a PV named "debuginfod-volume", which would be this 300GBish PV
a PV named "koji-volume", which would be the read-only NFS mount of the /fedora_koji tree
an incoming route from the web frontends to a 'debuginfod-debuginfod.$openshift' service into the cluster

I don't have a way of testing ansible wrapping for any of this here at home, so ... not sure how to proceed testing that aspect without access to batcave etc.

I'm not sure there is a need for a distinct stg- vs prod- configuration, please advise. At this point the yml doesn't conditionalize anything on {{ env == "staging" }}.

fche commented 3 years ago

FWIW if nfs-backed PVs are available in this version of openshift, the following might work:

  kind: PersistentVolume
  apiVersion: v1
  metadata:
    name: koji-volume
  spec:
    capacity:
      storage: 1Gi
    nfs:
      server: ntap-iad2-c02-fedora01-nfs01a # prod, or 10.3.167.64 for stg
      path: /mnt/fedora_koji
      mountOptions: ro
    accessModes:
    - ReadOnlyMany

Edited 3 years ago by fche

fche commented 3 years ago

Thanks to nirik, we have the needed PVs set up on the staging cluster. Now just need the https://pagure.io/fedora-infra/ansible/pull-request/474 PR approved, pulled, and activated.

fche commented 3 years ago

Thanks to pingou, the PR is pulled & activated! Now just need some typo?-correction in the openshift PV/PVCs. According to https://os.stg.fedoraproject.org/console/project/debuginfod/browse/storage the DC is blocked on Storage.

The debuginfod-storage-stg PVC might just need an accessMode: change from ReadWriteOnce -> ReadWriteMany. Can send a new PR to make that adjustment in roles/openshift-apps/debuginfod/templates/storage.yml.

The koji-volume PVC .... can't see what's wrong with it, having no access to the PV data.

fche commented 3 years ago

Filed https://pagure.io/fedora-infra/ansible/pull-request/514 for fixing the ReadWriteMany mismatch, but still dunno what can be done about the nfs koji-volume PV problem.

fche commented 3 years ago

Status: https://debuginfod.stg.fedoraproject.org/ is up and running! It's in a cgroups-constrained openshift cluster, able to use some 15% of 1 CPU, and 12GB of RAM. The former constraint means it's going to take a long time to finish indexing (weeks?). The latter constraint appears soft and hasn't impacted indexing so early.

I'll send a new PR to bump up scanning concurrency even on this little node, hoping to get a little more CPU usage out of it. 76% idleness is a Crime. :-)

OTOH when/if we get access to a less constrained cluster, the index database can be rsync'd from here to there. That should make it start with all previous work saved.

fche commented 3 years ago

When we get a new home on a prod cluster, it would be great to get a storage PV that is local rather than NFS-backed. During the indexing stage, debuginfod is fairly I/O intensive to its database. With that stored on NFS as on the os.stg. cluster, we're observing O(20ms) latencies for sqlite operations that on local SSD storage take O(0.1ms).

kevin commented 3 years ago

We don't have local storage in our prod cluster either. :)

However, we are soon going to be working on bringing up a new v4 cluster, and hopefully it might have local storage.

Is the plan here to go to prod before freeze (next tuesday), or to wait until after f34 is out?

fche commented 3 years ago

We can wait till after f34, no external pressure or whatever. (We haven't announced this as an f34 Change or something like that.) We can hang out on the .stg. alone until the v4 OSD is up and running. Local storage would be a serious plus.

Or perhaps could we could consider switching (back) to a VM based service?

fche commented 3 years ago

BTW just number-crunching, we can index ~20TB of content with the 300GB of storage, but with the NFS setup as it is currently on os.stg, it'd take about 230 days :-). Yeah pretty sure local storage will be a must, whether openshift or VM.

kevin commented 3 years ago

ok, so we moved this to vm's. Both the stg and prod ones are installed.

I think it best to let stg finish indexing, then copy that to prod and fire it up from that (to prevent us having two of them hammering the netapp at the same time).

Please re-open if there's anything further to do here.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

Boards 1

ops Status: Done

fedora-infrastructure

Source Code

#9715 Needs for debuginfod Closed: Fixed 3 years ago by kevin. Opened 3 years ago by pingou.

Describe what you would like us to do:

When do you need this to be done by? (YYYY/MM/DD)

Metadata

ops medium-trouble medium-gain

Boards 1

#9715 Needs for debuginfod

Closed: Fixed 3 years ago by kevin. Opened 3 years ago by pingou.