#9245 Getting direct access to a view of datanomer database for data-analysis purposes
Opened a month ago by asaleh. Modified 17 days ago

Describe what you would like us to do:

As part of the Packager Healtcare initiative we were looking at the data in datagrepper/datanommer to see how our testing of the basic packager worlflow incluences the various systems.

Unfortunately, the dump of database itself [0], while reasonable to work with on a local machine with large enough disk-space, is unweildy if we want to share or publish the data. We have a more focused view that is just several MB compressed [1], regenerating it seems to require re-downloading the tar-file, and even then the view of the data will be at least a day old.

But with the analysis I ran localy, it seems that we would have sufficient data for the analysis with materialized view based on a query in the datanommer database:

SELECT * FROM messages 
  (topic LIKE '%.monitor-gating.%' 
  OR topic LIKE '%.dist-git-pr.%'
  OR topic LIKE '%.koji-build.%' 
  OR topic LIKE '%.resultsdb.result.new' 
  OR topic LIKE '%waiverdb.waiver.new' 
  OR topic LIKE '%greenwave.decision.update')
  AND timestamp > now() - interval '3 months';

alongside with some periodic refresh (i.e., every 6 hours?)

This would cover most of the RPM packager workflow with gating.

Four our purposes, we need to reach the database from either os.fedoraproject.org or ocp.ci.centos.org (because that is where we would run the grafana for the analysis)

I would like to know if this sounds like a reasonable approach, or any other suggestions.

If you'd like, I can show our POC grafana dashboards running on centos openshift.

Metadata Update from @mobrien:
- Issue tagged with: medium-gain, medium-trouble

a month ago

Potential option: Create a dedicated user (and password) in postgresql with read-only access to the DB.

Metadata Update from @pingou:
- Issue untagged with: medium-gain, medium-trouble

a month ago

Metadata Update from @pingou:
- Issue tagged with: medium-gain, medium-trouble

a month ago

Yes, new postgress user with read-only access to the db is definitly part of the answer :-)

I remember there was some concern about being able to run queries that are too expensive (and this was stated as the reason, why we can't i.e. just have this exposed in data-grepper)

Of course, my idea would be that those dashboards and their queries would be stored in our ansible repo, so it wouldn't need to be as fool-proof as if we just expose querying capacity publicly, so if we get access to the full table, we can still create some interesting dashboards, without wrecking the db performance :)

Just a note: I doubt ocp.ci.centos.org will be able to connect to our DB server, so os.fp.o may be the better option there.

Yep, the POC is on ocp.ci.centos.org mostly because os.fp.o has the cert for the generic *.apps.os.fp.o broken (and we mostly don't care, because things we care about on os.fp.o have proper domains)

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

a month ago

Yeah, a r/o db user seems fine... but we may have to adjust firewall rules to allow you to reach it.

I have created the user in the database, it is named datanommer_ro and its password is in the private repository under the variable datanommer_ro_password, which you should be able to access from ansible when deploying your app :)

Great, what is the hostname/port of the database? :-)

db-datanommer01.iad2.fedoraproject.org and the port is the regular postgresql one :)

Login to comment on this ticket.