#9164 Automating update / push of DNF user count data
Opened 3 months ago by wwoods. Modified 6 days ago

Describe what you would like us to do:


The server part of the DNF Better Counting feature all seems to be in place:

  • mirrors-countme code is ready to go
  • we've sorted out the log data hiccups from the colo move (e.g. #9065)

So the last task is to automate the daily/weekly updates:

  1. Run countme-update-rawdb.sh daily to parse access.log files and update the (sqlite) raw.db
  2. Run countme-update-totals.sh daily to count hits in raw.db, update totals.db, and generate totals.csv
  3. Do something to make totals.csv available to the public.

So, I've got 3 questions:

1. Where should raw.db go?

The current default is /mnt/fedora_stats/data/countme/raw.db, but I'm not sure SQLite over NFS is the best idea. Something like /var/lib/countme/raw.db might be a better option? (Or maybe /var/cache, since technically this data can be regenerated.. it just takes 5+ hours to read through 1TB+ of logs..)

2. Where should totals.{csv,db} go?

The current default is /var/lib/countme/countme-totals.{csv,db}. Probably these are fine in the same place as raw.db, but maybe we should keep safe/public data separate from raw/private data?

3. How do we make totals.{csv,db} public?

We could probably copy them to /var/www/html/countme/ or something, but I don't think that's actually public?

@mattdm suggested that putting countme-totals.csv in a public git repository would make it easy for people to consume and give us efficient storage with complete change history. This seems like a good idea to me.

It would be pretty easy to extend countme-update-totals.sh so it also updates a local git repository containing countme-totals.csv. From there, we could do a git push, but:

  1. Where should it push to, and
  2. What account/credentials should it use to do that?

When do you need this to be done by? (YYYY/MM/DD)


I think we should at least have decided on the URL and the format for the public data before the start of Flock Nest, which is 2020/08/07.


(btw I'm doing a manual run of countme-update-rawdb.sh on log01 right now, so when we know where we want it to go we can copy ~wwoods/countme/raw.db into place rather spending another 6 hours re-parsing all the logs on the first run.)

Metadata Update from @smooge:
- Issue assigned to smooge
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: groomed, medium-gain, medium-trouble

3 months ago

@wwoods does a /srv/stats which is local and contains a /srv/stats/html for dropping html data and a /srv/stats/data for the db files work?

@wwoods does a /srv/stats which is local and contains a /srv/stats/html for dropping html data and a /srv/stats/data for the db files work?

Works fine for me - there's no HTML involved here (yet), but I can imagine someone wrapping a static website generator around the .csv file.

Right now my main concern is getting each week's updated total.{db,csv} data pushed somewhere public. Drawing cool graphs or hosting chart explorers.. that'll be a project for another day.

Not sure if you had data-analysis.fp.o in mind as a domain for this but if so, we'll need to fix https://pagure.io/fedora-infrastructure/issue/9178 first :)

Not sure if you had data-analysis.fp.o in mind as a domain for this but if so, we'll need to fix https://pagure.io/fedora-infrastructure/issue/9178 first :)

Ah, so there was a public URL pointing to /var/www/html/! Neat.

It sounds like we intend to fix #9178, so how does this sound:

  1. raw data path: /srv/stats/countme/raw.db
    Seems like it should stay on the local disk. Let me know if there's a preferred path tho.
  2. totals data path: /srv/stats/countme/totals.{db,csv}
    This gives a reliable public URL for the data, in two different forms. (I'll need to make sure they are updated atomically so people don't get partial/corrupt data during updates..)
  3. Make data public: /var/www/html/countme/totals.{db,csv}
    I'd still like to be able to push the updated data to a git repo on pagure (and/or github), but to do that we'd probably need a "data-analysis" service account on pagure, and we'd need to have its credentials on log01, etc. As long as the data is available somewhere, the git repo can wait 'til later.

If those sound reasonable, I'm going to work on an ansible PR to get the countme updates running daily on log01. Once that's running and #9178 is fixed, this can be closed.

@wwoods @smooge do we have anything left to do on this ticket?

@wwoods @smooge do we have anything left to do on this ticket?

I think the questions have been answered:

  1. Where should raw.db go?
  2. Where should totals.{csv,db} go?

Intermediate data will go in /var/lib/countme on log01.

  1. How do we make totals.{csv,db} public?

Copy them from /var/lib/countme to /var/www/html/csv-reports/countme. The public URLs will be:

..but at the moment, https://data-analysis.fedoraproject.org/ gives 401 Unauthorized, so either I'm mistaken about the plan, or that needs fixed.

Once that's sorted out, somebody┬╣ just needs to finish automating the process & submit a PR for https://pagure.io/fedora-Infra/ansible, and then this will be definitely fixed.

┬╣ probably somebody named @wwoods

Oh FFS this was working during FLOCK/NEST.

OK the link should work.. it is now a matter of setting up the other parts.

Whats left to do here?

All of the above still needs to be done. It is currently only working in his home directory.

Okay! Filed a PR with proposed patches/scripts to automate this - see fedora-infra/ansible#238.

Okay! The ansible PR has been merged, though the playbook has yet to be run, so I dunno if it actually works yet.

I've been running the scripts manually once per week (or so) and I did find (and fix) a crash-bug in the log parser (see mirrors-countme#1), so the code itself should be fine.

For the moment, you can find my copy of the data (plus some examples using pandas/numpy/matplotlib to analyze and graph the data) here: https://github.com/wgwoods/fedora-countme-data/

Login to comment on this ticket.

Metadata