#9164 Automating update / push of DNF user count data
Opened 14 days ago by wwoods. Modified 5 days ago

Describe what you would like us to do:


The server part of the DNF Better Counting feature all seems to be in place:

  • mirrors-countme code is ready to go
  • we've sorted out the log data hiccups from the colo move (e.g. #9065)

So the last task is to automate the daily/weekly updates:

  1. Run countme-update-rawdb.sh daily to parse access.log files and update the (sqlite) raw.db
  2. Run countme-update-totals.sh daily to count hits in raw.db, update totals.db, and generate totals.csv
  3. Do something to make totals.csv available to the public.

So, I've got 3 questions:

1. Where should raw.db go?

The current default is /mnt/fedora_stats/data/countme/raw.db, but I'm not sure SQLite over NFS is the best idea. Something like /var/lib/countme/raw.db might be a better option? (Or maybe /var/cache, since technically this data can be regenerated.. it just takes 5+ hours to read through 1TB+ of logs..)

2. Where should totals.{csv,db} go?

The current default is /var/lib/countme/countme-totals.{csv,db}. Probably these are fine in the same place as raw.db, but maybe we should keep safe/public data separate from raw/private data?

3. How do we make totals.{csv,db} public?

We could probably copy them to /var/www/html/countme/ or something, but I don't think that's actually public?

@mattdm suggested that putting countme-totals.csv in a public git repository would make it easy for people to consume and give us efficient storage with complete change history. This seems like a good idea to me.

It would be pretty easy to extend countme-update-totals.sh so it also updates a local git repository containing countme-totals.csv. From there, we could do a git push, but:

  1. Where should it push to, and
  2. What account/credentials should it use to do that?

When do you need this to be done by? (YYYY/MM/DD)


I think we should at least have decided on the URL and the format for the public data before the start of Flock Nest, which is 2020/08/07.


(btw I'm doing a manual run of countme-update-rawdb.sh on log01 right now, so when we know where we want it to go we can copy ~wwoods/countme/raw.db into place rather spending another 6 hours re-parsing all the logs on the first run.)

Metadata Update from @smooge:
- Issue assigned to smooge
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: groomed, medium-gain, medium-trouble

14 days ago

@wwoods does a /srv/stats which is local and contains a /srv/stats/html for dropping html data and a /srv/stats/data for the db files work?

@wwoods does a /srv/stats which is local and contains a /srv/stats/html for dropping html data and a /srv/stats/data for the db files work?

Works fine for me - there's no HTML involved here (yet), but I can imagine someone wrapping a static website generator around the .csv file.

Right now my main concern is getting each week's updated total.{db,csv} data pushed somewhere public. Drawing cool graphs or hosting chart explorers.. that'll be a project for another day.

Not sure if you had data-analysis.fp.o in mind as a domain for this but if so, we'll need to fix https://pagure.io/fedora-infrastructure/issue/9178 first :)

Not sure if you had data-analysis.fp.o in mind as a domain for this but if so, we'll need to fix https://pagure.io/fedora-infrastructure/issue/9178 first :)

Ah, so there was a public URL pointing to /var/www/html/! Neat.

It sounds like we intend to fix #9178, so how does this sound:

  1. raw data path: /srv/stats/countme/raw.db
    Seems like it should stay on the local disk. Let me know if there's a preferred path tho.
  2. totals data path: /srv/stats/countme/totals.{db,csv}
    This gives a reliable public URL for the data, in two different forms. (I'll need to make sure they are updated atomically so people don't get partial/corrupt data during updates..)
  3. Make data public: /var/www/html/countme/totals.{db,csv}
    I'd still like to be able to push the updated data to a git repo on pagure (and/or github), but to do that we'd probably need a "data-analysis" service account on pagure, and we'd need to have its credentials on log01, etc. As long as the data is available somewhere, the git repo can wait 'til later.

If those sound reasonable, I'm going to work on an ansible PR to get the countme updates running daily on log01. Once that's running and #9178 is fixed, this can be closed.

Login to comment on this ticket.

Metadata