Issue #56: disk usage, intermediate formats, etc. - mirrors-countme

mirrors-countme

#56 disk usage, intermediate formats, etc.

Opened 2 years ago by mattdm. Modified 2 years ago

Smooge tells me that the current system's "rawdb" output is threatening to overwhelm the system. Will originally estimated that growth of rawdb would be something like 10GB/year, but I think we underestimated Fedora growth — and particularly, EPEL usage.

The current system processes raw Apache log files into rawdb and then generates totals.db from that file. It would be possible to instead accumulate directly into the totals.db.

 for every line in the http log:
     if the line matches the data format we're expecting
          collapse date to just "weeknum"
          extract (os_name, os_variant, os_version, os_arch, sys_age, repo_tag, repo_arch)
          if line matching weeknum + extracted info isn't in the db
                 add it with hits = 1
          else if it is in the db
                  increment hits

The last part could be done with a sqlite "upsert".

The main complications, as I understand it are:

The intermediate "rawdb" files are faster to parse then going through the whole text logs again, if we need to do that.
The Log files rotate at midnight, but we might have stragglers off by as much as 241 seconds (according to comments in the code) which are in the next day's log.
- This is important because technically the "countme" week could land twice from one system in that same four-minute window if it's the week boundary. So we can't just do "one file, one whole day". Rather, we need to keep each day "open" until the next day comes
- ...and therefore each week

For the first, it might just be a matter of "well, that's too bad", if we don't have the space.

For the second, though, we could do this a different way.

Instead of logging raw lines, do the above pseudocode above to "intermediate.db" as the log files come in.
When we get the http log that contains the last possible minutes of the week (that is, the first day of the next week -- Monday's log file, with its possible few entries of the Sunday before):
- process that
- then, we can be sure that all the previous week entries are in (because there definitely won't be any from Sunday in Tuesday's log)
- copy all of the entries matching previous weeknum to totals.db
- verify totals.db
- publish totals db
- delete all rows from intermediate.db before the current week
- so it'll now just contain the entries from this latest Monday
- and vacuum the intermediate.db sqlite file (no more space problem!)

Am I missing anything here? It's very possible I am!

mattdm commented 2 years ago

This could, at the same time, have another accumulator, the "IP hits" accumulator (see https://pagure.io/fedora-infrastructure/issue/10443). The logic for that would be slightly different:

for every line in the http log:
if the line matches the data format we're expecting
collapse date to "daynum" and "weeknum"
extract the IP address
if the IP address is in a table of daynum:IP
continue to next
else
add IP address to daynum:IP table
extract (repo_tag, repo_arch)
if line matching weeknum + extracted info isn't in the db
add it with iphits = 1
else if it is in the db
increment iphits for that line

And with this

the daynum:IP table (stored separately from the intermediate log) could be rotated / cleaned every second day (or, technically, after that 241 second window passes), so in addition to the raw logs we'd be storing two days worth of IP addresses
the intermediate db and totals db would still be free of IP addresses
when going from intermediate.db to totals.db, divide by 7 for "hits", to make more comparable to countme's once-a-week check (visualizations later might do further scaling based on observed correlation of the two methods)

Edited 2 years ago by mattdm

mattdm commented 2 years ago

Another observation: totals.db should be "append only". So, rather than growing one file, we could write to a new week-####.db every week. This is mildly more annoying to process, but means that data-processing clients can rsync just what they need (or, wget with proper timestamping, or whatever), and cache that, rather than an ever-increasing totals.db file.

Edited 2 years ago by mattdm

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Milestone

None

mirrors-countme

Source Code

#56 disk usage, intermediate formats, etc. Opened 2 years ago by mattdm. Modified 2 years ago

Metadata

#56 disk usage, intermediate formats, etc.

Opened 2 years ago by mattdm. Modified 2 years ago