The "old" fedora/epel stats at https://data-analysis.fedoraproject.org/csv-reports/mirrors/mirrorsdata-all.csv is in what's known as "wide format" instead of "long format" (aka "tidy format") (see https://www.theanalysisfactor.com/wide-and-long-data/). This has several disadvantages:
So, instead of the current CSV, I would like a table (CSV or sqlite database — the latter is measurably faster) which uses a schema similar to that used by the countme system "totals.db":
That is:
CREATE TABLE countme_totals ( hits INTEGER NOT NULL, weeknum INTEGER NOT NULL, os_name TEXT NOT NULL, os_version TEXT NOT NULL, os_variant TEXT NOT NULL, os_arch TEXT NOT NULL, sys_age INTEGER NOT NULL, repo_tag TEXT NOT NULL, repo_arch TEXT NOT NULL); UNIQUE (weeknum, os_name_os_version, os_variant, os_arch, sys_age, repo_tag, repo_arch); CREATE INDEX weeknum_idx on countme_totals (weeknum);
With a uniqueness constraint across everything except hits.
hits
Of course, since the old logs don't have the same information (no system_age or any of the os_ values, I think we just have:
system_age
os_
CREATE TABLE ipcount_totals ( daily_average_hits INTEGER NOT NULL, weeknum INTEGER NOT NULL, repo_tag TEXT NOT NULL, repo_arch TEXT NOT NULL), UNIQUE (weeknum, repo_tag, repo_arch) CREATE INDEX weeknum_idx on ipcount_totals (weeknum);
... although @smooge can correct me if I'm wrong and we actually an figure out more.
Note that here, I'm expecting "hits" to be the average for the week. That makes it resemble the sampling frequency of dnf-countme.
This is not urgent. However, it'd be awesome to have in time to present at DevConf in a month. If that's not possible, by Flock/Nest next summer is the next really useful window.
I was originally thinking we'd entirely abandon this method, but:
We will need to also work out a method for cleaning up the rawdb regularly. Currently it keeps growing and with only the 1/84th amount that countme covers is 21GB for just over a year. [A system comes in once a week with countme but will show up around 12*7 (84 times) per week. I would expect that we are looking at 18 TB of raw data. [I ran into troubles of zeroing out the old rawdb where totals.db got lost in counting. I am not sure if that has been fixed.]
I think the above sql values are what I was planning to store with my original python redo of data. Looking at the current wil code and maybe taking some things from my previous attempt I think this is a 4 week coding project with 2 coders, qa and docs.
The processed data should be much lower in size, though, right? In case it didn't stand out, I'm suggesting that we aggregate on a weekly basis.
There is some data in May 2007 but it's mostly noise; 2007-06-01 is the real start date (or 2007-06-04 considering we start weeks on Monday), and that's only 760 weeks from now.
The current totals.db 18MiB for 97 weeks, so assuming the same size ~ 142MiB (149MB, modern metric style).
But it should be significantly smaller because there's fewer distinct rows If the fields are just week_num,repo_tag,repo_arch,daily_average_hits`...
week_num,
,
There's this
# everything currently sqlite> select count(*) from countme_totals; 238753 # just double checking... sqlite> select count(*) from (select distinct weeknum, repo_tag, repo_arch, os_arch, os_name, os_variant, os_version, sys_age from countme_totals); 238753 # and then: sqlite> select count(*) from (select distinct weeknum,repo_tag,repo_arch from countme_totals); 14296
... which is just 1/16th the size. And I think the distinct repo_tag,repo_arch entries have only gone up since the earlier times. So I bet the processed "mirrorweekly.db" will be under 10MB at this time.
Of course, what to do with the raw log data is a different matter. Presumably we want to save that to long-term backup somewhere.
It might be nice to apply the "dinosaur-eating" steps of https://pagure.io/velociraptorizer/blob/master/f/velociraptorizer.py as an intermediate between generating the raw data and the processed mirrorweekly.db. Those things are known data aberrations with best-effort compensation that I'm 85% confident doesn't introduce any meaningful distortion and 99.99% confident that it doesn't introduce any distortion that matters long term.
mirrorweekly.db
I could also continue to put those as an intermediate step in my graph drawing, but I think it's probably better for most people working with the data to have a corrected view.
On project size: sounds reasonable to me. I'll add it to the list for @amoloney!
The data is processed in a 2 stage set by the current countme tool and would still be.
The raw.db is the currently ever growing database which may be around 17TB for just a couple of years of data (data growth being exponential). This is the one which needs work because removing old data isn't occurring and trying to start 'fresh' seems to cause the script creating totals.db not work work.
Ah, I see. That seems like a separate problem from this, really. Although I guess... as the same logs are being scanned, just counted in different ways, they could be done in parallel, and I guess fixing that could be done as the same time. Feeding two birds with one scone, as it were.
If I remember correctly, re-scanning at least the last day of the last week (and so, the last week can change) is necessary if we don't want to cause yet another day of delay, because a few entries for the previous week may come in after midnight. But after that, the logs shouldn't change, right?
Mainly it is a 'we are going to run into this by the time we finish one month of old data'. Dealing with the problem now means that this new mirroranalysis program won't kill disk space in a week of just looking at say December 2021 logs (not to say 2021 or even going back to really old data).
Getting garbage collection done right in the first will kill 2 birds with 1 scone. [What are you putting in these scones?]
I'm feeding the birds! They're very healthy scones!
Anyway, I'll add some thoughts on this to https://pagure.io/mirrors-countme/issues (now at https://pagure.io/mirrors-countme/issue/56)
This should be proposed as initiative https://docs.fedoraproject.org/en-US/cpe/initiatives/#_proposing_a_new_initiative
Sure! https://pagure.io/cpe/initiatives-proposal/issue/17
Closing as Initiative worthy
Metadata Update from @zlopez: - Issue close_status updated to: Initiative Worthy - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.