#10443 process mirror data ("old stats") into format similar to countme db
Closed: Initiative Worthy 2 years ago by zlopez. Opened 2 years ago by mattdm.

Describe what you would like us to do:


The "old" fedora/epel stats at https://data-analysis.fedoraproject.org/csv-reports/mirrors/mirrorsdata-all.csv is in what's known as "wide format" instead of "long format" (aka "tidy format") (see https://www.theanalysisfactor.com/wide-and-long-data/). This has several disadvantages:

  • new columns need to be created for new Fedora Linux and EPEL releases
  • same with architecture
  • multi-dimensionality is lost -- it's impossible to correlate architecture to release number, for example; as it is, some columns exist which split the data into fedora/epel, but not all,and it's kind of hacky.

So, instead of the current CSV, I would like a table (CSV or sqlite database — the latter is measurably faster) which uses a schema similar to that used by the countme system "totals.db":

That is:

CREATE TABLE countme_totals (
    hits INTEGER NOT NULL,
    weeknum INTEGER NOT NULL,
    os_name TEXT NOT NULL,
    os_version TEXT NOT NULL,
    os_variant TEXT NOT NULL,
    os_arch TEXT NOT NULL,
    sys_age INTEGER NOT NULL,
    repo_tag TEXT NOT NULL,
    repo_arch TEXT NOT NULL);
     UNIQUE (weeknum, os_name_os_version, os_variant, os_arch, sys_age, repo_tag, repo_arch);
CREATE INDEX weeknum_idx on countme_totals (weeknum);

With a uniqueness constraint across everything except hits.

Of course, since the old logs don't have the same information (no system_age or any of the os_ values, I think we just have:

CREATE TABLE ipcount_totals (
    daily_average_hits INTEGER NOT NULL,
    weeknum INTEGER NOT NULL,
    repo_tag TEXT NOT NULL,
    repo_arch TEXT NOT NULL),
    UNIQUE (weeknum, repo_tag, repo_arch)
CREATE INDEX weeknum_idx on ipcount_totals (weeknum);

... although @smooge can correct me if I'm wrong and we actually an figure out more.

Note that here, I'm expecting "hits" to be the average for the week. That makes it resemble the sampling frequency of dnf-countme.

When do you need this to be done by? (YYYY/MM/DD)


This is not urgent. However, it'd be awesome to have in time to present at DevConf in a month. If that's not possible, by Flock/Nest next summer is the next really useful window.

I was originally thinking we'd entirely abandon this method, but:

  • It's nice to have the correlation with the other method to compare
  • having it in this format will let me produce "long term" graphs which meld the two formats — even if it's not really directly comparable, the correlation is actually quite close with a scaling factor, and it's nice to not just discard everything we know from before F30.
  • EPEL 6 is still really popular, and EPEL 7 continues to break the roof. DNF Countme doesn't show these, and I expect we'll find that EPEL information useful for at least the next five years.
  • This is a chance for someone other than Smooge to know how to do this, so all of my crazy requests do not fall just on him

We will need to also work out a method for cleaning up the rawdb regularly. Currently it keeps growing and with only the 1/84th amount that countme covers is 21GB for just over a year. [A system comes in once a week with countme but will show up around 12*7 (84 times) per week. I would expect that we are looking at 18 TB of raw data. [I ran into troubles of zeroing out the old rawdb where totals.db got lost in counting. I am not sure if that has been fixed.]

I think the above sql values are what I was planning to store with my original python redo of data. Looking at the current wil code and maybe taking some things from my previous attempt I think this is a 4 week coding project with 2 coders, qa and docs.

The processed data should be much lower in size, though, right? In case it didn't stand out, I'm suggesting that we aggregate on a weekly basis.

There is some data in May 2007 but it's mostly noise; 2007-06-01 is the real start date (or 2007-06-04 considering we start weeks on Monday), and that's only 760 weeks from now.

The current totals.db 18MiB for 97 weeks, so assuming the same size ~ 142MiB (149MB, modern metric style).

But it should be significantly smaller because there's fewer distinct rows If the fields are just week_num,repo_tag,repo_arch,daily_average_hits`...

There's this

# everything currently
sqlite> select count(*) from countme_totals;
238753

# just double checking...
sqlite> select count(*) from (select distinct weeknum, repo_tag, repo_arch, os_arch, os_name, os_variant, os_version, sys_age from countme_totals);
238753

# and then:
sqlite> select count(*) from (select distinct weeknum,repo_tag,repo_arch from countme_totals);
14296

... which is just 1/16th the size. And I think the distinct repo_tag,repo_arch entries have only gone up since the earlier times. So I bet the processed "mirrorweekly.db" will be under 10MB at this time.

Of course, what to do with the raw log data is a different matter. Presumably we want to save that to long-term backup somewhere.

It might be nice to apply the "dinosaur-eating" steps of https://pagure.io/velociraptorizer/blob/master/f/velociraptorizer.py as an intermediate between generating the raw data and the processed mirrorweekly.db. Those things are known data aberrations with best-effort compensation that I'm 85% confident doesn't introduce any meaningful distortion and 99.99% confident that it doesn't introduce any distortion that matters long term.

I could also continue to put those as an intermediate step in my graph drawing, but I think it's probably better for most people working with the data to have a corrected view.


On project size: sounds reasonable to me. I'll add it to the list for @amoloney!

The data is processed in a 2 stage set by the current countme tool and would still be.

  1. Apache logs are grabbed, then combined then put on NFS (this is a process but isn't part of the main program).
  2. Take the log files, look for the entries we want and stick them in raw.db
  3. Weekly take the raw.db and make a totals.db

The raw.db is the currently ever growing database which may be around 17TB for just a couple of years of data (data growth being exponential). This is the one which needs work because removing old data isn't occurring and trying to start 'fresh' seems to cause the script creating totals.db not work work.

Ah, I see. That seems like a separate problem from this, really. Although I guess... as the same logs are being scanned, just counted in different ways, they could be done in parallel, and I guess fixing that could be done as the same time. Feeding two birds with one scone, as it were.

If I remember correctly, re-scanning at least the last day of the last week (and so, the last week can change) is necessary if we don't want to cause yet another day of delay, because a few entries for the previous week may come in after midnight. But after that, the logs shouldn't change, right?

Mainly it is a 'we are going to run into this by the time we finish one month of old data'. Dealing with the problem now means that this new mirroranalysis program won't kill disk space in a week of just looking at say December 2021 logs (not to say 2021 or even going back to really old data).

Getting garbage collection done right in the first will kill 2 birds with 1 scone. [What are you putting in these scones?]

I'm feeding the birds! They're very healthy scones!

Anyway, I'll add some thoughts on this to https://pagure.io/mirrors-countme/issues (now at https://pagure.io/mirrors-countme/issue/56)

Closing as Initiative worthy

Metadata Update from @zlopez:
- Issue close_status updated to: Initiative Worthy
- Issue status updated to: Closed (was: Open)

2 years ago

Log in to comment on this ticket.

Metadata