Blame README.md

Will Woods 97ed1fd
# mirrors-countme
Will Woods 97ed1fd
Will Woods 97ed1fd
Parse http `access_log` data, find DNF `countme` requests, and output
Will Woods 97ed1fd
structured data that lets us estimate the number of people using various
Will Woods 97ed1fd
Fedora releases.
Will Woods 97ed1fd
Will Woods 70e0e95
See [Changes/DNF Better Counting] for more info about the `countme` feature in
Will Woods 70e0e95
general.
Will Woods 97ed1fd
Will Woods 97ed1fd
## How it works
Will Woods 97ed1fd
Will Woods 97ed1fd
The short version:
Will Woods 97ed1fd
Will Woods 70e0e95
* Starting in Fedora 32, DNF adds "countme=N" to one random HTTP request
Will Woods 70e0e95
  per week for each repo that has its `countme` setting enabled.
Will Woods 70e0e95
* `parse-access-log.py` parses logs from mirrors.fedoraproject.org, finds
Will Woods 70e0e95
  those requests, and yields the following information:
Will Woods 70e0e95
    * request timestamp, repo & arch
Will Woods 70e0e95
    * client OS name, version, variant, and arch
Will Woods 70e0e95
    * client "age", from 1-4: 1 week, 1 month, 6 months, or >6 months.
Will Woods 70e0e95
* We use that data to make cool charts & graphs and estimate how many Fedora
Will Woods 70e0e95
  users there are and what they're using.
Will Woods 97ed1fd
Will Woods 97ed1fd
## Technical details
Will Woods 97ed1fd
Will Woods 97ed1fd
### Client behavior & configuration
Will Woods 97ed1fd
Will Woods 97ed1fd
DNF 4.2.9 added the `countme` option, which [dnf.conf(5)] describes like so:
Will Woods 97ed1fd
Will Woods 97ed1fd
>    Determines whether a special flag should be added to a single, randomly
Will Woods 97ed1fd
>    chosen metalink/mirrorlist query each week.
Will Woods 97ed1fd
>    This allows the repository owner to estimate the number of systems
Will Woods 97ed1fd
>    consuming it, by counting such queries over a week's time, which is much
Will Woods 97ed1fd
>    more accurate than just counting unique IP addresses (which is subject to
Will Woods 97ed1fd
>    both overcounting and undercounting due to short DHCP leases and NAT,
Will Woods 97ed1fd
>    respectively).
Will Woods 97ed1fd
>
Will Woods 97ed1fd
>    The flag is a simple "countme=N" parameter appended to the metalink and
Will Woods 97ed1fd
>    mirrorlist URL, where N is an integer representing the "longevity" bucket
Will Woods 97ed1fd
>    this system belongs to.
Will Woods 97ed1fd
>    The following 4 buckets are defined, based on how many full weeks have
Will Woods 97ed1fd
>    passed since the beginning of the week when this system was installed: 1 =
Will Woods 97ed1fd
>    first week, 2 = first month (2-4 weeks), 3 = six months (5-24 weeks) and 4
Will Woods 97ed1fd
>    = more than six months (> 24 weeks).
Will Woods 97ed1fd
>    This information is meant to help distinguish short-lived installs from
Will Woods 97ed1fd
>    long-term ones, and to gather other statistics about system lifecycle.
Will Woods 97ed1fd
Will Woods 97ed1fd
Note that the default is False, because we don't want to enable this for every
Will Woods 97ed1fd
repo you have configured.
Will Woods 97ed1fd
Will Woods 97ed1fd
Starting with Fedora 32, we set `countme=1` in Fedora official repo configs:
Will Woods 97ed1fd
Will Woods 97ed1fd
```
Will Woods 97ed1fd
[updates]
Will Woods 97ed1fd
name=Fedora $releasever - $basearch - Updates
Will Woods 97ed1fd
#baseurl=http://download.example/pub/fedora/linux/updates/$releasever/Everything/$basearch/
Will Woods 97ed1fd
metalink=https://mirrors.fedoraproject.org/metalink?repo=updates-released-f$releasever&arch=$basearch
Will Woods 97ed1fd
enabled=1
Will Woods 97ed1fd
countme=1
Will Woods 97ed1fd
repo_gpgcheck=0
Will Woods 97ed1fd
type=rpm
Will Woods 97ed1fd
gpgcheck=1
Will Woods 97ed1fd
metadata_expire=6h
Will Woods 97ed1fd
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-$releasever-$basearch
Will Woods 97ed1fd
skip_if_unavailable=False
Will Woods 97ed1fd
```
Will Woods 97ed1fd
Will Woods 70e0e95
This means that the default configuration only adds "countme=N" when using
Will Woods 97ed1fd
official Fedora repos, which are all done via HTTPS connections to
Will Woods 70e0e95
mirrors.fedoraproject.org. "countme=N" does _not_ get added in subsequent
Will Woods 97ed1fd
requests to the chosen mirror(s).
Will Woods 97ed1fd
Will Woods 97ed1fd
### Privacy, randomization, and user counting
Will Woods 97ed1fd
Will Woods 97ed1fd
DNF makes a serious effort to keep the `countme` data anonymous _and_ accurate
Will Woods 97ed1fd
by only sending `countme` with one _random_ request to each enabled repo _per
Will Woods 97ed1fd
week_. So how does it decide when the week starts, and how does it choose
Will Woods 97ed1fd
which request?
Will Woods 97ed1fd
Will Woods 97ed1fd
First, all clients use the same "week": Week 0 started at timestamp 345600
Will Woods 97ed1fd
(Mon 05 Jan 1970 00:00:00 - the first Monday of POSIX time), and weeks are
Will Woods 70e0e95
exactly 604800 (7×24×60×60) seconds long.
Will Woods 97ed1fd
Will Woods 97ed1fd
Second, all clients have the same random chance - currently 1:4 - to send
Will Woods 97ed1fd
`countme` with any request in a given week. Once it's been sent, the client
Will Woods 97ed1fd
won't send another `countme` for that repo for the rest of the week.
Will Woods 97ed1fd
Will Woods 97ed1fd
The default update interval for the `updates` repo is 6 hours, which means
Will Woods 97ed1fd
that clients who use `dnf-makecache.service` will probably send `countme`
Will Woods 97ed1fd
sometime in the first 24 hours of a given week - and nothing for the rest of
Will Woods 97ed1fd
the week.
Will Woods 97ed1fd
Will Woods 97ed1fd
This means that _daily_ totals of users are unreliably variable, since the
Will Woods 97ed1fd
start of the week will have more `countme` requests than the end of the week.
Will Woods 97ed1fd
But the weekly totals should be a consistent, representative sample of the
Will Woods 97ed1fd
total population.
Will Woods 97ed1fd
Will Woods 70e0e95
For more details on how libdnf handles the randomization, see
Will Woods 70e0e95
[libdnf/repo/Repo.cpp:addCountmeFlag()].
Will Woods 70e0e95
Will Woods 97ed1fd
### Data collected
Will Woods 97ed1fd
Will Woods 97ed1fd
The only data we look at is in the HTTP request itself. Our log lines are in
Will Woods 70e0e95
the standard Combined Log Format, like so[^IPvBeefy]:
Will Woods 97ed1fd
Will Woods 97ed1fd
```
Will Woods 97ed1fd
240.159.140.173 - - [29/Mar/2020:16:04:28 +0000] "GET /metalink?repo=fedora-modular-32&arch=x86_64&countme=1 HTTP/2.0" 200 18336 "-" "libdnf (Fedora 32; workstation; Linux.x86_64)"
Will Woods 97ed1fd
```
Will Woods 97ed1fd
Will Woods 97ed1fd
Will Woods 97ed1fd
We only look at log lines where the request is "GET", the query string includes
Will Woods 97ed1fd
"countme=N", the result is 200 or 302, and the User-Agent string matches the
Will Woods 97ed1fd
libdnf User-Agent header.
Will Woods 97ed1fd
Will Woods 70e0e95
The only data we use are the timestamp, the query parameters (`repo`, `arch`,
Will Woods 70e0e95
`countme`), and the libdnf User-Agent data.
Will Woods 97ed1fd
Will Woods 97ed1fd
#### libdnf User-Agent data
Will Woods 97ed1fd
Will Woods 97ed1fd
As in the log line above, the User-Agent header that libdnf sends looks like this:
Will Woods 97ed1fd
Will Woods 97ed1fd
```
Will Woods 97ed1fd
User-Agent: libdnf (Fedora 32; workstation; Linux.x86_64)
Will Woods 97ed1fd
```
Will Woods 97ed1fd
Will Woods 70e0e95
This string is assembled in [libdnf/utils/os-release.cpp:getUserAgent()] and
Will Woods 97ed1fd
the format is as follows:
Will Woods 97ed1fd
Will Woods 97ed1fd
```
Will Woods 97ed1fd
{product} ({os_name} {os_version}; {os_variant}; {os_canon}.{os_arch})
Will Woods 97ed1fd
```
Will Woods 97ed1fd
Will Woods 70e0e95
where the values are:
Will Woods 70e0e95
Will Woods 70e0e95
`product`
Will Woods 70e0e95
:  "libdnf"
Will Woods 70e0e95
Will Woods 70e0e95
`os_name`
Will Woods 70e0e95
:  [/etc/os-release] `NAME`
Will Woods 70e0e95
Will Woods 70e0e95
`os_version`
Will Woods 70e0e95
:  [/etc/os-release] `VERSION_ID`
Will Woods 70e0e95
Will Woods 70e0e95
`os_variant`
Will Woods 70e0e95
:  [/etc/os-release] `VARIANT_ID`
Will Woods 70e0e95
Will Woods 70e0e95
`os_canon`
Will Woods 70e0e95
:  rpm `%_os` (via libdnf `getCanonOS()`)
Will Woods 97ed1fd
Will Woods 70e0e95
`os_arch`
Will Woods 70e0e95
:  rpm `%_arch` (via libdnf `getBaseArch()`)
Will Woods 97ed1fd
Will Woods 97ed1fd
(Older versions of libdnf sent `libdnf/{LIBDNF_VERSION}` for the `product`,
Will Woods 97ed1fd
but the version string was dropped in libdnf 0.37.2 due to privacy concerns;
Will Woods 70e0e95
see [libdnf commit d8d0984].)
Will Woods 97ed1fd
Will Woods 70e0e95
#### `repo=`, `arch=`, `countme=`
Will Woods 97ed1fd
Will Woods 97ed1fd
The `repo=` and `arch=` values are exactly what's set in the URL in the `.repo`
Will Woods 97ed1fd
file.
Will Woods 97ed1fd
Will Woods 70e0e95
`repo` is whatever string appears after `repo=` in the repo's `metalink` URL.
Will Woods 70e0e95
The values that are accepted for `repo` are determined by [mirrormanager];
Will Woods 70e0e95
see [mirrormanager2/lib/repomap.py] for some of the gnarly details there.
Will Woods 70e0e95
Will Woods 97ed1fd
`arch` is usually set as `arch=$basearch`, which means that `os_arch` and
Will Woods 70e0e95
`repo_arch` are usually the same value. But it _is_ valid for a client to
Will Woods 70e0e95
use a repo with an `arch=` that's different from rpm's `%_arch` - for example,
Will Woods 70e0e95
an i686 system could use an i386 repo - so `repo_arch` and `os_arch` _may_ be
Will Woods 70e0e95
different values.
Will Woods 97ed1fd
Will Woods 97ed1fd
`countme`, as discussed in [dnf.conf(5)], is a value from 1 to 4 indicating
Will Woods 97ed1fd
the "age" of the system, counted in _full_ weeks since the system was
Will Woods 97ed1fd
first installed. The values are:
Will Woods 97ed1fd
Will Woods 97ed1fd
1. One week or less (0-1 weeks)
Will Woods 97ed1fd
2. Up to one month (2-4 weeks)
Will Woods 97ed1fd
3. Up to six months (5-24 weeks)
Will Woods 97ed1fd
4. More than six months (25+ weeks)
Will Woods 97ed1fd
Will Woods 70e0e95
These are defined in [libdnf/repo/Repo.cpp:COUNTME\_BUCKETS].
Will Woods 70e0e95
Will Woods 70e0e95
Will Woods d91b09e
## OK but how do we actually use it in Fedora?
Will Woods d91b09e
Will Woods d91b09e
Because the raw log data contains IP and timestamps that could be used to
Will Woods d91b09e
track or identify users, we run the parsing and counting inside private parts
Will Woods d91b09e
of the Fedora infrastructure and only publish the anonymous aggregate data.
Will Woods d91b09e
Will Woods d91b09e
In practice, this is a three-part process:
Will Woods d91b09e
Will Woods d91b09e
1. Run `countme-update-rawdb.sh` daily to parse log data into `rawdb`
Will Woods d91b09e
  * `rawdb` is a SQLite database of structured data for each `countme` hit
Will Woods d91b09e
  * Kept private since it contains IP addresses and timestamps
Will Woods d91b09e
  * Typical log data: ~6GB/day
Will Woods d91b09e
  * Typical parsing time: ~5min (Intel Core i7-6770HQ, 2.60GHz)
Will Woods d91b09e
  * Typical rawdb size: ~8MB/day for F32; I'd guess keeping 1 year of data for
Will Woods d91b09e
    3 concurrent releases would take about 10GB.
Will Woods d91b09e
  * Retaining historical data lets us quickly recalculate counts if we
Will Woods d91b09e
    discover significant errors due to misconfigured/malicious clients
Will Woods d91b09e
2. Run `countme-update-totals.sh` to read `rawdb` and update `totalsdb`
Will Woods d91b09e
  * Counts up hits for each week, grouped by:
Will Woods d91b09e
    * System info: `os_name`, `os_version`, `os_variant`, `os_arch`, `sys_age`
Will Woods d91b09e
    * Repo requested: `repo_tag`, `repo_arch`
Will Woods d91b09e
  * Only generates data for weeks where we have complete log data
Will Woods d91b09e
  * No timestamps or IP addresses
Will Woods d91b09e
  * Typical parsing time: small, <=~5s
Will Woods d91b09e
  * Typical totalsdb size: ~55KB/week (~700 rows/week) for F32
Will Woods d91b09e
  * After update, (re)generate `totals.csv`
Will Woods d91b09e
3. Publish updated `totals.db` and `totals.csv`
Will Woods d91b09e
  * See https://data-analysis.fedoraproject.org/csv-reports/countme/
Will Woods d91b09e
  * Might end up in different places/forms in the future
Will Woods 70e0e95
Will Woods 70e0e95
[^IPvBeefy]: Don't worry, 240.159.140.173 is a fake IP address. Actually,
Will Woods 70e0e95
             it's the 4-byte UTF-8 encoding for 🌭, U+1F32D HOT DOG.
Will Woods 70e0e95
Will Woods 70e0e95
[Changes/DNF Better Counting]: https://fedoraproject.org/wiki/Changes/DNF_Better_Counting
Will Woods 70e0e95
[dnf.conf(5)]: https://dnf.readthedocs.io/en/latest/conf_ref.html
Will Woods 70e0e95
[/etc/os-release]: http://man7.org/linux/man-pages/man5/os-release.5.html
Will Woods 70e0e95
[mirrormanager]: https://github.com/fedora-infra/mirrormanager2
Will Woods 70e0e95
[mirrormanager2/lib/repomap.py]: https://github.com/fedora-infra/mirrormanager2/blob/master/mirrormanager2/lib/repomap.py
Will Woods 70e0e95
[libdnf commit d8d0984]: https://github.com/rpm-software-management/libdnf/commit/d8d0984
Will Woods 70e0e95
[libdnf/utils/os-release.cpp:getUserAgent()]: https://github.com/rpm-software-management/libdnf/blob/0.47.0/libdnf/utils/os-release.cpp#L108
Will Woods 70e0e95
[libdnf/repo/Repo.cpp:addCountmeFlag()]: https://github.com/rpm-software-management/libdnf/blob/0.47.0/libdnf/repo/Repo.cpp#L1051
Will Woods 70e0e95
[libdnf/repo/Repo.cpp:COUNTME\_BUCKETS]: https://github.com/rpm-software-management/libdnf/blob/0.47.0/libdnf/repo/Repo.cpp#L92