|
Will Woods |
97ed1fd |
# mirrors-countme
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
Parse http `access_log` data, find DNF `countme` requests, and output
|
|
Will Woods |
97ed1fd |
structured data that lets us estimate the number of people using various
|
|
Will Woods |
97ed1fd |
Fedora releases.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
70e0e95 |
See [Changes/DNF Better Counting] for more info about the `countme` feature in
|
|
Will Woods |
70e0e95 |
general.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
## How it works
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
The short version:
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
70e0e95 |
* Starting in Fedora 32, DNF adds "countme=N" to one random HTTP request
|
|
Will Woods |
70e0e95 |
per week for each repo that has its `countme` setting enabled.
|
|
Will Woods |
70e0e95 |
* `parse-access-log.py` parses logs from mirrors.fedoraproject.org, finds
|
|
Will Woods |
70e0e95 |
those requests, and yields the following information:
|
|
Will Woods |
70e0e95 |
* request timestamp, repo & arch
|
|
Will Woods |
70e0e95 |
* client OS name, version, variant, and arch
|
|
Will Woods |
70e0e95 |
* client "age", from 1-4: 1 week, 1 month, 6 months, or >6 months.
|
|
Will Woods |
70e0e95 |
* We use that data to make cool charts & graphs and estimate how many Fedora
|
|
Will Woods |
70e0e95 |
users there are and what they're using.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
## Technical details
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
### Client behavior & configuration
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
DNF 4.2.9 added the `countme` option, which [dnf.conf(5)] describes like so:
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
> Determines whether a special flag should be added to a single, randomly
|
|
Will Woods |
97ed1fd |
> chosen metalink/mirrorlist query each week.
|
|
Will Woods |
97ed1fd |
> This allows the repository owner to estimate the number of systems
|
|
Will Woods |
97ed1fd |
> consuming it, by counting such queries over a week's time, which is much
|
|
Will Woods |
97ed1fd |
> more accurate than just counting unique IP addresses (which is subject to
|
|
Will Woods |
97ed1fd |
> both overcounting and undercounting due to short DHCP leases and NAT,
|
|
Will Woods |
97ed1fd |
> respectively).
|
|
Will Woods |
97ed1fd |
>
|
|
Will Woods |
97ed1fd |
> The flag is a simple "countme=N" parameter appended to the metalink and
|
|
Will Woods |
97ed1fd |
> mirrorlist URL, where N is an integer representing the "longevity" bucket
|
|
Will Woods |
97ed1fd |
> this system belongs to.
|
|
Will Woods |
97ed1fd |
> The following 4 buckets are defined, based on how many full weeks have
|
|
Will Woods |
97ed1fd |
> passed since the beginning of the week when this system was installed: 1 =
|
|
Will Woods |
97ed1fd |
> first week, 2 = first month (2-4 weeks), 3 = six months (5-24 weeks) and 4
|
|
Will Woods |
97ed1fd |
> = more than six months (> 24 weeks).
|
|
Will Woods |
97ed1fd |
> This information is meant to help distinguish short-lived installs from
|
|
Will Woods |
97ed1fd |
> long-term ones, and to gather other statistics about system lifecycle.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
Note that the default is False, because we don't want to enable this for every
|
|
Will Woods |
97ed1fd |
repo you have configured.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
Starting with Fedora 32, we set `countme=1` in Fedora official repo configs:
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
```
|
|
Will Woods |
97ed1fd |
[updates]
|
|
Will Woods |
97ed1fd |
name=Fedora $releasever - $basearch - Updates
|
|
Will Woods |
97ed1fd |
#baseurl=http://download.example/pub/fedora/linux/updates/$releasever/Everything/$basearch/
|
|
Will Woods |
97ed1fd |
metalink=https://mirrors.fedoraproject.org/metalink?repo=updates-released-f$releasever&arch=$basearch
|
|
Will Woods |
97ed1fd |
enabled=1
|
|
Will Woods |
97ed1fd |
countme=1
|
|
Will Woods |
97ed1fd |
repo_gpgcheck=0
|
|
Will Woods |
97ed1fd |
type=rpm
|
|
Will Woods |
97ed1fd |
gpgcheck=1
|
|
Will Woods |
97ed1fd |
metadata_expire=6h
|
|
Will Woods |
97ed1fd |
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-$releasever-$basearch
|
|
Will Woods |
97ed1fd |
skip_if_unavailable=False
|
|
Will Woods |
97ed1fd |
```
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
70e0e95 |
This means that the default configuration only adds "countme=N" when using
|
|
Will Woods |
97ed1fd |
official Fedora repos, which are all done via HTTPS connections to
|
|
Will Woods |
70e0e95 |
mirrors.fedoraproject.org. "countme=N" does _not_ get added in subsequent
|
|
Will Woods |
97ed1fd |
requests to the chosen mirror(s).
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
### Privacy, randomization, and user counting
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
DNF makes a serious effort to keep the `countme` data anonymous _and_ accurate
|
|
Will Woods |
97ed1fd |
by only sending `countme` with one _random_ request to each enabled repo _per
|
|
Will Woods |
97ed1fd |
week_. So how does it decide when the week starts, and how does it choose
|
|
Will Woods |
97ed1fd |
which request?
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
First, all clients use the same "week": Week 0 started at timestamp 345600
|
|
Will Woods |
97ed1fd |
(Mon 05 Jan 1970 00:00:00 - the first Monday of POSIX time), and weeks are
|
|
Will Woods |
70e0e95 |
exactly 604800 (7×24×60×60) seconds long.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
Second, all clients have the same random chance - currently 1:4 - to send
|
|
Will Woods |
97ed1fd |
`countme` with any request in a given week. Once it's been sent, the client
|
|
Will Woods |
97ed1fd |
won't send another `countme` for that repo for the rest of the week.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
The default update interval for the `updates` repo is 6 hours, which means
|
|
Will Woods |
97ed1fd |
that clients who use `dnf-makecache.service` will probably send `countme`
|
|
Will Woods |
97ed1fd |
sometime in the first 24 hours of a given week - and nothing for the rest of
|
|
Will Woods |
97ed1fd |
the week.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
This means that _daily_ totals of users are unreliably variable, since the
|
|
Will Woods |
97ed1fd |
start of the week will have more `countme` requests than the end of the week.
|
|
Will Woods |
97ed1fd |
But the weekly totals should be a consistent, representative sample of the
|
|
Will Woods |
97ed1fd |
total population.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
70e0e95 |
For more details on how libdnf handles the randomization, see
|
|
Will Woods |
70e0e95 |
[libdnf/repo/Repo.cpp:addCountmeFlag()].
|
|
Will Woods |
70e0e95 |
|
|
Will Woods |
97ed1fd |
### Data collected
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
The only data we look at is in the HTTP request itself. Our log lines are in
|
|
Will Woods |
70e0e95 |
the standard Combined Log Format, like so[^IPvBeefy]:
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
```
|
|
Will Woods |
97ed1fd |
240.159.140.173 - - [29/Mar/2020:16:04:28 +0000] "GET /metalink?repo=fedora-modular-32&arch=x86_64&countme=1 HTTP/2.0" 200 18336 "-" "libdnf (Fedora 32; workstation; Linux.x86_64)"
|
|
Will Woods |
97ed1fd |
```
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
We only look at log lines where the request is "GET", the query string includes
|
|
Will Woods |
97ed1fd |
"countme=N", the result is 200 or 302, and the User-Agent string matches the
|
|
Will Woods |
97ed1fd |
libdnf User-Agent header.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
70e0e95 |
The only data we use are the timestamp, the query parameters (`repo`, `arch`,
|
|
Will Woods |
70e0e95 |
`countme`), and the libdnf User-Agent data.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
#### libdnf User-Agent data
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
As in the log line above, the User-Agent header that libdnf sends looks like this:
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
```
|
|
Will Woods |
97ed1fd |
User-Agent: libdnf (Fedora 32; workstation; Linux.x86_64)
|
|
Will Woods |
97ed1fd |
```
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
70e0e95 |
This string is assembled in [libdnf/utils/os-release.cpp:getUserAgent()] and
|
|
Will Woods |
97ed1fd |
the format is as follows:
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
```
|
|
Will Woods |
97ed1fd |
{product} ({os_name} {os_version}; {os_variant}; {os_canon}.{os_arch})
|
|
Will Woods |
97ed1fd |
```
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
70e0e95 |
where the values are:
|
|
Will Woods |
70e0e95 |
|
|
Will Woods |
70e0e95 |
`product`
|
|
Will Woods |
70e0e95 |
: "libdnf"
|
|
Will Woods |
70e0e95 |
|
|
Will Woods |
70e0e95 |
`os_name`
|
|
Will Woods |
70e0e95 |
: [/etc/os-release] `NAME`
|
|
Will Woods |
70e0e95 |
|
|
Will Woods |
70e0e95 |
`os_version`
|
|
Will Woods |
70e0e95 |
: [/etc/os-release] `VERSION_ID`
|
|
Will Woods |
70e0e95 |
|
|
Will Woods |
70e0e95 |
`os_variant`
|
|
Will Woods |
70e0e95 |
: [/etc/os-release] `VARIANT_ID`
|
|
Will Woods |
70e0e95 |
|
|
Will Woods |
70e0e95 |
`os_canon`
|
|
Will Woods |
70e0e95 |
: rpm `%_os` (via libdnf `getCanonOS()`)
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
70e0e95 |
`os_arch`
|
|
Will Woods |
70e0e95 |
: rpm `%_arch` (via libdnf `getBaseArch()`)
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
(Older versions of libdnf sent `libdnf/{LIBDNF_VERSION}` for the `product`,
|
|
Will Woods |
97ed1fd |
but the version string was dropped in libdnf 0.37.2 due to privacy concerns;
|
|
Will Woods |
70e0e95 |
see [libdnf commit d8d0984].)
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
70e0e95 |
#### `repo=`, `arch=`, `countme=`
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
The `repo=` and `arch=` values are exactly what's set in the URL in the `.repo`
|
|
Will Woods |
97ed1fd |
file.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
70e0e95 |
`repo` is whatever string appears after `repo=` in the repo's `metalink` URL.
|
|
Will Woods |
70e0e95 |
The values that are accepted for `repo` are determined by [mirrormanager];
|
|
Will Woods |
70e0e95 |
see [mirrormanager2/lib/repomap.py] for some of the gnarly details there.
|
|
Will Woods |
70e0e95 |
|
|
Will Woods |
97ed1fd |
`arch` is usually set as `arch=$basearch`, which means that `os_arch` and
|
|
Will Woods |
70e0e95 |
`repo_arch` are usually the same value. But it _is_ valid for a client to
|
|
Will Woods |
70e0e95 |
use a repo with an `arch=` that's different from rpm's `%_arch` - for example,
|
|
Will Woods |
70e0e95 |
an i686 system could use an i386 repo - so `repo_arch` and `os_arch` _may_ be
|
|
Will Woods |
70e0e95 |
different values.
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
`countme`, as discussed in [dnf.conf(5)], is a value from 1 to 4 indicating
|
|
Will Woods |
97ed1fd |
the "age" of the system, counted in _full_ weeks since the system was
|
|
Will Woods |
97ed1fd |
first installed. The values are:
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
97ed1fd |
1. One week or less (0-1 weeks)
|
|
Will Woods |
97ed1fd |
2. Up to one month (2-4 weeks)
|
|
Will Woods |
97ed1fd |
3. Up to six months (5-24 weeks)
|
|
Will Woods |
97ed1fd |
4. More than six months (25+ weeks)
|
|
Will Woods |
97ed1fd |
|
|
Will Woods |
70e0e95 |
These are defined in [libdnf/repo/Repo.cpp:COUNTME\_BUCKETS].
|
|
Will Woods |
70e0e95 |
|
|
Will Woods |
70e0e95 |
|
|
Will Woods |
d91b09e |
## OK but how do we actually use it in Fedora?
|
|
Will Woods |
d91b09e |
|
|
Will Woods |
d91b09e |
Because the raw log data contains IP and timestamps that could be used to
|
|
Will Woods |
d91b09e |
track or identify users, we run the parsing and counting inside private parts
|
|
Will Woods |
d91b09e |
of the Fedora infrastructure and only publish the anonymous aggregate data.
|
|
Will Woods |
d91b09e |
|
|
Will Woods |
d91b09e |
In practice, this is a three-part process:
|
|
Will Woods |
d91b09e |
|
|
Will Woods |
d91b09e |
1. Run `countme-update-rawdb.sh` daily to parse log data into `rawdb`
|
|
Will Woods |
d91b09e |
* `rawdb` is a SQLite database of structured data for each `countme` hit
|
|
Will Woods |
d91b09e |
* Kept private since it contains IP addresses and timestamps
|
|
Will Woods |
d91b09e |
* Typical log data: ~6GB/day
|
|
Will Woods |
d91b09e |
* Typical parsing time: ~5min (Intel Core i7-6770HQ, 2.60GHz)
|
|
Will Woods |
d91b09e |
* Typical rawdb size: ~8MB/day for F32; I'd guess keeping 1 year of data for
|
|
Will Woods |
d91b09e |
3 concurrent releases would take about 10GB.
|
|
Will Woods |
d91b09e |
* Retaining historical data lets us quickly recalculate counts if we
|
|
Will Woods |
d91b09e |
discover significant errors due to misconfigured/malicious clients
|
|
Will Woods |
d91b09e |
2. Run `countme-update-totals.sh` to read `rawdb` and update `totalsdb`
|
|
Will Woods |
d91b09e |
* Counts up hits for each week, grouped by:
|
|
Will Woods |
d91b09e |
* System info: `os_name`, `os_version`, `os_variant`, `os_arch`, `sys_age`
|
|
Will Woods |
d91b09e |
* Repo requested: `repo_tag`, `repo_arch`
|
|
Will Woods |
d91b09e |
* Only generates data for weeks where we have complete log data
|
|
Will Woods |
d91b09e |
* No timestamps or IP addresses
|
|
Will Woods |
d91b09e |
* Typical parsing time: small, <=~5s
|
|
Will Woods |
d91b09e |
* Typical totalsdb size: ~55KB/week (~700 rows/week) for F32
|
|
Will Woods |
d91b09e |
* After update, (re)generate `totals.csv`
|
|
Will Woods |
d91b09e |
3. Publish updated `totals.db` and `totals.csv`
|
|
Will Woods |
d91b09e |
* See https://data-analysis.fedoraproject.org/csv-reports/countme/
|
|
Will Woods |
d91b09e |
* Might end up in different places/forms in the future
|
|
Will Woods |
70e0e95 |
|
|
Will Woods |
70e0e95 |
[^IPvBeefy]: Don't worry, 240.159.140.173 is a fake IP address. Actually,
|
|
Will Woods |
70e0e95 |
it's the 4-byte UTF-8 encoding for 🌭, U+1F32D HOT DOG.
|
|
Will Woods |
70e0e95 |
|
|
Will Woods |
70e0e95 |
[Changes/DNF Better Counting]: https://fedoraproject.org/wiki/Changes/DNF_Better_Counting
|
|
Will Woods |
70e0e95 |
[dnf.conf(5)]: https://dnf.readthedocs.io/en/latest/conf_ref.html
|
|
Will Woods |
70e0e95 |
[/etc/os-release]: http://man7.org/linux/man-pages/man5/os-release.5.html
|
|
Will Woods |
70e0e95 |
[mirrormanager]: https://github.com/fedora-infra/mirrormanager2
|
|
Will Woods |
70e0e95 |
[mirrormanager2/lib/repomap.py]: https://github.com/fedora-infra/mirrormanager2/blob/master/mirrormanager2/lib/repomap.py
|
|
Will Woods |
70e0e95 |
[libdnf commit d8d0984]: https://github.com/rpm-software-management/libdnf/commit/d8d0984
|
|
Will Woods |
70e0e95 |
[libdnf/utils/os-release.cpp:getUserAgent()]: https://github.com/rpm-software-management/libdnf/blob/0.47.0/libdnf/utils/os-release.cpp#L108
|
|
Will Woods |
70e0e95 |
[libdnf/repo/Repo.cpp:addCountmeFlag()]: https://github.com/rpm-software-management/libdnf/blob/0.47.0/libdnf/repo/Repo.cpp#L1051
|
|
Will Woods |
70e0e95 |
[libdnf/repo/Repo.cpp:COUNTME\_BUCKETS]: https://github.com/rpm-software-management/libdnf/blob/0.47.0/libdnf/repo/Repo.cpp#L92
|