#9896 DNF better counting is broken
Closed: Upstream 2 years ago by kevin. Opened 2 years ago by mattdm.

Looks like data is not making it through the system? This graph basically says it all:

fedora-updates-timeseries-age-all-stacked.png

If this can be a fixed in a way that pulls in the old data, this isn't urgent. If we have it by mid-next month, that's fine. However, if it's losing data that won't be filled in later, I'd really like it to be fixed ASAP.

Thanks.


How I 'fixed' this the last 2 times.. was to run the programs over past logs in a 'force' mode.

./parse-access-log.py --progress --sqlite /var/lib/countme/raw.db --no-dup-check /mnt/fedora_stats/combined-http/2021/04/01/mirrors.fedoraproject.org-access.log /mnt/fedora_stats/combined-http/2021/04/02/mirrors.fedoraproject.org-access.log /mnt/fedora_stats/combined-http/2021/04/03/mirrors.fedoraproject.org-access.log /mnt/fedora_stats/combined-http/2021/04/04/mirrors.fedoraproject.org-access.log /mnt/fedora_stats/combined-http/2021/04/05/mirrors.fedoraproject.org-access.log /mnt/fedora_stats/combined-http/2021/04/06/mirrors.fedoraproject.org-access.log /mnt/fedora_stats/combined-http/2021/04/07/mirrors.fedoraproject.org-access.log

Doing it without dup-check it 'runs' but does not seem to add all the entries for dates so we get an 'undercount'. There will be a large drop for one week in late march where we did not get logs from proxies. The raw db is a large binary db so I do not know what its 'state' is and this maybe the problem.

Metadata Update from @mohanboddu:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 years ago

Trying that parse-access-log.py now over the april logs.

Where is that image generated? I don't see it in the normal ones on https://data-analysis.fedoraproject.org/csv-reports/images/ ?

The picture is some program Matt has and not part of the data-analysis images. There will be a drop on the week of March 25 as we did not get logs from most servers for 4 days. however dates after April 1 should show up.

I forgot to say that I run the script as the countme user to keep permissions on dbs and such correct. After the db's are finished.. I run a

sh /usr/local/bin/countme-update-totals.sh --rawdb /var/lib/countme/raw.db --totals-db /var/lib/countme/totals.db --totals-csv /var/lib/countme/totals.csv

which should update the db that matt uses and the csv file I have used to see if things are working. If it doesn't then it will be a dev issue to work through the code

@kevin Yeah, this is from my own graphing code for the new DNF Better Counting. Those reports are from the Classic IP Addr Counting System. Here's a direct look at the database:

$ sqlite3 totals.db 
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> SELECT date(julianday('1970-01-05')+ weeknum * 7) as week,sum(hits) from countme_totals WHERE os_name="Fedora" AND os_variant = "workstation"  AND repo_tag LIKE "updates-released-%" AND weeknum>2654 GROUP BY weeknum;
2020-11-23|366463
2020-11-30|370701
2020-12-07|366514
2020-12-14|364404
2020-12-21|337926
2020-12-28|331814
2021-01-04|364301
2021-01-11|375153
2021-01-18|379698
2021-01-25|383402
2021-02-01|379734
2021-02-08|388700
2021-02-15|389459
2021-02-22|390081
2021-03-01|398500
2021-03-08|401790
2021-03-15|403854
2021-03-22|254365
2021-03-29|262491
2021-04-05|3
2021-04-12|2
sqlite> 

As you can see, we do have a drop on the week of March 25, but... rather than coming back after that, it just drops out.

Hmmm, in fact:

sqlite> select * from countme_totals where weeknum>2673
   ...> ;
2|2674|Fedora|32|generic|x86_64|1|fedora-modular-32|x86_64
2|2674|Fedora|32|generic|x86_64|1|fedora-32|x86_64
2|2674|Fedora|32|generic|ppc64le|1|updates-released-modular-f32|ppc64le
2|2674|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64
2|2674|Fedora|32|generic|x86_64|1|updates-released-modular-f32|x86_64
2|2674|Fedora|32|generic|x86_64|1|updates-released-f32|x86_64
2|2674|Fedora|32|generic|ppc64le|1|fedora-modular-32|ppc64le
2|2674|Fedora|32|generic|ppc64le|1|updates-released-f32|ppc64le
12|2674|Fedora|35|silverblue|x86_64|1|rawhide|x86_64
2|2674|Fedora|32|workstation|x86_64|1|fedora-modular-32|x86_64
119|2674|Fedora|33|coreos|x86_64|1|updates-released-f33|x86_64
119|2674|Fedora|33|coreos|x86_64|1|fedora-33|x86_64
2|2674|Fedora|35|silverblue|x86_64|2|rawhide|x86_64
1|2674|Fedora|32|workstation|x86_64|1|updates-released-modular-f32|x86_64
7|2674|Fedora|35|coreos|x86_64|1|rawhide|x86_64
1|2675|Fedora|32|generic|ppc64le|1|updates-released-modular-f32|ppc64le
1|2675|Fedora|32|generic|ppc64le|1|updates-released-f32|ppc64le
1|2675|Fedora|32|generic|x86_64|1|fedora-32|x86_64
1|2675|Fedora|32|generic|x86_64|1|updates-released-modular-f32|x86_64
3|2675|Fedora|35|silverblue|x86_64|2|rawhide|x86_64
123|2675|Fedora|33|coreos|x86_64|1|updates-released-f33|x86_64
123|2675|Fedora|33|coreos|x86_64|1|fedora-33|x86_64
1|2675|Fedora|35|coreos|x86_64|1|rawhide|x86_64
3|2675|Fedora|35|silverblue|x86_64|1|rawhide|x86_64
1|2675|Fedora|32|generic|x86_64|1|fedora-modular-32|x86_64
1|2675|Fedora|32|generic|x86_64|1|updates-released-f32|x86_64
1|2675|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64
1|2675|Fedora|32|workstation|x86_64|1|updates-released-modular-f32|x86_64
1|2675|Fedora|32|generic|ppc64le|1|fedora-modular-32|ppc64le
1|2675|Fedora|32|workstation|x86_64|1|fedora-modular-32|x86_64
44|2675|Fedora|34|silverblue|x86_64|1|updates-testing-f34|x86_64
72|2675|Fedora|34|silverblue|x86_64|1|updates-released-f34|x86_64
76|2675|Fedora|34|silverblue|x86_64|1|fedora-34|x86_64
32|2675|Fedora|34|coreos|x86_64|1|updates-testing-f34|x86_64
41|2675|Fedora|34|coreos|x86_64|1|updates-released-f34|x86_64
41|2675|Fedora|34|coreos|x86_64|1|fedora-34|x86_64
1|2675|Fedora|34|silverblue|x86_64|1|updates-testing-modular-f34|x86_64
2|2675|Fedora|34|iot|arm|1|updates-testing-f34|armhfp
2|2675|Fedora|34|iot|arm|1|updates-released-f34|armhfp
1|2675|Fedora|34|iot|arm|1|fedora-34|armhfp

^ those are ALL of the rows we've gotten for all of April.

Try now? I just ran the update-totals smooge noted above.

OK I am going to say this is a devel issue then. The code is running, but the db is not getting updated correctly. We had this happen once a long time ago and it needed a rerun of all the logs from 2020.
Out of the blue hypothesis: I am guessing the raw.db is 'trashed' somewhere and needs fixing and then whatever in the code which is causing the 'trash' needs to be found.

ok theroy wrong.. raw.db says its at least right somewhat

sqlite> PRAGMA integrity_check;
ok

OK so in /var/lib/countme I mv out the 'old' db and reran the commands completely;

$ ./countme-totals.py --update-from /var/lib/countme/raw.db --csv-dump /var/lib/countme/totals.csv --progress /var/lib/countme/totals.db
$ ls -l /var/www/html/csv-reports/countme/ /var/lib/countme/
/var/lib/countme/:
total 6584688
-rw-r--r--. 1 countme countme 6714486784 Apr 23 16:56 raw.db
-rw-rw-r--. 1 countme countme    7322886 Apr 23 17:20 totals.csv
-rw-r--r--. 1 countme countme    7081984 Apr 23 17:20 totals.db
-rw-r--r--. 1 countme countme    6733824 Mar 25 09:03 totals.db-old
-rw-r--r--. 1 countme countme    7086080 Apr 22 09:04 totals.db-old2

/var/www/html/csv-reports/countme/:
total 14072
-rw-rw-r--. 1 countme countme 7322881 Apr 23 16:38 totals.csv
-rw-r--r--. 1 countme countme 7086080 Apr 23 16:38 totals.db

The data in the file did not change for those last two weeks. The raw.db has the data in it

> SELECT * from countme_raw WHERE os_name="Fedora" AND os_variant = "workstation"  AND repo_tag LIKE "updates-released-%" AND date(julianday('2021-04-18')) limit 10;
1581323291|2001:470:ddd2:1:b8f9:4fe2:f0a4:de1e|Fedora|31|workstation|x86_64|3|updates-released-modular-f31|x86_64
1581345132|2001:470:ddd2:1:b8f9:4fe2:f0a4:de1e|Fedora|31|workstation|x86_64|3|updates-released-f31|x86_64
1581512575|77.20.132.93|Fedora|32|workstation|x86_64|1|updates-released-modular-f32|x86_64
1581521049|2604:3d08:1381:8800::93cf|Fedora|32|workstation|x86_64|1|updates-released-modular-f32|x86_64
1581521084|2604:3d08:1381:8800::93cf|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64
1581589087|2001:470:ddd2:1:b8f9:4fe2:f0a4:de1e|Fedora|31|workstation|x86_64|3|updates-released-f31|x86_64
1581612017|2804:14c:4a:8950:3dc8:ff26:127c:8644|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64
1581613651|2804:14c:4a:8950:80be:9ff8:4219:230e|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64
1581615305|2804:14c:4a:8950:d401:6525:8ca7:fce|Fedora|32|workstation|x86_64|1|updates-released-modular-f32|x86_64
1581615337|2804:14c:4a:8950:d401:6525:8ca7:fce|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64

so the problem must be in countme-totals.py

And I found that my sql was wrong so I was counting dates in the raw.db incorrectly. There is a problem with counting after 2021-03-29 (if my improved query is correct) so the problem is in the parse-access-log program.

sqlite> SELECT count(*) from countme_raw where date(timestamp, 'unixepoch') >= '2021-03-28';
1807618
sqlite> SELECT count(*) from countme_raw where date(timestamp, 'unixepoch') >= '2021-03-29';
1728833
sqlite> SELECT count(*) from countme_raw where date(timestamp, 'unixepoch') >= '2021-04-01';
441117

Related to the changes for rpm-ostree maybe?

OK I have checked with the original log files to make sure that this wasn't a 'countme is right and we dont' have countme= in the logs'.

$ grep -c countme= */mirrors.fedoraproject.org-access.log 
01/mirrors.fedoraproject.org-access.log:277038
02/mirrors.fedoraproject.org-access.log:247856
03/mirrors.fedoraproject.org-access.log:192878
04/mirrors.fedoraproject.org-access.log:151194
05/mirrors.fedoraproject.org-access.log:143573
06/mirrors.fedoraproject.org-access.log:1130080
07/mirrors.fedoraproject.org-access.log:376237
08/mirrors.fedoraproject.org-access.log:302063
09/mirrors.fedoraproject.org-access.log:259353
10/mirrors.fedoraproject.org-access.log:236679

I am going to git revert the commits, rerun the logs and see if that fixes it.

OK I reran the countme with older code and it updated all the items and databases. So I believe the changes in commit 2872a824a04c8557500a5ea257fc7cac0a9907d5 may be causing this issue. My guess is the regex pattern:

@@ -103,8 +104,8 @@ LOG_PATTERN_FIELDS = {
 #
 # For more info on the User-Agent header, see RFC7231, Section 5.5.3:
 #   https://tools.ietf.org/html/rfc7231#section-5.5.3)
-LIBDNF_USER_AGENT_PATTERN = (
-    r'(?P<product>libdnf(?:/(?P<product_version>\S+))?)\s+'
+COUNTME_USER_AGENT_PATTERN = (
+    r'(?P<product>(libdnf(?:/(?P<product_version>\S+)))|rpm-ostree?)\s+'
     r'\('
       r'(?P<os_name>.*)\s'
       r'(?P<os_version>[0-9a-z._-]*?);\s'

I think I have applied a fix to the regex so that it will evaluate the logs.

Awesome. So, shall we close this for now?

Re-open if you see anything further.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Metadata Update from @mattdm:
- Issue status updated to: Open (was: Closed)

2 years ago
sqlite> select * from countme_totals where weeknum=2677;
160|2677|Fedora|34|silverblue|x86_64|1|updates-released-f34|x86_64
161|2677|Fedora|34|silverblue|x86_64|1|fedora-34|x86_64
1|2677|Fedora|32|generic|x86_64|1|updates-released-f32|x86_64
92|2677|Fedora|34|silverblue|x86_64|2|fedora-34|x86_64
90|2677|Fedora|34|silverblue|x86_64|2|updates-released-f34|x86_64
2|2677|Fedora|35|silverblue|x86_64|3|rawhide|x86_64
3|2677|Fedora|34|iot|aarch64|1|updates-released-f34|aarch64
3|2677|Fedora|34|iot|aarch64|1|fedora-34|aarch64
1|2677|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64
120|2677|Fedora|33|coreos|x86_64|1|updates-released-f33|x86_64
120|2677|Fedora|33|coreos|x86_64|1|fedora-33|x86_64
1|2677|Fedora|32|generic|x86_64|1|fedora-modular-32|x86_64
1|2677|Fedora|32|generic|ppc64le|1|fedora-modular-32|ppc64le
1|2677|Fedora|32|generic|ppc64le|1|updates-released-f32|ppc64le
1|2677|Fedora|32|workstation|x86_64|1|fedora-modular-32|x86_64
1|2677|Fedora|32|workstation|x86_64|1|updates-released-modular-f32|x86_64
1|2677|Fedora|32|generic|x86_64|1|updates-released-modular-f32|x86_64
1|2677|Fedora|32|generic|x86_64|1|fedora-32|x86_64
1|2677|Fedora|32|generic|ppc64le|1|updates-released-modular-f32|ppc64le
48|2677|Fedora|34|coreos|x86_64|1|updates-released-f34|x86_64
48|2677|Fedora|34|coreos|x86_64|1|fedora-34|x86_64
2|2677|Fedora|34|silverblue|x86_64|1|fedora-modular-34|x86_64
11|2677|Fedora|35|silverblue|x86_64|1|rawhide|x86_64
1|2677|Fedora|34|silverblue|x86_64|2|updates-testing-modular-f34|x86_64
1|2677|Fedora|34|silverblue|x86_64|1|updates-testing-modular-f34|x86_64
1|2677|Fedora|34|silverblue|x86_64|1|updates-testing-f34|x86_64
1|2677|Fedora|34|silverblue|x86_64|1|updates-released-modular-f34|x86_64
1|2677|Fedora|34|iot|x86_64|2|fedora-34|x86_64
1|2677|Fedora|34|iot|x86_64|2|updates-released-f34|x86_64
3|2677|Fedora|34|silverblue|x86_64|2|updates-testing-f34|x86_64
1|2677|Fedora|35|coreos|x86_64|1|rawhide|x86_64
3|2677|Fedora|34|iot|x86_64|1|fedora-34|x86_64
3|2677|Fedora|34|iot|x86_64|1|updates-released-f34|x86_64
1|2677|Fedora|34|iot|arm|2|fedora-34|armhfp
1|2677|Fedora|34|iot|arm|2|updates-released-f34|armhfp

Any guesses? It looks like the ones that are coming through are mostly rpm-ostree (but maybe not all of those, even?). I'm eager to see the F34 numbers!

So I have rerun all the jobs with the code and it still counted nothing. I have removed the rpm-ostree regex from the code and am trying one more time to see if that fixes it. If it does then this needs to be a dev ticket/mini-initiative to fix the code properly.

I am not sure what is causing this at all. I am going to reconstruct the raw.db in a different db and see if that is the issue. This will probably take all weekend.

@smooge Thanks, I appreciate it. This is all going to be worth it :)

I have rerun all the data from 2020-01-01 until 2021-05-05?. It seems to have changed a lot of the csv lines so I am guessing the counts have been off or there are other problems in the code. I think we are at 'need a developer to fix'.

DNF worked for 1 week without breaking.

Yeah. The numbers seem lower this week than last week for some reason, but I'm not sure if that's something broken or just what the data looks like. I guess we'll see!

Not sure myself. I recreated the databases from scratch using just the current code which was used this week on the logs. I expect that this still needs someone to look at the code long term.

Metadata Update from @smooge:
- Issue assigned to smooge

2 years ago

We could but I am worried it would sit around not getting touched.

A mini-initiative means there are people who know the project and can work on this while doing their own regular day to day jobs. There needs to be someone on the team who is pushing it forward.

An initiative is meant for when there are not people who know what the project and/or the work is going to take out from their day to day jobs but mgmt can say 'that is ok, this other thing is more important than XYZ thing group B wanted'. It also means that it gets reviewed and pushed to conclusion.

There is no one of the CPE team who knows this project or its code. Also currently most of CPE's people are working CentOS Stream work. That would push it more towards being an initiative to get a focus on it.

I will see what I can look at over the weekend but I think for long term work statistics needs to be a 'real person' job versus volunteer.

OK I don't know if things are 'broken'. The raw data for the last week has been 'weird'

[smooge@log01 05][PROD-IAD2]$ for i in *; do echo -n "$i "; grep -c countme= $i/mirrors.fedoraproject.org-access.log; done                                                                          
01 285863
02 211358
03 193647
04 1622168
05 403975
06 332682
07 298408
08 271678
09 201907
10 193090
11 1662151
12 410051
13 347021
14 294174
15 71515
16 121151
17 200247
18 979033

The 15th and 16th are well below normal. The raw files are the usual sizes but the number of counts from clients took a nose dive. I do not have the sql guruhood to compare this to what the DB says.

For what it's worth there's weird dip in Old Counting Method numbers as well.

So not sure what is up.. the xz sizes of the days are the following:

14 507388       total
15 501832       total
16 476472       total

but the combined logs are much smaller

8.0G    14/mirrors.fedoraproject.org-access.log
2.2G    15/mirrors.fedoraproject.org-access.log
4.5G    16/mirrors.fedoraproject.org-access.log

OK something went wrong on the 15th and 16th for combining logs so we logs 6 gigs of lines. Redoing the merge did not give any errors and a new file. I will see about reprocessing the data for those days.

OK I have remerged the files and I have rerun the countme on those dates. The totals.db and totalscvs show new data so please check to see if this 'fixes' things. I will try to fix the old data also.

Thank you so much @smooge. I think we still need some CPE official attention on this, because you fixing it every week or so is not sustainable. :)

Yeah.. it definitely is still broken. I just looked over my simple read the EPEL ones and the counts for last week are 1/2 of what they were the week before.
There seem to be multiple problems going on with the log combining, the log parsing and the raw and total databases. I don't what they all are, and my 'fix' is currently to rebuild the raw db from scratch which is doubly unsustainable as it takes 4-6 days to do so.

OK ran various 'test' runs and the bug seems to be something to do with the raw.db where all data is put in first before being condensed into the totals.db. If I start with a new db for the last week and then run the totals I get 'correct' looking data counts.

I then made a copy of the known raw.db and ran the logs for the last 1 week again on it telling it to not check if it had an entry with it.. it did fix that week but it also 'fix' the week previous by adding some entries to it. At this point we are going to make this an development initiative with help from someone who really understands DB's. Either the writing to the db is getting corrupted or it is too large for what it is doing (the file is 8.3 GB in size now). It may be that there are some sort of cleaning and reordering which is needed to 'regularly' keep it nice.

In any case, at this point I am going to say this will be continually 'broke'

We now have a iniative working on this... I'm gonna go ahead and close this now. Feel free to re-open if there's some need to track things here.

Metadata Update from @kevin:
- Issue close_status updated to: Upstream
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Done