Issue #11204: DNF countme is broken for the week 2023-03-13->2023-03-19 - fedora-infrastructure

fedora-infrastructure

#11204 DNF countme is broken for the week 2023-03-13->2023-03-19

Closed: Fixed a year ago by kevin. Opened a year ago by smooge.

Normally the service updates on Thursday fo the week for the week previous data. Today the data did not update and I am not sure what is 'broken' in any of the databases. The CentOS database did update so it is probably an issue with the base DB for /var/lib/countme/totals.db or the /var/lib/countme/raw.db

@mattdm I am not sure what is broken or what can be done to fix it. I am putting this ticket in so it can be tracked.

smooge commented a year ago

Update on the problem:

Usually the logs are updated in totals.db and totals.csv for the previous week on a Thursday. However this week, they were not updated until the Friday run. I don't see anything in the logs to say that dates were missed or anything else which might have caused the delay by 24 hours.

Metadata Update from @humaton:
- Issue tagged with: low-gain, medium-trouble, ops

a year ago

mattdm commented a year ago

Huh, weird. Curious to see what happens this week!

kevin commented a year ago

So, what action should we take here? Or it's working, but behaving differently?

mattdm commented a year ago

@kevin Seems like something took a really, really long time to process? Might be a symptom of something else wrong.

smooge commented a year ago

I can close this and if the problem occurs again on Thursday reopen it for a developer to look at it? I went looking at it a bit and could not find an 'ops' side which would have caused the problem.

Metadata Update from @smooge:
- Issue untagged with: ops
- Issue tagged with: dev

a year ago

kevin commented a year ago

Sounds reasonable. Also it might be something that could be looked at by the folks working on it this next quarter.

Metadata Update from @smooge:
- Issue close_status updated to: Initiative Worthy
- Issue status updated to: Closed (was: Open)

a year ago

mattdm commented a year ago

It updated today, but... brokenly:

Metadata Update from @smooge:
- Issue status updated to: Open (was: Closed)

a year ago

smooge commented a year ago

Reopened. I expect that there are multiple issues going on here which will need to be dealt with. My guesses are the following:
a. proxy logs did not get synced?
b. raw.db has a problem and the last month will need to be rerun
c. something new.

I am not able to help on this currently.

smooge commented a year ago

OK looked at the logs in /mnt/fedora_stats/combined-http/2023/03/*/mirrors.fedoraproject.org-access.log and they are all there and similar sizes. So I am going with b or c.

mattdm commented a year ago

Thanks Smooge. I'll escalate this with CPE.

james commented a year ago

I can start looking at this pretty soon, but I'm not sure what I need access to ... and I don't even seem to be able to assign my self to this ticket :).

smooge commented a year ago

you will need access to
systems:
- bastion01
- batcave01
- log01
pagure:
https://pagure.io/mirrors-countme
https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis

kevin commented a year ago

I've added you permissions now ,you should be able to assign/etc.

smooge commented a year ago

The problem as far as I can tell is some sort of 'problem' in the raw.db file which creeps in and then causes the second job which updates the totals.db to fail. I normally have to take a 'fresh' date and create a new raw.db by

su -s /bin/bash countme
cd /var/lib/countme/
mv raw.db raw.db-broke.$(date -I)
cp totals.db totals.db-broke.$(date -I)
rawdb="/var/lib/countme/raw.db"
totsdb="/var/lib/countme/totals.db"
totscsv="/var/lib/countme/totals.csv"
for year in 2023; do
for month in 1 2 3 4; do
for day in $(seq -w 31); do
logfile="${year}/${month}/${day}/mirrors.fedoraproject.org-access.log"
if [[ -f ${logfile} ]]; then
parse-access-log.py --progress --sqlite ${rawdb} ${logfile}
fi
done
done
done
bash countme-update-totals.sh --rawdb ${rawdb} --totals-db ${totsdb} --totals-csv ${totscsv} --progress

That said, long term is to look at the existing 9.4GB raw.db file and figure out why it is broken.

Edited a year ago by smooge

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

a year ago

smooge commented a year ago

edited the text as I don't remember if I used a fresh one or used the old one over.

mattdm commented a year ago

Please retag this as something other than "low-gain".
As discussed elsewhere, infinitely-growing raw.db could / should be refactored away.

Metadata Update from @kevin:
- Issue untagged with: low-gain
- Issue tagged with: high-gain

a year ago

james commented a year ago

So I've now finished the reimport for this year, not sure how to check the graphs etc.

smooge commented a year ago

Ok so what I did in the past was do a fresh import to a new raw.db and then did a

countme-totals.py --update-from /var/lib/countme/raw.db --csv-dump /var/lib/countme/totals1.csv --progress /var/lib/countme/totals1.db

to compare what is saw with the old /var/lib/countme/totals.db and /varlib/countme/totals.csv. If the code I was hacking 'worked' then the totals1.csv should have the results for the bad week fixed.

Since it was a while since I had done this, i wanted to test to make sure it was still valid. I started this this morning as an example for people to look at later.

smooge commented a year ago

Ok so there is no difference in the data for the week of 2023-03-20 which was the problematic week before.

===== Fedora  Base Stats =====
2023-03-13   fedora-3.  x86_64        398354
2023-03-13   fedora-3.  aarch64        10825
2023-03-13   fedora-3.  ppc64le         1759
2023-03-13   fedora-3.  s390x             74
===== Fedora  Base Stats =====
2023-03-20   fedora-3.  x86_64        167674
2023-03-20   fedora-3.  aarch64         6936
2023-03-20   fedora-3.  ppc64le         1264
2023-03-20   fedora-3.  s390x             10
===== Fedora  Base Stats =====
2023-03-27   fedora-3.  x86_64        401520
2023-03-27   fedora-3.  aarch64        11532
2023-03-27   fedora-3.  ppc64le         1694
2023-03-27   fedora-3.  s390x             78

So something is going on for that 'week' of data. The logs for the month all look the same size and looking at the centos.org csv files on the equivalent time does not see a dip.

smooge commented a year ago

OK the 'raw' countme for that week would look to have been higher:

$ for i in */mirrors.fedoraproject.org-access.log; do echo -n $i": "
> grep -c countme= $i
> done
01/mirrors.fedoraproject.org-access.log: 693529
02/mirrors.fedoraproject.org-access.log: 609849
03/mirrors.fedoraproject.org-access.log: 560413
04/mirrors.fedoraproject.org-access.log: 482537
05/mirrors.fedoraproject.org-access.log: 388629
06/mirrors.fedoraproject.org-access.log: 364415
07/mirrors.fedoraproject.org-access.log: 3299640
08/mirrors.fedoraproject.org-access.log: 670951
09/mirrors.fedoraproject.org-access.log: 570565
10/mirrors.fedoraproject.org-access.log: 537646
11/mirrors.fedoraproject.org-access.log: 488814
12/mirrors.fedoraproject.org-access.log: 389636
13/mirrors.fedoraproject.org-access.log: 376449
14/mirrors.fedoraproject.org-access.log: 3331590
15/mirrors.fedoraproject.org-access.log: 692252
16/mirrors.fedoraproject.org-access.log: 583591
17/mirrors.fedoraproject.org-access.log: 573738
18/mirrors.fedoraproject.org-access.log: 518574
19/mirrors.fedoraproject.org-access.log: 412978
20/mirrors.fedoraproject.org-access.log: 392952
21/mirrors.fedoraproject.org-access.log: 3363989
22/mirrors.fedoraproject.org-access.log: 708976
23/mirrors.fedoraproject.org-access.log: 635738
24/mirrors.fedoraproject.org-access.log: 583137
25/mirrors.fedoraproject.org-access.log: 546008
26/mirrors.fedoraproject.org-access.log: 412433
27/mirrors.fedoraproject.org-access.log: 389356
28/mirrors.fedoraproject.org-access.log: 3382676
29/mirrors.fedoraproject.org-access.log: 719028
30/mirrors.fedoraproject.org-access.log: 558641
31/mirrors.fedoraproject.org-access.log: 488492

[The logs cover the day before so the one labeled 07 is really covering the dates for 2023-03-06. ] The numbers per week are fairly all the same so something else is happening during that week in the raw.db. Sadly what that is.. I don't know.

james commented a year ago

Just to double check I did:

for i in */mirrors.fedoraproject.org-access.log; do echo -n $i": "
 grep countme= $i | grep repo=fedora-3 | grep -c arch=x86_64
done

And the result was:

13/mirrors.fedoraproject.org-access.log: 33985
14/mirrors.fedoraproject.org-access.log: 275058
15/mirrors.fedoraproject.org-access.log: 85379
16/mirrors.fedoraproject.org-access.log: 65477
17/mirrors.fedoraproject.org-access.log: 54937
18/mirrors.fedoraproject.org-access.log: 47494
19/mirrors.fedoraproject.org-access.log: 38657
20/mirrors.fedoraproject.org-access.log: 35773
21/mirrors.fedoraproject.org-access.log: 275680
22/mirrors.fedoraproject.org-access.log: 85915
23/mirrors.fedoraproject.org-access.log: 66847
24/mirrors.fedoraproject.org-access.log: 55650
25/mirrors.fedoraproject.org-access.log: 53508
26/mirrors.fedoraproject.org-access.log: 38929

Which is 600,987 for the 13th and 612,302 for the 20th.

Edited a year ago by james

james commented a year ago

And the triple check:

time_t $(( 1581292800+(604800*162) ))
Sun Mar 19 20:00:00 2023
Mon Mar 20 00:00:00 2023 GMT

RAW.db:
sqlite> select count(*) from countme_raw where timestamp >= (1581292800+(604800*161)) AND timestamp <= (1581292800+(604800*162));
6483599
sqlite> select count(*) from countme_raw where timestamp >= (1581292800+(604800*162)) AND timestamp <= (1581292800+(604800*163));
3262313
sqlite> select count(*) from countme_raw where timestamp >= (1581292800+(604800*163)) AND timestamp <= (1581292800+(604800*164));
6197846

james commented a year ago

Ok, I think I've found the problem: https://pagure.io/mirrors-countme/pull-request/60

james commented a year ago

New stats.

===== Fedora  Base Stats =====
2023-03-13   fedora-3.  x86_64        397174
2023-03-13   fedora-3.  aarch64        10798
2023-03-13   fedora-3.  ppc64le         1759
2023-03-13   fedora-3.  s390x             74
===== Fedora  Base Stats =====
2023-03-20   fedora-3.  x86_64        402814
2023-03-20   fedora-3.  aarch64        11604
2023-03-20   fedora-3.  ppc64le         1703
2023-03-20   fedora-3.  s390x             72

mattdm commented a year ago

Current week's run seems to have the missing data, but is instead missing everything before 2023. :)

james commented a year ago

I manually merged the old data back in to totals.csv, and did the copy to the public locations.

I assume raw.db doesn't need all the old data (even though we are copying that file publicly too).

mattdm commented a year ago

I manually merged the old data back in to totals.csv, and did the copy to the public locations.

Oh -- I don't use or look at totals.csv at all. I use totals.db.

mattdm commented a year ago

Anyway -- I see you have fixed it. Thanks @james!

kevin commented a year ago

Great. Lets close this then?

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a year ago

Metadata

Assignee

james

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

Boards 1

dev Status: Backlog

Attachments 1

2023-03-26-fedora_updates_systems-timeseries-line-...

Attached a year ago View Comment

fedora-infrastructure

Source Code

#11204 DNF countme is broken for the week 2023-03-13->2023-03-19 Closed: Fixed a year ago by kevin. Opened a year ago by smooge.

Metadata

high-gain medium-trouble dev

Boards 1

Attachments 1

#11204 DNF countme is broken for the week 2023-03-13->2023-03-19

Closed: Fixed a year ago by kevin. Opened a year ago by smooge.