#105 GSoC Idea - Centralized metrics for Fedora
Closed: Moved 2 years ago Opened 2 years ago by skamath.

Problem

Right now, metrics collection in CommOps is not very efficient and requires a lot of manual work. Metrics for various events/FAS groups/users are collected using scripts which query datagrepper and return results. This process is very time consuming and writing scripts each time is a very tedious process. Also, querying the datagrepper to get data everytime is redundant and time-consuming.

Example of current statistics generation : https://github.com/bee2502/fedora-stats-tools (Lots of hack-y scripts)

Proposed Solution

Hack on statscache to build a central metrics generation system for Fedora with handy features to pull statistics. Statscache consumes all the messages and does not query datagrepper every single time thereby increasing the efficiency. By building on top of statscache, we can significantly reduce the number of scripts required to gather metrics to almost 0.

Nice to have features :

  • Statistics by FAS group.
  • Per user statistics.
  • Statistics of users holding a badge (Useful for event statistics)
  • Exporting of stats in various formats (JSON, HTML, CSV, etc)
  • Date based filtering for all statistics (Useful for generating reports)
  • Graphs for statistics

I added the above-mentioned features based on datagrepper in the tool I had developed last summer. This can be used as a base for statscache integration.

Sample data generated using the tool : fedstats-data

Final Deliverables

  • Custom statscache plugins for various metrics.
  • A webapp to run queries (with a nice, minimal interface - based on statscache)
  • API for all the queries functions
  • Well-Documented code.

This is my initial proposal and we can definitely build upon this. Thoughts?
CC: @bee2502 @bex


Is statscache developed enough to support a webapp ? Also, who will mentor this project ?

Instead, a GSoC project to build plugins for basic metrics in statscache and statscache related development along with documentation could be an alternative. Someone from Infra team and hopefully @pingou can provide more info about this.

Is statscache developed enough to support a webapp ? Also, who will mentor this project ?

Yes. The whole point of statscache is to make metrics easier. Quoting the README :
It is cool, but insufficient for some more advanced reporting and analysis that we would like to do. Due to some confusions in my last years project, I couldn't work on statscache but I have set it up, have explored the plugins and the platform quite a lot. I believe what I mentioned is quite doable. If not for the webapp. It is very possible to add API features for the features I had mentioned.

Re : Mentorship, people working on metrics should be able to mentor this project. @bt0dotninja said he could help with the metrics part of stuff. I pinged sayan about this a couple of days back but looks like he is occupied with other projects atm. @bee2502, are you planning to mentor this time as well?

Instead, a GSoC project to build plugins for basic metrics in statscache and statscache related development along with documentation could be an alternative. Someone from Infra team and hopefully @pingou can provide more info about this.

I'd like to see the proposed mentors weigh in on this. I agree that we need better statistics, and in fact was having a conversation about this yesterday. I wonder if a better place to start might be to work on recreating the work that @mattdm does for his presentation in a stable tool like this? He has some sample code and defined metrics. I would not like to see this GSoC project get bogged down defining metrics instead of writing code.

After discussing this idea with @bee2502 , we decided to split the idea based on the priority of tasks :

High Priority

Custom statscache plugins for metrics

Post event metrics from badges.
Metrics for FAS groups.
Integrate fedora-stats-tool into statscache?
Ability to generate weekly/monthly/quarterly/yearly reports.
Ability to export statistics in different formats.
Well-documented code.

Medium Priority

A nice web interface for generating the statistics (using Flask/Bottle? )
Automatic statistics generation and storage.
Deploy the final deliverable on infracloud.

@bee2502 Please add to this if I missed something :)

While I'm not labeled as a statscache mentor I'd be willing to help with this project as well seeing as this year we have an abundance of python talent in the mentor pool. @skamath @bex

After discussing this idea with bee2502 , we decided to split the idea based on the priority of tasks :
High Priority
Custom statscache plugins for metrics

Which all plugins?

Post event metrics from badges.
Metrics for FAS groups.
Integrate fedora-stats-tool into statscache?
Ability to generate weekly/monthly/quarterly/yearly reports.
Ability to export statistics in different formats.
Well-documented code.
Medium Priority
A nice web interface for generating the statistics (using Flask/Bottle? )

An interface for displaying the stats in statscache already exists. Is the plan to build project specific dashboard here on top of statscache?

Automatic statistics generation and storage.

Can you elaborate this?

Deploy the final deliverable on infracloud.

Why deploy it to infracloud rather production?

The plan seems too abstract to me. Can write a detail write of the implementation on what/how are you planning?

After discussing this idea with bee2502 , we decided to split the idea based on the priority of tasks :
High Priority
Custom statscache plugins for metrics

Which all plugins?
As mentioned earlier, some of the plugins required would be :
* Stats by FAS group : We need to be able to generate statistics of a FAS group as a whole. This can be used to generate reports of a SIG/group easily when necessary.

  • Stats by badge : This can be useful while generating event reports. This plugin needs to find all the recipients of, say badge X and should show activity of the users, given the time delta.

  • FAS Account trends : This plugin will be really useful to track the new comer trends to Fedora. This can also help CommOps/Join track newcomer retention rates. Please take a look at Slide 6 of the presentation [1] by @bee2502 for more information on this.

Also, we should be integrating the stats required by @mattdm, as mentioned by you on the statscache ticket[2]

Post event metrics from badges.
Metrics for FAS groups.
Integrate fedora-stats-tool into statscache?
Ability to generate weekly/monthly/quarterly/yearly reports.
Ability to export statistics in different formats.
Well-documented code.
Medium Priority
A nice web interface for generating the statistics (using Flask/Bottle? )

An interface for displaying the stats in statscache already exists. Is the plan to build project specific dashboard here on top of statscache?

Yes, but IMO the interface is really basic and needs some work. We can either build a dashboard on top of the statscache or revamp statscache to include interactive features such as filtering by date, ability to generate graphs, etc.

Automatic statistics generation and storage.

Can you elaborate this?

Sorry for not elaborating on that one. By that, I meant the ability to export the metrics in different formats for using it elsewhere.

Deploy the final deliverable on infracloud.

Why deploy it to infracloud rather production?

If the project is ready for production this summer, we can definitely go ahead and deploy it to production. If not, we can host it on commops Infracloud for testing it.

The plan seems too abstract to me. Can write a detail write of the implementation on what/how are you planning?

From what I see, the following should should be the action plan :
1) Figure out how statscache codebase can be expanded to accomodate the new requirements.
2) Tweak/Add modules to statscache
3) Revamp the statscache interface.
4) Write the required plugins required for statistics.
5) Add interactivity to the interface. ( This can either be a direct around statscache or a separate dashboard for the project build on top of statscache).

@sayanchowdhury Your thoughts on this? :)

[1] https://jflory7.fedorapeople.org/pub/flock/2016/i-contributed-now-what/i-contributed-now-what-slides.pdf
[2] https://github.com/fedora-infra/statscache/issues/50#issuecomment-285357472

I just cc myself to this ticket.

I am working on adding a plugin to collect statistic for pagure. I am planning some blog post to document the process as there is not much documentation on how to develop plugins.

I'd like to take on the high priority task as a GSoC student, and I was thinking of getting started doing a part of the task to get to know the codebase better.

Do you think starting with creating the 'Stats by badge' plugin would be a good idea?

I sent my introduction mail here [0], I've setup statscache yet, and I'm excited to work on this!

[0] https://lists.fedoraproject.org/archives/list/summer-coding@lists.fedoraproject.org/thread/BCRGZZ66XPOOCJXYLLW3EIJIJKNECASY/

After discussing this idea with bee2502 , we decided to split the idea based on the priority of tasks :
High Priority
Custom statscache plugins for metrics
Which all plugins?
As mentioned earlier, some of the plugins required would be :
* Stats by FAS group : We need to be able to generate statistics of a FAS group as a whole. This can be used to generate reports of a SIG/group easily when necessary.

Stats by badge : This can be useful while generating event reports. This plugin needs to find all the recipients of, say badge X and should show activity of the users, given the time delta.

FAS Account trends : This plugin will be really useful to track the new comer trends to Fedora. This can also help CommOps/Join track newcomer retention rates. Please take a look at Slide 6 of the presentation [1] by bee2502 for more information on this.

Also, we should be integrating the stats required by @mattdm, as mentioned by you on the statscache ticket[2]

I think this would be possible atm, at least cannot be counted as a task in GSoC. This is a combination of generating analytics from fedmsg as well as generating via some internal scripts. So this would need migrating those scripts to push fedmsg messages and afaik @mattdm changes the presentation to make the analytics more interesting. So, it would be good to have a framework to accommodate that.

Post event metrics from badges.
Metrics for FAS groups.
Integrate fedora-stats-tool into statscache?
Ability to generate weekly/monthly/quarterly/yearly reports.
Ability to export statistics in different formats.
Well-documented code.
Medium Priority
A nice web interface for generating the statistics (using Flask/Bottle? )
An interface for displaying the stats in statscache already exists. Is the plan to build project specific dashboard here on top of statscache?

Yes, but IMO the interface is really basic and needs some work. We can either build a dashboard on top of the statscache or revamp statscache to include interactive features such as filtering by date, ability to generate graphs, etc.

Automatic statistics generation and storage.
Can you elaborate this?

Sorry for not elaborating on that one. By that, I meant the ability to export the metrics in different formats for using it elsewhere.

Deploy the final deliverable on infracloud.
Why deploy it to infracloud rather production?

If the project is ready for production this summer, we can definitely go ahead and deploy it to production. If not, we can host it on commops Infracloud for testing it.

The plan seems too abstract to me. Can write a detail write of the implementation on what/how are you planning?

From what I see, the following should should be the action plan :
1) Figure out how statscache codebase can be expanded to accomodate the new requirements.

I don't think statscache needs a heavy modification to accomodate the changes.

2) Tweak/Add modules to statscache

Two plugins that I see (Badges & FAS)

3) Revamp the statscache interface.

Agreed

4) Write the required plugins required for statistics.

So, my point is that it should be clear ahead of GSoC that what plugins would the GSoC student be building and what would be the timeframe for it.

Adding plugins is quite easy into statscache and at the end of day if the plugin is necessary they should be added. For example Badges & FAS plugins is needed by CommOps team, so that should be built.

5) Add interactivity to the interface. ( This can either be a direct around statscache or a separate dashboard for the project build on top of statscache).

+1 to this

@sayanchowdhury Your thoughts on this? :)

I would say everything that is built should be generic is nature, like the exporting feature.

After discussing this idea with bee2502 , we decided to split the idea based on the priority of tasks :
High Priority
Custom statscache plugins for metrics
Which all plugins?
As mentioned earlier, some of the plugins required would be :
* Stats by FAS group : We need to be able to generate statistics of a FAS group as a whole. This can be used to generate reports of a SIG/group easily when necessary.
Stats by badge : This can be useful while generating event reports. This plugin needs to find all the recipients of, say badge X and should show activity of the users, given the time delta.
FAS Account trends : This plugin will be really useful to track the new comer trends to Fedora. This can also help CommOps/Join track newcomer retention rates. Please take a look at Slide 6 of the presentation [1] by bee2502 for more information on this.
Also, we should be integrating the stats required by @mattdm, as mentioned by you on the statscache ticket[2]

I think this would be possible atm, at least cannot be counted as a task in GSoC. This is a combination of generating analytics from fedmsg as well as generating via some internal scripts. So this would need migrating those scripts to push fedmsg messages and afaik @mattdm changes the presentation to make the analytics more interesting. So, it would be good to have a framework to accommodate that.

Post event metrics from badges.
Metrics for FAS groups.
Integrate fedora-stats-tool into statscache?
Ability to generate weekly/monthly/quarterly/yearly reports.
Ability to export statistics in different formats.
Well-documented code.
Medium Priority
A nice web interface for generating the statistics (using Flask/Bottle? )
An interface for displaying the stats in statscache already exists. Is the plan to build project specific dashboard here on top of statscache?
Yes, but IMO the interface is really basic and needs some work. We can either build a dashboard on top of the statscache or revamp statscache to include interactive features such as filtering by date, ability to generate graphs, etc.
Automatic statistics generation and storage.
Can you elaborate this?
Sorry for not elaborating on that one. By that, I meant the ability to export the metrics in different formats for using it elsewhere.
Deploy the final deliverable on infracloud.
Why deploy it to infracloud rather production?
If the project is ready for production this summer, we can definitely go ahead and deploy it to production. If not, we can host it on commops Infracloud for testing it.
The plan seems too abstract to me. Can write a detail write of the implementation on what/how are you planning?
From what I see, the following should should be the action plan :
1) Figure out how statscache codebase can be expanded to accomodate the new requirements.

I don't think statscache needs a heavy modification to accomodate the changes.

2) Tweak/Add modules to statscache

Two plugins that I see (Badges & FAS)

3) Revamp the statscache interface.

Agreed

4) Write the required plugins required for statistics.

So, my point is that it should be clear ahead of GSoC that what plugins would the GSoC student be building and what would be the timeframe for it.

Definitely agree on this.

Adding plugins is quite easy into statscache and at the end of day if the plugin is necessary they should be added. For example Badges & FAS plugins is needed by CommOps team, so that should be built.

5) Add interactivity to the interface. ( This can either be a direct around statscache or a separate dashboard for the project build on top of statscache).

+1 to this

@sayanchowdhury Your thoughts on this? :)

I would say everything that is built should be generic is nature, like the exporting feature.

I am also cc'ing @jberkus on this ticket so that he can comment on community metrics he has in mind for Project Atomic + Fedora.

Just as a small afterthought, some statistics to be collected on the Fedora Magazine / Community Blog via the WordPress API could be cool. Specifically, we would want to gain insight into:

  • Post performance (daily / weekly / monthly)
  • Overall trends in viewing (weekly / monthly / annually)
  • Comparative reports (e.g. how this {week,month,year} was compared to last {week,month,year}

It's worth noting that there might already be tools or resources out there to access the WordPress API for statistics, although I'm unaware of anything off the top of my head.

This list isn't as comprehensive as I'd like it to be, but I'm going to ping @pfrields and @ryanlerch too, since they might have ideas for "Fedora Magazine metric wishlist" to throw into the ring.

Here are some metrics suggestions for Atomic and Cockpit as suggested by @jberkus and me :

  • Knowing about contribution activity ( Issues posting and responses, Pull Requests, wiki edits)
    We'll also want to span the systems used by the various projects, so Pagure for Fedora Atomic, GitHub for Cockpit, etc.

  • Statistics about contributors activity
    tenure
    participation in FADs/vFADS
    participation in IRC meetings
    if possible, Trello activity
    wiki edits, but this needs more thought - small/large contributions and areas.

We also want to collect stats on Cockpit as a a project.

One of the possible metrics : #42 Activity (mailing list or otherwise) by day of week and time of day in each contributor's local timezone

@skamath would love your help on this ticket to finalize the list of metrics (since we have had numerous suggestions recently).

Metadata Update from @bee2502:
- Issue priority set to: critical (next week)
- Issue tagged with: GSoC, metrics, needs feedback

2 years ago

Metadata Update from @skamath:
- Issue assigned to skamath

2 years ago

So, maybe this will help? This is scripts I used to generate contributor stats as presented at Flock and DevConf:

https://github.com/mattdm/fedora-stats-tools/tree/develop/contributor-trends

They're pretty ugly, and in particular hit the server harder than they should, and they cache nothing and stupidly have to start over at the beginning just to add data to the end.

In general, big things I'm interested in are the onboarding and retention numbers that Bee has been looking at. Things like:

  • Number of active contributors (as segmented by area as possible) over time
  • Number of new contributors each month (or other period of time, but month is nice)
  • Percent of work done by relatively new contributors
  • Percent of work done by "old hands"
  • Retention of new contributors
  • and retention of long-time contributors, too

These are basically relevant and interesting as indicators of project health.

It would also be interesting to map onboarding growth (going from mailing list posts to irc participation to pagure commits, say).

Another interesting thing is categorization of work patterns. Do some people tend to do a ton of work all at once, and then go dormant? Do some people do some small thing every week, adding up to a lot? This stuff isn't project health per se, but still is interesting to know and we might do something with it.

Closing in favor of #114

All the ideas noted here will be migrated there :)

Metadata Update from @skamath:
- Issue untagged with: GSoC, metrics, needs feedback
- Issue close_status updated to: Moved
- Issue priority set to: None (was: critical (next week))
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata