#3956 [RFE] API endpoint to find all projects with a branch named 'X'
Closed: Fixed 5 years ago by pingou. Opened 5 years ago by jcpunk.

In preparation for the CentOS switch to pagure, it would be helpful if I could search for projects with a specific named branch.

This would prevent huge query loads to locate any newly added package repos.


Metadata Update from @pingou:
- Issue tagged with: RFE

5 years ago

This would be good to prioritize in the next month or two if possible.

How up to date must the data be?
Would provide a JSON of the data refreshed daily or a few times a day enough?

I fear that if we want to extract this info from all 27,000+ repos "live" it will just be too much but building a JSON of all repos with their branches and running it in a cron sounds doable.
Would it work for your use-case?

Metadata Update from @pingou:
- Issue set to the milestone: 5.3

5 years ago

A refresh multiple times a day would be viable. Every 4 hours would be the least frequently I'd probably be able to work with. Less than that could delay build and publish of security errata by a significant bit on our end....

Could it updated when a package is pushed in? Rather than a live scrape treat it as more of an index?

@jcpunk : the problem I see with the current pagure implementation is that the git repositories themselves are now hosted in repospanner distributed server, so nothing really "on disk" , except a cache from time to time.
The only way to show that "live" (but not a maintainer, so only discussing here) would probably be to have hook that would be ran , including those info in the DB used by pagure, so that on API call, pagure has that "cached" and with recent history, instead of having to open all the thousands of repositories to construct that json ?

I'm going to write a POC in staging and we'll see how long it takes to run, that'll determine how often we can run it.
Hopefully we can get it to run every 3h or 4h and we could see with the admins about running it more quickly when there is a (known/important) push happening.

While working on this script, are we also interested in knowing the last commit for each branch? This way we'd know about new branches but also new commits on them.
What do you think?

That information could be very handy!

CentOS workflow requires checking all repos for new changes - because c7 repos are pushed by different people than software is build. So it would be important to include "last-change" timestamp to git json api. If possible, listing all branches with last change, or at least adding timestamp info of last change per git repo. Alternative would be to pull all the repos every four hours or so which will cause huge load. I mean interface like: https://git.centos.org/rpc/?req=LIST_REPOSITORIES which can be used to fetch info about changed repos.

The current APIs can be used to do this currently. For example the following list all the branches for the first 100 repo.

import requests
DIST_GIT = "https://src.fedoraproject.org/api/0/"
projects = requests.get(DIST_GIT + "projects", params={"per_page": 100})
projects = projects.json().get("projects")
for p in projects:
    branches = requests.get(DIST_GIT+p.get("fullname")+"/git/branches")
    print(branches.json().get("branches"))

Well - is it guaranteed that 100 first requests contain all new? If not, this was not very helpful because all projects need to be checked. So for 5000 projects that would be 5000+1 requests.

Well - is it guaranteed that 100 first requests contain all new? If not, this was not very helpful because all projects need to be checked. So for 5000 projects that would be 5000+1 requests.

Yes sorry I was meaning that as of today we can already do the check all branch for every projects on a regular basis (~ 4hours) + with a thread pool or an asyncio client the performance might be acceptable.

But yes adding the last-change time stamp is a good idea :)

With help from @puiterwijk we leveraged repoSpanner to extract this information.

The output is a (large) JSON file available at: https://src.stg.fedoraproject.org/extras/ (repoinfo.json) that is refreshed every two hours.

Does this solve your request?

Metadata Update from @pingou:
- Issue assigned to pingou

5 years ago

Yes, this works perfectly.

Perfect, let's close this as fixed :)

Metadata Update from @pingou:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 years ago

Any chance to get this URL added to the infrastructure docs for greater visability?

Login to comment on this ticket.

Metadata