All the copr hypervisors seem to be very slow on ipv6 for some reason, establishing ipv6 connection takes ages and often time-outs.
The problem for Copr, in particular, is that we heavily depend on ipv6 (copr-backend controls VMs on hypervisors over ipv6 because we don't have enough public ipv4 addresses for all the VMs). For users: This basically means that Copr is slower than usual (especially for the ppc64le architecture).
The other problem is that we communicate with the hypervisors themselves over IPv6 (default ip stack). I'm going to switch to v4 as it seems to be more reliable longterm. But we still need to fix ipv6 to connect to VMs themselves.
[copr_hypervisor] vmhost-x86-copr01.rdu-cc.fedoraproject.org vmhost-x86-copr02.rdu-cc.fedoraproject.org vmhost-x86-copr03.rdu-cc.fedoraproject.org vmhost-x86-copr04.rdu-cc.fedoraproject.org vmhost-p08-copr01.rdu-cc.fedoraproject.org vmhost-p08-copr02.rdu-cc.fedoraproject.org vmhost-p09-copr01.rdu-cc.fedoraproject.org
I suspect the whole rdu-cc.f.o lab is affected.
On my local host, git pull for ssh://git@pagure.io/fedora-infra/ansible.git takes ages, too, git pull -4 works normally.
git pull
ssh://git@pagure.io/fedora-infra/ansible.git
git pull -4
Metadata Update from @zlopez: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: Needs investigation
This configuration helps us to ssh to hypervisors to start new VMs, but we can not ssh ot those VMs.
Metadata Update from @praiskup: - Issue untagged with: Needs investigation - Issue priority set to: Needs Review (was: Waiting on Assignee)
Metadata Update from @praiskup: - Issue tagged with: copr, pagure
In order for networking to debug this, they are going to need a lot more information on this ticket:
If there are multiple hosts which are seeing this, collecting the data from all of them is useful to cut down which block may be causing an issue.
There are a lot of reasons for slowdown which will need to be looked at one-by-one: A. local rdu-cc router is overloaded. ipv6 takes a quadratic sized more of memory and CPU over ipv4. B. some network in between has 'devalued' ipv6 traffic over ipv4 traffic. C. OS bug on the hosts D. OS bug on a RH controlled router E. bug on the N routers between 'from' and 'to'
So, from another host (www.osci.io) in the same DC, connecting to pagure.io is immediate. From France, it take ages to do ssh to both pagure.io and www.osci.io, but ping is fast.
So option C is out.
I have tested from a server in a german DC, and it is also a issue. However, mtr show the only common hop is 2600:3000:0:2::1d, the upstream provider of the DC.
So option E is likely out.
So, testing from another DC (the Digital Ocean one in NYC), I still see the ssh delay, but I do not see the 2600:3000:0:2::xx routers, so that mean the issue is between the servers and 2620:52:3:fffe::2, which is the RH IT router.
I opened a ticket for RH IT ( INC2567813 ).
Thanks @misc !
IT seems to have fixed the issue. Neither me nor smooge can reproduce the problem anymore.
Seems to work fine! Thank you for the diagnosis. I'll provide more info next time.
Metadata Update from @praiskup: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.