#11215 rdu-cc ipv6 issues causing troubles to Copr and Pagure
Closed: Fixed 2 years ago by praiskup. Opened 2 years ago by praiskup.

All the copr hypervisors seem to be very slow on ipv6 for some reason, establishing
ipv6 connection takes ages and often time-outs.

The problem for Copr, in particular, is that we heavily depend on ipv6 (copr-backend
controls VMs on hypervisors over ipv6 because we don't have enough public ipv4
addresses for all the VMs). For users: This basically means that Copr is
slower than usual (especially for the ppc64le architecture).

The other problem is that we communicate with the hypervisors themselves
over IPv6 (default ip stack). I'm going to switch to v4 as it seems to be more
reliable longterm. But we still need to fix ipv6 to connect to VMs themselves.

[copr_hypervisor]
vmhost-x86-copr01.rdu-cc.fedoraproject.org
vmhost-x86-copr02.rdu-cc.fedoraproject.org
vmhost-x86-copr03.rdu-cc.fedoraproject.org
vmhost-x86-copr04.rdu-cc.fedoraproject.org
vmhost-p08-copr01.rdu-cc.fedoraproject.org
vmhost-p08-copr02.rdu-cc.fedoraproject.org
vmhost-p09-copr01.rdu-cc.fedoraproject.org

I suspect the whole rdu-cc.f.o lab is affected.


On my local host, git pull for ssh://git@pagure.io/fedora-infra/ansible.git takes ages, too, git pull -4 works normally.

Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: Needs investigation

2 years ago

This configuration helps us to ssh to hypervisors to start new VMs, but we can not ssh ot those VMs.

Metadata Update from @praiskup:
- Issue untagged with: Needs investigation
- Issue priority set to: Needs Review (was: Waiting on Assignee)

2 years ago

Metadata Update from @praiskup:
- Issue tagged with: copr, pagure

2 years ago

In order for networking to debug this, they are going to need a lot more information on this ticket:

  1. From where (ipv6 address)
  2. To where (ipv6 address)
  3. What protocols and ports (TCP6, UDP6, port 22, port 80, port 443)
  4. When was this measured? UTC time
  5. What is the network path between the "from" and "to". (this can change regularly so we need to capture what was happening when the test happened)
  6. Does mtr using a protocol/port show any slow downs in the path
  7. What are the timing seen
  8. A similar set of data for ipv4 is needed because you can end up with a completely different path, timing, etc.

If there are multiple hosts which are seeing this, collecting the data from all of them is useful to cut down which block may be causing an issue.

There are a lot of reasons for slowdown which will need to be looked at one-by-one:
A. local rdu-cc router is overloaded. ipv6 takes a quadratic sized more of memory and CPU over ipv4.
B. some network in between has 'devalued' ipv6 traffic over ipv4 traffic.
C. OS bug on the hosts
D. OS bug on a RH controlled router
E. bug on the N routers between 'from' and 'to'

So, from another host (www.osci.io) in the same DC, connecting to pagure.io is immediate. From France, it take ages to do ssh to both pagure.io and www.osci.io, but ping is fast.

So option C is out.

I have tested from a server in a german DC, and it is also a issue. However, mtr show the only common hop is 2600:3000:0:2::1d, the upstream provider of the DC.

So option E is likely out.

So, testing from another DC (the Digital Ocean one in NYC), I still see the ssh delay, but I do not see the 2600:3000:0:2::xx routers, so that mean the issue is between the servers and 2620:52:3:fffe::2, which is the RH IT router.

I opened a ticket for RH IT ( INC2567813 ).

IT seems to have fixed the issue. Neither me nor smooge can reproduce the problem anymore.

Seems to work fine! Thank you for the diagnosis. I'll provide more info next time.

Metadata Update from @praiskup:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Log in to comment on this ticket.

Metadata