#9131 openQA worker hosts use 10.0.2.2/15 IP, seems to be causing network issues
Closed: Fixed 3 years ago by adamwill. Opened 3 years ago by adamwill.

mgalgoci and @smooge poked me the other day about openQA's use of the IP address 10.0.2.2.

To explain it as briefly as I can, openQA 'tap' worker hosts (worker hosts that run complex tests where two or more jobs need to run simultaneously and communicate with each other) have a br0 interface with IP 10.0.2.2/15 (covering the 10.0.1.0 and 10.0.2.0 range). The VMs that run 'tap' jobs take an IP in the 10.0.2.x range, and a helper script sets some clever openflow rules that rewrite their traffic to appear to the bridge as coming from addresses in the 10.0.1.x range (based on each VM's MAC address) so that if we have two instances of the same job running on two different VLANs, they can both take 'the same' IP address and we can pass their traffic between each of them and the bridge without conflicts.

Non-tap jobs simply use qemu 'user' networking, in which (by default) the host also appears as 10.0.2.2 to the guest.

The openQA test API encodes this assumption - so when test code needs to talk to the host for any reason (it does this, for instance, to upload files from the guest to the host, this is how we get log files and things out), it expects to reach the host at 10.0.2.2.

So I started thinking about how we could patch openQA (actually os-autoinst, the openQA test runner; there's no code in openQA itself that assumes 10.0.2.2) to allow us to use something other than 10.0.2.2, but it's not really straightforward and I'm not sure of the best way to do it. So I'm filing this ticket to track progress on the Fedora end. I've filed an upstream issue too for discussion on how best to do this upstream, if we can.

It would probably be useful to know if it's only the bridge interfaces on tap worker hosts that are a problem for infra, or if the non-tap jobs using qemu user networking with the default 10.0.2.2 IP are a problem too.


Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: groomed, high-gain, high-trouble

3 years ago

Metadata Update from @smooge:
- Issue close_status updated to: Duplicate
- Issue status updated to: Closed (was: Open)

3 years ago

Metadata Update from @smooge:
- Issue status updated to: Open (was: Closed)

3 years ago

This was erroneously closed

Update here: the pull request to make this configurable upstream should be in a workable state now, so I will try it out today.

Well, I tried it, but it didn't really work out :/ It seems like tests could communicate with each other, but they definitely couldn't reach the internet and maybe couldn't reach the VM host. This is likely either me getting something wrong or some firewall or something, but it's going to be complex to figure out and I can't leave the prod instance broken all weekend, so I reverted to 10.0.2.2 for now and will take another swing at this next week.

This should be resolved now. I found the sneaky little thing that was making it not work and redeployed the change and it seems to be working OK now. We shouldn't be using any 10.x IPs for openQA any more, everything is moved to 172.x. Let me know if there still seem to be issues. Thanks!

Metadata Update from @adamwill:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata