#4977 Pagure takes a long time (~1m) to start transferring a large raw file
Closed: Fixed 3 years ago by pingou. Opened 3 years ago by ngompa.

This is keyed off the CentOS infra ticket asking for a pagure upgrade for git.centos.org to improve performance of downloading large raw files.

I mirrored the kernel rpm packaging repo from git.centos.org to pagure.io and attempted to test the same download, and for some reason we're taking ~1 minute before starting the file transfer:

~> curl -vvvvOL "https://pagure.io/centos-kernel-rpm/raw/c7-sig-centosplus-kernel/f/SOURCES/centos-linux-3.10-wireguard.patch"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 8.43.85.75:443...
* Connected to pagure.io (8.43.85.75) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [93 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [2908 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [589 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: C=US; ST=North Carolina; L=Raleigh; O=Red Hat Inc.; CN=pagure.io
*  start date: Dec  7 00:00:00 2017 GMT
*  expire date: Jan  5 12:00:00 2021 GMT
*  subjectAltName: host "pagure.io" matched cert's "pagure.io"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 High Assurance Server CA
*  SSL certificate verify ok.
} [5 bytes data]
> GET /centos-kernel-rpm/raw/c7-sig-centosplus-kernel/f/SOURCES/centos-linux-3.10-wireguard.patch HTTP/1.1
> Host: pagure.io
> User-Agent: curl/7.69.1
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:01:00 --:--:--     0{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Thu, 03 Sep 2020 01:36:50 GMT
< Server: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips mod_wsgi/3.4 Python/2.7.5
< X-Content-Type-Options: nosniff
< Content-Security-Policy: default-src 'self';script-src 'self' 'nonce-OxhKl8LIPBUGl48gk7zPNyh1G'; style-src 'self' 'nonce-OxhKl8LIPBUGl48gk7zPNyh1G'; object-src 'none';base-uri 'self';img-src 'self' https:;connect-src 'self' https://pagure.io:8088;frame-src https://docs.pagure.org;frame-ancestors https://pagure.io;
< Set-Cookie: pagure=eyJfcGVybWFuZW50Ijp0cnVlLCJjc3JmIjp7IiBiIjoiWVdZMU1UQXdNakEyTkdGa05XUTJPV0ZrWldZM09XVm1OR1V6WmpNMll6UXpZak0yTmpBMk1BPT0ifX0.EjHabw.jno8nldliQ3L2buAMx-qXGtca8A; Expires=Sun, 04-Oct-2020 01:37:51 GMT; Secure; HttpOnly; Path=/
< X-Xss-Protection: 1; mode=block
< Referrer-Policy: same-origin
< X-Frame-Options: ALLOW-FROM https://pagure.io/
< Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
< Content-Length: 1703165
< Content-Type: text/plain; charset=utf-8
<
  0 1663k    0     0    0     0      0      0 --:--:--  0:01:00 --:--:--     0{ [5 bytes data]
100 1663k  100 1663k    0     0  27392      0  0:01:02  0:01:02 --:--:--  435k
* Connection #0 to host pagure.io left intact

I tested locally with my own instance configured with Apache as the frontend like pagure.io and git.centos.org, and I can reproduce it:

~> curl -vvvvOL "http://obs-gitdev-pagure.home.lan/packages/centos-kernel/raw/c7-sig-centosplus-kernel/f/SOURCES/centos-linux-3.10-wireguard.patch"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.0.130.26:80...
* Connected to obs-gitdev-pagure.home.lan (10.0.130.26) port 80 (#0)
> GET /packages/centos-kernel/raw/c7-sig-centosplus-kernel/f/SOURCES/centos-linux-3.10-wireguard.patch HTTP/1.1
> Host: obs-gitdev-pagure.home.lan
> User-Agent: curl/7.69.1
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:00:56 --:--:--     0* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Thu, 03 Sep 2020 02:00:13 GMT
< Server: Apache
< Content-Length: 1703165
< X-Content-Type-Options: nosniff
< Content-Security-Policy: default-src 'self';script-src 'self' 'nonce-K4yLO7WFbtZWnMiFZvaTj86zu'; style-src 'self' 'nonce-K4yLO7WFbtZWnMiFZvaTj86zu'; object-src 'none';base-uri 'self';img-src 'self' https:;
< Vary: Cookie
< Set-Cookie: pagure_local_cookie=; HttpOnly; Path=/
< Set-Cookie: pagure=eyJfcGVybWFuZW50Ijp0cnVlLCJjc3JmX3Rva2VuIjoiNDc4NTUzMTE5ZjJhNmFkYTZkMTViYzI3NmU0OTg0ZmFiZDczYzMwMyJ9.EjHf5g.fd-Mwn_9Ls9DqB-0gogDFA-c-_g; Expires=Sun, 04-Oct-2020 02:01:10 GMT; HttpOnly; Path=/
< Content-Type: text/x-patch; charset=utf-8
<
{ [701 bytes data]
100 1663k  100 1663k    0     0  29653      0  0:00:57  0:00:57 --:--:--  405k
* Connection #0 to host obs-gitdev-pagure.home.lan left intact

So this is clearly something happening in the Pagure layer. Any idea of what's going on here? In some cases, it leads to a browser timeout and termination of the connection before it even begins fetching...


So I've added some logging in the code to figure out what is taking so much time.

Here is what I gathered.

The curl command:

$ curl -vvvOL "http://127.0.0.1:5000/rpms/centos-kernel-rpm/raw/c7-sig-centosplus-kernel/f/SOURCES/centos-linux-3.10-wireguard-1.0.20200908.patch"
*   Trying 127.0.0.1:5000...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 127.0.0.1 (127.0.0.1) port 5000 (#0)
> GET /rpms/centos-kernel-rpm/raw/c7-sig-centosplus-kernel/f/SOURCES/centos-linux-3.10-wireguard-1.0.20200908.patch HTTP/1.1
> Host: 127.0.0.1:5000
> User-Agent: curl/7.69.1
> Accept: */*
> 
  0     0    0     0    0     0      0      0 --:--:--  0:00:31 --:--:--     0* Mark bundle as not supporting multiuse
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< X-Content-Type-Options: nosniff
< Content-Type: text/plain; charset=utf-8
< Content-Length: 1703314
< Content-Security-Policy: default-src 'self';script-src 'self' 'nonce-X9MNhXNoYtMBOVnCX2FhCKguW'; style-src 'self' 'nonce-X9MNhXNoYtMBOVnCX2FhCKguW'; object-src 'none';base-uri 'self';img-src 'self' https:;
< Vary: Cookie
< Set-Cookie: pagure=eyJfcGVybWFuZW50Ijp0cnVlLCJjc3JmX3Rva2VuIjoiM2U4NWQyNDVmZDI0MDQ1ZGRjMjUxNDE2YTNlZjI2NmY3MzgzMWJmYyJ9.EjuLog.mPr4AafOd9TGlLQZ1TEXXQ-rFys; Expires=Sun, 11-Oct-2020 10:04:18 GMT; HttpOnly; Path=/
< Server: Werkzeug/0.16.0 Python/3.8.5
< Date: Thu, 10 Sep 2020 10:04:18 GMT
< 
{ [32768 bytes data]
100 1663k  100 1663k    0     0  54003      0  0:00:31  0:00:31 --:--:--  385k
* Closing connection 0

The server logs:

2020-09-10 12:03:49,813 [INFO] pagure.ui.repo: view_raw_file
2020-09-10 12:03:49,816 [INFO] pagure.ui.repo: Got commit
2020-09-10 12:03:49,816 [INFO] pagure.ui.repo: filename provided
2020-09-10 12:03:49,836 [INFO] pagure.ui.repo: got content
2020-09-10 12:03:49,857 [INFO] pagure.ui.repo: got data
2020-09-10 12:03:49,857 [INFO] pagure.ui.repo: checking mimetype and returning
2020-09-10 12:03:49,857 [INFO] pagure.lib.mimetype: get_type_headers
2020-09-10 12:03:49,857 [INFO] pagure.lib.mimetype: filename: SOURCES/centos-linux-3.10-wireguard-1.0.20200908.patch
2020-09-10 12:03:49,857 [INFO] pagure.lib.mimetype: guessed from name: None - None
2020-09-10 12:03:49,858 [INFO] pagure.lib.mimetype: settling for text
2020-09-10 12:04:18,471 [DEBUG] pagure.lib.encoding_utils: Possible encodings: [Guess(encoding='utf-8', confidence=0.9587499999999999), Guess(encoding='ISO-8859-1', confidence=0.7299614052710883), Guess(encoding='ISO-8859-9', confidence=0.4501921945201518), Guess(encoding='TIS-620', confidence=0.14392848481446535), Guess(encoding='SHIFT_JIS', confidence=0.01), Guess(encoding='EUC-JP', confidence=0.01), Guess(encoding='GB2312', confidence=0.01), Guess(encoding='EUC-KR', confidence=0.01), Guess(encoding='CP949', confidence=0.01), Guess(encoding='Big5', confidence=0.01), Guess(encoding='EUC-TW', confidence=0.01), Guess(encoding='windows-1251', confidence=0.01), Guess(encoding='KOI8-R', confidence=0.01), Guess(encoding='MacCyrillic', confidence=0.01), Guess(encoding='IBM866', confidence=0.01), Guess(encoding='IBM855', confidence=0.01), Guess(encoding='windows-1253', confidence=0.01), Guess(encoding='ISO-8859-5', confidence=0.0), Guess(encoding='ISO-8859-7', confidence=0.0), Guess(encoding='windows-1255', confidence=0.0)]
2020-09-10 12:04:18,471 [INFO] pagure.lib.encoding_utils: Trying encoding: Guess(encoding='utf-8', confidence=0.9587499999999999)
2020-09-10 12:04:18,472 [INFO] pagure.lib.mimetype: Re-guessed encoding from data: utf-8
2020-09-10 12:04:18,473 [INFO] pagure.lib.mimetype: guessed encoding and mimetype
2020-09-10 12:04:18,473 [INFO] pagure.lib.mimetype: returning headers

So as we can see the encoding detection is what is taking the most time. We're using chardet for this and it comes with a CLI tool to do it, so I've ran it manually:

$ time chardetect SOURCES/centos-linux-3.10-wireguard-1.0.20200908.patch
SOURCES/centos-linux-3.10-wireguard-1.0.20200908.patch: utf-8 with confidence 0.87625
chardetect SOURCES/centos-linux-3.10-wireguard-1.0.20200908.patch  12.51s user 0.08s system 98% cpu 12.740 total

So even with the CLI, it takes 12 seconds to detect the encoding of the file with chardet.

I'm not too sure there is much we can do here.

Out of curiosity, I've tried submitting only a part of the file, it goes much faster but suddenly the encoding detected was ascii instead of utf-8.

IIRC this was code that @jcline contributed but I don't know if he would be interested (or have the time) to look at it again to see if we can optimize it.

The suggestion by chardet is to switch to cchardet, which fortunately already exists in Fedora and EPEL 8.

Compromising accuracy for speed by limiting how much of the file is read is certainly an option. If there's a clear UTF-8 encoding late in the file you'll likely mis-label it, but it's not possible to correctly guess 100% of the time anyway so... eh. There might also be some tune-ables for chardet to make it faster but less accurate, I don't recall.

You could also determine encoding ahead-of-time and save the results in a database.

Or, if you want to be really fast and usually right, you could just say everything is UTF-8 even if it's not.

Metadata Update from @pingou:
- Issue assigned to pingou

3 years ago

Metadata Update from @pingou:
- Issue set to the milestone: 5.12

3 years ago

Looks like cchardet is giving us quite some improvements already:

$ curl -vvvOL "http://127.0.0.1:5000/rpms/centos-kernel-rpm/raw/c7-sig-centosplus-kernel/f/SOURCES/centos-linux-3.10-wireguard-1.0.20200908.patch"
*   Trying 127.0.0.1:5000...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 127.0.0.1 (127.0.0.1) port 5000 (#0)
> GET /rpms/centos-kernel-rpm/raw/c7-sig-centosplus-kernel/f/SOURCES/centos-linux-3.10-wireguard-1.0.20200908.patch HTTP/1.1
> Host: 127.0.0.1:5000
> User-Agent: curl/7.69.1
> Accept: */*
> 
* Mark bundle as not supporting multiuse
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< X-Content-Type-Options: nosniff
< Content-Type: text/plain; charset=utf-8
< Content-Length: 1703314
< Content-Security-Policy: default-src 'self';script-src 'self' 'nonce-JzwN3bEKjYhiID2cIyI8jUag1'; style-src 'self' 'nonce-JzwN3bEKjYhiID2cIyI8jUag1'; object-src 'none';base-uri 'self';img-src 'self' https:;
< Vary: Cookie
< Set-Cookie: pagure=eyJfcGVybWFuZW50Ijp0cnVlLCJjc3JmX3Rva2VuIjoiMDZjNzBmNTc0NmUxNGVlYzlmYTUyMTY5NDJmNTZmZDNjZTcwOWZjNCJ9.EkEMkw.qWGT6ROXVNavI69vd-8ItpArUgU; Expires=Thu, 15-Oct-2020 14:15:47 GMT; HttpOnly; Path=/
< Server: Werkzeug/0.16.0 Python/3.8.5
< Date: Mon, 14 Sep 2020 14:15:47 GMT
< 
{ [32768 bytes data]
100 1663k  100 1663k    0     0  9190k      0 --:--:-- --:--:-- --:--:-- 9241k
* Closing connection 0

Metadata Update from @pingou:
- Issue set to the milestone: None (was: 5.12)

3 years ago

Metadata Update from @pingou:
- Issue set to the milestone: 5.12

3 years ago

Login to comment on this ticket.

Metadata
Related Pull Requests
  • #4991 Merged 3 years ago