#2135 Add more spiders which do not seem to honour robots.txt
Merged 6 months ago by zlopez. Opened 6 months ago by smooge.
fedora-infra/ smooge/ansible spiders-gone-wild-20240708  into  main

@@ -3,8 +3,13 @@ 

  RewriteCond %{HTTP_USER_AGENT} "lftp"

  RewriteRule ^.*$ https://fedoraproject.org/wiki/Infrastructure/Mirroring#Tools_to_avoid [R,L]

  

- RewriteRule ^/$ /pub [R=302,L]

+ # Spiders-gone-wild

+ # These spiders may not follow robots.txt and will

+ # hit admin sections which consume large amounts of CPU

+ RewriteCond %{HTTP_USER_AGENT} ^.*(Bytespider|ClaudeBot|Amazonbot|YandexBot|ChatGLM-Spider|GPTBot|Barkrowler|YisouSpider|MJ12bot).*$ [NC]

+ RewriteRule .* - [F,L]

  

+ RewriteRule ^/$ /pub [R=302,L]

  

  RedirectMatch 302 ^/pub/fedora/linux/atomic/(.*$) https://kojipkgs.fedoraproject.org/atomic/$1

  RedirectMatch 302 ^/pub/fedora/linux/atomic https://kojipkgs.fedoraproject.org/atomic/

@@ -9,3 +9,6 @@ 

  

  User-agent: ClaudeBot

  Disallow: /

+ 

+ User-agent: Barkrowler

+ Disallow: /

@@ -33,9 +33,11 @@ 

  # Redirecting to hyperkitty if nothing is specified

  RewriteEngine on

  RewriteRule  ^/$    /archives [R,L]

+ 

  # Spiders-gone-wild

- # These spiders do not follow robots.txt

- RewriteCond %{HTTP_USER_AGENT} ^.*(Bytespider|ClaudeBot).*$ [NC]

+ # These spiders may not follow robots.txt and will

+ # hit admin sections which consume large amounts of CPU

+ RewriteCond %{HTTP_USER_AGENT} ^.*(Bytespider|ClaudeBot|Amazonbot|YandexBot|ChatGLM-Spider|GPTBot|Barkrowler|YisouSpider|MJ12bot).*$ [NC]

  RewriteRule .* - [F,L]

  

  # Old static archives

@@ -51,6 +51,11 @@ 

  RewriteEngine on

  RewriteRule ^/$ /nagios/ [R]

  

+ # Spiders-gone-wild

+ # These spiders may not follow robots.txt and will

+ # hit admin sections which consume large amounts of CPU

+ RewriteCond %{HTTP_USER_AGENT} ^.*(Bytespider|ClaudeBot|Amazonbot|YandexBot|ChatGLM-Spider|GPTBot|Barkrowler|YisouSpider|MJ12bot).*$ [NC]

+ RewriteRule .* - [F,L]

  

  Alias /nagios /usr/share/nagios/html/

  <Directory "/usr/share/nagios/html">

@@ -138,10 +138,13 @@ 

  #  RewriteEngine On

  #  RewriteCond %{REQUEST_URI} ^/fedora-web/websites$

  #  RewriteRule .* - [F]

-   # Reject Bytespider spider

+ 

    RewriteEngine On

-   RewriteCond %{HTTP_USER_AGENT} .*Bytespider.*

-   RewriteRule .* - [F]

+ # Spiders-gone-wild

+ # These spiders may not follow robots.txt and will

+ # hit admin sections which consume large amounts of CPU

+   RewriteCond %{HTTP_USER_AGENT} ^.*(Bytespider|ClaudeBot|Amazonbot|YandexBot|ChatGLM-Spider|GPTBot|Barkrowler|YisouSpider|MJ12bot).*$ [NC]

+   RewriteRule .* - [F,L]

  

    <Location /apache-status>

        SetHandler server-status

I went through the last couple of logs afer the first round of 'turn
off the spiders' went out. I looked at the areas which the /robots.txt
disregard and then looked for the bots which ignored it and still
looked up stuff in 'accounts'. This may cut down CPU spikes as these
are looking at dynamic data which can 'blow' things up.

It might be good to add similar tooling to pagure and src since they
seem to be hit a lot in the logs also.

Signed-off-by: Stephen Smoogen ssmoogen@redhat.com

2 new commits added

  • Add blocks to nagios.conf httpd
  • Add blockers to dl.fedoraproject.org
6 months ago

rebased onto 377e83f

6 months ago

rebased onto 377e83f

6 months ago

Pull-Request has been merged by zlopez

6 months ago

Merged and deployed by running:

ansible-playbook /srv/web/infra/ansible/playbooks/groups/mailman.yml -t config,robots
ansible-playbook /srv/web/infra/ansible/playbooks/groups/download.yml -t config
ansible-playbook /srv/web/infra/ansible/playbooks/groups/noc.yml -t nagios_server
ansible-playbook /srv/web/infra/ansible/playbooks/groups/pagure.yml -t config
ansible-playbook /srv/web/infra/ansible/playbooks/groups/pkgs.yml -t config