nginx: How To Set A Connection Limit For Search Engine Bots Gone Wild (Especially Bingbot)

Channel: Linux
Abstract: then nginx will return a 503 error to Bingbot). Of course//www.example.com/sitemap.xml 3 Using nginx To Control Bot Connections We can use the HttpGeo
nginx: How To Set A Connection Limit For Search Engine Bots Gone Wild (Especially Bingbot)

As a server administrator, you might know this problem: you have done everything to optimize your server and it's working really well, and along comes a stupid search engine bot (like Bingbot) and hits all your vhosts at the same time with hundreds of connections, making your server load go up. Of course, you don't want to completely block the bot (unless you don't care about that particular search engine), so you can use robots.txt and/or nginx to control connections of a search engine bot to your server.

 

1 Preliminary Note

I'm focusing on Bingbot in this tutorial because this bot creates an excessive amount of connections each time it visits a web site (I haven't noticed this for any other search engine bot). Of course, the first thing you should do is limit the crawl rate in the Bing webmaster tools. If that doesn't help or you don't have access to Bing webmaster tools for all vhosts on your server, read on.

 

2 Using robots.txt

Bingbot understands the Crawl-delay directive (Googlebot doesn't so don't use this for Googlebot!), so you can use this in your robots.txt file (see http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2012/05/03/to-crawl-or-not-to-crawl-that-is-bingbot-s-question.aspx):

User-Agent: bingbot
Crawl-delay: 1

Because you've created an extra section for Bingbot in your robots.txt, your Allow/Disallow directives for User-Agent: * aren't valid for Bingbot anymore, so make sure to repeat your Allow/Disallow directives for Bingbot, e.g. like this:

User-Agent: *
Disallow: /cache/
Disallow: /engine/
Disallow: /files/
Disallow: /templates/
Disallow: /uploads/
Disallow: /newsletter/
Disallow: /kontaktformular/
Disallow: /widerrufsrecht/
Disallow: /datenschutz-und-sicherheit
Disallow: /agb/
Disallow: /shopware.php/sViewport,admin
Disallow: /shopware.php/sViewport,note
Disallow: /shopware.php/sViewport,basket
Disallow: /shopware.php/sViewport,rma
Disallow: /shopware.php/sViewport,support
Disallow: /shopware.php/sViewport,ticket
Disallow: /shopware.php/sViewport,newsletter
Disallow: /shopware.php/sViewport,tellafriend
Sitemap: http://www.example.com/sitemap.xml

User-Agent: bingbot
Crawl-delay: 1
Disallow: /cache/
Disallow: /engine/
Disallow: /files/
Disallow: /templates/
Disallow: /uploads/
Disallow: /newsletter/
Disallow: /kontaktformular/
Disallow: /widerrufsrecht/
Disallow: /datenschutz-und-sicherheit
Disallow: /agb/
Disallow: /shopware.php/sViewport,admin
Disallow: /shopware.php/sViewport,note
Disallow: /shopware.php/sViewport,basket
Disallow: /shopware.php/sViewport,rma
Disallow: /shopware.php/sViewport,support
Disallow: /shopware.php/sViewport,ticket
Disallow: /shopware.php/sViewport,newsletter
Disallow: /shopware.php/sViewport,tellafriend
Sitemap: http://www.example.com/sitemap.xml

 

3 Using nginx To Control Bot Connections

We can use the HttpGeoModule and the HttpLimitReqModule to control connections of search engine bots to your nginx server. Open /etc/nginx/nginx.conf...

vi /etc/nginx/nginx.conf

... and add this to your http {} container (before the part where your vhost configuration files are included/defined):

[...]
geo $isabot {
        default 0;
        #bingbot
        157.55.32.0/24 1;
        157.56.229.0/24 1;
        157.56.93.0/24 1;
        157.55.33.0/24 1;
}
map $isabot $limited_ip_key {
    0 '';
    1 $binary_remote_addr;
}
limit_req_zone $limited_ip_key zone=isabot:5m rate=2r/s;
limit_req zone=isabot burst=200;
[...]

This will limit Bingbot's crawl rate to two requests per second; exceeding connections will be delayed and put in the 200 bursts (until the burst is full, then nginx will return a 503 error to Bingbot).

Of course, you are free to add more IPs or subnets to the geo container.

Don't forget to reload nginx:

/etc/init.d/nginx reload

 

4 Links
  • nginx: http://nginx.org/
  • nginx Wiki: http://wiki.nginx.org/

Ref From: howtoforge
Channels: web servernginx

Related articles