In this article I’m going to teach you how you can identify and then block bad robots from your website, that could be possibly using up system resources on your server.
What is a Bad Robot?
There are many reasons that an automated robot would be trying to crawl through your website, the most common is for the large search engines such as Google or Bing to be able to find all the content on your website, so they can then serve it up to their users via search queries they are doing on those services.
These robots are supposed to follow rules that you place in a robots.txt file, such as how frequently they are allowed to request pages, and from what directories they are allowed to crawl through. They should also be supplying a consistent valid User-Agent string that identifies the requests as a bot request.
A bad robot usually will ignore robots.txt rules, request pages too quickly, re-visit your site too frequently, attempt to harvest email addresses, or in general simply provide no value back to your website. When a good robot crawls your site, this is typically so other people can find your content and then be directed to it from a search engine. When a bad robot is crawling through your site it could be for malicious intensions such as trying to copy your content so that they can use it as their own.
Identify a Bad Robot
Using the steps below, I’ll show you some good steps to take in order to verify if a robot is a good or bad one.
Please note in order to follow these steps you would need to be on either a VPS (Virtual Private Server) or dedicated server that has SSH access. If you’re on a shared server you could read our guide on enabling raw access log archiving in cPanel to be able to view the same data, but it would have to be on your own local computer.
- Login to your server via SSH.
- Navigate to your user’s home directory where the Apache access logs are stored, in this case our username is userna5, so we’ll use the following command:
cd ~userna5/access-logs/
- We can now use the following command to see all User-Agents that have requests to our example.com website:
cat example.com | awk -F'"' '{print $6}' | sort | uniq -c | sort -n
This gives us back the output:638 Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)
1015 msnbot-UDiscovery/2.0b (+https://search.msn.com/msnbot.htm)
1344 Mozilla/5.0 (en-US) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229 Safari/537.4 pss-webkit-request
21937 –
So in this case, there have been (21,937) requests that aren’t supplying a User-Agent string. This is an immediate red flag, as typically any human visitor requesting a page from your website should have the User-Agent of their web-browser in each request.
We can also see that the next highest level of requests came from something calling itself pss-webkit-request, followed by msnbot-UDiscovery/2.0b, and then Googlebot/2.1 - First we can see all of the requests that didn’t provide a User-Agent string, and then view all of the unique IP addresses that sent those requests in, with the following code:
cat example.com | awk -F" '$6 ~ "-"' | awk '{print $1}' | sort -n | uniq -c | sort -n
In this example here are the top IPs that had requests without a User-Agent string:421 74.125.176.94 17. 434 74.125.176.85 18. 463 74.125.176.95
- We can now search for these IP address against our access log to see what might going on with their requests. The following command is going to look for what User-Agent strings are coming from the one 74.125.176.95 IP address which had the most requests:
grep 74.125.176.95 example.com | awk -F" '{print $6}' | sort | uniq -c | sort -n
This gives us back:7 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
11 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 29 Mozilla/5.0 (en-US) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229 Safari/537.4 pss-webkit-request 434 –
Again this is another red flag, typically requests coming from a single IP address will use the same User-Agent string for each request. We can go a step further and run the following command to find out some information about the IP address:whois 74.125.176.95
Some of the pertinent information that command gives us back for that IP is:NetRange: 74.125.0.0 – 74.125.255.255 30.
NetName: GOOGLE
OrgName: Google Inc.
OrgId: GOGL
Address: 1600 Amphitheatre Parkway
City: Mountain View
StateProv: CA
PostalCode: 94043
Country: US
RegDate: 2000-03-30
Updated: 2011-09-24
Ref: https://whois.arin.net/rest/org/GOGL
This is an IP address that belongs to Google, and so are all the other IPs we saw coming from the 74.125 IP range. However this isn’t coming from the official Googlebot crawler which would be identified as such by the User-Agent string, but instead these are requests from their AppEngine/Cloud service.
So these requests are from custom crawlers more than likely that other users have made, and in some cases they could simply be trying to index your content for their own purposes instead of providing links back to you, which would fall under our definition of a bad robot.
Block a Bad Robot
Now that you understand a bit about how to identify a possible bad robot, the next step would be to probably block that bad robot if they’ve been causing problems with your website usage.
Using the steps below I’ll show you how we can block the entire range of 74.125 IPs we were seeing from accessing the example.com website, but still allow them to request if they do happen to mention Google in their User-Agent string of the request.
- Edit the .htaccess file for your website with the following command:
vim ~userna5/public_html/.htaccess
Once the vim text-editor has loaded the file, hit i to enter Insert mode, enter in the following code (in most SSH clients you can also right-click to paste text from the clipboard):ErrorDocument 503 "Site disabled for crawling"
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} !^.*(Google).*$
RewriteCond %{REMOTE_ADDR} ^74.125
RewriteRule .* – [R=503,L]
Once you’ve entered that information in, hit Esc to exit Insert mode, then hold down Shift and type in ZZ to save the file.
These rules will first check if the User-Agent has the word Google in it anywhere, if it does not, it moves onto the next rule, which checks to see if the IP address begins with 74.125. If the IP matches then it will serve them a 503 response of Site disabled for crawling which uses up very minimal server resources, instead of allowing them to hit your various website pages that could be causing a lot of usage. - As a bonus you can then use the following command to check up on how many requests you’re saving your server from having to serve to these bad bots with this command:
cat ~userna5/access-logs/example.com | grep "74.125" | awk '$9 ~ 503' | cut -d[ -f2 | cut -d] -f1 | awk -F: '{print $2":00"}' | sort -n | uniq -c | sed 's/[ ]*//'
This gives you back the amount of bots blocked per hour with your rule:2637 14:00
2823 15:00
2185 16:00
You should now understand how to identify and then block bad robots from causing usage problems on your website with their excessive requests.
Enjoy high-performance, lightning-fast servers with increased security and maximum up-time with our Secure VPS Hosting!
Linux VPS cPanel or Control Web Panel Scalable Website Migration Assistance
Hi there:
First off I want to say thanks for a really useful article and that I have read thoroughly and put into practice. However, being a bit of a novice, I have run into a couple of hassles around the awk command. I am not sure what the original intention was, but in my environment, running Ubuntu 22, the awk command is giving me a hassle because of quote issues – sometimes unclosed single-quotes, sometimes a double-quote – they consistently landed me at a prompt so I went hunting for help and this is what I got:
Original command – cat /var/log/apache2/access.log | awk -F” ‘$6 ~ “-“‘ | awk ‘{print $1}’ | sort -n | uniq -c | sort -n
Updated command – cat /var/log/apache2/access.log | awk -F'”‘ ‘$6 ~ “-” {print $1}’ | sort -n | uniq -c | sort -n
Original command – grep 198.199.119.63 /var/log/apache2/access.log | awk -F” ‘{print $6}’ | sort | uniq -c | sort -n
Updated command – grep 198.199.119.63 /var/log/apache2/access.log | awk -F’ ‘ ‘{print $6}’ | sort | uniq -c | sort -n
Apart from those two issues, I really found this article useful. Thanks for sharing it.
Kind regards.
-Michael
Michael – thanks for your reply and the information that you have provided.
Thank you for the great article!
In step 4, you said:
cat example.com | awk -F\” ‘$6 ~ “-“‘ | awk ‘{print $1}’ | sort -n | uniq -c | sort -n
This is looking for the entries contains “-” instead of equal to it? So it returns all even they have a user-agent value but contains “-” in it. Should we instead use $6 == “-” or am I missing something?
Thanks again.
The idea here is to ignore the IPs that provide a user-agent string. It looks like the equal sign provides no output in my test.
Thanks for the article
From what I know: WhoIs lookup does not always help. There are services that will hide real domain holder identity. It’s usually the case with large botnet websites.
I’d also use 3 things to detect and prevent bad bots:
1) Google Analytics (rather time-consuming as you’d have to add new bots to your ban list constantly)
2) Heat Maps and Video Sessions Recodring (works fine on smaller scales)
3) Fingerprinting – essentially a technology that replaces cookies with unique tracking codes that don’t change and are a lot harder to manipulate. there are affordable solutions like fraudhunt.net and more expensive ones – Forensiq and Distil.
Hope it helps
Thanks for this, what a great way to handle bad bots!
We’ve been harvesting the IPs from a contact form on our site. The form has multiple required fields, but bots are able to crawl the form action url without invoking the field validator, so the result is an e-mail with no content other than the senders IP address.
We store the IPs in a flat file on the server and then redirect any traffic coming from that address to www.***.gov/default.aspx. Our solutions works well, however, your solution should catch more “intruders” and reject them in a more efficient way.
I use PHP and filter on IP address and User-Agent to make the ‘bad bot’ wait 999 seconds
and return 0 bytes.
https://gelm.net/How-to-block-Baidu-with-PHP.htm