In this article we’ll review how you can locate possibly problematic user agents from requests on your site, that could be causing additional resource usage on your server.
This guide is meant for VPS or dedicated server customers that have SSH access to their server. If you happened to have setup a server load monitoring bash script, or you’re utilizing one of the tools available mentioned in our advanced server load monitoring article, and see that your server’s load average has been spiking, sometimes it’s a good idea to see if there are any particular user agents in your access logs that seem to be causing this.
Locate high amounts of duplicate user agents
You can look at your Apache access logs in order to see a high amount of duplicate requests from one certain user agent using the steps below.
- Login to your server via SSH.
- Navigate to the home directory for the website you’d like to investigate. In this example our cPanel username is userna5, and our domain name is example.com:
cd /home/userna5/access-logs
- You can use the awk command to only print certain columns of the Apache log, we will then pipe | that to the sort command so that all of the user agents are sorted by name, we’ll then pipe that to the uniq -c command to uniquely count up how many times each user agent occurs, then finally we’ll pipe all that to the sort -n command so it sorts the user agents by how many total requests they had:
awk -F""" '{print $(NF-1)}' example.com | sort | uniq -c | sort -n
You should get back something similar to this:1308 facebookexternalhit/1.0 (+https://www.facebook.com/externalhit_uatext.php)
1861 facebookexternalhit/1.1 (+https://www.facebook.com/externalhit_uatext.php)
1931 msnbot-media/1.1 (+https://search.msn.com/msnbot.htm)
3293 Mozilla/5.0 (compatible; AhrefsBot/4.0; +https://ahrefs.com/robot/) - Now we can see that the AhrefsBot/4.0 search engine crawler has far more requests than any other user agent currently. In this case let’s say that this website doesn’t necessarily want to be indexed by this search engine, and they are just worried about Google and Bing (MSN) crawling them. Then we could go ahead and use the robots.txt file to stop a search engine from crawling your website.
- If the requests from this user-agent continue to flood in and are causing a current issue on your server, the robots.txt rules won’t stop the requests until they request the rules again. You can go ahead and use our guide on how to block bad users based on their user agent string to stop them immediately from being able to access your site.
You should understand how to locate possible problematic user agents from hitting your site and causing issues.