In this article we’ll review what steps you can take if your server’s load average is spiking, to help determine the root cause of the issues.
For these examples, you would need to be on a VPS (Virtual Private Server), or dedicated server so that you have SSH access to the server to run commands on the command line.
Determining the cause of Server Usage Spike
- Login to your server via SSH.
- Check on the load average of your server over a minute with the following command:
sar -q 5 12
02:10:06 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 02:10:11 PM 1 112 1.29 1.36 1.43 02:10:16 PM 3 109 1.27 1.35 1.43 02:10:21 PM 3 108 1.41 1.38 1.44 02:10:26 PM 4 118 1.62 1.42 1.45 02:10:31 PM 0 108 1.73 1.45 1.46 02:10:36 PM 4 119 1.67 1.44 1.46 02:10:41 PM 2 122 1.69 1.45 1.46 02:10:46 PM 0 113 1.64 1.44 1.46 02:10:51 PM 2 112 1.59 1.44 1.46 02:10:56 PM 0 103 1.46 1.41 1.45 02:11:01 PM 1 102 1.42 1.40 1.44 02:11:06 PM 0 97 1.31 1.38 1.44 Average: 2 110 1.51 1.41 1.45
This will run the sar command with the -q flag that shows load averages.
The 5 tells it to run a check every 5 seconds, and the 12 tells it to do it 12 times.
If your ldavg-1 column stays consistently high, or continues to rise during this load check, this is an indication that you could have something on the server spiking its usage.
- Now we noticed from looking at our load averages that at 2:10:11 PM the load was 1.29, and it continued to spike up till 02:10:31 PM where the load got as high as 1.73.
It’s very common that websites being accessed, and having to run PHP scripts, or other server side code can cause these spikes in your usage. So you can check in your Apache access logs for what might have been going on around the time of the spike.
Using the following code we are going to look at our website’s access log to see how many “hits” happened from 2:09PM – 2:10PM (14:09 – 14:10). This way we can see the requests leading up to the spike, as well as after:
egrep "15/Jan/2013:14:09|15/Jan/2013:14:10" /home/userna5/access-logs/example.com | wc -l
502
So here we can see that there were 502 requests over those 2 minutes. We can take this a bit further even and break down how many requests happened per minute with this code:
egrep "15/Jan/2013:14:09|15/Jan/2013:14:10" example.com | cut -d[ -f2 | cut -d] -f1 | awk -F: '{print $2":"$3}' | sort -nk1 -nk2 | uniq -c | sed 's/[ ]*//'
164 14:09
338 14:10This right here is already a pretty telling sign, our load average started to spike at 2:10 PM (14:10) and during that minute we had almost double the amount of requests to our site as the previous minute. So it would make sense that the server is having to work harder to serve more requests.
- Now comes the part where we take an even deeper look at what was going on with the requests. Because the server can easily handle 100 or so image or plain HTML page requests with less of a usage spike than having to run PHP scripts for instance, it’s important to know exactly what is getting requested.
We can use the following command in order to see what duplicate requests have been happening:
egrep "15/Jan/2013:14:09|15/Jan/2013:14:10" example.com | cut -d" -f2 | awk '{print $1 " " $2}' | cut -d? -f1 | sort | uniq -c | sort -n | sed 's/[ ]*//'
15 GET /wp-content/plugins/s2member/s2member-o.php
22 GET /about-us/
26 GET /wp-content/uploads/2012/06/logo.pngHere we can see that this happens to be a WordPress site, the highest duplicated request is a logo.png image so that’s probably not going to cause a load spike. However the 22 requests for /about-us/, and 15 for /wp-content/plugins/s2member/s2member-o.php in a 2 minute period might have.
Taking a look at this WordPress site, I noticed that there is currently no form of caching enabled such as using the W3 Total Cache plugin. As such that means that each time the /about-us/ page is getting requested, the server is going to have to re-process the PHP script, connect to the database, and retreive the page. So here we were able to determine within a few minutes of a server load spike that our possible culprit of that spike was a sudden influx in requests for a WordPress page that isn’t cached.
You should now have a basic understanding of how to track down the possible cause of a server load spike. Now you might also be interested in reading our articles about advanced server load monitoring, or about how to create a server load monitoring bash script to alert you via email when your server’s load is spiking.
Since I have 90+ cpanels, I had the same issue as Rama — getting to step 3 didn’t help. After searching around, and a couple of support calls, I tried using the command
`top -icd1`
which shows (among other things) %CPU using. This at least gave me cpanel users to focus on for troubleshooting high usage.
But it seemed to be a lot of my cPanels that were showing up in this listing–most all of them. I noticed that one of the recurring commands (another thing that is displayed) was admin-ajax.php. So I followed a lot of other help articles here in suppport: installing & configuring Heartbeat Control plugin, installing an Idle User Logout plugin, etc. but this did not really lower my server usage.
Then I tried looking for a site that was currently showing high usage with admin-ajax.php, opening the Raw Access logs and looking for admin-ajax.php calls. What I found is that a lot of them were coming from WordFence plugin, which we use on all our sites. In their help forum, I found some settings to reduce server usage: don’t scan outside your WP install, don’t scan images, binary, and other files as if they were executable, don’t enable high sensitivity scanning, and DO enable low resource scanning. Then also disable live view.
Using the export/import settings feature in WordFence, I was able to quickly get these settings changed on about half of our cPanels, and the server load has come way down. Hoping to finish the rest of the cPanels soon.
We also use BackWPup plugin on all our sites, and it also has an option in settings to reduce server load. I didn’t see that call happening in the raw logs, but that does give me one more place to make an adjustment if we have more high server usage.
Hope someone else finds this helpful!
Thank you so much for sharing your insight and experience!
Cheers!
John-Paul
This method won’t work. Why? When you get to
“Step 3” there is no way determine which website’s access log to use, there are hundered’s of websites (and more, depending on the server), so at bets, you’re guessing. 😉
Rama, that’s an excellent point, but remember that this article is for illustration purposes. Having said that, you could change the text for ‘example.com’ in the first egrep command to an * and it would show all domains for that user. However, it is very important to keep in mind that if you have a lot of domains this process may drive up usage on your server, too, and may take a long time to complete.