Table of Contents
After some late night Googling, I ran across a proposed method of significantly speeding up a grep search from dogbane over on StackOverflow.
I went ahead and dug deeper with research and even setup a little test to try things out and understand what’s going on.
As someone that’s used grep for nearly a decade, I’m a bit embarassed to say I’d never heard of this.
If you care to skip over my extensive research on this and are just curious in the actual testing results, I won’t get offended, much.
data:image/s3,"s3://crabby-images/46d8f/46d8fce55ff7e19694fef64ff758a795c196c6e7" alt="speed up grep search"
Locale and internationalisation variables
In a shell execution environment, you alter the environment behaviour with variables.
There is a special sub-set of internationalisation variables that deal with how support for internationalised applications behave, with grep being one of these applications.
You can easily view your server’s current locale setting by running:
LC_ALL variable
One variable you can adjust is called LC_ALL. This sets all LC_ type variables at once to a specified locale.
If we simply append LC_ALL=C before our command. We change the locale used by the command.
When using the locale C it will default to the server’s base Unix/Linux language of ASCII.
UTF8 vs ASCII
This all might not make a whole lot of sense yet, but hang with me.
Basically when you grep something, by default your locale is going to be internationalised and set to UTF8.
UTF8 can represent every character in the Unicode character set to help display any of the world’s writing systems, currently over more than 110,000 unique characters.
So what’s the big deal? Well typically you grep through files encoded in ASCII. The ASCII character set is comprised of a whopping 128 unique characters.
Servers and computers these days can quickly process data thrown at them, but the more efficiently we hand it data, the faster it will be able to accomplish the task and with fewer resources.
Using strace to see what’s going on
I won’t get too technical on it in this article, but strace is a utility to keep tabs on what a process is up to.
Below I’m displaying a file with 1 line with the cat command. The strace output is stored in a file called TRACE.
Then I call egrep to only show mentions of open and read operations:
Now here is the same thing with our little LC_ALL=C trick:
That was 19 opens and 5 reads for my first test, and 3 opens and 3 reads for the LC_ALL=C test.
You can see that in the default test we had to open multiple files in the /usr/lib/locale/en_US.utf8 directory.
The largest of these locale files is LC_COLLATE and LC_CTYPE.
I threw in a plain en_US locale for comparison sake, which is much smaller than the utf8 version:
Sorting things out
If you’re anything like me, I’m sure at some point in your life you’ve had to recite your ABCs to figure out the alphabetical sorting of something. Imagine having thousands and thousands of letters to keep track of and having to keep starting over, doesn’t sound too efficient does it?
Now I also bring up an alphabetial sorting example, because it’s important to note that you don’t want to just go always using LC_ALL=C for everything. I won’t go in depth here but basically just know that when using the sort command it’s gonna give you different types of sorting based on the locale.
Proof is in the pudding
Once I understood the basic principles of what was supposed to be happening, I was excited to start testing right away to see just how much of a boost in search speed I could get.
Avoiding filesystem caching
Now due to the way Linux caches things from the disk into memory, you might have noticed if you grep a file, the first time you do it could take 5-10 seconds. But if you do the exact same search a bit later, it’s almost instant.
That’s because the filesystem caches the file into memory which is way faster than your hard drive.
I knew this going into my tests, so I knew I couldn’t just run a timed grep against my file, and then do it again a few seconds later without severly skewed results due to the system caching.
So what I did was first build up a 5MB or so test file, by running the following command:
So now my WP_LOGINS file had about 21,000 lines of attempted wp-login.php requests.
I wanted a ton more to really see the impact on large files, so I proceeded to duplicate the contents of my WP_LOGINS file 100 times into a new file called WP_LOGINS2 with this command:
Now I’ve got a 504MB file with 2,100,000 lines, and that should provide a great testbed, at least for one of the tests. So I also duplicated this file multiple times to again avoid filesystem caching in-between tests.
Testing LC_ALL=C grep and fgrep performance
I ran 2 test with the default grep command looking for hits of wp-login.php and providing a count.
I also did 2 with LC_ALL=C set first, and 2 using both LC_ALL=C and fgrep which matches only fixed strings and is even more efficient when doing simple searches like in this case.
Here is the series of tests I ran:
The results are broken up from the time command into 3 values, funny enough the LC_ALL=C locale also alters the output from the time command, which is why the results are different.
Here are the meanings behind these values:
real – How much wall clock time the test took
user – CPU seconds consumed in user space
sys – CPU seconds consumed in system space
Now here are the results from the tests:
Here it is in a table:
real | user | sys | |
---|---|---|---|
grep 1 | 9.56 | 9.42 | 0.13 |
grep 2 | 9.45 | 9.32 | 0.13 |
LC_ALL 1 | 1.50 | 1.37 | 0.13 |
LC_ALL 2 | 1.48 | 1.37 | 0.11 |
fgrep 1 | 0.67 | 0.54 | 0.12 |
fgrep 2 | 0.66 | 0.54 | 0.12 |
Conclusion
So there you have it, standard grep took 9 1/2 seconds.
Using the LC_ALL=C locale increased our performance 640% and brought that time down to 1 1/2 seconds.
Using fgrep increased our performance 1427% and brought that time down to just over a 1/2 second.
If you skipped down here and were wondering why things got faster, check out my locale research above.
Needless to say, I’ll be using this tactic in a ton of scripts and when doing manual grep searches going forward. Hopefully this information will help speed along your own searches as well.
I imagine the reason it did not work for many people is that their default language was already C (mine is). If you are unsure of what your default locale is, you should set LC_LANG to en_US.UTF-8 or whatever before running the tests
Does not have any effect on Ubuntu 14.04 and 16.04. It did work for me for sure last time I’ve tried it back in 2008.
Zaar
Thank you for sharing this great finding! This trick made my process go from 1.20 hours to a matter of seconds!
Thanks again for making my life better =)
No errors, just no improvement in time to run grep commands.
Tried this on a Centos 5 system with no luck. Is it OS or distro specific?
Hello MadMan,
This tutorial was made either on a Centos 5.6 or 6.0 server. This should work. Are you getting any errors?
Best Regards,
TJ Edens
Hey great article Jacob,
This does affect more than meets the eye. For instance download a file with UTF-8 characters in it, like many web pages, and then use an strace to see how greps regex is affected:
As you wrote, this may not give you what you were expecting:
LC_ALL=C sort moop.txt
But, this might:
LC_ALL=C sort -f moop.txt
You could also eliminate the Linux buffer cache from skewing your testing by droppig the caches before each test.
echo 3 > /proc/sys/vm/drop_caches
Hello Noah, and thanks for the comment!
You are correct! That is another great way to make sure the pagecache isn’t skewing results. However be careful because your system could seem a bit sluggish as it rebuilds back up the pagecache after totally clearing it out.
Thanks again!
– Jacob