Speed up grep searches with LC_ALL=C

Ever try to run a grep search on a large file, and wish there was a way to speed things up?

After some late night Googling, I ran across a proposed method of significantly speeding up a grep search from dogbane over on StackOverflow.

I went ahead and dug deeper with research and even setup a little test to try things out and understand what’s going on.

As someone that’s used grep for nearly a decade, I’m a bit embarassed to say I’d never heard of this.

If you care to skip over my extensive research on this and are just curious in the actual testing results, I won’t get offended, much.

speed up grep search

Locale and internationalisation variables

In a shell execution environment, you alter the environment behaviour with variables.

There is a special sub-set of internationalisation variables that deal with how support for internationalised applications behave, with grep being one of these applications.

You can easily view your server’s current locale setting by running:

 root@server [~] locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= 

LC_ALL variable

One variable you can adjust is called LC_ALL. This sets all LC_ type variables at once to a specified locale.

If we simply append LC_ALL=C before our command. We change the locale used by the command.

When using the locale C it will default to the server’s base Unix/Linux language of ASCII.

 root@server [~] LC_ALL=C locale LANG=en_US.UTF-8 LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_PAPER="C" LC_NAME="C" LC_ADDRESS="C" LC_TELEPHONE="C" LC_MEASUREMENT="C" LC_IDENTIFICATION="C"        LC_ALL=C 

UTF8 vs ASCII

This all might not make a whole lot of sense yet, but hang with me.

Basically when you grep something, by default your locale is going to be internationalised and set to UTF8.

UTF8 can represent every character in the Unicode character set to help display any of the world’s writing systems, currently over more than 110,000 unique characters.

So what’s the big deal? Well typically you grep through files encoded in ASCII. The ASCII character set is comprised of a whopping 128 unique characters.

Servers and computers these days can quickly process data thrown at them, but the more efficiently we hand it data, the faster it will be able to accomplish the task and with fewer resources.

Using strace to see what’s going on

I won’t get too technical on it in this article, but strace is a utility to keep tabs on what a process is up to.

Below I’m displaying a file with 1 line with the cat command. The strace output is stored in a file called TRACE.

Then I call egrep to only show mentions of open and read operations:

 root@server [~] strace -o TRACE cat TEST_FILE This is a test  root@server [~] egrep "open|read" TRACE open("/etc/ld.so.cache", O_RDONLY)      = 3 open("/lib64/libc.so.6", O_RDONLY)      = 3 read(3, "177ELF2113>13003321"..., 832) = 832 open("/usr/lib/locale/locale-archive", O_RDONLY) = 3 open("/usr/share/locale/locale.alias", O_RDONLY) = 3 read(3, "# Locale name alias data base.n#"..., 4096) = 2528 read(3, "", 4096)                       = 0 open("/usr/lib/locale/en_US.utf8/LC_IDENTIFICATION", O_RDONLY) = 3 open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_MEASUREMENT", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_TELEPHONE", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_ADDRESS", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_NAME", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_PAPER", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_MESSAGES", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_MONETARY", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_COLLATE", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_TIME", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_NUMERIC", O_RDONLY) = 3 open("/usr/lib/locale/en_US.utf8/LC_CTYPE", O_RDONLY) = 3 open("TEST_FILE", O_RDONLY)             = 3 read(3, "This is a testn", 4096)       = 15 read(3, "", 4096) 

Now here is the same thing with our little LC_ALL=C trick:

 root@server [~] LC_ALL=C strace -o TRACE cat TEST_FILE This is a test  root@server [~] egrep "open|read" TRACE  open("/etc/ld.so.cache", O_RDONLY)      = 3 open("/lib64/libc.so.6", O_RDONLY)      = 3 read(3, "177ELF2113>13003321"..., 832) = 832 open("TEST_FILE", O_RDONLY)             = 3 read(3, "This is a testn", 4096)       = 15 read(3, "", 4096) 

That was 19 opens and 5 reads for my first test, and 3 opens and 3 reads for the LC_ALL=C test.

You can see that in the default test we had to open multiple files in the /usr/lib/locale/en_US.utf8 directory.

The largest of these locale files is LC_COLLATE and LC_CTYPE.

I threw in a plain en_US locale for comparison sake, which is much smaller than the utf8 version:

 root@server [~] ls -lahSr /usr/lib/locale/en_US/LC_C* /usr/lib/locale/en_US.utf8/LC_C* -rw-r--r--  43 root root  19K May 30 17:10 /usr/lib/locale/en_US/LC_COLLATE -rw-r--r--  73 root root 203K May 30 17:10 /usr/lib/locale/en_US/LC_CTYPE  -rw-r--r-- 152 root root 233K May 30 17:10 /usr/lib/locale/en_US.utf8/LC_CTYPE -rw-r--r--  98 root root 860K May 30 17:10 /usr/lib/locale/en_US.utf8/LC_COLLATE     

Sorting things out

If you’re anything like me, I’m sure at some point in your life you’ve had to recite your ABCs to figure out the alphabetical sorting of something. Imagine having thousands and thousands of letters to keep track of and having to keep starting over, doesn’t sound too efficient does it?

Now I also bring up an alphabetial sorting example, because it’s important to note that you don’t want to just go always using LC_ALL=C for everything. I won’t go in depth here but basically just know that when using the sort command it’s gonna give you different types of sorting based on the locale.

 root@server [~] cat TEST_FILE C B A c a b  root@server [~] sort TEST_FILE a A b B c C  root@server [~] LC_ALL=C sort TEST_FILE A B C a b c 

Proof is in the pudding

Once I understood the basic principles of what was supposed to be happening, I was excited to start testing right away to see just how much of a boost in search speed I could get.

Avoiding filesystem caching

Now due to the way Linux caches things from the disk into memory, you might have noticed if you grep a file, the first time you do it could take 5-10 seconds. But if you do the exact same search a bit later, it’s almost instant.

That’s because the filesystem caches the file into memory which is way faster than your hard drive.

I knew this going into my tests, so I knew I couldn’t just run a timed grep against my file, and then do it again a few seconds later without severly skewed results due to the system caching.

So what I did was first build up a 5MB or so test file, by running the following command:

 grep wp-login.php /usr/local/apache/domlogs/ -R > WP_LOGINS 

So now my WP_LOGINS file had about 21,000 lines of attempted wp-login.php requests.

I wanted a ton more to really see the impact on large files, so I proceeded to duplicate the contents of my WP_LOGINS file 100 times into a new file called WP_LOGINS2 with this command:

 for i in {1..100}; do cat WP_LOGINS >> WP_LOGINS2; done 

Now I’ve got a 504MB file with 2,100,000 lines, and that should provide a great testbed, at least for one of the tests. So I also duplicated this file multiple times to again avoid filesystem caching in-between tests.

Testing LC_ALL=C grep and fgrep performance

I ran 2 test with the default grep command looking for hits of wp-login.php and providing a count.

I also did 2 with LC_ALL=C set first, and 2 using both LC_ALL=C and fgrep which matches only fixed strings and is even more efficient when doing simple searches like in this case.

Here is the series of tests I ran:

 time grep wp-login.php WP_LOGINS_001 -c time grep wp-login.php WP_LOGINS_002 -c time LC_ALL=C grep wp-login.php WP_LOGINS_003 -c time LC_ALL=C grep wp-login.php WP_LOGINS_004 -c time LC_ALL=C fgrep wp-login.php WP_LOGINS_005 -c time LC_ALL=C fgrep wp-login.php WP_LOGINS_006 -c 

The results are broken up from the time command into 3 values, funny enough the LC_ALL=C locale also alters the output from the time command, which is why the results are different.

Here are the meanings behind these values:

real – How much wall clock time the test took

user – CPU seconds consumed in user space

sys – CPU seconds consumed in system space

Now here are the results from the tests:

 real    0m9.545s user    0m9.416s sys     0m0.126s  real    0m9.445s user    0m9.316s sys     0m0.130s  1.37user 0.13system 0:01.50elapsed 1.37user 0.11system 0:01.48elapsed  0.54user 0.12system 0:00.67elapsed 0.54user 0.12system 0:00.66elapsed 

Here it is in a table:

 realusersys
grep 19.569.420.13
grep 29.459.320.13
LC_ALL 11.501.370.13
LC_ALL 21.481.370.11
fgrep 10.670.540.12
fgrep 20.660.540.12

Conclusion

So there you have it, standard grep took 9 1/2 seconds.

Using the LC_ALL=C locale increased our performance 640% and brought that time down to 1 1/2 seconds.

Using fgrep increased our performance 1427% and brought that time down to just over a 1/2 second.

If you skipped down here and were wondering why things got faster, check out my locale research above.

Needless to say, I’ll be using this tactic in a ton of scripts and when doing manual grep searches going forward. Hopefully this information will help speed along your own searches as well.

InMotion Hosting Contributor
InMotion Hosting Contributor Content Writer

InMotion Hosting contributors are highly knowledgeable individuals who create relevant content on new trends and troubleshooting techniques to help you achieve your online goals!

More Articles by InMotion Hosting

10 thoughts on “Speed up grep searches with LC_ALL=C

  1. I imagine the reason it did not work for many people is that their default language was already C (mine is). If you are unsure of what your default locale is, you should set LC_LANG to en_US.UTF-8 or whatever before running the tests

  2. Does not have any effect on Ubuntu 14.04 and 16.04. It did work for me for sure last time I’ve tried it back in 2008.

    Zaar

  3. Thank you for sharing this great finding! This trick made my process go from 1.20 hours to a matter of seconds!
    Thanks again for making my life better =)

    1. Hello MadMan,

      This tutorial was made either on a Centos 5.6 or 6.0 server. This should work. Are you getting any errors?

      Best Regards,
      TJ Edens

  4. Hey great article Jacob,

    This does affect more than meets the eye.  For instance download a file with UTF-8 characters in it, like many web pages, and then use an strace to see how greps regex is affected:


    $ export LANG=C LC_ALL=C;
    $ strace -f -q -e trace=write -o TRACE.CC 2>&1 grep -o '.\{1\}' t.html

    When in ASCII mode grep will incorrectly count utf-8 characters.

    write(1, "\342\n", 2)
    write(1, "\200\n", 2)
    write(1, "\235\n", 2)

    vs

    $ export LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8;
    $ strace -f -q -e trace=write -o TRACE.8 2>&1 grep -o '.\{1\}' t.html

    When in UTF-8 mode grep will correctly count a utf-8 character as 1 character.

    write(1, "\342\200\235\n", 4)
  5. You could also eliminate the Linux buffer cache from skewing your testing by droppig the caches before each test.

        echo 3 > /proc/sys/vm/drop_caches

     

     

    1. Hello Noah, and thanks for the comment!

      You are correct! That is another great way to make sure the pagecache isn’t skewing results. However be careful because your system could seem a bit sluggish as it rebuilds back up the pagecache after totally clearing it out.

      Thanks again!

      – Jacob

Was this article helpful? Join the conversation!

Questions about our MailChannels Deployment? We have answers and are here to help!Learn More
+