After some late night Googling, I ran across a proposed method of significantly speeding up a grep search from dogbane over on StackOverflow.
I went ahead and dug deeper with research and even setup a little test to try things out and understand what’s going on.
As someone that’s used grep for nearly a decade, I’m a bit embarassed to say I’d never heard of this.
If you care to skip over my extensive research on this and are just curious in the actual testing results, I won’t get offended, much.
Locale and internationalisation variables
In a shell execution environment, you alter the environment behaviour with variables.
There is a special sub-set of internationalisation variables that deal with how support for internationalised applications behave, with grep being one of these applications.
You can easily view your server’s current locale setting by running:
root@server [~] locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
LC_ALL variable
One variable you can adjust is called LC_ALL. This sets all LC_ type variables at once to a specified locale.
If we simply append LC_ALL=C before our command. We change the locale used by the command.
When using the locale C it will default to the server’s base Unix/Linux language of ASCII.
root@server [~] LC_ALL=C locale LANG=en_US.UTF-8 LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_PAPER="C" LC_NAME="C" LC_ADDRESS="C" LC_TELEPHONE="C" LC_MEASUREMENT="C" LC_IDENTIFICATION="C" LC_ALL=C
UTF8 vs ASCII
This all might not make a whole lot of sense yet, but hang with me.
Basically when you grep something, by default your locale is going to be internationalised and set to UTF8.
UTF8 can represent every character in the Unicode character set to help display any of the world’s writing systems, currently over more than 110,000 unique characters.
So what’s the big deal? Well typically you grep through files encoded in ASCII. The ASCII character set is comprised of a whopping 128 unique characters.
Servers and computers these days can quickly process data thrown at them, but the more efficiently we hand it data, the faster it will be able to accomplish the task and with fewer resources.
Using strace to see what’s going on
I won’t get too technical on it in this article, but strace is a utility to keep tabs on what a process is up to.
Below I’m displaying a file with 1 line with the cat command. The strace output is stored in a file called TRACE.
Then I call egrep to only show mentions of open and read operations:
root@server [~] strace -o TRACE cat TEST_FILE This is a test root@server [~] egrep "open|read" TRACE open("/etc/ld.so.cache", O_RDONLY) = 3 open("/lib64/libc.so.6", O_RDONLY) = 3 read(3, "177ELF211 3 >