Search Your Files With Grep and Regex

To understand the grep regex combination, ask yourself a question first.  How do you search through a file?  On the surface, this might seem like sort of a silly question.

But somewhere between the common-sense answer for many (“double click it and start reading!”) and the heavily technical (“command line text grep regex”) lies an interesting set of questions.

  • Where does this file reside?
  • What kind of file is it?
  • How big is the file?
  • What, exactly, are you looking for in the file?

Today, we’re going to look at one of the most versatile ways to search a file: using grep and regex (short for regular expression).

What You Will Learn

In this article, you will learn why grep and regex together are so powerful.

  • We will show a common problem we have when manually searching.
  • Next, we are going to discuss what both grep and regex are.
  • Then, we will dive into some examples: we will show examples of grep and regex separately, then we will show how you can use them both together for powerful searching.
  • And finally, we will wrap up with some advice on judicious use of these tools.

Learning Grep and Regex Teaches You a Powerful Search Technique

Using this combination of tools, you can search files of any sort and size.  You can also search with extremely limited access to your environment, and if you get creative, you can find just about anything.

Imagine: you fret about organizing your files, because you can never seem to find where you put the last file. With grep, that concern goes away.

Or perhaps you are an avid note-taker, but you are too busy jotting them down to worry about organization. Again, grep and regex can rescue you here.

But with that versatility comes a bit of a learning curve.

So let’s look at how to take the edge off of that and get you familiar with this file search technique.  To do that, I’ll walk through a hypothetical example of trying to extract some information.

When you’re done reading, you’ll understand the basics enough to search your files with grep-regex.

A magnifying glass.

What is Grep? A High Level Understanding

Grep (actually, “grep” — you don’t capitalize it) is a command line utility originating in the Unix world.  It’s since made its way onto Linux machines and even into the Windows world.

What does it do?  Simple enough. Grep helps you search through files, looking for patterns.

Here’s a template of what that looks like.

 grep [-options] pattern [filename]

So basically, at a command line prompt, you would type “grep ford cars.txt” if you wanted to search for the text “ford” in the file “cars.txt.”  The grep utility would print any matching lines right there in the console for you to review.

The -options tag is just that: it lets you supply some options.  For instance, you can tell grep to ignore the case of the characters or to put the results into a new file.

And that’s really all there is to grep.  Its beauty lies in its simplicity and the power it gives you to do things.

What Does Regex Mean?

Speaking of power, let’s talk about regex.

The term is actually, as I mentioned earlier, “regular expressions,” but it’s such a ubiquitous term in the programmer world that it’s earned a nickname.  I don’t think regex has quite made the English dictionary yet, but programmers know what you mean by this.

You’ll find that programmers have a love-hate relationship with regex — as in, some programmers love them and others hate them.

People love regex for the power they confer on their users. Others hate them for their incomprehensibility and the confusion they create.

Alright, So What Is It, Really?

Like grep, it’s simple enough.  Regular expressions are sequences of characters that represent patterns, and they instruct regex parsers on ways to search text and match patterns.

Think of a much simpler version of this concept: the wildcard.  The wildcard lets you enter a search like, say, “d*g” in the dictionary and receive results that include “dog,” “dig,” and “dug.”

d*g => {dig, dog, dug}

Regex takes this to a whole new level.

You can do simple matches and wildcard searches with them.  For instance, the expression “d.*g” says the same thing as my wildcard example: match words that start with d and end with g and have stuff between them.

But you can get more complicated, too.  A lot more complicated.

^(19|20)dd[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$

Want to hazard a guess what that does?  It matches a date in yyyy-mm-dd format from 1900 through 2099.  I mean, obviously, right?

It’s this complexity that drives the love-hate in the programming world.

Expressing validation logic for a date in just 64 characters is powerful.  But good luck understanding them without significant study and memorization.  They’re hard to read.

Grep: a Simple, Practical Example

By now, you’ve probably put together the grep-regex equation in your mind with its value proposition.

  • Grep lets you search files or strings from the command line,
  • And regex lets you do some really formidable stuff.

But let’s walk before we run and take a look at an example using just grep.

A lot of hosted solutions using Apache feature something called the Webalizer.  It offers a very specific sort of log aggregation, but that’s not of interest for this example.

Instead, we’re interested in its configuration file, webalizer.conf.  For this example, I want to log in to my hosted web server and figure some things out about my configuration.

A lot of people default to popping open a file in a text editor to take a look around and search.  But recall the restrictions I mentioned at the beginning of the post.

  • The file resides on a server where I have limited SSH access and can’t use a graphical text editor.
  • I don’t know exactly what’s in it, and it could be really big, for all I know.
  • I might need to adjust my search as I go.

So because of these restrictions, I go with grep.

I know that I can use it from the command line to search the file.  Now, let’s say that a Google search for some problem told me that I needed to check on a series of settings that start with “All” and I don’t really know where in the file they are.

Grep to the rescue.

grep All webalizer.conf

That’s pretty good, but I don’t really care about those comments or “HideAllSites,” so I revise it just a touch.

grep "#All" webalizer.conf

(The quotes are because I want to search for the special character “#” as well)

Pretty handy!  Now I’m seeing only the lines immediately of interest to me.

Grep Regex: a Simple, Practical Example

Alright, now we’re making progress.  I’ve got just the settings that I want.  And I can also see that my problem may stem from the fact that the settings are not enabled, by virtue of the “#” commenting them out.

But let’s say that for this hypothetical example, I wanted to drill in a little further.  And let’s also say that the file were a lot bigger with a lot more matches, so simply opening it and looking for these lines weren’t feasible.

What if I wanted to narrow this down to just the “yes” entries?

Huh.

I’ve pretty much reached the limit of what I can do with simple search.  Not only is there indeterminate text between the #All and the yes/no, but there’s also a variable number of spaces.  Let’s further say that I’m only interested in the existence of a “yes,” not a “no” or a hypothetical blank.

How would I do that?

Well, I’d get ready to start using regex.  (And I’d also test them with this tool because regex is hard.)  Let’s see what happens with this one.

grep "#All.*yes" webalizer.conf

Success!

Now we’ve narrowed it specifically to the items that we want action with, but that are commented out.

As you can see, this is extremely powerful. Even with access only to the command line, and without ever opening a file, you can perform remarkably sophisticated searches to zero in on issues.

I can’t think of a better way to understand grep and regex than looking at examples. But before diving into more examples, take a quick look at what regex characters mean:

. (dot) - Match any character   ^ - Match the beginning of the line $ - Match the end of line - Match zero or more occurrences of the previous character + - Match one or more occurrences of the previous character

I’ve just scratched the surface. You can find more on regex characters here. Delving any more into grep (and especially into regex) would carry us well beyond the scope of this post.

Some Additional Grep/Regex Examples

Before closing, I’ll leave you with some additional, quick examples to help you wrap your head around the idea further.

  • grep “^#All” webalizer.conf searches for any line with “#All” at the beginning of the line
  • grep “yes$l” webalizer.conf searches for any line that ends with “yes”
  • grep “^$” webalizer.conf will see if webalizer has any blank lines
  • grep “[0-9]{3}-[0-9]{4}” contacts.txt searches contacts.txt for phone numbers.
  • grep “home|mobile” contacts.txt searches for any line that has either the word “home” or the word “mobile”

Hopefully, these examples give you an idea of just how versatile these types of searches are.  This isn’t intended to be comprehensive, but rather to help you come away with an idea of all you might do.

Not Quite Enough? Here Are a Couple Recipes

Even after all our examples you may still find it difficult to see where grep and regex can help boost your productivity.

Finding Logs By Timestamp

It can be very useful to troubleshoot a software problem by seeing the logs for a certain day. This recipe lets you do that:

grep '2020-06-20T09:[0-5][0-9]' server.log

It is simple but powerful.

2020-06-20T09:30 INFO started server successfully.

The “[0-5][0-9]” will look for events between 9:00 and 9:59.

Finding Those Notes You Took Months Ago

As I mentioned earlier, perhaps you may take notes but scatter them across your “documents”  folder. Instead of tediously opening every file,  you can use grep for an entire directory, to find specific notes. You can do it like this:

grep -rnwi 'market launch' documents/

The “-r” means recursive, to search subdirectories. “n” means show the line number. “w” means search on whole words, which you can also do through regex. “i” means case-insensitive. You can also do case-insensitive search with regex directly. Results may look like this:

documents//demo.txt:6:Here are my notes about the Market Launch
documents//demo.txt:7:This market launch is super-intense

The answer to “Where can I use grep and regex?” is, “Wherever there are strings.” To understand the wide scope of where grep and regex can be used, let’s look at some of the most commonly used regex patterns.

Common Regex Patterns

Validating Email Address

You don’t need junk email addresses stored in your database. If you have an application that asks for an email, you’d want to validate the email before accepting it. So the regex for a valid email looks something like this:

/^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,6})*$/

Check Password Strength

To update the security of your organization, you might want to strengthen the passwords being used. You know what a strong password looks like, so you can just use a regex to check the strength instead of manually verifying all the passwords. The minimum criteria of a password to be considered strong is when its length is 8 and contains at least one uppercase alphabet, lowercase alphabet, a numerical character, and a special character. The regex for this is as follows:

/ ^(?=.*[a-z])(?=.*[A-Z])(?=.*d)(?=.*[$@$!%*?&])[A-Za-zd$@$!%*?&]{8,}/

Match URL

Want to check if a URL provided by an applicant or client is valid or not, here’s the regex that can match a valid URL:

/^(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)*/?$/

Match IP Address

Want to fetch all the IPv4 addresses from a lengthy document? This regex will get them for you:

/^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]).){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$/

Match HTML Tags

Web scraping is a common thing these days. But when you scrape a website, you might also get some HTML tags along with the text. You can easily find out these tags to separate them from the text:

/^<([a-z]+)([^<]+)*(?:>(.*)</1>|s+/>)$/

Grep and Regex: Know When to Say When

I’ll close with a bit of philosophical advice.  As you learn your way around these tools, you’ll find yourself able to do some truly cool stuff.  They’ll help you solve problems and be productive.

But don’t let yourself lose sight of the forest for the trees.

Grep and regex are powerful for automatic ingestion, parsing, and analysis of common types of files. One such common types of files are logs. Grep and regex simplify working with logs more than you can think. If you deal with logs on a regular basis, grep and regex are surely something you should get used to. But there’s a lot more to do with logs. If you’re looking for a complete log-management solution, Scalyr is your savior.

In a pinch (like my hypothetical one) where you need to log into a server, drill into some configuration file, and find stuff in it, this is great.  If you find yourself doing similar things on a routine basis, you might start to ask yourself why. And if this is your life — grepping and regexing your way through countless, massive files (e.g., log files), then you probably have better options at your disposal.