Biz & IT —

Ask Ars: how do I use the find command in a pipeline?

In this command line tutorial, we will introduce the find command and explain …

In 1998, Ask Ars was an early feature of the newly launched Ars Technica. Now, as then, it's all about your questions and our community's answers. Each week, we'll dig into our question bag, provide our own take, then tap the wisdom of our readers. To submit your own question, see our helpful tips page.

Q: I know I can use the find command at the command line to locate files, but how do I use it with other commands to perform a real-world task? What's the difference between the -exec parameter and piping into xargs?

The find command is a standard utility on UNIX and Linux systems. It will recurse through directory structures and look for files that conform with the user's specified parameters. There are a number of different search operators that can be used together to achieve fine-grained file matching.

In this tutorial, I'll explain how to use the find command with several common search operators and then I'll show you some examples of how to use the find command in a pipeline.

The most basic use of the find command is simple filename matching. This is accomplished by using the -name parameter. The following example shows how the find command can be used to locate all of the README files in the /usr/share directory. The command's output will display one matching file result on each line.

$ find /usr/share -name README

The first argument is the path that I want the find command to search under. It's important to remember that the search will be performed recursively, which means that it will look in all of the subdirectories and descendants of the /usr/share directory, not just that individual path.

The find command allows you to specify as many different paths as you want. It will generally interpret every argument that appears before the search operators as an additional target path.

The -name parameter is a search operator. It tells the find command that you want to locate all files and folders that have the desired name, which is 'README' in this case. It is case-sensitive and will only identify a full match. For a case insensitive version, you would use the -iname parameter instead.

If you want to find partial matches, you can use the -name or -iname parameter with a simple glob expression. For example, this is how I find all of the files with a txt extension in my Journalism folder:

$ find ~/Journalism -name '*.txt'

The single quotes around the glob expression, which ensure that it gets passed into the find command as a literal string, are very important. If I left out the single quotes, the glob could potentially be expanded by the shell. That's not desirable in this case, because it would make the -name operator match against the expanded value rather than using the glob to perform the search.

If you want the find command to do more sophisticated pattern matching against file or directory names, you might consider using the -regex search operator. It works basically the same way as the -name operator, but it will take a regular expression instead of a glob. That's an enormously powerful feature for complex searches.

The -name and -regex search operators match solely against the last segment of the filesystem path—the actual name of a file or directory. If you want to perform matches against the entire path, you can use the -path or -ipath search operators instead. For example, imagine a complex folder hierarchy where you have a bunch of nested directories at various levels and you want to find all of the ".c" files at any level under any "src" directory. You could do something like this:

$ find ~/Programming -path '*/src/*.c'

It's important to understand that the way the -path operator compares a glob expression against full paths is different from a how a glob would be expanded in the shell. It's comparing against the path as a string. For example, it could match something like "test/project/code/src/modules/stuff.c" where the file is deep in the hierarchy.

There may be situations where you want to limit the depth of the find command's recursion or only match against results that are a certain number of layers deep in a directory hierarchy. To limit the depth, you would use the -maxdepth search operator. The -mindepth operator will let you filter for results at or beyond the specified depth.

In addition to name matching, there are search operators for matching against file age, ownership, and size, and other attributes. There are too many to discuss at length here, so I'll only show one more basic example before we move on to the really interesting stuff. You can refer to the manual page if you want more details about the other search operators.

Let's say that I want to find all of the ".jpg" files that are larger than 500 kilobytes. I can use the -size search operator in conjunction with a -name search operator:

$ find ~/Images/Screenshots -size +500k -iname '*.jpg'

I put a + at the beginning of the value after the -size operator to indicate that I want files larger than the specified size. The 'k' at the end is to indicate that the value is in kilobytes. You can use an 'M' for megabytes or a 'G' for gigabytes if you wanted to.

Using the find command in a pipeline

Now that we have gotten past the basics, it's time to see how the find command can be used with other commands to complete real-world tasks. The ability to find files that match certain search operators is useful by itself, but there are many situations where the user will want to run some commands on the files in the result set.

The find command has a special -exec parameter that can be used to perform an action on each file that matches the search operators. After the -exec parameter, you put a shell command that indicates what action you want performed. For example, if you wanted to echo all of the text in every ".txt" file in a directory hierarchy to stdout, you could do the following:

$ find ~/Journalism -name '*.txt' -exec cat {} ;

The curly brace pair ("{}") is a placeholder that the find command will automatically replace with the name of the matched file when it's performing the command on each item in the result set. The semi-colon at the end, which must be escaped, indicates the end of the -exec command.

The problem with -exec is that it will perform the action separately on each individual file that matches your search. It will spawn a new instance of the specified command for each file, creating a lot of extra processes and unnecessary overhead.

Rather than calling cat separately on each file, it would be better to call cat once, and pass it all of the matching files as parameters. In cases where you don't need to have a separate instance of the command for each file, you can simply pipe into the xargs command instead of using the find command's -exec parameter.

The xargs command is a simple utility that takes the output of the previous command in a pipeline and passes it to another command as arguments. Using xargs with find isn't always straightforward, however. The way that the xargs command uses whitespace to break up its input into arguments can sometimes have undesired results.

If you have spaces in the filenames returned by the find command, for example, xargs would break a filename with spaces into multiple arguments—definitely not the behavior that we want. Fortunately, there is a workaround built into both find and xargs.

If you supply the -print0 parameter in the find command, it will use the ASCII null character as a delimiter to separate the filenames instead of using linebreaks. Then you can give the xargs command the -0 parameter, which tells it to split the input at the null character instead of whitespace.

This approach allows us to safely use the xargs command to perform an operation on the output of the find command, but with proper handling for filenames with spaces. This is what the previous trivial cat example looks like when it's done correctly with xargs instead of -exec:

$ find ~/Journalism -name '*.txt' -print0 | xargs -0 cat

On my desktop computer, performing that operation with the xargs command ended up being more than three times faster than using the -exec parameter.

Now that we understand the basics of using find with xargs, let's try a more complicated example. If I want to see the word count of every ".txt" file in my Journalism directory, I could do the following:

$ find ~/Journalism -name '*.txt' -print0 | xargs -0 wc -w

I used the wc command to count the words in all of the files. Each line displays the word count followed by the file path. At the very end of the output, on the last line, the wc command shows the total word count of all the files put together.

If I want to compute the mean average word count for the text files, I would just need to take the total word count and divide it by the number of matched files. Let's try that now by adding a simple awk invocation at the end of the pipeline:

$ find ~/Journalism -name '*.txt' -print0 | xargs -0 wc -w 
      | awk 'END { print $1/(NR-1) }'

In awk scripting, the code in the END block is performed when the end of the input is reached. That means that the value of the $1 variable in the END block will be the first column of text on the last line. In this case, that's the total word count.

The NR is an internal awk variable that holds the number of records—in this case, the number of lines emitted by the wc command. I can get the average word count by the value of $1 by the value of NR. I subtracted one from NR before performing the division so that it doesn't count the wc command's total line as a record.

Just for fun, here's a variation on the previous example where the awk script is extended to compute the average word count only of articles that are less than 2000 words:

find ~/Journalism  -name '*.txt' -print0 | xargs -0 wc -w 
     | awk '$1 < 2000 {v += $1; c++} END {print v/c}'

As you can see, combining the find command with an awk one-liner can often be very useful. There are a lot of situations where using the find command in a pipeline will enable practical automation and simplify day-to-day tasks.

Listing image by Photograph by Jake Bouma

Channel Ars Technica