In 1998, Ask Ars was an early feature of the newly launched Ars Technica. Now, as then, it's all about your questions and our community's answers. Each week, we'll dig into our question bag, provide our own take, then tap the wisdom of our readers. To submit your own question, see our helpful tips page.
Q: I know I can use the find
command at the command line to locate files, but how do I use it with other commands to perform a real-world task? What's the difference between the -exec
parameter and piping into xargs
?
The find
command is a standard utility on UNIX and Linux systems. It will recurse through directory structures and look for files that conform with the user's specified parameters. There are a number of different search operators that can be used together to achieve fine-grained file matching.
In this tutorial, I'll explain how to use the find
command with several common search operators and then I'll show you some examples of how to use the find
command in a pipeline.
The most basic use of the find
command is simple filename matching. This is accomplished by using the -name
parameter. The following example shows how the find
command can be used to locate all of the README files in the /usr/share
directory. The command's output will display one matching file result on each line.
$ find /usr/share -name README
The first argument is the path that I want the find
command to search under. It's important to remember that the search will be performed recursively, which means that it will look in all of the subdirectories and descendants of the /usr/share
directory, not just that individual path.
The find
command allows you to specify as many different paths as you want. It will generally interpret every argument that appears before the search operators as an additional target path.
The -name
parameter is a search operator. It tells the find
command that you want to locate all files and folders that have the desired name, which is 'README' in this case. It is case-sensitive and will only identify a full match. For a case insensitive version, you would use the -iname
parameter instead.
If you want to find partial matches, you can use the -name
or -iname
parameter with a simple glob expression. For example, this is how I find all of the files with a txt extension in my Journalism folder:
$ find ~/Journalism -name '*.txt'
The single quotes around the glob expression, which ensure that it gets passed into the find
command as a literal string, are very important. If I left out the single quotes, the glob could potentially be expanded by the shell. That's not desirable in this case, because it would make the -name
operator match against the expanded value rather than using the glob to perform the search.
If you want the find
command to do more sophisticated pattern matching against file or directory names, you might consider using the -regex
search operator. It works basically the same way as the -name
operator, but it will take a regular expression instead of a glob. That's an enormously powerful feature for complex searches.
The -name
and -regex
search operators match solely against the last segment of the filesystem path—the actual name of a file or directory. If you want to perform matches against the entire path, you can use the -path
or -ipath
search operators instead. For example, imagine a complex folder hierarchy where you have a bunch of nested directories at various levels and you want to find all of the ".c" files at any level under any "src" directory. You could do something like this:
$ find ~/Programming -path '*/src/*.c'
It's important to understand that the way the -path
operator compares a glob expression against full paths is different from a how a glob would be expanded in the shell. It's comparing against the path as a string. For example, it could match something like "test/project/code/src/modules/stuff.c" where the file is deep in the hierarchy.
There may be situations where you want to limit the depth of the find
command's recursion or only match against results that are a certain number of layers deep in a directory hierarchy. To limit the depth, you would use the -maxdepth
search operator. The -mindepth
operator will let you filter for results at or beyond the specified depth.
In addition to name matching, there are search operators for matching against file age, ownership, and size, and other attributes. There are too many to discuss at length here, so I'll only show one more basic example before we move on to the really interesting stuff. You can refer to the manual page if you want more details about the other search operators.
Let's say that I want to find all of the ".jpg" files that are larger than 500 kilobytes. I can use the -size
search operator in conjunction with a -name
search operator:
$ find ~/Images/Screenshots -size +500k -iname '*.jpg'
I put a +
at the beginning of the value after the -size
operator to indicate that I want files larger than the specified size. The 'k' at the end is to indicate that the value is in kilobytes. You can use an 'M' for megabytes or a 'G' for gigabytes if you wanted to.
Using the find
command in a pipeline
Now that we have gotten past the basics, it's time to see how the find
command can be used with other commands to complete real-world tasks. The ability to find files that match certain search operators is useful by itself, but there are many situations where the user will want to run some commands on the files in the result set.
The find
command has a special -exec
parameter that can be used to perform an action on each file that matches the search operators. After the -exec
parameter, you put a shell command that indicates what action you want performed. For example, if you wanted to echo all of the text in every ".txt" file in a directory hierarchy to stdout, you could do the following:
$ find ~/Journalism -name '*.txt' -exec cat {} ;
The curly brace pair ("{}") is a placeholder that the find
command will automatically replace with the name of the matched file when it's performing the command on each item in the result set. The semi-colon at the end, which must be escaped, indicates the end of the -exec
command.
The problem with -exec
is that it will perform the action separately on each individual file that matches your search. It will spawn a new instance of the specified command for each file, creating a lot of extra processes and unnecessary overhead.
Rather than calling cat
separately on each file, it would be better to call cat
once, and pass it all of the matching files as parameters. In cases where you don't need to have a separate instance of the command for each file, you can simply pipe into the xargs
command instead of using the find
command's -exec
parameter.
The xargs
command is a simple utility that takes the output of the previous command in a pipeline and passes it to another command as arguments. Using xargs
with find
isn't always straightforward, however. The way that the xargs
command uses whitespace to break up its input into arguments can sometimes have undesired results.
If you have spaces in the filenames returned by the find
command, for example, xargs
would break a filename with spaces into multiple arguments—definitely not the behavior that we want. Fortunately, there is a workaround built into both find
and xargs
.
If you supply the -print0
parameter in the find
command, it will use the ASCII null
character as a delimiter to separate the filenames instead of using linebreaks. Then you can give the xargs
command the -0
parameter, which tells it to split the input at the null
character instead of whitespace.
This approach allows us to safely use the xargs
command to perform an operation on the output of the find
command, but with proper handling for filenames with spaces. This is what the previous trivial cat
example looks like when it's done correctly with xargs
instead of -exec
:
$ find ~/Journalism -name '*.txt' -print0 | xargs -0 cat
On my desktop computer, performing that operation with the xargs
command ended up being more than three times faster than using the -exec
parameter.
Now that we understand the basics of using find
with xargs
, let's try a more complicated example. If I want to see the word count of every ".txt" file in my Journalism directory, I could do the following:
$ find ~/Journalism -name '*.txt' -print0 | xargs -0 wc -w
I used the wc
command to count the words in all of the files. Each line displays the word count followed by the file path. At the very end of the output, on the last line, the wc
command shows the total word count of all the files put together.
If I want to compute the mean average word count for the text files, I would just need to take the total word count and divide it by the number of matched files. Let's try that now by adding a simple awk
invocation at the end of the pipeline:
$ find ~/Journalism -name '*.txt' -print0 | xargs -0 wc -w
| awk 'END { print $1/(NR-1) }'
In awk
scripting, the code in the END
block is performed when the end of the input is reached. That means that the value of the $1
variable in the END
block will be the first column of text on the last line. In this case, that's the total word count.
The NR
is an internal awk
variable that holds the number of records—in this case, the number of lines emitted by the wc
command. I can get the average word count by the value of $1
by the value of NR
. I subtracted one from NR before performing the division so that it doesn't count the wc
command's total line as a record.
Just for fun, here's a variation on the previous example where the awk
script is extended to compute the average word count only of articles that are less than 2000 words:
find ~/Journalism -name '*.txt' -print0 | xargs -0 wc -w
| awk '$1 < 2000 {v += $1; c++} END {print v/c}'
As you can see, combining the find
command with an awk
one-liner can often be very useful. There are a lot of situations where using the find
command in a pipeline will enable practical automation and simplify day-to-day tasks.
Listing image by Photograph by Jake Bouma
reader comments
47