T2: Pipes, Search & Text

The Unix philosophy — small tools, composed

T1 introduced the pipe character | as a way to chain commands. T2 is where pipes go from a trick you've heard of to a tool you reach for automatically.

Unix was designed around one idea: each program does exactly one thing, does it well, reads from standard input, and writes to standard output. No program knows or cares what comes before or after it in a pipeline. That narrow contract is what makes composition possible. A tool written in 1975 pipes into a tool written last year because they both speak the same language: text.

When you write cmd1 | cmd2 | cmd3, the shell wires the output of cmd1 directly into the input of cmd2, and cmd2's output into cmd3. No temp files. The commands run concurrently — as cmd1 produces output, cmd2 is already consuming it. This is the engine that makes one-liners like "top 10 most frequent words in a file" a 30-second job.

Before the tools: the three standard streams every process has from birth.

stdin (0) — where a process reads input from. By default, your keyboard.
stdout (1) — where a process writes normal output. By default, your terminal.
stderr (2) — where a process writes errors and diagnostics. Also your terminal by default, but a separate stream from stdout — so you can redirect them independently.

You'll meet those stream numbers again when you redirect stderr (2>) later in this module.

pbcopy and pbpaste — macOS-only pipe targets

macOS ships two utilities that bridge the shell and the clipboard: pbcopy reads from stdin and puts it on the clipboard; pbpaste writes the clipboard contents to stdout. You can pipe into and out of the clipboard just like any other command: cat results.txt | pbcopy copies the file; pbpaste | sort sorts whatever is on your clipboard. Linux doesn't have these built in — another small BSD delta to file away.

grep — search inside files

grep searches for a pattern in files — or in piped input — and prints every line that matches. It's one of the most-reached-for tools in the Unix toolkit.

$ grep "error" app.log              # lines containing "error"
$ grep -i "error" app.log           # -i: case-insensitive (Error, ERROR, error)
$ grep -r "TODO" src/               # -r: recursive — search a whole directory tree
$ grep -n "error" app.log           # -n: show line numbers with each match
$ grep -v "DEBUG" app.log           # -v: invert — lines that do NOT match
$ grep -c "error" app.log           # -c: count matching lines (just the number)
$ grep -E "error|warn" app.log      # -E: extended regex — match "error" OR "warn"
$ grep -F "1.2.3" version.txt       # -F: fixed string — no regex, literal match

Flags combine freely: grep -rn "TODO" src/ searches recursively and shows line numbers. grep -iv "debug" app.log case-insensitively excludes debug lines.

macOS: BSD grep has no -P

On Linux, grep -P enables Perl-compatible regular expressions — a powerful superset of basic regex. macOS ships with BSD grep, which does not support -P. Scripts copied from Linux that use grep -P "\d+" will fail on your Mac with an "invalid option" error. The fix: use grep -E "[0-9]+" instead — extended regex is available on both. When you look up grep examples online, watch for -P patterns and know you'll need to rewrite them.

Regex, just enough

Regular expressions are a mini-language for describing patterns. You don't need to be an expert — you need enough to read and write basic patterns without being surprised.

.       # any single character (except newline)
*       # zero or more of the preceding (e.g. "ab*" matches "a", "ab", "abb", ...)
^       # anchor to start of line (e.g. "^Error" only matches at line start)
$       # anchor to end of line (e.g. "\.log$" — line ends with ".log")
[abc]   # character class — matches one of a, b, or c
[0-9]   # range — matches any single digit
\.      # escaped dot — literal period (unescaped, . means "any character")

The dot is a wildcard — use -F for literal strings

A pattern like grep "1.2.3" looks like it searches for the literal string 1.2.3, but the dots are wildcards — it also matches 1x2y3, 1a2b3, and anything else with one character between each digit. This is one of the most common silent mismatch bugs in shell work. When you're searching for a literal string that contains dots (version numbers, IP addresses, file extensions), use grep -F "1.2.3" to treat the pattern as a fixed string, or escape the dots: grep "1\.2\.3".

In practice for T2, you mainly need -E for "either/or" patterns (grep -E "error|warn") and -F when you want literal strings. Deeper regex is covered in context when you need it — don't memorise; recognise and look up.

find — locate files

find walks a directory tree and prints files that match criteria you specify. It's how you answer "where is the config file I changed last week?" or "how many log files are in this tree?"

$ find . -name "*.txt"              # files named *.txt from current dir
$ find ~/projects -name "*.py"      # Python files under ~/projects
$ find . -type f                    # files only (not directories)
$ find . -type d                    # directories only
$ find . -size +1M                  # files larger than 1 megabyte
$ find . -mtime -1                  # modified in the last 24 hours
$ find . -name "*.log" -exec rm {} \;   # delete every .log file found

Always quote the pattern — the most common find mistake

The shell processes your command before find ever sees it. If you write find . -name *.txt, the shell expands *.txt against your current directory first. If there are no .txt files here, the shell leaves *.txt as-is and find works by accident. If there's exactly one, the shell replaces *.txt with that filename and find searches for only that file. If there are several, the shell expands to all of them, find sees garbled arguments, and errors out. Protect the pattern from shell expansion by quoting it: find . -name "*.txt". Always.

macOS: BSD find differences

BSD find on macOS has two notable differences from GNU find on Linux. First, it requires a starting path — bare find -name "*.txt" without a path errors on macOS; always pass find . or a real path. Second, it has no -printf — a GNU-only flag used in many Linux examples to format output. When you see find ... -printf "%f\n" online, it will fail on your Mac. The portable alternative is piping to xargs or using -exec echo {} \;.

The -exec flag runs a command on each matched file: {} is a placeholder for the matched path, and \; terminates the expression. Before running -exec rm, always dry-run with -print first to see exactly what you'd be deleting: find . -name "*.tmp" -print.

wc — counting

wc (word count) counts lines, words, and bytes in its input. You'll reach for it constantly to sanity-check what's in a file or pipeline.

$ wc -l app.log                     # count lines
$ wc -w essay.txt                   # count words
$ wc -c data.bin                    # count bytes
$ cat *.log | wc -l                 # total lines across all log files
$ grep "error" app.log | wc -l      # count lines that matched

Note the distinction between grep -c "error" app.log and grep "error" app.log | wc -l — they give the same number for a single file, but -c works per file only while the pipe form works on anything, including multi-step pipelines.

sort and uniq — order and deduplicate

sort sorts lines of text. By default it sorts alphabetically; add flags for other orderings:

$ sort names.txt                    # alphabetical ascending
$ sort -r names.txt                 # reverse alphabetical
$ sort -n sizes.txt                 # numeric sort ("10" after "9", not before)
$ sort -u words.txt                 # sort and remove duplicates in one step

uniq removes adjacent duplicate lines. Paired with sort, it handles deduplication across an entire file:

$ sort words.txt | uniq             # deduplicate (sort first, collapse adjacent dupes)
$ sort words.txt | uniq -c          # prefix each unique line with its count
$ sort words.txt | uniq -d          # show only lines that appear more than once

uniq only collapses adjacent duplicates — always sort first

uniq looks at consecutive pairs of lines. If the same value appears at line 1 and line 50 of an unsorted file, uniq will not collapse them — both appear in the output. Running uniq on unsorted input is almost always a silent bug: it looks like it worked, but it missed most of the duplicates. Always pipe through sort first. The only exception is when you deliberately want run-length collapse (adjacent duplicates only) — and that's rare.

The canonical idiom — count how often each value appears, most frequent first:

$ sort access.log | uniq -c | sort -rn

Read it left to right: sort the log (identical lines become adjacent) → count adjacent duplicates with uniq -c (prepends each unique line with its count) → sort numerically in reverse (largest count first). Result: a frequency table of every unique line, ranked by occurrence. You'll use this exact pipeline constantly.

cut — slice columns

cut extracts specific fields from each line. Essential for structured text like CSVs and tab-separated files:

$ cut -d',' -f1 data.csv            # column 1 from comma-separated file
$ cut -d',' -f1,3 data.csv          # columns 1 and 3
$ cut -f2 report.tsv                # column 2 from tab-delimited file (tab is the default)
$ cut -d':' -f1 /etc/passwd         # usernames (colon-delimited)

-d sets the delimiter character; -f picks the field number(s). For anything more complex — quoted CSV fields, computed columns, restructured output — reach for awk, which you'll meet in T4.

Streams and redirection, deeper

T1 covered > (write to file) and >> (append). Now the rest — specifically stderr, which trips everyone up eventually.

$ cmd > output.txt              # redirect stdout to a file
$ cmd 2> errors.txt             # redirect stderr to a file (stdout still to terminal)
$ cmd > output.txt 2>&1        # stdout to file, then merge stderr into stdout
$ cmd > output.txt 2>/dev/null  # stdout to file, discard stderr
$ cmd 2>/dev/null               # silence all errors, stdout to terminal
$ cmd >/dev/null 2>&1          # discard everything (fully silent)

The numbers are file descriptors: 1 = stdout, 2 = stderr. 2>&1 means "redirect fd 2 to wherever fd 1 currently points." Order matters: write the stdout redirect first, then 2>&1 — reversing them doesn't work the way you'd expect.

/dev/null is the kernel's black hole — anything written there is discarded immediately, no disk I/O. Reading from it gives you an empty stream. You'll most often use it to silence noisy commands you're running for their side effects.

Pipes only carry stdout — stderr passes through

When you pipe a command, only stdout goes into the pipe. Stderr still prints to your terminal. This is usually what you want — errors stay visible even when you're piping output somewhere else. When you want to suppress them (e.g. find printing "Permission denied" lines), append 2>/dev/null to the command that generates them: find / -name "*.conf" 2>/dev/null | grep apache.

Putting it together — real pipelines

This is where the tools pay off. Three worked examples, each built left-to-right so you can see how a pipeline grows.

Example 1 — top 10 most frequent words in a file

$ cat essay.txt | tr -s ' ' '\n' | sort | uniq -c | sort -rn | head -10

Left to right: print the file → translate spaces to newlines (one word per line, tr -s squeezes multiple spaces) → sort alphabetically (identical words become adjacent) → count adjacent duplicates → sort by count largest-first → keep only the top 10. Six tools; each does one thing.

Example 2 — count ERROR lines in a log

$ grep -i "error" app.log | wc -l

Simple two-step. To see the error lines and the count, use tee to split the stream: grep -i "error" app.log | tee /dev/stderr | wc -l — lines print to the terminal via stderr, count appears at the end via stdout.

Example 3 — 5 largest files in a directory tree

$ find . -type f -exec du -sk {} \; 2>/dev/null | sort -rn | head -5

Find all files, run du -sk on each (prints size in kilobytes + path), silence permission errors with 2>/dev/null, sort numerically largest-first, show the top 5. The -k flag on du gives kilobytes — portable across macOS and Linux.

Pipes, Search & Text

By the end of this module you will: