A lot of introductions to using a shell — whether it’s Linux, one of the BSDs, the Mac (or even Windows using WSL!) — show examples that are a bit on the light side (looking at you, cowsay
😅) or dump cryptical command sequences on the unwary newbie that make an inscription in hieroglyphs on an Egyptian temple column look easy. Both approaches make sense. The first one tries not to scare people when they use the command line, while the second one shows how powerful it is compared to clicking around in a GUI. But they don’t really explain the advantages of a shell or the UNIX idea of “do one thing and do it well“.
An introduction should be easy to understand and follow, show a real-world use case, and ideally require more effort when trying to do the same task in a graphical environment. A few years back, I was planning a weekend workshop about using the command line for data analysis, and I came up with an idea for a example called “The Bard and The Shell” that I’d like to share. I hope it’s useful when someone asks why so many of us prefer the command line for certain tasks.
It shows some common commands (not too many to make it easy to follow), the advantages of the idea of pipelining, and iteratively solving a problem. We’re going to find out the 25 most-used words in Shakespeare’s “Much Ado About Nothing“. If you’re working with a GUI, you’ll quickly see that it’s not as simple as it seems. It’s not easy to log the steps you need to take to get the results you’re looking for.
First, we need the text of the Bard. You can find it online, but you can also download the text file containing “Much Ado About Nothing” from: https://www.arminhanisch.de/data/muchado.zip. Just unzip the file and put the muchado.txt
file in a directory of your choice. Now let’s get this show on the road. I’m using bash for this example, but this should work with other shells too (we will keep the fact that there are different shells, each with its own dedicated following, for a later post 😉). Open a terminal window and change to the directory where you put the muchado.txt file (using the cd
command).
The first step when analyzing the text to find the most frequent words is to convert it so that each word is on its own line. We’ll be using the tr
command for this. tr
stands for “translate“. Like the name says, it’s a command-line utility for translating or deleting characters. It supports a bunch of different transformations. You can change text to uppercase or lowercase, squeeze repeating characters, delete specific characters, and do basic find and replace. You can also use it with UNIX pipes to support more complex translations.
Let’s turn the Bard’s work into a long list of words, one per line.
cat muchado.txt | tr '[:blank:]' '\n'
This finds any instance of whitespace (the :blank:
class) and replaces it with a newline character. The output will be a very long list of over 22,000 lines of text, so you might want to just read along for the time being or wait until your terminal window finishes displaying the words.
The next step is to take out all the punctuation, quotes, and other stuff. So, we just send the output of the last command to a new call to tr
and then another. The backslash is great for making our command line more readable by continuing it to the next line.
cat muchado.txt | tr '[:blank:]' '\n' \
| tr -d '[:punct:]' \
| tr -d "'"
We don’t want to distinguish between a “You” and a “you” because they’re the same word, so we’re going to convert everything to lowercase, again using the mighty tr
command. tr
also gives us character classes for this, so we don’t have to specify every letter of the alphabet and its lowercase counterpart.
cat muchado.txt | tr '[:blank:]' '\n' \
| tr -d '[:punct:]' \
| tr -d "'" \
| tr '[:upper:]' '[:lower:]'
I don’t want to bore you with tr
over and over, so for our next task of removing empty lines (no word, no need to check), we’ll switch to another command named grep
. grep
stands for Global Regular Expression Print. If you will continue using the shell, you’ll learn the meaning of a lot of these cryptic abbreviations. 😎 Anyway, how to get rid of empty lines with grep
? Like so:
cat muchado.txt | tr '[:blank:]' '\n' \
| tr -d '[:punct:]' \
| tr -d "'" \
| tr '[:upper:]' '[:lower:]' \
| grep -e '^$' -v
Now, let’s sort all these words alphabetically. You’ve got to do this step first because the next step, which is to remove all the duplicates and count them, needs its input to be sorted.
cat muchado.txt | tr '[:blank:]' '\n' \
| tr -d '[:punct:]' \
| tr -d "'" \
| tr '[:upper:]' '[:lower:]' \
| grep -e '^$' -v \
| sort
Now that looks a lot more orderly. Here’s a fun fact: the last word is “zeal” and it only appears once in the whole text. Maybe you weren’t too zealous William? 😂 Alright, let’s go ahead and remove all the duplicates while we’re counting them.
cat muchado.txt | tr '[:blank:]' '\n' \
| tr -d '[:punct:]' \
| tr -d "'" \
| tr '[:upper:]' '[:lower:]' \
| grep -e '^$' -v \
| sort \
| uniq -c
There are less than 3,000 words in the output. Looks like you can read Shakespeare even if you don’t speak English perfectly. How do I know that? Just as an aside, I’m using the wc
command (word count) to do all the counting. Want to know how many lines your output has? Just add wc
with the -l
option (for lines) to the command. Yes, wc
can also count words and characters.
cat muchado.txt | tr '[:blank:]' '\n' \
| tr -d '[:punct:]' \
| tr -d "'" \
| tr '[:upper:]' '[:lower:]' \
| grep -e '^$' -v \
| sort \
| uniq -c \
| wc -l
This will not output the long list of words, but just the number 2978
. OK, back to our task…
We want this list sorted by count in reverse order. There’s a command for this, and it’s called sort
(what a surprise 😁). It also has a bunch of options, but we’ll only use two: n
for numericical sorting and r
for reverse.
cat muchado.txt | tr '[:blank:]' '\n' \
| tr -d '[:punct:]' \
| tr -d "'" \
| tr '[:upper:]' '[:lower:]' \
| grep -e '^$' -v \
| sort \
| uniq -c \
| sort -nr
We’re getting closer. We just need to make sure we’re outputting only the first 25 lines. The command to filter out only the start of a stream of lines is called head
and it takes the number of lines as an option. And yes, you got it right: if you want to get the last part of a list of lines, you’d use the command tail
. 😉
cat muchado.txt | tr '[:blank:]' '\n' \
| tr -d '[:punct:]' \
| tr -d "'" \
| tr '[:upper:]' '[:lower:]' \
| grep -e '^$' -v \
| sort \
| uniq -c \
| sort -nr \
| head -n 25
And there you have it—the most frequently used 25 words from “Much Ado About Nothing“:
694 i
628 and
581 the
491 you
485 a
428 to
360 of
311 in
302 is
291 that
281 my
256 it
250 not
223 her
220 for
219 me
212 don
200 he
199 with
199 will
198 benedick
196 claudio
182 your
182 be
173 but
IMHO that’s a great way to get started with “data science on the command line” and see how flexible and useful the command line tools and the concept of pipelines can be to solve a specific task. Taking a look at Shakespeare through the lens of a one-liner…
Leave a Reply