Welcome to our today’s tutorial on how to search text files using regular expressions like grep, egrep, fgrep, sed, regex. String-searching algorithms are widely used by several data-processing tasks, so much that Unix-like operating systems have their own ubiquitous implementation: Regular expressions
, often acronym REs
. Regular expressions consist of character sequences that make up a generic pattern used to locate and sometimes modify a corresponding sequence in a larger string of characters. Regular expressions greatly expand the ability to:
- Write parsing rules to requests in HTTP servers, nginx in particular.
- Write scripts that convert text-based datasets to another format.
- Search for occurrences of interest in journal entries or documents.
- Filter markup documents, keeping semantic content.
Differences Between Basic and Extended Regular Expressions
Basic Regular Expressions
Basic regular expressions (BREs) include characters, such as a dot followed by an asterisk (.*
) to represent multiple characters and a single dot (.
) to represent one character. They also may use brackets to represent multiple characters, such as [a,e,i,o,u] (you do not have to include the commas) or a range of characters, such as [A-z]. When brackets are employed, it is called a bracket expression.
anchor characters: To find text file records that begin with particular characters, you can precede them with a caret (^
) symbol. For finding text file records where particular characters are at the record’s end, append them with a dollar sign ($
) symbol. Both the caret and the dollar sign symbols are called anchor
characters for BREs, because they fasten the pattern to the beginning or the end of a text line.
Using the grep
command with a BRE pattern;
$ grep root /etc/passwd
root:x:0:0:root:/root:/bin/bash
nm-openvpn:x:115:121:NetworkManager OpenVPN,,,:/var/lib/openvpn/chroot:/usr/sbin/nologin
The above command searches for instances of the word root
within the password file. You notice that it displays two lines from the file.
Using the grep
to display only lines matching the PATTERN;
$ grep ^root /etc/passwd
root:x:0:0:root:/root:/bin/bash
The above command employs the BRE ^
character and places it before the word root
. This regular expression pattern causes grep
to display only lines in the password file that begin with root
.
Using the grep
command to audit the password file;
$ grep -v nologin$ /etc/passwd
root:x:0:0:root:/root:/bin/bash
sync:x:4:65534:sync:/bin:/bin/sync
tss:x:103:108:TPM software stack,,,:/var/lib/tpm:/bin/false
speech-dispatcher:x:111:29:Speech Dispatcher,,,:/run/speech-dispatcher:/bin/false
hplip:x:116:7:HPLIP system user,,,:/run/hplip:/bin/false
whoopsie:x:117:122::/nonexistent:/bin/false
gnome-initial-setup:x:121:65534::/run/gnome-initial-setup/:/bin/false
gdm:x:122:127:Gnome Display Manager:/var/lib/gdm3:/bin/false
frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh
mysql:x:995:1001::/home/mysql:/bin/sh
pilot:x:1001:1002:,,,:/home/pilot:/bin/bash
The -v
option is useful when auditing your configuration files with the grep utility. It produces a list of text fi le records that do not contain the pattern. The above output shows an example of finding all the records in the password file that do not end in nologin
. Notice that the BRE pattern puts the $
at the end of the word. If you were to place the $
before the word, it would be treated as a variable name instead of a BRE pattern.
character classes: A special group of bracket expressions are character classes
. These bracket expressions have predefined names and could be considered bracket expression shortcuts. Their interpretation is based on the LC_CTYPE locale environment variable.
Commonly Used Character Classes;
[:alnum:]
: Represents an alphanumeric character.[:alpha:]
: Represents an alphabetic character.[:ascii:]
: Represents a character that fits into the ASCII character set.[:blank:]
: Represents a blank character, that is, a space or a tab.[:cntrl:]
: Represents a control character.[:digit:]
: Represents a digit (0 through 9).[:graph:]
: Represents any printable character except space.[:lower:]
: Represents a lowercase character.[:print:]
:Represents any printable character including space.[:punct:]
: Represents any printable character which is not a space or an alphanumeric character.[:space:]
: Represents white-space characters: space, form-feed (\f
), newline (\n
), carriage return (\r
), horizontal tab (\t
), and vertical tab (\v
).[:upper:]
: Represents an uppercase letter.[:xdigit:]
: Represents hexadecimal digits (0 through F).
Having our file users.txt let’s check what it contains with cat
command;
$ cat users.txt
pilot
3434
frank
64646
sshd
77767
Using the grep
command and a character class;
$ grep "[[:digit:]]" users.txt
Quantifiers: An atom is just a character that may or may not have special meaning. The reach of an atom, either a single character atom or a bracket atom, can be adjusted using an atom quantifier. Atom quantifiers define atom sequences, that is, matches occur when a contiguous repetition for the atom is found in the string. The substring corresponding to the match is called a piece. Notwithstanding, quantifiers and other features of regular expressions are treated differently depending on which standard is being used.
The *
quantifier has the same function in both basic and extended REs (atom occurs zero or more times) and it’s a literal character if it appears at the beginning of the regular expression or if it’s preceded by a backslash \
.
The plus sign quantifier +
will select pieces containing one or more atom matches in sequence. The question mark quantifier ?
, a match will occur if the corresponding atom appears once or if it doesn’t appear at all. If preceded by a backslash \
, their special meaning is not considered.
Basic regular expressions also support +
and ?
quantifiers, but they need to be preceded by a backslash. Unlike extended regular expressions, +
and ?
by themselves are literal characters in basic regular expressions.
special characters;
.
(dot): Atom matches with any character.^
(caret): Atom matches with the beginning of a line.$
(dollar sign): Atom matches with the end of a line
Extended Regular Expressions
Extended regular expressions (EREs) allow more complex patterns. For example, a vertical bar symbol (|
) allows you to specify two possible words or character sets to match. You can also employ parentheses to designate additional subexpressions. The best examples of ERE are egrep
and grep -E
commands discussed below.
Using grep
One of the most common uses of grep
is to facilitate the inspection of long files, using the regular expression as a filter applied to each line. It can be used to show only the lines starting with a certain term. The grep
command is powerful in its use of regular expressions, which will help with filtering text files.
Syntax:
grep [OPTION] PATTERN [FILE…]
Commonly Used Options with grep Command
Short | Long | Description |
-c | --count | Display a count of text file records that contain a PATTERN match. |
-d action | --directories=action | When a file is a directory, if action is set to read, read the directory as if it were a regular text file; if action is set to skip, ignore the directory; and if action is set to recurse, act as if the – R, -r, or –recursive option was used. |
-E | --extended regexp | Designate the PATTERN as an extended regular expression. |
-i | --ignore-case | Ignore the case in the PATTERN as well as in any text file records. |
-R, -r | --recursive | Search a directory’s contents, and for any subdirectory within the original directory tree, consecutively search its contents as well (recursively). |
-v | --invert-match | Display only text files records that do not contain a PATTERN match. |
Using a simple grep
command to search a file.
No options are used, and the grep
utility is used to search for the word frank
(PATTERN) within /etc/passwd (FILE).
$ grep frank /etc/passwd
frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh
We notice that in the above output the grep
command returns each file record (line) that contains an instance of the PATTERN, which in this case was the word frank
.
egrep (Extended-regexp)
egrep
command is the same as grep -E
. It interpret PATTERNS as extended regular expressions. EREs allow more complex patterns.
Using the grep -E
command with an ERE pattern;
$ grep -E "^root|^pilot" /etc/passwd
root:x:0:0:root:/root:/bin/bash
pilot:x:1001:1002:,,,:/home/pilot:/bin/bash
In the above output, the grep
command uses the -E
option to indicate the pattern is an extended regular expression. If you did not employ the -E
option, unpredictable results would occur. Quotation marks around the ERE pattern protect it from misinterpretation. The command searches for any password file records that start with either the word frank
or the word pilot
. Thus, a caret (^
) is placed prior to each word, and a vertical bar (|
) separates the words to indicate that the record can start with either word.
Using the fgrep
command with an ERE pattern;
$ egrep "(sshd|s).*sh" /etc/passwd
frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh
sshd:x:123:65534::/run/sshd:/usr/sbin/nologin
mysql:x:995:1001::/home/mysql:/bin/sh
In the above output, you notice that the egrep
command is employed. The egrep
command is equivalent to using the grep -E
command. The ERE pattern here also uses quotation marks to avoid misinterpretation and employs parentheses to issue a subexpression. The subexpression consists of a choice, indicated by the vertical bar (|
), between the word sshd
and the letter s
. Also in the ERE pattern, the .*
symbols are used to indicate there can be anything in between the subexpression choice and the word sh
in the text file record.
fgrep (fixed-strings)
fgrep
command is the same as grep -F
. It interpret PATTERNS as fixed strings, not regular expressions. grep command is used to search for patterns stored in a text file.
Using the fgrep
and grep -F
command to search for patterns stored in a text file. We have a file users.txt, let’s look at its contents with cat
command.
$ cat users.txt
pilot
frank
sshd
Let’s use fgrep
command to search for patterns stored in /etc/passwd file;
$ fgrep -f users.txt /etc/passwd
frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh
sshd:x:123:65534::/run/sshd:/usr/sbin/nologin
pilot:x:1001:1002:,,,:/home/pilot:/bin/bash
In the above output, the patterns are stored in the users.txt file, which is first displayed using the cat
command. Next, the fgrep
command is employed, along with the -f
option to indicate the file that holds the patterns. The /etc/passwd file is searched for all the patterns stored within the users.txt file, and the results are displayed.
Let’s use grep -F
command to search for patterns stored in /etc/passwd file;
$ grep -F -f users.txt /etc/passwd
frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh
sshd:x:123:65534::/run/sshd:/usr/sbin/nologin
pilot:x:1001:1002:,,,:/home/pilot:/bin/bash
You notice, the grep -F
command is equivalent to using the fgrep
command, which is why the two commands produce identical results.
Searching with Regular Expressions
The immediate benefit offered by regular expressions is to improve searches on filesystems and in text documents. The -regex
option of command find
allows to test every path in a directory hierarchy against a regular expression.
Using -regex
with find
command;
$ find $HOME -regex '.*/\..*' -size +100M
/home/frank/.vagrant.d/boxes/generic-VAGRANTSLASH-fedora28/3.2.10/virtualbox/generic-fedora28-virtualbox-disk001.vmdk
/home/frank/.vagrant.d/boxes/ubuntu-VAGRANTSLASH-focal64/20210302.0.0/virtualbox/ubuntu-focal-20.04-cloudimg.vmdk
/home/frank/.vagrant.d/boxes/generic-VAGRANTSLASH-centos8/3.2.10/virtualbox/generic-centos8-virtualbox-disk001.vmdk
/home/frank/.vagrant.d/boxes/gusztavvargadr-VAGRANTSLASH-docker-linux/2010.0.2012/virtualbox/gusztavvargadr-u1604s-dc-2012.0.0-1608130612-disk001.vmdk
/home/frank/.vagrant.d/boxes/batrusi-VAGRANTSLASH-suse_minimal/0.0.1/virtualbox/box-disk001.vmdk
The above command searches for files greater than 100 megabytes (100 units of 1048576 bytes), but only in paths inside the user’s home directory that do contain a match with .*/\..*
, that is, a /.
surrounded by any other number of characters. In other words, only hidden files or files inside hidden directories will be listed, regardless of the position of /.
in the corresponding path.
For case insensitive regular expressions, the -iregex
option is used instead;
$ find /usr/share/fonts -regextype posix-extended -iregex '.*(dejavu|liberation).*sans.*(italic|oblique).*'
/usr/share/fonts/truetype/liberation2/LiberationSans-Italic.ttf
/usr/share/fonts/truetype/liberation2/LiberationSans-BoldItalic.ttf
/usr/share/fonts/truetype/liberation/LiberationSans-Italic.ttf
/usr/share/fonts/truetype/liberation/LiberationSans-BoldItalic.ttf
/usr/share/fonts/truetype/liberation/LiberationSansNarrow-Italic.ttf
/usr/share/fonts/truetype/liberation/LiberationSansNarrow-BoldItalic.ttf
In the above example, the regular expression contains branches (written in extended style) to list only specific font files under the /usr/share/fonts
directory hierarchy. Extended regular expressions are not supported by default, but find
allows for them to be enabled with -regextype posix-extended
or -regextype egrep
. The default RE standard for find
is findutils-default, which is virtually a basic regular expression clone.
Using Sed (stream editor)
There are times where you will want to edit text without having to pull out a full-fledged text editor. A stream editor
modifies text that is passed to it via a file or output from a pipeline. This editor uses special commands to make text changes as the text “streams” through the editor utility.
The sed
editor changes data based on commands either entered into the command line or stored in a text file. The process the editor goes through is as follows:
- Reads one text line at a time from the input stream
- Matches that text with the supplied editor commands
- Modifies the text as specified in the commands
- Displays the modified text
After the sed
editor matches all the specified commands against a text line, it reads the next text line and repeats the editorial process. Once sed
reaches the end of the text lines, it stops.
Syntax:
sed [OPTIONS] [SCRIPT]… [FILENAME]
Using sed to modify/substitute file text
You can modify text stored in a file using sed
command. Having our file AboutLinux.txt let’s check it contents with cat command;
$ cat AboutLinux.txt
Linus Torvalds developed Linux OS
Linux OS made everything simple
Linux OS has many Distros
We love Linux OS
We love Technology
Using sed
to modify file text;
$ sed 's/Linux/Unix/' AboutLinux.txt
Linus Torvalds developed Unix OS
Unix OS made everything simple
Unix OS has many Distros
We love Unix OS
We love Technology
The stream editor only displays the modifi ed text to STDOUT. You could save the modifi ed text to another fi le name via a STDOUT redirection operator, if desired.
Using sed to delete file text
You can also delete lines using the stream editor. Use the syntax of ‘PATTERN/d
‘ for the sed
command’s SCRIPT to accomplish it.
Using sed
to delete file text;
$ sed '/Torvalds/d' AboutLinux.txt
Linux OS made everything simple
Linux OS has many Distros
We love Linux OS
We love Technology
You notice that the AboutLinux.txt file line that contains the word Torvalds
is not displayed to STDOUT. It was “deleted” in the output, but it still exists within the text file.
Using sed to change an entire file line
You can also change an entire line of text. To accomplish this, you use the syntax of ‘ADDRESScNEWTEXT
‘ for the sed
command’s SCRIPT. The ADDRESS
refers to the file’s line number, and the NEWTEXT
is the different text line you want displayed.
Using sed
to change an entire file line;
$ sed '5cLinux OS exmples Ubuntu, ArchLinux, Elementary, Manjaro and many more' AboutLinux.txt
Linus Torvalds developed Linux OS
Linux OS made everything simple
Linux OS has many Distros
We love Linux OS
Linux OS exmples Ubuntu, ArchLinux, Elementary, Manjaro and many more
Conclusion
That’s all on to Search Text Files Using Regular Expressions using grep, egrep, fgrep, sed, regex. Stay tuned for more LPIC – 101 guides.
More guides on LPIC – 101: