16.5. File and Archiving Commands

Archiving

tar

The standard UNIX archiving utility. [1] Originally a Tape ARchiving program, it has developed into a general purpose package that can handle all manner of archiving with all types of destination devices, ranging from tape drives to regular files to even stdout (see Example 3-4). GNU tar has been patched to accept various compression filters, for example: tar czvf archive_name.tar.gz *, which recursively archives and gzips all files in a directory tree except dotfiles in the current working directory ($PWD). [2]

Some useful tar options:

  1. -c create (a new archive)

  2. -x extract (files from existing archive)

  3. --delete delete (files from existing archive)

    Caution

    This option will not work on magnetic tape devices.

  4. -r append (files to existing archive)

  5. -A append (tar files to existing archive)

  6. -t list (contents of existing archive)

  7. -u update archive

  8. -d compare archive with specified filesystem

  9. --after-date only process files with a date stamp after specified date

  10. -z gzip the archive

    (compress or uncompress, depending on whether combined with the -c or -x) option

  11. -j bzip2 the archive

Caution

It may be difficult to recover data from a corrupted gzipped tar archive. When archiving important files, make multiple backups.

shar

Shell archiving utility. The text and/or binary files in a shell archive are concatenated without compression, and the resultant archive is essentially a shell script, complete with #!/bin/sh header, containing all the necessary unarchiving commands, as well as the files themselves. Unprintable binary characters in the target file(s) are converted to printable ASCII characters in the output shar file. Shar archives still show up in Usenet newsgroups, but otherwise shar has been replaced by tar/gzip. The unshar command unpacks shar archives.

The mailshar command is a Bash script that uses shar to concatenate multiple files into a single one for e-mailing. This script supports compression and uuencoding.

ar

Creation and manipulation utility for archives, mainly used for binary object file libraries.

rpm

The Red Hat Package Manager, or rpm utility provides a wrapper for source or binary archives. It includes commands for installing and checking the integrity of packages, among other things.

A simple rpm -i package_name.rpm usually suffices to install a package, though there are many more options available.

Tip

rpm -qf identifies which package a file originates from.

 bash$ rpm -qf /bin/ls
 coreutils-5.2.1-31
 	      

Tip

rpm -qa gives a complete list of all installed rpm packages on a given system. An rpm -qa package_name lists only the package(s) corresponding to package_name.

 bash$ rpm -qa
 redhat-logos-1.1.3-1
 glibc-2.2.4-13
 cracklib-2.7-12
 dosfstools-2.7-1
 gdbm-1.8.0-10
 ksymoops-2.4.1-1
 mktemp-1.5-11
 perl-5.6.0-17
 reiserfs-utils-3.x.0j-2
 ...
 
 
 bash$ rpm -qa docbook-utils
 docbook-utils-0.6.9-2
 
 
 bash$ rpm -qa docbook | grep docbook
 docbook-dtd31-sgml-1.0-10
 docbook-style-dsssl-1.64-3
 docbook-dtd30-sgml-1.0-10
 docbook-dtd40-sgml-1.0-11
 docbook-utils-pdf-0.6.9-2
 docbook-dtd41-sgml-1.0-10
 docbook-utils-0.6.9-2
 	      

cpio

This specialized archiving copy command (copy input and output) is rarely seen any more, having been supplanted by tar/gzip. It still has its uses, such as moving a directory tree. With an appropriate block size (for copying) specified, it can be appreciably faster than tar.


Example 16-30. Using cpio to move a directory tree

   1 #!/bin/bash
   2 
   3 # Copying a directory tree using cpio.
   4 
   5 # Advantages of using 'cpio':
   6 #   Speed of copying. It's faster than 'tar' with pipes.
   7 #   Well suited for copying special files (named pipes, etc.)
   8 #+  that 'cp' may choke on.
   9 
  10 ARGS=2
  11 E_BADARGS=65
  12 
  13 if [ $# -ne "$ARGS" ]
  14 then
  15   echo "Usage: `basename $0` source destination"
  16   exit $E_BADARGS
  17 fi  
  18 
  19 source="$1"
  20 destination="$2"
  21 
  22 ###################################################################
  23 find "$source" -depth | cpio -admvp "$destination"
  24 #               ^^^^^         ^^^^^
  25 #  Read the 'find' and 'cpio' info pages to decipher these options.
  26 #  The above works only relative to $PWD (current directory) . . .
  27 #+ full pathnames are specified.
  28 ###################################################################
  29 
  30 
  31 # Exercise:
  32 # --------
  33 
  34 #  Add code to check the exit status ($?) of the 'find | cpio' pipe
  35 #+ and output appropriate error messages if anything went wrong.
  36 
  37 exit $?

rpm2cpio

This command extracts a cpio archive from an rpm one.


Example 16-31. Unpacking an rpm archive

   1 #!/bin/bash
   2 # de-rpm.sh: Unpack an 'rpm' archive
   3 
   4 : ${1?"Usage: `basename $0` target-file"}
   5 # Must specify 'rpm' archive name as an argument.
   6 
   7 
   8 TEMPFILE=$$.cpio                         #  Tempfile with "unique" name.
   9                                          #  $$ is process ID of script.
  10 
  11 rpm2cpio < $1 > $TEMPFILE                #  Converts rpm archive into
  12                                          #+ cpio archive.
  13 cpio --make-directories -F $TEMPFILE -i  #  Unpacks cpio archive.
  14 rm -f $TEMPFILE                          #  Deletes cpio archive.
  15 
  16 exit 0
  17 
  18 #  Exercise:
  19 #  Add check for whether 1) "target-file" exists and
  20 #+                       2) it is an rpm archive.
  21 #  Hint:                    Parse output of 'file' command.

pax

The pax portable archive exchange toolkit facilitates periodic file backups and is designed to be cross-compatible between various flavors of UNIX. It was designed to replace tar and cpio.

   1 pax -wf daily_backup.pax ~/linux-server/files 
   2 #  Creates a tar archive of all files in the target directory.
   3 #  Note that the options to pax must be in the correct order --
   4 #+ pax -fw     has an entirely different effect.
   5 
   6 pax -f daily_backup.pax
   7 #  Lists the files in the archive.
   8 
   9 pax -rf daily_backup.pax ~/bsd-server/files
  10 #  Restores the backed-up files from the Linux machine
  11 #+ onto a BSD one.

Note that pax handles many of the standard archiving and compression commands.

Compression

gzip

The standard GNU/UNIX compression utility, replacing the inferior and proprietary compress. The corresponding decompression command is gunzip, which is the equivalent of gzip -d.

Note

The -c option sends the output of gzip to stdout. This is useful when piping to other commands.

The zcat filter decompresses a gzipped file to stdout, as possible input to a pipe or redirection. This is, in effect, a cat command that works on compressed files (including files processed with the older compress utility). The zcat command is equivalent to gzip -dc.

Caution

On some commercial UNIX systems, zcat is a synonym for uncompress -c, and will not work on gzipped files.

See also Example 7-7.

bzip2

An alternate compression utility, usually more efficient (but slower) than gzip, especially on large files. The corresponding decompression command is bunzip2.

Similar to the zcat command, bzcat decompresses a bzipped2-ed file to stdout.

Note

Newer versions of tar have been patched with bzip2 support.

compress, uncompress

This is an older, proprietary compression utility found in commercial UNIX distributions. The more efficient gzip has largely replaced it. Linux distributions generally include a compress workalike for compatibility, although gunzip can unarchive files treated with compress.

Tip

The znew command transforms compressed files into gzipped ones.

sq

Yet another compression (squeeze) utility, a filter that works only on sorted ASCII word lists. It uses the standard invocation syntax for a filter, sq < input-file > output-file. Fast, but not nearly as efficient as gzip. The corresponding uncompression filter is unsq, invoked like sq.

Tip

The output of sq may be piped to gzip for further compression.

zip, unzip

Cross-platform file archiving and compression utility compatible with DOS pkzip.exe. "Zipped" archives seem to be a more common medium of file exchange on the Internet than "tarballs."

unarc, unarj, unrar

These Linux utilities permit unpacking archives compressed with the DOS arc.exe, arj.exe, and rar.exe programs.

lzma, unlzma, lzcat

Highly efficient Lempel-Ziv-Markov compression. The syntax of lzma is similar to that of gzip. The 7-zip Website has more information.

xz, unxz, xzcat

A new high-efficiency compression tool, backward compatible with lzma, and with an invocation syntax similar to gzip. For more information, see the Wikipedia entry.

File Information

file

A utility for identifying file types. The command file file-name will return a file specification for file-name, such as ascii text or data. It references the magic numbers found in /usr/share/magic, /etc/magic, or /usr/lib/magic, depending on the Linux/UNIX distribution.

The -f option causes file to run in batch mode, to read from a designated file a list of filenames to analyze. The -z option, when used on a compressed target file, forces an attempt to analyze the uncompressed file type.

 bash$ file test.tar.gz
 test.tar.gz: gzip compressed data, deflated,
 last modified: Sun Sep 16 13:34:51 2001, os: Unix
 
 bash file -z test.tar.gz
 test.tar.gz: GNU tar archive (gzip compressed data, deflated,
 last modified: Sun Sep 16 13:34:51 2001, os: Unix)
 	      

   1 # Find sh and Bash scripts in a given directory:
   2 
   3 DIRECTORY=/usr/local/bin
   4 KEYWORD=Bourne
   5 # Bourne and Bourne-Again shell scripts
   6 
   7 file $DIRECTORY/* | fgrep $KEYWORD
   8 
   9 # Output:
  10 
  11 # /usr/local/bin/burn-cd:          Bourne-Again shell script text executable
  12 # /usr/local/bin/burnit:           Bourne-Again shell script text executable
  13 # /usr/local/bin/cassette.sh:      Bourne shell script text executable
  14 # /usr/local/bin/copy-cd:          Bourne-Again shell script text executable
  15 # . . .


Example 16-32. Stripping comments from C program files

   1 #!/bin/bash
   2 # strip-comment.sh: Strips out the comments (/* COMMENT */) in a C program.
   3 
   4 E_NOARGS=0
   5 E_ARGERROR=66
   6 E_WRONG_FILE_TYPE=67
   7 
   8 if [ $# -eq "$E_NOARGS" ]
   9 then
  10   echo "Usage: `basename $0` C-program-file" >&2 # Error message to stderr.
  11   exit $E_ARGERROR
  12 fi  
  13 
  14 # Test for correct file type.
  15 type=`file $1 | awk '{ print $2, $3, $4, $5 }'`
  16 # "file $1" echoes file type . . .
  17 # Then awk removes the first field, the filename . . .
  18 # Then the result is fed into the variable "type."
  19 correct_type="ASCII C program text"
  20 
  21 if [ "$type" != "$correct_type" ]
  22 then
  23   echo
  24   echo "This script works on C program files only."
  25   echo
  26   exit $E_WRONG_FILE_TYPE
  27 fi  
  28 
  29 
  30 # Rather cryptic sed script:
  31 #--------
  32 sed '
  33 /^\/\*/d
  34 /.*\*\//d
  35 ' $1
  36 #--------
  37 # Easy to understand if you take several hours to learn sed fundamentals.
  38 
  39 
  40 #  Need to add one more line to the sed script to deal with
  41 #+ case where line of code has a comment following it on same line.
  42 #  This is left as a non-trivial exercise.
  43 
  44 #  Also, the above code deletes non-comment lines with a "*/" . . .
  45 #+ not a desirable result.
  46 
  47 exit 0
  48 
  49 
  50 # ----------------------------------------------------------------
  51 # Code below this line will not execute because of 'exit 0' above.
  52 
  53 # Stephane Chazelas suggests the following alternative:
  54 
  55 usage() {
  56   echo "Usage: `basename $0` C-program-file" >&2
  57   exit 1
  58 }
  59 
  60 WEIRD=`echo -n -e '\377'`   # or WEIRD=$'\377'
  61 [[ $# -eq 1 ]] || usage
  62 case `file "$1"` in
  63   *"C program text"*) sed -e "s%/\*%${WEIRD}%g;s%\*/%${WEIRD}%g" "$1" \
  64      | tr '\377\n' '\n\377' \
  65      | sed -ne 'p;n' \
  66      | tr -d '\n' | tr '\377' '\n';;
  67   *) usage;;
  68 esac
  69 
  70 #  This is still fooled by things like:
  71 #  printf("/*");
  72 #  or
  73 #  /*  /* buggy embedded comment */
  74 #
  75 #  To handle all special cases (comments in strings, comments in string
  76 #+ where there is a \", \\" ...),
  77 #+ the only way is to write a C parser (using lex or yacc perhaps?).
  78 
  79 exit 0

which

which command gives the full path to "command." This is useful for finding out whether a particular command or utility is installed on the system.

$bash which rm
 /usr/bin/rm

For an interesting use of this command, see Example 36-16.

whereis

Similar to which, above, whereis command gives the full path to "command," but also to its manpage.

$bash whereis rm
 rm: /bin/rm /usr/share/man/man1/rm.1.bz2

whatis

whatis command looks up "command" in the whatis database. This is useful for identifying system commands and important configuration files. Consider it a simplified man command.

$bash whatis whatis
 whatis               (1)  - search the whatis database for complete words


Example 16-33. Exploring /usr/X11R6/bin

   1 #!/bin/bash
   2 
   3 # What are all those mysterious binaries in /usr/X11R6/bin?
   4 
   5 DIRECTORY="/usr/X11R6/bin"
   6 # Try also "/bin", "/usr/bin", "/usr/local/bin", etc.
   7 
   8 for file in $DIRECTORY/*
   9 do
  10   whatis `basename $file`   # Echoes info about the binary.
  11 done
  12 
  13 exit 0
  14 
  15 #  Note: For this to work, you must create a "whatis" database
  16 #+ with /usr/sbin/makewhatis.
  17 #  You may wish to redirect output of this script, like so:
  18 #    ./what.sh >>whatis.db
  19 #  or view it a page at a time on stdout,
  20 #    ./what.sh | less

See also Example 11-3.

vdir

Show a detailed directory listing. The effect is similar to ls -lb.

This is one of the GNU fileutils.

 bash$ vdir
 total 10
 -rw-r--r--    1 bozo  bozo      4034 Jul 18 22:04 data1.xrolo
 -rw-r--r--    1 bozo  bozo      4602 May 25 13:58 data1.xrolo.bak
 -rw-r--r--    1 bozo  bozo       877 Dec 17  2000 employment.xrolo
 
 bash ls -l
 total 10
 -rw-r--r--    1 bozo  bozo      4034 Jul 18 22:04 data1.xrolo
 -rw-r--r--    1 bozo  bozo      4602 May 25 13:58 data1.xrolo.bak
 -rw-r--r--    1 bozo  bozo       877 Dec 17  2000 employment.xrolo
 	      

locate, slocate

The locate command searches for files using a database stored for just that purpose. The slocate command is the secure version of locate (which may be aliased to slocate).

$bash locate hickson
 /usr/lib/xephem/catalogs/hickson.edb

getfacl, setfacl

These commands retrieve or set the file access control list -- the owner, group, and file permissions.

 bash$ getfacl *
 # file: test1.txt
 # owner: bozo
 # group: bozgrp
 user::rw-
 group::rw-
 other::r--

 # file: test2.txt
 # owner: bozo
 # group: bozgrp
 user::rw-
 group::rw-
 other::r--
  
 
  
 bash$ setfacl -m u:bozo:rw yearly_budget.csv
 bash$ getfacl yearly_budget.csv
 # file: yearly_budget.csv
 # owner: accountant
 # group: budgetgrp
 user::rw-
 user:bozo:rw-
 user:accountant:rw-
 group::rw-
 mask::rw-
 other::r--
 	      

readlink

Disclose the file that a symbolic link points to.

 bash$ readlink /usr/bin/awk
 ../../bin/gawk
 	      

strings

Use the strings command to find printable strings in a binary or data file. It will list sequences of printable characters found in the target file. This might be handy for a quick 'n dirty examination of a core dump or for looking at an unknown graphic image file (strings image-file | more might show something like JFIF, which would identify the file as a jpeg graphic). In a script, you would probably parse the output of strings with grep or sed. See Example 11-8 and Example 11-10.


Example 16-34. An "improved" strings command

   1 #!/bin/bash
   2 # wstrings.sh: "word-strings" (enhanced "strings" command)
   3 #
   4 #  This script filters the output of "strings" by checking it
   5 #+ against a standard word list file.
   6 #  This effectively eliminates gibberish and noise,
   7 #+ and outputs only recognized words.
   8 
   9 # ===========================================================
  10 #                 Standard Check for Script Argument(s)
  11 ARGS=1
  12 E_BADARGS=85
  13 E_NOFILE=86
  14 
  15 if [ $# -ne $ARGS ]
  16 then
  17   echo "Usage: `basename $0` filename"
  18   exit $E_BADARGS
  19 fi
  20 
  21 if [ ! -f "$1" ]                      # Check if file exists.
  22 then
  23     echo "File \"$1\" does not exist."
  24     exit $E_NOFILE
  25 fi
  26 # ===========================================================
  27 
  28 
  29 MINSTRLEN=3                           #  Minimum string length.
  30 WORDFILE=/usr/share/dict/linux.words  #  Dictionary file.
  31 #  May specify a different word list file
  32 #+ of one-word-per-line format.
  33 #  For example, the "yawl" word-list package,
  34 #  http://bash.deta.in/yawl-0.3.2.tar.gz
  35 
  36 
  37 wlist=`strings "$1" | tr A-Z a-z | tr '[:space:]' Z | \
  38        tr -cs '[:alpha:]' Z | tr -s '\173-\377' Z | tr Z ' '`
  39 
  40 # Translate output of 'strings' command with multiple passes of 'tr'.
  41 #  "tr A-Z a-z"  converts to lowercase.
  42 #  "tr '[:space:]'"  converts whitespace characters to Z's.
  43 #  "tr -cs '[:alpha:]' Z"  converts non-alphabetic characters to Z's,
  44 #+ and squeezes multiple consecutive Z's.
  45 #  "tr -s '\173-\377' Z"  converts all characters past 'z' to Z's
  46 #+ and squeezes multiple consecutive Z's,
  47 #+ which gets rid of all the weird characters that the previous
  48 #+ translation failed to deal with.
  49 #  Finally, "tr Z ' '" converts all those Z's to whitespace,
  50 #+ which will be seen as word separators in the loop below.
  51 
  52 #  ***********************************************************************
  53 #  Note the technique of feeding/piping the output of 'tr' back to itself,
  54 #+ but with different arguments and/or options on each successive pass.
  55 #  ***********************************************************************
  56 
  57 
  58 for word in $wlist                    #  Important:
  59                                       #  $wlist must not be quoted here.
  60                                       # "$wlist" does not work.
  61                                       #  Why not?
  62 do
  63   strlen=${#word}                     #  String length.
  64   if [ "$strlen" -lt "$MINSTRLEN" ]   #  Skip over short strings.
  65   then
  66     continue
  67   fi
  68 
  69   grep -Fw $word "$WORDFILE"          #   Match whole words only.
  70 #      ^^^                            #  "Fixed strings" and
  71                                       #+ "whole words" options. 
  72 done  
  73 
  74 exit $?

Comparison

diff, patch

diff: flexible file comparison utility. It compares the target files line-by-line sequentially. In some applications, such as comparing word dictionaries, it may be helpful to filter the files through sort and uniq before piping them to diff. diff file-1 file-2 outputs the lines in the files that differ, with carets showing which file each particular line belongs to.

The --side-by-side option to diff outputs each compared file, line by line, in separate columns, with non-matching lines marked. The -c and -u options likewise make the output of the command easier to interpret.

There are available various fancy frontends for diff, such as sdiff, wdiff, xdiff, and mgdiff.

Tip

The diff command returns an exit status of 0 if the compared files are identical, and 1 if they differ (or 2 when binary files are being compared). This permits use of diff in a test construct within a shell script (see below).

A common use for diff is generating difference files to be used with patch The -e option outputs files suitable for ed or ex scripts.

patch: flexible versioning utility. Given a difference file generated by diff, patch can upgrade a previous version of a package to a newer version. It is much more convenient to distribute a relatively small "diff" file than the entire body of a newly revised package. Kernel "patches" have become the preferred method of distributing the frequent releases of the Linux kernel.

   1 patch -p1 <patch-file
   2 # Takes all the changes listed in 'patch-file'
   3 # and applies them to the files referenced therein.
   4 # This upgrades to a newer version of the package.

Patching the kernel:

   1 cd /usr/src
   2 gzip -cd patchXX.gz | patch -p0
   3 # Upgrading kernel source using 'patch'.
   4 # From the Linux kernel docs "README",
   5 # by anonymous author (Alan Cox?).

Note

The diff command can also recursively compare directories (for the filenames present).

 bash$ diff -r ~/notes1 ~/notes2
 Only in /home/bozo/notes1: file02
 Only in /home/bozo/notes1: file03
 Only in /home/bozo/notes2: file04
 	      

Tip

Use zdiff to compare gzipped files.

Tip

Use diffstat to create a histogram (point-distribution graph) of output from diff.

diff3, merge

An extended version of diff that compares three files at a time. This command returns an exit value of 0 upon successful execution, but unfortunately this gives no information about the results of the comparison.

 bash$ diff3 file-1 file-2 file-3
 ====
 1:1c
   This is line 1 of "file-1".
 2:1c
   This is line 1 of "file-2".
 3:1c
   This is line 1 of "file-3"
 	      

The merge (3-way file merge) command is an interesting adjunct to diff3. Its syntax is merge Mergefile file1 file2. The result is to output to Mergefile the changes that lead from file1 to file2. Consider this command a stripped-down version of patch.

sdiff

Compare and/or edit two files in order to merge them into an output file. Because of its interactive nature, this command would find little use in a script.

cmp

The cmp command is a simpler version of diff, above. Whereas diff reports the differences between two files, cmp merely shows at what point they differ.

Note

Like diff, cmp returns an exit status of 0 if the compared files are identical, and 1 if they differ. This permits use in a test construct within a shell script.


Example 16-35. Using cmp to compare two files within a script.

   1 #!/bin/bash
   2 # file-comparison.sh
   3 
   4 ARGS=2  # Two args to script expected.
   5 E_BADARGS=85
   6 E_UNREADABLE=86
   7 
   8 if [ $# -ne "$ARGS" ]
   9 then
  10   echo "Usage: `basename $0` file1 file2"
  11   exit $E_BADARGS
  12 fi
  13 
  14 if [[ ! -r "$1" || ! -r "$2" ]]
  15 then
  16   echo "Both files to be compared must exist and be readable."
  17   exit $E_UNREADABLE
  18 fi
  19 
  20 cmp $1 $2 &> /dev/null
  21 #   Redirection to /dev/null buries the output of the "cmp" command.
  22 #   cmp -s $1 $2  has same result ("-s" silent flag to "cmp")
  23 #   Thank you  Anders Gustavsson for pointing this out.
  24 #
  25 #  Also works with 'diff', i.e.,
  26 #+ diff $1 $2 &> /dev/null
  27 
  28 if [ $? -eq 0 ]         # Test exit status of "cmp" command.
  29 then
  30   echo "File \"$1\" is identical to file \"$2\"."
  31 else  
  32   echo "File \"$1\" differs from file \"$2\"."
  33 fi
  34 
  35 exit 0

Tip

Use zcmp on gzipped files.

comm

Versatile file comparison utility. The files must be sorted for this to be useful.

comm -options first-file second-file

comm file-1 file-2 outputs three columns:

  • column 1 = lines unique to file-1

  • column 2 = lines unique to file-2

  • column 3 = lines common to both.

The options allow suppressing output of one or more columns.

  • -1 suppresses column 1

  • -2 suppresses column 2

  • -3 suppresses column 3

  • -12 suppresses both columns 1 and 2, etc.

This command is useful for comparing "dictionaries" or word lists -- sorted text files with one word per line.

Utilities

basename

Strips the path information from a file name, printing only the file name. The construction basename $0 lets the script know its name, that is, the name it was invoked by. This can be used for "usage" messages if, for example a script is called with missing arguments:
   1 echo "Usage: `basename $0` arg1 arg2 ... argn"

dirname

Strips the basename from a filename, printing only the path information.

Note

basename and dirname can operate on any arbitrary string. The argument does not need to refer to an existing file, or even be a filename for that matter (see Example A-7).


Example 16-36. basename and dirname

   1 #!/bin/bash
   2 
   3 address=/home/bozo/daily-journal.txt
   4 
   5 echo "Basename of /home/bozo/daily-journal.txt = `basename $address`"
   6 echo "Dirname of /home/bozo/daily-journal.txt = `dirname $address`"
   7 echo
   8 echo "My own home is `basename ~/`."         # `basename ~` also works.
   9 echo "The home of my home is `dirname ~/`."  # `dirname ~`  also works.
  10 
  11 exit 0

split, csplit

These are utilities for splitting a file into smaller chunks. Their usual use is for splitting up large files in order to back them up on floppies or preparatory to e-mailing or uploading them.

The csplit command splits a file according to context, the split occuring where patterns are matched.


Example 16-37. A script that copies itself in sections

   1 #!/bin/bash
   2 # splitcopy.sh
   3 
   4 #  A script that splits itself into chunks,
   5 #+ then reassembles the chunks into an exact copy
   6 #+ of the original script.
   7 
   8 CHUNKSIZE=4    #  Size of first chunk of split files.
   9 OUTPREFIX=xx   #  csplit prefixes, by default,
  10                #+ files with "xx" ...
  11 
  12 csplit "$0" "$CHUNKSIZE"
  13 
  14 # Some comment lines for padding . . .
  15 # Line 15
  16 # Line 16
  17 # Line 17
  18 # Line 18
  19 # Line 19
  20 # Line 20
  21 
  22 cat "$OUTPREFIX"* > "$0.copy"  # Concatenate the chunks.
  23 rm "$OUTPREFIX"*               # Get rid of the chunks.
  24 
  25 exit $?

Encoding and Encryption

sum, cksum, md5sum, sha1sum

These are utilities for generating checksums. A checksum is a number [3] mathematically calculated from the contents of a file, for the purpose of checking its integrity. A script might refer to a list of checksums for security purposes, such as ensuring that the contents of key system files have not been altered or corrupted. For security applications, use the md5sum (message digest 5 checksum) command, or better yet, the newer sha1sum (Secure Hash Algorithm). [4]

 bash$ cksum /boot/vmlinuz
 1670054224 804083 /boot/vmlinuz
 
 bash$ echo -n "Top Secret" | cksum
 3391003827 10
 
 
 
 bash$ md5sum /boot/vmlinuz
 0f43eccea8f09e0a0b2b5cf1dcf333ba  /boot/vmlinuz
 
 bash$ echo -n "Top Secret" | md5sum
 8babc97a6f62a4649716f4df8d61728f  -
 	      

Note

The cksum command shows the size, in bytes, of its target, whether file or stdout.

The md5sum and sha1sum commands display a dash when they receive their input from stdout.


Example 16-38. Checking file integrity

   1 #!/bin/bash
   2 # file-integrity.sh: Checking whether files in a given directory
   3 #                    have been tampered with.
   4 
   5 E_DIR_NOMATCH=80
   6 E_BAD_DBFILE=81
   7 
   8 dbfile=File_record.md5
   9 # Filename for storing records (database file).
  10 
  11 
  12 set_up_database ()
  13 {
  14   echo ""$directory"" > "$dbfile"
  15   # Write directory name to first line of file.
  16   md5sum "$directory"/* >> "$dbfile"
  17   # Append md5 checksums and filenames.
  18 }
  19 
  20 check_database ()
  21 {
  22   local n=0
  23   local filename
  24   local checksum
  25 
  26   # ------------------------------------------- #
  27   #  This file check should be unnecessary,
  28   #+ but better safe than sorry.
  29 
  30   if [ ! -r "$dbfile" ]
  31   then
  32     echo "Unable to read checksum database file!"
  33     exit $E_BAD_DBFILE
  34   fi
  35   # ------------------------------------------- #
  36 
  37   while read record[n]
  38   do
  39 
  40     directory_checked="${record[0]}"
  41     if [ "$directory_checked" != "$directory" ]
  42     then
  43       echo "Directories do not match up!"
  44       # Tried to use file for a different directory.
  45       exit $E_DIR_NOMATCH
  46     fi
  47 
  48     if [ "$n" -gt 0 ]   # Not directory name.
  49     then
  50       filename[n]=$( echo ${record[$n]} | awk '{ print $2 }' )
  51       #  md5sum writes records backwards,
  52       #+ checksum first, then filename.
  53       checksum[n]=$( md5sum "${filename[n]}" )
  54 
  55 
  56       if [ "${record[n]}" = "${checksum[n]}" ]
  57       then
  58         echo "${filename[n]} unchanged."
  59 
  60         elif [ "`basename ${filename[n]}`" != "$dbfile" ]
  61                #  Skip over checksum database file,
  62                #+ as it will change with each invocation of script.
  63                #  ---
  64                #  This unfortunately means that when running
  65                #+ this script on $PWD, tampering with the
  66                #+ checksum database file will not be detected.
  67                #  Exercise: Fix this.
  68         then
  69           echo "${filename[n]} : CHECKSUM ERROR!"
  70         # File has been changed since last checked.
  71         fi
  72 
  73       fi
  74 
  75 
  76 
  77     let "n+=1"
  78   done <"$dbfile"       # Read from checksum database file. 
  79 
  80 }  
  81 
  82 # =================================================== #
  83 # main ()
  84 
  85 if [ -z  "$1" ]
  86 then
  87   directory="$PWD"      #  If not specified,
  88 else                    #+ use current working directory.
  89   directory="$1"
  90 fi  
  91 
  92 clear                   # Clear screen.
  93 echo " Running file integrity check on $directory"
  94 echo
  95 
  96 # ------------------------------------------------------------------ #
  97   if [ ! -r "$dbfile" ] # Need to create database file?
  98   then
  99     echo "Setting up database file, \""$directory"/"$dbfile"\"."; echo
 100     set_up_database
 101   fi  
 102 # ------------------------------------------------------------------ #
 103 
 104 check_database          # Do the actual work.
 105 
 106 echo 
 107 
 108 #  You may wish to redirect the stdout of this script to a file,
 109 #+ especially if the directory checked has many files in it.
 110 
 111 exit 0
 112 
 113 #  For a much more thorough file integrity check,
 114 #+ consider the "Tripwire" package,
 115 #+ http://sourceforge.net/projects/tripwire/.

Also see Example A-19, Example 36-16, and Example 10-2 for creative uses of the md5sum command.

Note

There have been reports that the 128-bit md5sum can be cracked, so the more secure 160-bit sha1sum is a welcome new addition to the checksum toolkit.

 bash$ md5sum testfile
 e181e2c8720c60522c4c4c981108e367  testfile
 
 
 bash$ sha1sum testfile
 5d7425a9c08a66c3177f1e31286fa40986ffc996  testfile
 	      

Security consultants have demonstrated that even sha1sum can be compromised. Fortunately, newer Linux distros include longer bit-length sha224sum, sha256sum, sha384sum, and sha512sum commands.

uuencode

This utility encodes binary files (images, sound files, compressed files, etc.) into ASCII characters, making them suitable for transmission in the body of an e-mail message or in a newsgroup posting. This is especially useful where MIME (multimedia) encoding is not available.

uudecode

This reverses the encoding, decoding uuencoded files back into the original binaries.


Example 16-39. Uudecoding encoded files

   1 #!/bin/bash
   2 # Uudecodes all uuencoded files in current working directory.
   3 
   4 lines=35        # Allow 35 lines for the header (very generous).
   5 
   6 for File in *   # Test all the files in $PWD.
   7 do
   8   search1=`head -n $lines $File | grep begin | wc -w`
   9   search2=`tail -n $lines $File | grep end | wc -w`
  10   #  Uuencoded files have a "begin" near the beginning,
  11   #+ and an "end" near the end.
  12   if [ "$search1" -gt 0 ]
  13   then
  14     if [ "$search2" -gt 0 ]
  15     then
  16       echo "uudecoding - $File -"
  17       uudecode $File
  18     fi  
  19   fi
  20 done  
  21 
  22 #  Note that running this script upon itself fools it
  23 #+ into thinking it is a uuencoded file,
  24 #+ because it contains both "begin" and "end".
  25 
  26 #  Exercise:
  27 #  --------
  28 #  Modify this script to check each file for a newsgroup header,
  29 #+ and skip to next if not found.
  30 
  31 exit 0

Tip

The fold -s command may be useful (possibly in a pipe) to process long uudecoded text messages downloaded from Usenet newsgroups.

mimencode, mmencode

The mimencode and mmencode commands process multimedia-encoded e-mail attachments. Although mail user agents (such as pine or kmail) normally handle this automatically, these particular utilities permit manipulating such attachments manually from the command-line or in batch processing mode by means of a shell script.

crypt

At one time, this was the standard UNIX file encryption utility. [5] Politically-motivated government regulations prohibiting the export of encryption software resulted in the disappearance of crypt from much of the UNIX world, and it is still missing from most Linux distributions. Fortunately, programmers have come up with a number of decent alternatives to it, among them the author's very own cruft (see Example A-4).

openssl

This is an Open Source implementation of Secure Sockets Layer encryption.
   1 # To encrypt a file:
   2 openssl aes-128-ecb -salt -in file.txt -out file.encrypted \
   3 -pass pass:my_password
   4 #          ^^^^^^^^^^^   User-selected password.
   5 #       aes-128-ecb      is the encryption method chosen.
   6 
   7 # To decrypt an openssl-encrypted file:
   8 openssl aes-128-ecb -d -salt -in file.encrypted -out file.txt \
   9 -pass pass:my_password
  10 #          ^^^^^^^^^^^   User-selected password.

Piping openssl to/from tar makes it possible to encrypt an entire directory tree.
   1 # To encrypt a directory:
   2 
   3 sourcedir="/home/bozo/testfiles"
   4 encrfile="encr-dir.tar.gz"
   5 password=my_secret_password
   6 
   7 tar czvf - "$sourcedir" |
   8 openssl des3 -salt -out "$encrfile" -pass pass:"$password"
   9 #       ^^^^   Uses des3 encryption.
  10 # Writes encrypted file "encr-dir.tar.gz" in current working directory.
  11 
  12 # To decrypt the resulting tarball:
  13 openssl des3 -d -salt -in "$encrfile" -pass pass:"$password" |
  14 tar -xzv
  15 # Decrypts and unpacks into current working directory.

Of course, openssl has many other uses, such as obtaining signed certificates for Web sites. See the info page.

shred

Securely erase a file by overwriting it multiple times with random bit patterns before deleting it. This command has the same effect as Example 16-61, but does it in a more thorough and elegant manner.

This is one of the GNU fileutils.

Caution

Advanced forensic technology may still be able to recover the contents of a file, even after application of shred.

Miscellaneous

mktemp

Create a temporary file [6] with a "unique" filename. When invoked from the command-line without additional arguments, it creates a zero-length file in the /tmp directory.

 bash$ mktemp
 /tmp/tmp.zzsvql3154
 	      

   1 PREFIX=filename
   2 tempfile=`mktemp $PREFIX.XXXXXX`
   3 #                        ^^^^^^ Need at least 6 placeholders
   4 #+                              in the filename template.
   5 #   If no filename template supplied,
   6 #+ "tmp.XXXXXXXXXX" is the default.
   7 
   8 echo "tempfile name = $tempfile"
   9 # tempfile name = filename.QA2ZpY
  10 #                 or something similar...
  11 
  12 #  Creates a file of that name in the current working directory
  13 #+ with 600 file permissions.
  14 #  A "umask 177" is therefore unnecessary,
  15 #+ but it's good programming practice nevertheless.

make

Utility for building and compiling binary packages. This can also be used for any set of operations triggered by incremental changes in source files.

The make command checks a Makefile, a list of file dependencies and operations to be carried out.

The make utility is, in effect, a powerful scripting language similar in many ways to Bash, but with the capability of recognizing dependencies. For in-depth coverage of this useful tool set, see the GNU software documentation site.

install

Special purpose file copying command, similar to cp, but capable of setting permissions and attributes of the copied files. This command seems tailormade for installing software packages, and as such it shows up frequently in Makefiles (in the make install : section). It could likewise prove useful in installation scripts.

dos2unix

This utility, written by Benjamin Lin and collaborators, converts DOS-formatted text files (lines terminated by CR-LF) to UNIX format (lines terminated by LF only), and vice-versa.

ptx

The ptx [targetfile] command outputs a permuted index (cross-reference list) of the targetfile. This may be further filtered and formatted in a pipe, if necessary.

more, less

Pagers that display a text file or stream to stdout, one screenful at a time. These may be used to filter the output of stdout . . . or of a script.

An interesting application of more is to "test drive" a command sequence, to forestall potentially unpleasant consequences.
   1 ls /home/bozo | awk '{print "rm -rf " $1}' | more
   2 #                                            ^^^^
   3 		 
   4 # Testing the effect of the following (disastrous) command-line:
   5 #      ls /home/bozo | awk '{print "rm -rf " $1}' | sh
   6 #      Hand off to the shell to execute . . .       ^^

The less pager has the interesting property of doing a formatted display of man page source. See Example A-39.

Notes

[1]

An archive, in the sense discussed here, is simply a set of related files stored in a single location.

[2]

A tar czvf ArchiveName.tar.gz * will include dotfiles in subdirectories below the current working directory. This is an undocumented GNU tar "feature."

[3]

The checksum may be expressed as a hexadecimal number, or to some other base.

[4]

For even better security, use the sha256sum, sha512, and sha1pass commands.

[5]

This is a symmetric block cipher, used to encrypt files on a single system or local network, as opposed to the public key cipher class, of which pgp is a well-known example.

[6]

Creates a temporary directory when invoked with the -d option.