How To Parse Text Files Line by Line in Unix scripts

I'm finally back from my holidays and thrilled to be sharing next of my Unix tips with you!

Today I'd like to talk about parsing text files in Unix shell scripts. This is one of the really popular areas of scripting, and there's a few quite typical limitations which everyone comes across.

Reading text files in Unix shell

If we agree that by "reading a text file" we assume a procedure of going through all the lines found in a clear text file with a view to somehow process the data, then cat command would be the simplest demonstration of such procedure:

redhat$ cat /etc/redhat-release
Red Hat Enterprise Linux Client release 5 (Tikanga)

As you can see, there's only one line in the /etc/redhat-release file, and we see what this line is.

But if you for whatever reason wanted to read this file from a script and assign the whole release information line to a Unix variable, using cat output would not work as expected:

bash-3.1$ for i in `cat /etc/redhat-release`; do echo $i; done;
RedHat
Enterprise
Linux
Client
release
5
(Tikanga)

Instead of reading a line of text from the file, our one-liner splits the line and outputs every word on a separate line of the output. This happens because of the shell syntax parsing – Unix shells assume space to be a delimiter of various elements in a list, so when you do a for loop, Unix shell interpreter treats each line with spaces as a list of elements, splits it and returns elements one by one.

How to read text files line by line

Here's what I decided: if I can't make Unix shell ignore the spaces between words of each line of text, I'll disguise these spaces. Since my solution was getting pretty bulky for a one-liner, I've made it into a script. Here it is:

bash-3.1$ cat /tmp/cat.sh
#!/bin/sh
FILE=$1
UNIQUE='-={GR}=-'
#
if [ -z "$FILE" ]; then
        exit;
fi;
#
for LINE in `sed "s/ /$UNIQUE/g" $FILE`; do
        LINE=`echo $LINE | sed "s/$UNIQUE/ /g"`;
        echo $LINE;
done;

As you can see, I've introduced an idea of a UNIQUE variable, something containing a unique combination of characters which I can use to replace spaces in the original string. This variable needs to be a unique combination in a context of your text files, because later we turn the string back into its original version, replacing all the instances of $UNIQUE text with plain spaces.

Since most of the needs of mine required such functionality for a relatively small text files, this rather expensive (in terms of CPU cycles) approach proved to be quite usable and pretty fast.

Update: please see comments to this post for a much better approach to the same problem. Thanks again, Nails!

Here's how my script would work on the already known /etc/redhat-release file:

bash-3.1$ /tmp/cat.sh /etc/redhat-release
Red Hat Enterprise Linux Client release 5 (Tikanga)

Exactly what I wanted! Hopefully this little trick will save some of your time as well. Let me know if you like it or know an even better one yourself!

Related books

If you want to learn more, here's a great book:


classic-shell-scripting

Classic Shell Scripting

See also:

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS
  • http://n00bsys0p.wordpress.com n00b

    Just what I was looking for! Thanks!

  • Gleb Reys

    I wish I thought about this a few years earlier – so many scripts of mine could be much better!

  • Nails Carmody

    I don't mean to be rude of condescending, but you are trying to solve a problem that doesn't exist. While your UNIQUE variable idea is clever, why don't you just use while loop:

    while read LINE
    do
    echo "$LINE"
    done < /etc/redhat-release

    Also, your `cat $1` is often referred to as a UUOC. I found this link to be very instructional:

    http://partmaps.org/era/unix/award.html

    Sorry I had to disagree with you.

    Regards,

    Nails

  • Gleb Reys

    Nails, thanks for finding the courage to speak up! I'm glad you recognize the thinking pattern I've followed (trying to cat a file and expecting a line at a time)!

    I'm glad you brought the `cat $1` part up too, not only should it be `cat $FILE` in my particular example, but I have never heard about the UUOC, so look forward to reading a whole page about it.

    THANK YOU and please comment on anything else in the future – like I said somewhere, I'm not trying to look all-knowing, and I welcome any opportunity to learn and improve myself.

  • zenith191

    One problem with Nails solution is that it removes leading whitespace. This causes loss of indentation.

  • ming

    why not use read line
    for example
    cat aFile| while read line; do echo $line; done

  • Heba

    manyy thanks ming,, it worked with me..

  • P

    I am facing problem with while read line on content having lot of '\' as the while loop ignores this character for example my file has:
    ad\bc\gf\ufg, xyz
    aw\ne\hw\iue\dkiue, sde
    which is read by while read line loop as…
    adbcgfufg, xyz
    awnehwiuedkiue, sde

    Any one has any idea how to over come this ?