Word Count Lab

Introduction

The purpose of this lab is to gain experience programming in Ruby. Unlike other assignments in this class, it is ok to work on the code for this lab in teams or groups.

Part 1 - Simple Counts

For this part of the lab, you will write a simple Ruby script that will calculate word counts for a plain text file. The goal for this lab is to write clean, concise, and readable code for text processing; Ruby is an excellent language for this task.

  1. Bring up a terminal window (on Mac OS X, you can do this quickly by pressing "⌘-Space" and entering "Terminal"). Create a folder somewhere and navigate there (example terminal command: "mkdir wclab && cd wclab").
  2. Create a simple text file called test.txt with some test words in it. You may use any editor you wish, but it must be capable of editing plain text files.

    I recommend GNU nano if you've never used a terminal-based text editor before. It is quicker than trying to set up major IDEs like Eclipse and easier to learn than other terminal-based editors like Vim or Emacs. You can use nano to edit plain text files as well as Ruby scripts. To launch nano, run "nano <file>" where <file> is the name of the file you'd like to edit. The bottom of the screen should contain a list of various editing commands.

    Here is a an example of some text that you could use for testing:
            test text
            here is a test
    
            more text 
  3. Create a Ruby script called word_count.rb in the same folder. In that file, write a method called count_words that takes a single filename as a parameter and returns a Ruby Hash where the keys are words from the file and the values are the counts of those words in the file. Here is some pseudocode for that method:
    
            create new hash table
            for each line in the file:
              parse the line into words
              for each word:
                increment the corresponding count in the hash table
              end
            end
            return hash table


    Hints:
    • Useful documentation pages:   Array | Hash | String
    • You may wish to choose a default value for the Hash object. You can do this by passing the default value as a parameter to the Hash constructor.
    • Recall that you can iterate over all lines in a file using the following syntax:
                  File.foreach(filename) do |line|
                    # process line
                  end 
    • Recall that you can split a Ruby string on spaces (or any other pattern) using the split method.
  4. Write some code outside the function at the bottom of the file that calls the function with the filename of the file you created earlier and stores the result in a variable. Then iterate over the word-count pairs in the resulting Hash and print them out. Here are sample results:
            test: 2
            text: 2
            here: 1
            is: 1
            a: 1
            more: 1 

Part 2 - Nicer Counts

Now that the basic word counting routine is working, switch to working with a larger input text. Download a larger text from somewhere on the web. For example, you could download a copy of the the U.S. Constitution with the following command:

    wget http://www.usconstitution.net/const.txt

If you run your old code on this, the results likely will not be terribly illuminating because it will generate a lot of output without any organization (and probably with some incorrectly-detected "words"). You should make several improvements to your script:

  1. Improve your parsing routine by using a regular expression for the splitting. The regular expression should match various punctuation symbols (commas, colons, etc.) as well as whitespace characters as potential word breaks.
  2. Make the counting case-insensitive by converting words to lower case before storing them in the Hash.
  3. Filter the results by only printing words that occur more than five times in the source document.
  4. Improve the relevance of the results by sorting them so that the most frequently-seen words are at the top (or bottom) of the list. You may wish to look into the sort_by method in the Hash class.

Here is an excerpt from some example results after improvements have been made:

    ...
    congress: 60
    have: 64
    as: 64
    any: 79
    state: 79
    united: 85
    for: 85
    a: 97
    by: 101
    president: 109
    states: 129
    in: 147
    ...

Challenge (Optional)

How many lines of code is your solution? It is possible to fit the final count_words method into less than 15 lines of Ruby code. See how concise you can make yours. Remember, it must still be readable!

Submission

Submit your completed word_count.rb script on Canvas. Please include a comment at the top with your name. If you worked in a group with others, make sure that everyone's name is listed in the file and that everyone gets a copy to submit on Canvas.