Word Count Lab
Introduction
The purpose of this lab is to gain experience programming in Ruby. Unlike other assignments in this class, it is ok to work on the code for this lab in teams or groups.
Part 1 - Simple Counts
For this part of the lab, you will write a simple Ruby script that will calculate word counts for a plain text file. The goal for this lab is to write clean, concise, and readable code for text processing; Ruby is an excellent language for this task.
- Bring up a terminal window (on Mac OS X, you can do this quickly by
pressing "⌘-Space" and entering "Terminal"). Create a folder somewhere
and navigate there (example terminal command: "
mkdir wclab && cd wclab
"). - Create a simple text file called
test.txt
with some test words in it. You may use any editor you wish, but it must be capable of editing plain text files.
I recommend GNU nano if you've never used a terminal-based text editor before. It is quicker than trying to set up major IDEs like Eclipse and easier to learn than other terminal-based editors like Vim or Emacs. You can use nano to edit plain text files as well as Ruby scripts. To launch nano, run "nano <file>
" where<file>
is the name of the file you'd like to edit. The bottom of the screen should contain a list of various editing commands.
Here is a an example of some text that you could use for testing:test text here is a test more text
- Create a Ruby script called
word_count.rb
in the same folder. In that file, write a method calledcount_words
that takes a single filename as a parameter and returns a RubyHash
where the keys are words from the file and the values are the counts of those words in the file. Here is some pseudocode for that method:create new hash table for each line in the file: parse the line into words for each word: increment the corresponding count in the hash table end end return hash table
Hints:- Useful documentation pages: Array | Hash | String
- You may wish to choose a default value for the
Hash
object. You can do this by passing the default value as a parameter to theHash
constructor. - Recall that you can iterate over all lines in a file using the
following syntax:
File.foreach(filename) do |line| # process line end
- Recall that you can split a Ruby string on spaces (or any other
pattern) using the
split
method.
- Write some code outside the function at the bottom of the file that
calls the function with the filename of the file you created earlier and
stores the result in a variable. Then iterate over the word-count pairs in
the resulting
Hash
and print them out. Here are sample results:test: 2 text: 2 here: 1 is: 1 a: 1 more: 1
Part 2 - Nicer Counts
Now that the basic word counting routine is working, switch to working with a larger input text. Download a larger text from somewhere on the web. For example, you could download a copy of the the U.S. Constitution with the following command:
wget http://www.usconstitution.net/const.txt
If you run your old code on this, the results likely will not be terribly illuminating because it will generate a lot of output without any organization (and probably with some incorrectly-detected "words"). You should make several improvements to your script:
- Improve your parsing routine by using a regular expression for the splitting. The regular expression should match various punctuation symbols (commas, colons, etc.) as well as whitespace characters as potential word breaks.
- Make the counting case-insensitive by converting words to lower case
before storing them in the
Hash
. - Filter the results by only printing words that occur more than five times in the source document.
- Improve the relevance of the results by sorting them so that the most
frequently-seen words are at the top (or bottom) of the list. You may wish
to look into the
sort_by
method in theHash
class.
Here is an excerpt from some example results after improvements have been made:
... congress: 60 have: 64 as: 64 any: 79 state: 79 united: 85 for: 85 a: 97 by: 101 president: 109 states: 129 in: 147 ...
Challenge (Optional)
How many lines of code is your solution? It is possible to fit the final
count_words
method into less than 15 lines of Ruby code. See how
concise you can make yours. Remember, it must still be readable!
Submission
Submit your completed word_count.rb
script on Canvas. Please
include a comment at the top with your name. If you worked in a group with
others, make sure that everyone's name is listed in the file and that everyone
gets a copy to submit on Canvas.