Modules and Dictionaries Lab

Objectives

Dictionary

The goal of this lab is to get comfortable working with basic features of the Python programming language including Python modules and dictionaries.

Introduction

In this lab, you will begin working with modules and dictionaries in Python. Here are links to the pertinent sections of the Python documentation:

This lab will also prepare you to complete the first programming assignment (PA1).

Quick Reference
Modules Dictionaries
    class MyObj():     # my_lib.py
        pass
    def my_func():
        pass

    import my_lib      # my_main.py
    a = my_lib.MyObj()
    my_lib.my_func()
        creating:     d = dict()
        adding:       d[key] = value
        updating:     d[key] = new_value   
        retrieving:   d[key]
        checking:     key in d
                      key not in d
        get all keys: d.keys()


Exercises

  1. Getting Started
    • Save the file dictionaries.py to your desktop (or somewhere else that is easily accessible).
    • Save the file crawler_util.py to the same folder as dictionaries.py. The crawler_util module contains a class called HTMLGrabber that you will use to download webpage information in PA1. Read (or at least skim) the documentation for the __init__ function (the constructor) and the get_* accessor methods.
    • Start up the Python interpreter (python3) or an IDLE instance in the same folder. Execute the following line in the interactive interpreter:
      
              import crawler_util
              
      This tells Python to look for a file called "crawler_util.py" (notice that it will automatically append the ".py" file extension). There are various places that it will look for this file, but for the purposes of this lab, you will want it to be in the current folder. Assuming it finds the requested file, Python will open it and load any code that it contains. After you import the module, you can refer to any of the classes or functions in the module using the "crawler_util." prefix.
    • Create an instance of the HTMLGrabber class by instantiating it with a URL:
      
              grabber = crawler_util.HTMLGrabber("http://www.jmu.edu", True)
              
      Note that you do need to include the "http://" part of the URL. The second parameter controls whether errors are reported while grabbing pages. Passing True will alert you to malformed URLs. You should pass True as the second parameter for all parts of this lab.

      Creating an instance of the HTMLGrabber will execute its constructor. Python uses the special function name "__init__" to denote constructors. The code in that function in crawler_util.py handles the ugly details of downloading and parsing the HTML file to extract the various information that we are interested in. You do not need to understand all of the code in HTMLGrabber, but you do need to know how to use it. After spending some time reading the documentation and predicting what each function will return, try calling each of the following functions on your grabber object in the interpreter:

      • get_url()
      • get_title()
      • get_links()
      • get_text()
    • Close the Python interpreter if you have it open. The rest of the exercises will involve editing and testing dictionaries.py.
  2. Basic Information
    • In the file dictionaries.py, start a new function with the following signature:
      
              print_info(url)
              
      This function takes a URL string as its only parameter. It will download the referenced webpage and print various information about it. For this part of the exercise, write the function so that it downloads the page (by creating an HTMLGrabber object; don't forget to use the crawler_util module prefix!) and prints two pieces of information: the page title and the page URL, each on a separate line. Here is an example of the desired output:
      
              Data Structures And Algorithms
              http://w3.cs.jmu.edu/spragunr/CS240
              
    • Write some testing code in main() that exercises print_info() by calling it with various well-known URLs. Verify that the results are correct by opening the same URLs in a browser and checking the window title.
    • Extend print_info() to also print the total number of words in the page's text, using the following example as a guideline for formatting:
      
              Data Structures And Algorithms
              http://w3.cs.jmu.edu/spragunr/CS240
              46 total word(s)
              
    • Extend print_info() to also print a count and list of all the links in the given webpage. Here is some example output of the final version of print_info():
      
              Data Structures And Algorithms
              http://w3.cs.jmu.edu/spragunr/CS240
              46 total word(s)
              8 link(s)
                - https://canvas.jmu.edu/
                - http://www.jmu.edu/
                - http://w3.cs.jmu.edu/spragunr/
                - http://w3.cs.jmu.edu/spragunr/supplement.shtml
                - http://w3.cs.jmu.edu/spragunr/CS240/
                - http://w3.cs.jmu.edu/spragunr/schedule.shtml
                - http://xkcd.com/353/
                - http://w3.cs.jmu.edu/spragunr/syllabus.shtml
              
  3. Word Frequencies
    • In the dictionaries.py file, write a function with the following signature:
      
              print_word_frequencies(url, min_frequency)
              
      This function takes a URL string as its first parameter. It should download the referenced webpage then calculate and print word frequencies. The output should be a list of all the words in the document that occur more than min_frequency (the second parameter). The list should contain each such word exactly once and should include a count of the number of times that word appeared in the document. The list should be sorted in alphabetical order by word. Here is an excerpt of some sample output from JMU's home page with a minimum frequency of 10:
      
              alumni: 14
              and: 25
              aug: 24
              campus: 21
              faculty: 10
              for: 17
              holiday: 10
              in: 23
              jmu: 55
              learning: 11
              madison: 16
              ...
              
      You will want to use a Python dictionary for this exercise. You may also wish to use the sorted() built-in function, which will sort any iterable container.
    • Write some testing code in main() that exercises print_word_frequencies() by calling it with various well-known URLs or test webpages.

      You should create some short testing pages of your own. You can do this on your local system because the HTMLGrabber class understands file:// URLs. For instance, if the file you want to test is called test.html in the /home/lam/cs240 folder, you can test your code with it using the following URL:
      
              file:///home/lam/cs240/test.html
              
      You will probably need to escape any spaces in the folder or file names (i.e., "My\ Documents" rather than "My Documents").

      Here is an example test page: simple_counts.html

Submission

This lab will not be graded so there is nothing to submit. However, you will most likely find it very useful to reference this lab while doing PA1. Make sure you keep a copy of your code for future reference. If you would like to discuss your solution or any problems you encounter while working on this lab, please come to office hours or make an appointment.