Data Structures And Algorithms: PA #1

CS 240: Data Structures and Algorithms

James Madison University, Spring 2013

Programming Assignment #1: JMUahoogling
(The World's Greatest Search Engine)

Introduction

Your objective for this assignment is to build a simple text-based search engine in Python.

A search engine has three main components:

The Index stores a mapping from possible search terms to web-pages that contain those terms. Commercial search engines like Google or Yahoo use highly optimized, highly parallel, special-purpose databases containing many terabytes of data. We'll use a Python dictionary to store the index.
The Crawler systematically downloads HTML documents in order to gather data for the index. A web crawler finds documents by following links, in much the same way you do when you surf the web.
The Ranking Function provides a mechanism for ordering results in cases where more than one match is found. Google became a multi-zillion dollar company by introducing a better ranking function (PageRank) than other companies were using at the time.

For this project, we will focus on the first two components. We won't worry about ranking results.

Utility Code

As a starting point, I have provided a utility module named crawler_util.py that contains the HTMLGrabber class. This class will handle some of the messy details of networking in Python and parsing HTML. Here is an abbreviated version of the documentation of for that class:

class HTMLGrabber(HTMLParser.HTMLParser)
 |  This class handles downloading and processing an HTML document.
 |
 |  Methods defined here:
 |  
 |  __init__(self, url, verbose=False)
 |      Construct an HTMLGrabber object. 
 |      
 |      This constructor downloads the HTML file at the indicated
 |      address and parses it in order to extract the page title, the
 |      page text and all outgoing links.
 |      
 |      Notes:
 |        - robots.txt is respected.  If the server doesn't want the
 |          file crawled, this function won't crawl it.
 |        - If any sort of exception occurs during processing, this
 |          function will quietly give up and subsequent calls to getText etc.
 |          will return empty defaults. 
 |      
 |       Parameters: 
 |         url -     A string containing a valid absolute url.  Eg:
 |                   "http://www.jmu.edu" or
 |                   "file:///home/spragunr/somefile.html"
 |         verbose - If True, an error message will be printed if an
 |                   exception is encountered.  If False, exceptions
 |                   are quietly ignored.  Defaults to False.
 |  
 |  getLinks(self)
 |      Return all outgoing links from the HTML document. 
 |      
 |      Notes:
 |      - Only one copy of each link will be returned.
 |      - All links are returned.  Returned links may or may not refer to HTML
 |        documents.
 |      
 |      Returns: 
 |        An unordered list of strings, where each string is a URL.
 |  
 |  getText(self)
 |      Return all of the document text as a single lower-case string.
 |      All Punctuation will be removed.  This may return an empty
 |      string if there were errors during parsing.
 |  
 |  getTitle(self)
 |      Return the HTML page title.  This may return an empty string
 |      if there were errors during parsing, or if the page has no
 |      title.
 |  
 |  getURL(self)
 |      Returns a string containing the URL of this page.

Here is an example of using this class from within the Python interpreter:

Python 2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from crawler_util import HTMLGrabber
>>> coursePage = HTMLGrabber("http://w3.cs.jmu.edu/spragunr/CS240/")
>>> print coursePage.getLinks()
['http://xkcd.com/353/', 'https://blackboard.jmu.edu/', 'http://w3.cs.jmu.edu/spragunr/CS240/', 'https://piazza.com/jmu/spring2013/cs240/home', 'http://w3.cs.jmu.edu/spragunr/CS240/syllabus.shtml', 'http://www.jmu.edu/', 'http://w3.cs.jmu.edu/spragunr/', 'http://w3.cs.jmu.edu/spragunr/CS240/supplement.shtml', 'http://w3.cs.jmu.edu/spragunr/CS240/schedule.shtml']
>>> print coursePage.getText()
cs 240  data structures and algorithms james madison university   spring 2013 instructor  nathan sprague meeting times  section 1  meets mwf 9 05 9 55  hhs 2208 section 2  meets mwf 10 10 11 00  hhs 2208 syllabus detailed schedule piazza page blackboard supplemental material obligatory xkcd comic
>>> print coursePage.getTitle()
Data Structures And Algorithms
>>> print coursePage.getURL()
http://w3.cs.jmu.edu/spragunr/CS240/
>>>

I recommend that you get started on this project by playing around with this class in the terminal until you have a good understanding of how it works.

The SearchEngine Class

Your main task for this project is to complete a search engine class. The following skeleton has been provided in the file engine.py.

""" This module includes the SearchEngine class.  The heart of JMUahoogling. 

Author: 
Date:
"""

#-------------------------------------------------------------------
# I certify that this work complies with the JMU Honor Code.  [I would
# like to acknowledge the assistance of NAME(S).  DESCRIPTION OF HELP
# PROVIDED.]
# -YOUR NAME
#-------------------------------------------------------------------

import crawler_util

class SearchResult(object):
    """ This is a simple container class for storing search results 

    Public Attributes: 
       url -   A string containing the URL of an HTML document.
       title - A string containing the title of the HTML page. 
    """
    #YOU SHOULD NOT NEED TO MODIFY THIS CLASS.

    def __init__(self, url, title):
        """ Initialize the search result object. """
        self.url = url
        self.title = title

    def __eq__(self, rhs):
        """ Overide the == operator. 
        Search results are considered equal if they have the same url.
        """
        return type(self) == type(rhs) and self.url == rhs.url


class SearchEngine(object):
    """
    The SearchEngine class is responsible for crawling some collection
    of HTML documents and creating an index that maps from words in
    those documents to SearchResult objects containing URLs and page
    titles.  This class also provides a method for searching the index
    once it is created.
    """

    def __init__(self):
        """ Construct a SearchEngine.  

        At the time of construction the index is empty. 
        """
        # YOUR CODE HERE You will (at least) need someplace to store
        # the search index (I suggest a dictionary) and some way of
        # keeping track of which URLs have already been crawled (I
        # suggest a set).

        pass # pass is a do-nothing instruction. It's just here because
             # it is a syntax error in Python to have an empty code block.
             # Replace it with good code. 
        
    def crawl(self, url, maxDepth):
        """  Update the index by crawling a collection of HTML documents. 
        
        The crawl begins at url and will visit all pages within maxDepth links.
        For example, if maxDepth = 0, only the document located at the 
        indicated url is indexed. If maxDepth = 2, the original document is 
        indexed, along with all documents that it links to and all of the 
        documents that THOSE documents link to.  This method tracks which URLs
        have already been visited so that no URL will be processed more than 
        once.
        
        Arguments: 
           url      -    A string containing a valid URL.  For example: 
                        "http://www.jmu.edu" or
                        "file:///home/spragunr/somefile.html"
           maxDepth -   The maximum crawl depth.  See explanation above.
        
         No return value. 
        """
        # YOUR CODE HERE 
        pass
    
    def search(self, word):
        """  Return a list of SearchResult objects that match the given word. 
        
        This function returns a (possibly empty) list of SearchResult
        objects.  Each object refers to a document that contains
        the indicated word.  The search is NOT case sensitive.
        
        This method will accept any string, but assumes that the string
        represents a single word.  Since the index only stores individual 
        words, any strings with spaces or punctuation will necessarily 
        have 0 hits.
         
        Arguments:
          word - A string to search for in the search engine index.
          
          Return: A list of SearchResult objects.  The order of the
                  objects in the list is not specified.
        """
        #YOUR CODE HERE. 
        pass
    
    
    
if __name__ == "__main__":
    # TESTING CODE HERE. IF YOU WANT IT.
    pass

You are free to add as many instance variables and methods to this class as you like. However, you may not change the method signatures of the three existing methods above. Those methods must be completed according to the specifications provided in the docstrings.

Of the three methods above, crawl will be the most challenging to complete. The __init__ and search methods should only be a few lines of code each. The easiest way to implement crawl is probably to create a recursive helper method. Crawling is inherently recursive: you crawl a web page by visiting the page, then crawling all of the linked pages. The following pseudocode captures the basic idea:

 Crawling the web starting at URL:
    Record the fact that this URL has been crawled.
    Download all of the text from the URL and update the index.
    Download all of the links from the URL.
    For each downloaded link that hasn't been crawled yet:
        recursively crawl starting at that link.

There is one problem with this algorithm: it won't terminate until all pages reachable from the starting URL have been indexed. In most cases, this will represent a large percentage of the World Wide Web. As you probably know, the web is big. Your program would run out of memory long before it finished crawling.

The SearchEngine class addresses this issue by allowing us to put a limit on the maximum number of links that are followed from the starting URL to any other URL. Implementing this limit requires only a slight modification to the algorithm above:

 Crawling the web starting at URL:
    Record the fact that this URL has been crawled.
    Download all of the text from the URL and update the index.
    Download all of the links from the URL.
    If continuing to crawl won't exceed the depth limit: 
        For each downloaded link that hasn't been crawled yet:
            recursively crawl starting at that link.

User Interface

The class described above provides the back-end functionality for a search engine, but it doesn't provide a friendly user interface. Ideally, we would set up a web-based interface and get rich from advertising revenue. In the interest of time, we'll just create a text-based interface as a proof of concept. You interface should:

Prompt the user for a URL to start crawling.
Prompt the user for a maximum crawl depth.
Respond to user queries with lists of matching URLs from the crawled pages.
Exit whenever the user enters an empty query.

An interaction with your crawler should look like the following. The text formatting and prompts should be exactly the same as the example below. The ordering of the results need not be the same.

$ python search_ui.py
Enter URL to crawl: http://w3.cs.jmu.edu/spragunr/CS240/
Depth: 1

Enter search term: stacks
  Results:
     Data Structures And Algorithms: Syllabus
     http://w3.cs.jmu.edu/spragunr/CS240/syllabus.shtml

     CS240 Schedule
     http://w3.cs.jmu.edu/spragunr/CS240/schedule.shtml


Enter search term: Python
  Results:
     xkcd: Python
     http://xkcd.com/353/

     Data Structures And Algorithms: Syllabus
     http://w3.cs.jmu.edu/spragunr/CS240/syllabus.shtml

     Data Structures And Algorithms: Supplemental Material
     http://w3.cs.jmu.edu/spragunr/CS240/supplement.shtml

     CS240 Schedule
     http://w3.cs.jmu.edu/spragunr/CS240/schedule.shtml


Enter search term: [USER PRESSES ENTER]
Thank you for using JMUahoogling!

Your user interface code must be submitted in a file named search_ui.py.

Testing

In order to facilitate testing, I am providing a small collection of web pages that contains no links to outside pages. The following diagram illustrates the file names of the pages, the contents of the pages, and the link structure:

For example, trees.html is a page with only two words: "Ash" and "Maple". It contains only a single outgoing link to streets.html. You can browse these pages here: http://w3.cs.jmu.edu/spragunr/CS240/activities/crawler/simple/trees.html. You are welcome to use that URL in your testing, but it may be more convenient to download a local copy of those pages, and test on your own computer. First download and unzip the file simple.zip. You can then create a URL that specifies a location on the local machine using the file scheme. For example, on my own computer, I might use:

"file:///home/spragunr/simple/trees.html"

On your computer you will need to replace "/home/spragunr/" with the absolute path to the directory where you saved the web pages.

I recommend you get your code working on these pages before you turn it loose on the web at large. When you do test your code on the web, you'll probably need to restrict your crawling depth to 2 or 3 at the most. If we assume that most web pages have around 10 outgoing links (a conservative estimate), then a crawl of depth three will traverse:

1 (the initial page) +
10 (all linked pages) +
100 (all pages linked from those pages.) +
1000 (etc.) = 1111 total pages.

Handing In

You should submit two completed .py files by the deadline: engine.py and search_ui.py. You should not submit crawler_util.py. Your code should conform to the CS240 Python Coding Conventions.

Programming Assignment #1: JMUahoogling (The World's Greatest Search Engine)