Your objective for this assignment is to build a simple text-based search engine in Python.
A search engine has three main components:
As a starting point, I have provided a utility module
named crawler_util.py that contains
the HTMLGrabber
class. This class will handle some of the
messy details of networking in Python and parsing HTML. Here is an
abbreviated version of the documentation of for that class:
class HTMLGrabber(HTMLParser.HTMLParser) | This class handles downloading and processing an HTML document. | | Methods defined here: | | __init__(self, url, verbose=False) | Construct an HTMLGrabber object. | | This constructor downloads the HTML file at the indicated | address and parses it in order to extract the page title, the | page text and all outgoing links. | | Notes: | - robots.txt is respected. If the server doesn't want the | file crawled, this function won't crawl it. | - If any sort of exception occurs during processing, this | function will quietly give up and subsequent calls to getText etc. | will return empty defaults. | | Parameters: | url - A string containing a valid absolute url. Eg: | "http://www.jmu.edu" or | "file:///home/spragunr/somefile.html" | verbose - If True, an error message will be printed if an | exception is encountered. If False, exceptions | are quietly ignored. Defaults to False. | | getLinks(self) | Return all outgoing links from the HTML document. | | Notes: | - Only one copy of each link will be returned. | - All links are returned. Returned links may or may not refer to HTML | documents. | | Returns: | An unordered list of strings, where each string is a URL. | | getText(self) | Return all of the document text as a single lower-case string. | All Punctuation will be removed. This may return an empty | string if there were errors during parsing. | | getTitle(self) | Return the HTML page title. This may return an empty string | if there were errors during parsing, or if the page has no | title. | | getURL(self) | Returns a string containing the URL of this page.
Here is an example of using this class from within the Python interpreter:
Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from crawler_util import HTMLGrabber >>> coursePage = HTMLGrabber("http://w3.cs.jmu.edu/spragunr/CS240/") >>> print coursePage.getLinks() ['http://xkcd.com/353/', 'https://blackboard.jmu.edu/', 'http://w3.cs.jmu.edu/spragunr/CS240/', 'https://piazza.com/jmu/spring2013/cs240/home', 'http://w3.cs.jmu.edu/spragunr/CS240/syllabus.shtml', 'http://www.jmu.edu/', 'http://w3.cs.jmu.edu/spragunr/', 'http://w3.cs.jmu.edu/spragunr/CS240/supplement.shtml', 'http://w3.cs.jmu.edu/spragunr/CS240/schedule.shtml'] >>> print coursePage.getText() cs 240 data structures and algorithms james madison university spring 2013 instructor nathan sprague meeting times section 1 meets mwf 9 05 9 55 hhs 2208 section 2 meets mwf 10 10 11 00 hhs 2208 syllabus detailed schedule piazza page blackboard supplemental material obligatory xkcd comic >>> print coursePage.getTitle() Data Structures And Algorithms >>> print coursePage.getURL() http://w3.cs.jmu.edu/spragunr/CS240/ >>>
I recommend that you get started on this project by playing around with this class in the terminal until you have a good understanding of how it works.
Your main task for this project is to complete a search engine class. The following skeleton has been provided in the file engine.py.
""" This module includes the SearchEngine class. The heart of JMUahoogling. Author: Date: """ #------------------------------------------------------------------- # I certify that this work complies with the JMU Honor Code. [I would # like to acknowledge the assistance of NAME(S). DESCRIPTION OF HELP # PROVIDED.] # -YOUR NAME #------------------------------------------------------------------- import crawler_util class SearchResult(object): """ This is a simple container class for storing search results Public Attributes: url - A string containing the URL of an HTML document. title - A string containing the title of the HTML page. """ #YOU SHOULD NOT NEED TO MODIFY THIS CLASS. def __init__(self, url, title): """ Initialize the search result object. """ self.url = url self.title = title def __eq__(self, rhs): """ Overide the == operator. Search results are considered equal if they have the same url. """ return type(self) == type(rhs) and self.url == rhs.url class SearchEngine(object): """ The SearchEngine class is responsible for crawling some collection of HTML documents and creating an index that maps from words in those documents to SearchResult objects containing URLs and page titles. This class also provides a method for searching the index once it is created. """ def __init__(self): """ Construct a SearchEngine. At the time of construction the index is empty. """ # YOUR CODE HERE You will (at least) need someplace to store # the search index (I suggest a dictionary) and some way of # keeping track of which URLs have already been crawled (I # suggest a set). pass # pass is a do-nothing instruction. It's just here because # it is a syntax error in Python to have an empty code block. # Replace it with good code. def crawl(self, url, maxDepth): """ Update the index by crawling a collection of HTML documents. The crawl begins at url and will visit all pages within maxDepth links. For example, if maxDepth = 0, only the document located at the indicated url is indexed. If maxDepth = 2, the original document is indexed, along with all documents that it links to and all of the documents that THOSE documents link to. This method tracks which URLs have already been visited so that no URL will be processed more than once. Arguments: url - A string containing a valid URL. For example: "http://www.jmu.edu" or "file:///home/spragunr/somefile.html" maxDepth - The maximum crawl depth. See explanation above. No return value. """ # YOUR CODE HERE pass def search(self, word): """ Return a list of SearchResult objects that match the given word. This function returns a (possibly empty) list of SearchResult objects. Each object refers to a document that contains the indicated word. The search is NOT case sensitive. This method will accept any string, but assumes that the string represents a single word. Since the index only stores individual words, any strings with spaces or punctuation will necessarily have 0 hits. Arguments: word - A string to search for in the search engine index. Return: A list of SearchResult objects. The order of the objects in the list is not specified. """ #YOUR CODE HERE. pass if __name__ == "__main__": # TESTING CODE HERE. IF YOU WANT IT. pass
You are free to add as many instance variables and methods to this class as you like. However, you may not change the method signatures of the three existing methods above. Those methods must be completed according to the specifications provided in the docstrings.
Of the three methods above, crawl
will be the most
challenging to complete. The __init__
and search
methods should only be a few lines of code
each. The easiest way to implement
crawl
is probably to create a recursive helper method.
Crawling is inherently recursive: you crawl a web page by visiting the
page, then crawling all of the linked pages. The following pseudocode captures the basic idea:
Crawling the web starting at URL: Record the fact that this URL has been crawled. Download all of the text from the URL and update the index. Download all of the links from the URL. For each downloaded link that hasn't been crawled yet: recursively crawl starting at that link.
There is one problem with this algorithm: it won't terminate until all pages reachable from the starting URL have been indexed. In most cases, this will represent a large percentage of the World Wide Web. As you probably know, the web is big. Your program would run out of memory long before it finished crawling.
The SearchEngine class addresses this issue by allowing us to put a limit on the maximum number of links that are followed from the starting URL to any other URL. Implementing this limit requires only a slight modification to the algorithm above:
Crawling the web starting at URL: Record the fact that this URL has been crawled. Download all of the text from the URL and update the index. Download all of the links from the URL. If continuing to crawl won't exceed the depth limit: For each downloaded link that hasn't been crawled yet: recursively crawl starting at that link.
An interaction with your crawler should look like the following. The text formatting and prompts should be exactly the same as the example below. The ordering of the results need not be the same.
$ python search_ui.py Enter URL to crawl: http://w3.cs.jmu.edu/spragunr/CS240/ Depth: 1 Enter search term: stacks Results: Data Structures And Algorithms: Syllabus http://w3.cs.jmu.edu/spragunr/CS240/syllabus.shtml CS240 Schedule http://w3.cs.jmu.edu/spragunr/CS240/schedule.shtml Enter search term: Python Results: xkcd: Python http://xkcd.com/353/ Data Structures And Algorithms: Syllabus http://w3.cs.jmu.edu/spragunr/CS240/syllabus.shtml Data Structures And Algorithms: Supplemental Material http://w3.cs.jmu.edu/spragunr/CS240/supplement.shtml CS240 Schedule http://w3.cs.jmu.edu/spragunr/CS240/schedule.shtml Enter search term: [USER PRESSES ENTER] Thank you for using JMUahoogling!
Your user interface code must be submitted in a file named search_ui.py
.
In order to facilitate testing, I am providing a small collection of web pages that contains no links to outside pages. The following diagram illustrates the file names of the pages, the contents of the pages, and the link structure:
For example, trees.html
is a page with only two words:
"Ash" and "Maple". It contains only a single outgoing link to
streets.html
. You can browse these pages
here: http://w3.cs.jmu.edu/spragunr/CS240/activities/crawler/simple/trees.html.
You are welcome to use that URL in your testing, but it may be more
convenient to download a local copy of those pages, and test on your
own computer. First download and unzip the
file simple.zip. You can then create a URL
that specifies a location on the local machine using
the file
scheme. For example, on my own computer, I
might use:
"file:///home/spragunr/simple/trees.html"
On your computer you will need to replace "/home/spragunr/" with the absolute path to the directory where you saved the web pages.
I recommend you get your code working on these pages before you
turn it loose on the web at large. When you do test your code on the
web, you'll probably need to restrict your crawling depth to 2 or 3 at
the most. If we assume that most web pages have around 10 outgoing
links (a conservative estimate), then a crawl of depth three will
traverse:
1 (the initial page) +
10 (all linked pages) +
100
(all pages linked from those pages.) +
1000 (etc.) = 1111
total pages.
engine.py
and search_ui.py
. You
should not submit crawler_util.py
. Your code
should conform to the CS240 Python
Coding Conventions.