| See All Titles |
Advanced Web ClientsWeb browsers are basic Web clients. They are used primarily for searching and downloading documents from the Web. Advanced clients of the Web are those applications which do more than download single documents from the Internet. One example of an advanced Web client is a crawler (a.k.a. spider, robot). These are programs which explore and download pages from the Internet for different reasons, some of which include:
The crawler we present below, crawl.py, takes a starting Web address (URL), downloads that page and all other pages whose links appear in succeeding pages, but only those which are in the same domain as the starting page. Without such limitations, you will run out of disk space! The source for crawl.py follows: Example 19.1. An Advanced Web Client: a Web Crawler (crawl.py)The crawler consists of two classes, one to manage the entire crawling process (Crawler), and one to retrieve and parse each downloaded Web page (Retriever). <$nopage>
001 1 #!/usr/bin/env python
002 2
003 3 from sys import argv
004 4 from os import makedirs, unlink
005 5 from os.path import dirname, exists, isdir, splitext
006 6 from string import replace, find, lower
007 7 from htmllib import HTMLParser
008 8 from urllib import urlretrieve
009 9 from urlparse import urlparse, urljoin
010 10 from formatter import DumbWriter, AbstractFormatter
011 11 from cStringIO import StringIO
012 12
013 13 class Retriever: # download Web pages
014 14
015 15 def __init__(self, url):
016 16 self.url = url
017 17 self.file = self.filename(url)
018 18
019 19 def filename(self, url, deffile='index.htm'):
020 20 parsedurl = urlparse(url, 'http:', 0)# parse path
021 21 path = parsedurl[1] + parsedurl[2]
022 22 text = splitext(path)
023 23 if ext[1] == '': # no file, use default
024 24 if newpath[-1] == '/':
025 25 path = path + deffile
026 26 else: <$nopage>
027 27 path = path + '/' + deffile
028 28 dir = dirname(path)
029 29 if not isdir(dir): # create archive dir if nec.
030 30 if exists(dir): unlink(dir)
031 31 makedirs(dir)
032 32 return path
033 33
034 34 def download(self): # download Web page
035 35 try: <$nopage>
036 36 retval = urlretrieve(self.url, self.file)
037 37 except IOError:
038 38 retval = ('*** ERROR: invalid URL "%s"' %\
039 39 self.url,)
040 40 return retval
041 41
042 42 def parseAndGetLinks(self): # parse HTML, save links
043 43 self.parser = HTMLParser(AbstractFormatter(\
044 44 DumbWriter(StringIO())))
045 45 self.parser.feed(open(self.file).read())
046 46 self.parser.close()
047 47 return self.parser.anchorlist
048 48
049 49 class Crawler: # manage entire crawling process
050 50
051 51 count = 0 # static downloaded page counter
052 52
053 53 def __init__(self, url):
054 54 self.q = [url]
055 55 self.seen = []
056 56 self.dom = urlparse(url)[1]
057 57
058 58 def getPage(self, url):
059 59 r = Retriever(url)
060 60 retval = r.download()
061 61 if retval[0] == '*': # error situation, do not parse
062 62 print retval, '
skipping parse'
063 63 return <$nopage>
064 64 Crawler.count = Crawler.count + 1
065 65 print '\n(', Crawler.count, ')'
066 66 print 'URL:', url
067 67 print 'FILE:', retval[0]
068 68 self.seen.append(url)
069 69
070 70 links = r.parseAndGetLinks() # get and process links
071 71 for eachLink in links:
072 72 if eachLink[:4] != 'http' and \
073 73 find(eachLink, '://') == -1:
074 74 eachLink = urljoin(url, eachLink)
075 75 print '* ', eachLink,
076 76
077 77 if find(lower(eachLink), 'mailto:') != -1:
078 78 print '
discarded, mailto link'
079 79 continue <$nopage>
080 80
081 81 if eachLink not in self.seen:
082 82 if find(eachLink, self.dom) == -1:
083 83 print '
discarded, not in domain'
084 84 else: <$nopage>
085 85 if eachLink not in self.q:
086 86 self.q.append(eachLink)
087 87 print '
new, added to Q'
088 88 else: <$nopage>
089 89 print '
discarded, already in Q'
090 90 else: <$nopage>
091 91 print '
discarded, already processed'
092 92
093 93 def go(self):# process links in queue
094 94 while self.q:
095 95 url = self.q.pop()
096 96 self.getPage(url)
097 97
098 98 def main():
099 99 if len(argv) > 1:
100 100 url = argv[1]
101 101 else: <$nopage>
102 102 try: <$nopage>
103 103 url = raw_input('Enter starting URL: ')
104 104 except (KeyboardInterrupt, EOFError):
105 105 url = ''
106 106
107 107 if not url: return <$nopage>
108 108 robot = Crawler(url)
109 109 robot.go()
110 110
111 111 if __name__ == '__main__':
112 112 main()
113 <$nopage>
Line-by-line (Class-by-class) explanation:Lines 1 11The top part of the script consists of the standard Python Unix start-up line and the importation of various module attributes which are employed in this application. Lines 13 47The Retriever class has the responsibility of downloading pages from the Web and parsing the links located within each document, adding them to the "to-do" queue if necessary. A Retriever instance object is created for each page which is downloaded from the net. Retriever consists of several methods to aid in its functionality: a constructor (__init__()), filename(), download(), and parseAndGetLinks(). The filename() method takes the given URL and comes up with a safe and sane corresponding file name to store locally. Basically, it removes the "http://" prefix from the URL and uses the remaining part as the file name, creating any directory paths necessary. URLs without trailing file names will be given a default file name of "index.htm." (This name can be overridden in the call to filename()). The constructor instantiates a Retriever object and stores both the URL string and the corresponding file name returned by filename() as local attributes. The download() method, as you may imagine, actually goes out to the net to download the page with the given link. It calls urllib.urlretrieve() with the URL and saves it to the filename (the one returned by filename()). If the download was successful, the parse() method is called to parse the page just copied from the network, otherwise an error string is returned. If the Crawler determines that no error has occurred, it will invoke the parseAndGetLinks() method to parse newly-downloaded page and determine the cause of action for each link located on that page. Lines 49 96The Crawler class is the "star" of the show, managing the entire crawling process, thus only one instance is created for each invocation of our script. The Crawler consists of three items stored by the constructor during the instantiation phase, the first of which is q, a queue of links to download. Such a list will fluctuate during execution, shrinking as each page is processed and grown as new links are discovered within each downloaded page. The other two data values for the Crawler include seen, a list of all the links which "we have seen" (downloaded) already. And finally, we store the domain name for the main link, dom, and use that value to determine whether any succeeding links are part of the same domain. Crawler also has of a static data item named count. The purpose of this counter is just to keep track of the number of objects we have downloaded from the net. It is incremented for every page successfully download. Crawler has a pair of other methods in addition to its constructor, getPage() and go(). go() is simply the method that is used to start the Crawler and is called from the main body of code. go() consists of a loop that will continue to execute as long as there are new links in the queue which need to be downloaded. The workhorse of this class though, is the getPage() method. getPage() instantiates a Retriever object with the first link and lets it go off to the races. If the page was downloaded successfully, the counter is incremented and the link added to the "already seen" list. It looks recursively at all the links featured inside each downloaded page and determine whether any more links should be added to the queue. The main loop in go() will continue to process links until the queue is empty, at which time victory is declared. Links which are: part of another domain, have already been downloaded, are already in the queue waiting to be processed, or are "mailto:" links are ignored and not added to the queue. Lines 98 112main() is executed if this script is invoked directly and is the starting point of execution. Other modules which import crawl.py will need to invoke main() to begin processing. main() needs a URL to begin processing, If one is given on the command-line (for example which this script is invoked directly), it will just go with the one given. Otherwise, the script enters interactive mode prompting the user for a starting URL. With a starting link in hand, the Crawler is instantiated and away we go. One sample invocation of crawl.py may look like: % crawl.py Enter starting URL: http://www.null.com/home/index.html ( 1 ) URL: http://www.null.com/home/index.html FILE: www.null.com/home/index.html * http://www.null.com/home/overview.html new, added to Q * http://www.null.com/home/synopsis.html new, added to Q * http://www.null.com/home/order.html new, added to Q * mailto:postmaster@null.com discarded, mailto link * http://www.null.com/home/overview.html discarded, already in Q * http://www.null.com/home/synopsis.html discarded, already in Q * http://www.null.com/home/order.html discarded, already in Q * mailto:postmaster@null.com discarded, mailto link * http://bogus.com/index.html discarded, not in domain ( 2 ) URL: http://www.null.com/home/order.html FILE: www.null.com/home/order.html * mailto:postmaster@null.com discarded, mailto link * http://www.null.com/home/index.html discarded, already processed * http://www.null.com/home/synopsis.html discarded, already in Q * http://www.null.com/home/overview.html discarded, already in Q ( 3 ) URL: http://www.null.com/home/synopsis.html FILE: www.null.com/home/synopsis.html * http://www.null.com/home/index.html discarded, already processed * http://www.null.com/home/order.html discarded, already processed * http://www.null.com/home/overview.html discarded, already in Q ( 4 ) URL: http://www.null.com/home/overview.html FILE: www.null.com/home/overview.html * http://www.null.com/home/synopsis.html discarded, already processed * http://www.null.com/home/index.html discarded, already processed * http://www.null.com/home/synopsis.html discarded, already processed * http://www.null.com/home/order.html discarded, already processed After execution, a http://www.null.com directory would be created in the local file system, with a home subdirectory. Within home, all the HTML files processed will be found there.
|
© 2002, O'Reilly & Associates, Inc. |