This is a simple code which shows you how parser works. This program will ask user for a keyword and depth in integer. it will go to google search for the entered keyword and crawl the site, put the links in a text file. After that keep crawling the links till the depth.

parser and crawler in python language

This program will show the user how to extract information from the page linke title, links etc. This program uses simple regular expressions to extract information from the webpage. After reading this code you can also learn how to deal with error codes and robots.txt. It will only download the page if its allowed in robots.txt. If robots.txt doesnot allow the page to download it wont download that page. Also it will take care of the errors i.e. it wont give exception or wont crash if the page is not downloadable. you can use the code at your own risk. you can use it, modify it and abuse it under common free license 3.0.

# To change this template, choose Tools | Templates
# and open the template in the editor.

__author__="shahid"
__date__ ="$Dec 15, 2011 11:42:29 PM$"

import sys
import re
import urllib
import urlparse
import socket
import urllib2
import os
from urllib import FancyURLopener
from time import strftime
import time
timeout = 15
socket.setdefaulttimeout(timeout)
tocrawl = []
crawled = []
result=[]
visited={}
t1=0
t2=0
t3=0
code_200=0
code_403=0
code_404=0
error=0
number=0
totalsize=0
link_exp = re.compile('')
frame_exp=re.compile('')  # added R
numberoflink = raw_input("How many links you want to crawl")
numberoflinks=int(numberoflink)
var = raw_input("Enter key phrase: ")
crawling="http://www.google.com/search?q="+var
f=open(var+".txt","w")
size=open("filesize.txt","w")  # added R
f.write("*******Level-1*********\n")
f.write("http://www.google.com/search?q="+var)
f.write("        \n")
class MyOpener(FancyURLopener):
	version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'

try:
    myopener = MyOpener()
    page = myopener.open(crawling)
    msg=page.read()
    #print msg
except IOError, error_code:
    error=1
    if error_code[0]=="http error":
        if error_code[1]==401:
            print "Password required"
        elif error_code[1]==404:
            print "file not found"
        elif error_code[1]==500:
            print "server is down"
        else:
            print (error_code)
if error==0:
    links = link_exp.findall(msg)
    newlinks1 = frame_exp.findall(msg) # added R
    size.write(msg)                    # added R    
    crawled.append(crawling)
    mainlink = links + newlinks1       # added R  
    f.write("*******Level-2*********\n")
    f.write("        \n")
    for link in (mainlink.pop(0) for _ in xrange(len(mainlink))):  # added R
        
        if number+1 <= numberoflinks:
            if (link not in crawled) and (link not in tocrawl) and (not link.startswith('/') and not link.startswith('#') and  "google.com" not in link and  "youtube" not in link and  "q=cache" not in link):
                size=open("filesize.txt","w")
                myopener = MyOpener()
                page1=myopener.open(link)
                code=urllib.urlopen(link).getcode()
                if code==200:
                   code_200+=1
                elif code==403:
                   code_403+=1
                elif code==404:
                   code_404+=1     
                msg1=page1.read()
                size.write(msg1)
                t=time.time()
                print number+1,"|",link,"|",os.path.getsize("filesize.txt"),"bytes","|", strftime("%H:%M:%S"),"|", code
                totalsize+=os.path.getsize("filesize.txt")
                tocrawl.append(link)
                result.append(link)
                f.write(link)
                size.close()
                f.write("\n")
                if number==1:
                        t1=t
                elif number==numberoflinks-1:
                        t2=t
                number+=1
	else:
            t3=t2-t1
            print "total time=",t3
            print "total number of 200(ok) pages=",code_200
            print "total number of 403(forbidden) pages=",code_403
            print "total number of 404(page not found) pages=",code_404
            print "totalsize = ",totalsize
            sys.exit()
#################################################################
#####################Got 10 results from Google##################
#################################################################
for URL in (tocrawl.pop(0) for _ in xrange(len(tocrawl))):
    try:
        myopener = MyOpener()
        page = myopener.open(URL)
        f.write("        \n")
        f.write("********"+URL+"*********")
        f.write("        \n")
        msg=page.read()
    except IOError, error_code:
        error=1
        if error_code[0]=="http error":
            if error_code[1]==401:
                print "Password required"
            elif error_code[1]==404:
                print "file not found"
            elif error_code[1]==500:
                print "server is down"
            else:
                print (error_code)
    if error==0:
       newlinks = link_exp.findall(msg)
       newlinks1 = frame_exp.findall(msg)   # added R
       size=open("filesize.txt","w")
       size.write(msg)
       crawled.append(URL)
       mainlink = newlinks1 + newlinks          # added R
       Domain = urlparse.urlparse(URL)
       for link in (mainlink.pop(0) for _ in xrange(len(mainlink))): # added R
       
           if number+1 <= numberoflinks:
                if link.startswith('/'):
		    link = 'http://' + Domain[1] + link
                elif link.startswith('#'):
		    link = 'http://' + Domain[1] + Domain[2] + link
                elif not link.startswith('http'):
		    link = 'http://' + Domain[1] + '/' + link
		    
                if (link not in crawled) and (link not in tocrawl) and (link not in result):
                    size=open("filesize.txt","w")
                    myopener = MyOpener()
                    page1=myopener.open(link)
                    code=urllib.urlopen(link).getcode()
                    if code==200:
                         code_200+=1
                    elif code==403:
                         code_403+=1
                    elif code==404:
                         code_404+=1     
                    msg1=page1.read()
                    size.write(msg1)
                    t=time.time()    
                    print number+1,"|",link,"|",os.path.getsize("filesize.txt"),"bytes","|", strftime("%H:%M:%S"),"|", code
                    totalsize+=os.path.getsize("filesize.txt")
                    tocrawl.append(link)
                    result.append(link)
                    f.write(link)
                    size.close()
                    f.write("\n")
                    if number==1:
                        t1=t
                    elif number==numberoflinks-1:
                        t2=t
                    number+=1
           else:
               print "End of program"
               t3=t2-t1
               print "total time=",t3
               print "total number of 200(ok) pages=",code_200
               print "total number of 403(forbidden) pages=",code_403
               print "total number of 404(page not found) pages=",code_404
               print "totalsize = ",totalsize
               sys.exit()

print "End of program"
t3=t2-t1
print "total time=",t3
print "total number of 200(ok) pages=",code_200
print "total number of 403(forbidden) pages=",code_403
print "total number of 404(page not found) pages=",code_404
print "totalsize = ",totalsize
sys.exit()
f.close()

Some explanation about the code

This parser is made in Python 2.6. using inbuilt libraries. We are not trying to use any custom made library.

Crawler will do the following task:
1. The program parses N number of links from top 10 results of google, Where N will be the number entered by the user.
2. The program will check for any duplicate pages and it will not be parsing the duplicate pages. This is done by checking the page title and comparing it with the titles in title list.
3. The program computes the time taken to download the page and it will display the total time taken to parse N number of links towards the end of the program.
4. The program also checks if the page is available for download or for any errors. The errors will be displayed towards the end of the program.
5. The program checks if a web page can be crawled using robotparser. It'll continue to crawl only if it is allowed to do so.
6. The program computes the total size of the webpages downloaded.
7. The program stores the list of URL's in a file where file name is the same as key phrase provided by the user (.txt).

Following are some features NOT implemented in the program:

1. The program dosen't capture the time out response when trying to access the webpage.
2. The program dosen't check for the MIME type of the crawled page.
3. With respect to performance, the time taken for displaying the output is a bit slow as we need to download each page to see if it is duplicate of not before adding it to links.