Python Programming LanguageEveryday I receive between 50 – 100 emails from readers asking for help with different things. One of the biggest requests I get a lot is “can I provide source code for some of my articles in different programming languages?” For example, my articles about creating screen scrapers are among the highest viewed articles on my site. I have already provided source code for screen scrapers in C# and Perl. I thought I had already provided source code for a screen scraper using Java also, but I can’t find it. I’ll have to add that one later on. Anyways, according to the emails I get from my readers, many of them are Python programmers and prefer to see my examples written in the Python programming language. Personally, I prefer writing in Java, C#, and PHP. But, I also like to make an attempt at keeping my readers happy. So, this article is for all of you Python programmers looking for an easy way to do screen scraping with the Python programming language.

Note: The formatting is messed up throughout the tutorial. But, the full source code can be found with formatting in tact at the end of the article.

First off, we need to define a purpose for our example screen scraper. As a reply to a few of the emails I’ve received lately, I’m going to focus my screen scraper on getting URLs out of a web page. To do that, I’m going to leverage a couple of libraries that are built into Python. (By the way, I’m using Python 2.7). The libraries I’ll be working with are “urllib” and “sgmllib”. So, we’ll need to reference those as imports.

import urllib, sgmllib

Once we have our libraries referenced, we will need to create a new class which will house the meat of our screen scraper. I have chosen to call my class “Scraper” and I’m constructing it with “sgmllib.SGMLParser”. This will allow me to inherit from the SGML library in just a moment. Before continuing, I should probably give a quick explanation of the SGML library. SGML stands for “Standard Generalized Mark-up Language”. It’s basically used for parsing HTML. A quick Google search shows that SGMLLIB has been deprecated since Python 2.6 and was completely removed in Python 3.0. But, that’s not going to stop me from using it in this tutorial. :-)

class Scraper(sgmllib.SGMLParser):

The first thing we need to do in our Scraper class is to initialize the class and its’ parent class sgmllib.SGMLParser. In our init method, we will also need to go ahead and declare an array which we’ll store our links in.

def __init__(self, verbose = 0):
sgmllib.SGMLParser.__init__(self, verbose)
self.links = []

Next, we need to define a function that will accept the HTML as a string and pass it to the “feed” method from the class we inherited from at the beginning of this article. Once we’ve fed the HTML to our SGMLParser, we’ll need to go ahead and close the object.

def scrape_links(self, s):
self.feed(s)
self.close()

After we’ve populated our SGMLParser, we’ll need to define a function that walks thru the attributes of our scraper and checks for any attribute called “href”. Once an “href” has been located, its’ value will be shoved into the links array we defined in our init method earlier.

def start_a(self, attrs):
for name, value in attrs:
if name == “href”:
self.links.append(value)

The only thing you have left to add to your Scraper class is a getter that returns the links array.

def get_links(self):
return self.links

That’s everything you need for your scraper class. The only thing you have left to do is to test it out. You can test the scraper by calling the “urlopen” method found in the “urllib” module and passing it the URL from the site you want to scrape links from. After you have the file downloaded and stored in a variable, you’ll want to call the “read” method which gets the content from the file and stores it in a string. Once you’ve done that, you’ll need to initialize a new instance of your scraper and call the “scrape_links” function passing it the string that stores the content from the web page you downloaded. At this point, your scraper object will include an array filled with all of the URLs found on that page. Now, it’s just a matter of using that array for whatever devious plans you might have for it. To keep things simple, I just printed my array to make sure everything worked accordingly.

Below is the complete source code including a test case at the bottom.

import urllib, sgmllib

class Scraper(sgmllib.SGMLParser):
    def __init__(self, verbose = 0):
        sgmllib.SGMLParser.__init__(self, verbose)
        self.links = []
    
    def scrape_links(self, s):
        self.feed(s)
        self.close()

    def start_a(self, attrs):
        for name, value in attrs:
            if name == "href":
                self.links.append(value)

    def get_links(self):
        return self.links


f = urllib.urlopen('http://www.prodigyproductionsllc.com')
s = f.read()

scraper = Scraper()
scraper.scrape_links(s)

print scraper.get_links()

If you’d like more information on web programming with Python, here are a couple of books I would highly recommend from Amazon.

Related Posts

Tagged with:  

Leave a Reply