My statistics blog: Web scraping with Python tutorial, part 2 -- scrapy.py

Monday, December 18, 2017

Web scraping with Python tutorial, part 2 -- scrapy.py

This is the second part of this tutorial about web scraping with Python. In the first part we looked at scraping static content from a particular website and we used the syntax of that website to access the desired information. In this part, we will look at web scrapers that potentially could search the entire internet.

It would be quite simple to extend the scraper we already developed to search the entire internet rather than just one website. All that we need to do is search for links on that website and after processing one site we follow one of the links found on that site. If we build up a list of links that way it is very unlikely that we will ever run out of links.

Rather than building such a scraper ourselves, we are going to make use of scrapy.py, a Python package exactly for the purpose of widespread web scraping. Install scrapy with

pip install scrapy

and create a new project with

scrapy startproject email_scraper

Now there should be a folder called email_scraper. Under email_scraper/spiders we have to write our scraping commands. I wrote a short spider called email_spider.py

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
# scrapy crawl email_spider
def write_to_file(emails):
    with open("list_of_emails.csv", "a") as myfile:
        for email in emails:
            myfile.write("%s\n" % email)
    myfile.close
    return 

class PureEvil(CrawlSpider):
    name = "email_spider"
    download_delay = 1.0
    allowed_domains = ['stackoverflow.com'] 
    start_urls = ['https://stackoverflow.com/']    rules = [Rule(
        LinkExtractor(allow=".*"),
        callback='parse_page', 
        follow=True)
    ]
    def parse_page(self, response):
        emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body.decode('utf-8'))
        if emails:
            write_to_file(emails)
        return

This spider is named email_spider and if you go through the lines of code you can see why. It searches for email addresses in the source code and stores them in a CSV file. It is limited to domains within the site stackoverflow.com and will start at the main site. It will automatically store all the links on that site and will programmatically follow these links until it crawled the entire domain space.

You can see how easy it is to write a scraper with scrapy. Just a few lines of code give us a very powerful tool. Let me know if you have comments/questions below.

cheers
Florian

1 comment:

sam kirubakarFebruary 14, 2022 at 1:54 AM
This comment has been removed by the author.
ReplyDelete
Replies