It would be quite simple to extend the scraper we already developed to search the entire internet rather than just one website. All that we need to do is search for links on that website and after processing one site we follow one of the links found on that site. If we build up a list of links that way it is very unlikely that we will ever run out of links.
Rather than building such a scraper ourselves, we are going to make use of scrapy.py, a Python package exactly for the purpose of widespread web scraping. Install scrapy with
pip install scrapy
and create a new project with
scrapy startproject email_scraper
Now there should be a folder called email_scraper. Under email_scraper/spiders we have to write our scraping commands. I wrote a short spider called email_spider.py
from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor import re # scrapy crawl email_spider def write_to_file(emails): with open("list_of_emails.csv", "a") as myfile: for email in emails: myfile.write("%s\n" % email) myfile.close return class PureEvil(CrawlSpider): name = "email_spider" download_delay = 1.0 allowed_domains = ['stackoverflow.com'] start_urls = ['https://stackoverflow.com/'] rules = [Rule( LinkExtractor(allow=".*"), callback='parse_page', follow=True) ] def parse_page(self, response): emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body.decode('utf-8')) if emails: write_to_file(emails) return
You can see how easy it is to write a scraper with scrapy. Just a few lines of code give us a very powerful tool. Let me know if you have comments/questions below.
cheers
Florian
Florian
This comment has been removed by the author.
ReplyDelete