上QQ阅读APP看书,第一时间看更新
How to do it...
The script for this recipe is 01/03_events_with_scrapy.py. The following is the code:
import scrapy
from scrapy.crawler import CrawlerProcess
class PythonEventsSpider(scrapy.Spider):
name = 'pythoneventsspider'
start_urls = ['https://www.python.org/events/python-events/',]
found_events = []
def parse(self, response):
for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'):
event_details = dict()
event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first()
event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first()
event_details['time'] = event.xpath('p/time/text()').extract_first()
self.found_events.append(event_details)
if __name__ == "__main__":
process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
process.crawl(PythonEventsSpider)
spider = next(iter(process.crawlers)).spider
process.start()
for event in spider.found_events: print(event)
The following runs the script and shows the output:
~ $ python 03_events_with_scrapy.py
{'name': 'PyCascades 2018', 'location': 'Granville Island Stage, 1585 Johnston St, Vancouver, BC V6H 3R9, Canada', 'time': '22 Jan. – 24 Jan. '}
{'name': 'PyCon Cameroon 2018', 'location': 'Limbe, Cameroon', 'time': '24 Jan. – 29 Jan. '}
{'name': 'FOSDEM 2018', 'location': 'ULB Campus du Solbosch, Av. F. D. Roosevelt 50, 1050 Bruxelles, Belgium', 'time': '03 Feb. – 05 Feb. '}
{'name': 'PyCon Pune 2018', 'location': 'Pune, India', 'time': '08 Feb. – 12 Feb. '}
{'name': 'PyCon Colombia 2018', 'location': 'Medellin, Colombia', 'time': '09 Feb. – 12 Feb. '}
{'name': 'PyTennessee 2018', 'location': 'Nashville, TN, USA', 'time': '10 Feb. – 12 Feb. '}
{'name': 'PyCon Pakistan', 'location': 'Lahore, Pakistan', 'time': '16 Dec. – 17 Dec. '}
{'name': 'PyCon Indonesia 2017', 'location': 'Surabaya, Indonesia', 'time': '09 Dec. – 10 Dec. '}
The same result but with another tool. Let's go take a quick review of how this works.