⚝ Damien Accorsi
⚝ Freelance architecture et dév. backend
⚝ Créateur de Tracim — http://tracim.fr
⚝ Contributeur LinuxFR — http://linuxfr.org/users/lebouquetin
« An open source and collaborative framework
for extracting the data you need from websites.
In a fast, simple, yet extensible way. »
Exemple : comparateur de prix, agrégateurs d'annonces...
Exemple : création de statistiques sur un site web...
Exemple : traitement de données distantes CSV, XML...
« write the rules to extract the data and let Scrapy do the rest »
Trois exemples pour illustrer le fonctionnement de scrapy
afpy — jobs |
LinuxFR — dépêches |
Apec — salaires « python » |
Extraction de la liste des offres d'emploi du site web de l'AFPY — http://www.afpy.org/jobs/
<div class="jobitem">
<a href="http://www.afpy.org/jobs/full-stack-developper-python-django-angular">
<h2 class="tileHeadline">Full Stack Developper - Python/Django + Angular</h2>
</a>
<span class="discreet">
Créé le 26/01/2015 21:28
par <a href="http://www.labadens.eu/w">Labadens</a>
</span>
<p>If you are a Python Developer with Front-end development experience, please read on!</p>
<div class="portletMore">
<a href="http://www.afpy.org/jobs/full-stack-developper-python-django-angular">Lire l'offre</a>
</div>
</div>
...
<div class="listingBar">
<span class="next">
<a href="http://www.afpy.org/jobs?b_start:int=10">
10 éléments suivants »
</a>
</span>
...
</div>
from scrapy import Spider, Item, Field, Request
class Job(Item):
title = Field()
url = Field()
class AfpyJobSpider(Spider):
name = 'afpy_jobs'
start_urls = ['http://www.afpy.org/jobs']
def parse(self, response):
for job in response.xpath('//div[@class="jobitem"]'):
title_xpath = './a/h2[@class="tileHeadline"]/text()'
url_xpath = './a/@href'
title = job.xpath(title_xpath)[0].extract()
url = job.xpath(url_xpath)[0].extract()
yield Job(title=title, url=url)
next_page_url_xpath = '//div[@class="listingBar"]/span[@class="next"]/a/@href'
next_page_url = response.xpath(next_page_url_xpath)[0].extract()
yield Request(url=next_page_url)
$ scrapy runspider afpy_spider.py -o afpy_jobs.xml
Listing des dépêches de LinuxFR — http://www.linuxfr.org
prenant en compte le statut visité / non visité
<form id="new_account" class="new_account" action="/compte/connexion" accept-charset="UTF-8" method="post"><input name="utf8" type="hidden" value="✓" />
<p>
<label for="account_login">Identifiant</label>
<input id="account_login" required="required" placeholder="Identifiant" size="20" type="text" name="account[login]" />
</p>
<p>
<label for="account_password">Mot de passe</label>
<input id="account_password" required="required" placeholder="Mot de passe" size="20" type="password" name="account[password]" />
</p>
<p>
<input name="account[remember_me]" type="hidden" value="0" /><input id="account_remember_me" type="checkbox" value="1" name="account[remember_me]" />
<label for="account_remember_me">Connexion automatique</label>
</p>
<p>
<input type="submit" name="commit" value="Se connecter" id="account_submit" />
</p>
</form>
class LinuxfrNewsSpider(Spider):
name = ('linuxfr_news_spider')
def __init__(self, login, password, page=0, *args, **kwargs):
super(LinuxfrNewsSpider, self).__init__(*args, **kwargs)
self.start_urls = ['https://linuxfr.org/compte/connexion']
self.start_page_id = page
self.login = login
self.password = password
def parse(self, response):
return FormRequest.from_response(
response,
formxpath = '//form[@id="new_account"]',
formdata = {
'account[login]': self.login,
'account[password]': self.password,
},
callback=self.after_login
)
def after_login(self, response):
...
def parse(self, response):
return FormRequest.from_response(
response,
formxpath = '//form[@id="new_account"]',
formdata = {
'account[login]': self.login,
'account[password]': self.password,
},
callback = self.after_login
)
def after_login(self, response):
if "Identifiant ou mot de passe invalide" in response.body:
self.log("Login failed", level=log.ERROR)
return
print 'IDENTIFIED as '+response.xpath('//aside[@id="sidebar"]/div[@class="login box"]/h1/a/text()')[0].extract()
print ''
print 'waiting 5 seconds before continue...'
time.sleep(5)
yield Request('https://linuxfr.org/news?page=%s' % self.start_page_id, self.authenticated_parse)
def authenticated_parse(self, response):
# self.settings['DOWNLOAD_DELAY'] = 2
...
def authenticated_parse(self, response):
# self.settings['DOWNLOAD_DELAY'] = 2
articles = response.xpath('//article')
for article in articles:
title = article.xpath('./header/h1/a/text()')[0].extract()
path = article.xpath('./header/h1/a/@href')[0].extract()
pub_date = article.xpath('./header/div[@class="meta"]/time/@datetime')[0].extract()
score = article.xpath('.//figure[@class="score"]/text()')[0].extract()
comment_nb = article.xpath('./footer//span[@class="nb_comments"]/text()').re(r'\d+')[0] ## HERE
visited = article.xpath('./footer//span[@class="visit"]/text()').re(r', (.*)')[0] ## HERE
yield LinuxNewsPost(
title = title,
url = 'https://linuxfr.org'+path,
score = score,
pub_date = pub_date,
comment_nb = int(comment_nb),
visited = visited
)
next_page_path = response.xpath('//nav[@class="toolbox"]/nav[@class="pagination"]/span[@class="next"]/a/@href')[0].extract()
yield Request('https://linuxfr.org'+next_page_path, self.authenticated_parse)
scrapy runspider linuxfr_news_spider.py -a login=pyuggre -a password="mon_mot_de_passe" -a page=0
Ce qui donne, sous forme csv (un peu retraité)...
Salaire moyen des offres d'emploi « python » sur le site de l'APEC ?
Problématiques :
Accessoirement : gérer la dette technique du site web de l'APEC
<table class="noFieldsTable">
<tr>
<th>Référence Apec :</th>
<td>125496545W-5417-6876</td>
</tr>
...
<tr>
<th>Lieu :</th>
<td>BOULOGNE</td>
</tr>
<tr>
<th>Salaire :</th>
<td>De 45000 à 50000 EUR par an</td>
</tr>
...
<table class="noFieldsTable">
Mais le texte peut aussi être absent, de la forme "aux alentours de 40K€", "à discuter"...
Solution technique :
Architecture du code :
Get the python source codes from Github