那啥是第一生产力来着

起因

基友让我帮忙爬下这个网站的壁纸
看了下地址并不连续,以前写的爬虫就太naive了
突然想起不是有个爬虫框架叫scrapy么,听说很好用,正好可以借机学习一下

过程

安装与项目结构

1
2
3
sudo pip install scrapy
scrapy startproject <项目名>
scrapy crawl <爬虫名>

目录结构如下

1
2
3
4
5
6
7
8
9
10
11
WallPaper/
scrapy.cfg
WallPaper/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
WallpaperSpiders.py
...

item.py 对爬取的数据的模型定义
pipelines.py 对爬取数据的存储与处理
settings.py 相关配置
WallpaperSpiders.py 为爬虫的具体实现

爬虫分析

主要就是分析下载链接和下一页链接,通过简单分析发现下图

dowload_link

next_link

代码实现

items.py

1
2
3
4
5
6
# -*- coding: utf-8 -*-
import scrapy

class WallpaperItem(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()

WallpaperSpider.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# -*- coding: utf-8 -*-
import scrapy
from WallPaper.items import WallpaperItem
from scrapy.crawler import CrawlerProcess

class WallpaperSpider(scrapy.Spider):

name ='Wallpaper' #爬虫名

allowed_domains = ["simpledesktops.com","static.simpledesktops.com"]
start_urls = ["http://simpledesktops.com/browse/desktops/2011/jul/14/cassette/"]

def parse(self, response):

host = u"http://simpledesktops.com"
item = WallpaperItem()

#item['title'] = response.xpath('//div[@class="desktop"]//img//@title').extract()
download = response.xpath('//div[@class="desktop"]//h2//@href').extract()
if len(download) > 0:
download[0] = host + download[0]

item['image_urls'] = download

yield item

next_url = host + response.xpath('//a[@class="forward"]//@href').extract_first()

if next_url:
yield scrapy.Request(next_url,callback=self.parse)

pipelines.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# -*- coding: utf-8 -*-
import urllib,os
from WallPaper import settings

class WallpaperPipeline(object):

def process_item(self, item, spider):

store_path = '%s/%s'%(settings.IMAGES_STORE,spider.name)
if not os.path.exists(store_path):
os.makedirs(store_path)

for image_url in item['image_urls']:

url_name = image_url.split('=')
file_name = url_name[-1] + '.png'
file_path = '%s/%s'%(store_path,file_name)

if os.path.exists(file_path):
continue

with open(file_path,'wb') as file_writer:
conn = urllib.urlopen(image_url)
file_writer.write(conn.read())
file_writer.close()

return item

成果

发现壁纸还是挺好看的,半个小时爬了500张

参考