当前位置：首页 > Python爬虫 > Python并发爬取demo

Python并发爬取demo

作者：二营长发布时间：2016-12-20 点击：

今天二营长有了一个新的需求，大概需要抓二十万个页面，以前用的比较多的是threading这个库，pool线程池也用过，只是比较少。根据二营长个人经验，pool线程池比threading还是有不小优势的，一是用threading库，需要源url数量可以被线程数整除，这样就不会报异常，比如需要抓一万个url，线程数设置成5或者10都OK，但是如果设置成3或者7等，就会出现list溢出等异常（具体叫啥忘了），pool线程池就不会有这样的考虑。二是pool线程池代码简练，今天就上一个小demo。如果喷，请轻喷。

上代码：

#coding:utf-8
author = 'heziliang'
import socket,threading,random,requests,pycurl
from multiprocessing.dummy import Pool as ThreadPool
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
socket.setdefaulttimeout(3)
#以上，引用必需的模块
fuu_list = []
totalThread = 10
def getFuck(i):	#请求页面并解析、写到本地txt		
	r = requests.get(fuu_list[i])#这个参数i是fuuNum_list列表的值。
	s = BeautifulSoup(r.content,"lxml")
	try:
		for i in s.find('div',attrs={'id':'J_selector'}).find_all('a'):
			if 'javascript' not in i.get('href') and 'cid' in i.get('href'):
				print i.get('href'),i.get_text()
				mutex.acquire()#加锁
				f.write(i.get_text()+' , '+i.get('href')+'\n')
				mutex.release()#开锁
	except Exception,e:
		print e
		
def main():	
	fuuNum = 0
	fuuNum_list = []
	for url in open('urlid_duo.txt'):
		url = url.strip()	
		fuu_list.append(url)
		#把0、1、2...自然数，存到一个fuuNum_list里
		#同时，遍历源url文件以后，把url放到fuu_list里
		#fuuNum_list的第一个索引位置的值：0，对应fuu_list中第一个索引的值，即源url文件的第一个url
		#可以认为fuuNum_list的第一个索引位置的值0，是fuu_list中第一个索引序号。
		#二营长总感觉这块能简化一些，哪位大神看到了，请联系我哈~~
		fuuNum_list.append(fuuNum)
		fuuNum += 1
	pool = ThreadPool(totalThread)
	pool.map(getFuck, fuuNum_list)#map函数将fuuNum_list里的值（自然数）均发给getFuck这个函数
	pool.close()
	pool.join()
f = open('result_duo.txt',r'a+')
mutex = threading.Lock()#设置锁
main()
f.close()

更新:2016-12-20 晚上22点

由于涉及到了公司的一些数据，这里源url就不公开了。

邮箱：techseo.cn@gmail.com，欢迎交流。

上一篇：Python采集百度下拉词、相关词下一篇：【神器】Python批量生成sitemap

Python并发爬取demo

Python爬虫热门文章

二营长SEO最新文章

Python爬虫最新标签