当前位置：首页 > 编程日记 > 正文

[Python]小百合十大爬虫

编程日记 2024-03-06 10:00:00

国庆几天在家看了几篇关于使用Python来编写网络爬虫的博客，想来自己断断续续学习Python也有几个月了，但一个像样的程序都没有写过，编程能力并没有得到提高，愧对自己花费的时间。很多时候虽然知道什么事情是对的，但自身过于懒惰，不能坚持做一件事并且把它做好。这大概就是我和那些优秀的人之间的差距，这个月争取多写一些代码，把这个系列写完整！

下面的链接是假期在家看的一系列博客，收获很大！

零基础自学用Python 3开发网络爬虫-这篇博客写的不错，通俗易懂，文笔也很好
知乎上关于如何使用Python来编爬虫的解答1
知乎上关于如何使用Python来编爬虫的解答2
正则表达式30分钟入门

编写小百合十大爬虫，需要作如下几步：

1.访问十大网页，获取十大信息；

2. 爬取十大各帖内容。

1.访问十大，获取十大信息

浏览器访问网页过程是：浏览器向服务器发送HTTP请求，服务器端收到HTTP请求后将客户请求的内容发送给浏览器，浏览器接收到服务器响应内容后将其进行显示供用户浏览。

使用Python来访问十大网页，这就要求我们要模拟浏览器的操作过程，向服务器端发送HTTP请求。Python的urllib2模块提供了这样的功能，urllib2.urlopen(url)函数能够打开多种类型的url链接，如http://www.baidu.com, ftp://cs.nju.edu.cn等等。

为了伪装成浏览器，我们需要在请求中添加User-Agent，表明自己是浏览器:)

如不添加，urllib2会将自己设定为Python-urllib/x.y(这里的x、y分别表示Python的主版本号和次版本号)

 1 def get_top10article(self):
 2         top10_url = 'http://bbs.nju.edu.cn/bbstop10'
 3         bbs_url = 'http://bbs.nju.edu.cn/'
 4         
 5         req = urllib2.Request(top10_url, headers = self.headers)
 6         response = urllib2.urlopen(req)
 7         top10_page = response.read()
 8         #print top10_page
 9         
10         #unicode_top10_page = top10_page.decode('utf-8')
11         pattern_str = '<tr.*?bgcolor=.*?><td>(.*?)<td><a.*?href=(.*?)>(.*?)</a><td><a.*?href="(.*?)">(.*?)\n</a><td><a.*?href=(.*?)>(.*?)</a><td>(.*?)\n'
12         pattern = re.compile(pattern_str)
13         #pattern = re.compile(r'<tr.*?bgcolor=.*?><td>(.*?)<td><a.*?href=(.*?)>(.*?)</a><td><a.*?href="(.*?)">(.*?)</a><td><a.*?href=(.*?)>(.*?)</a>')
14         top10_retrive_infos = pattern.findall(top10_page)
15         for info in top10_retrive_infos:
16             article = Article(info[0], bbs_url + info[1], info[2], bbs_url + info[3], info[4], bbs_url + info[5], info[6])
17             self.top10.append(article)
18             #print info

上面代码5-7行，向小百合发送HTTP请求，请求得到响应之后。在第11-17行使用正则表达式来捕获各个帖子的相关信息并保存在top10这样一个list中(line 17)。

2.爬取十大各帖内容

根据步骤1中获取的各帖子的信息，爬取所有回复帖子的内容，同样使用正则表达式提取各帖的主要内容，去除不必要的HTML标签。

 1 def get_article(self, url):
 2         # url + '&start=-1' 显示本主题全部帖子
 3         all_article_url = url + '&start=-1'
 4         req = urllib2.Request(all_article_url, headers = self.headers)
 5         response = urllib2.urlopen(req)
 6         article_content = response.read()
 7    
 8         # use regular experssion to find out all the reply article content
 9         pattern_str = '<textarea.*?id=.*?class=hide>(.*?)--\n.*?</textarea>'
10         pattern = re.compile(pattern_str, re.S)
11         all_replies_content = pattern.findall(article_content)
12 
13         f = open('all_replies_content.txt', 'w')
14 
15         result_content = []
16         for reply in all_replies_content:
17             f.write(reply)
18             result_content.append(reply)
19             #print reply
20         return result_content

3-6行获取了本主题全部帖子，9-11行使用正则表达式提取各帖的回复内容。

完整代码如下所示：

 1 # -*- coding: cp936 -*-
 2 import urllib2
 3 import urllib
 4 import re
 5 # 自定义帖子类，包括十大排名、板块链接、板块名、帖子链接、帖子标题、作者链接和作者 7个字段
 6 class Article:
 7     def __init__(self, rank, board_link, board, article_link, title, author_link, author):
 8         self.rank = rank
 9         self.board_link = board_link
10         self.board = board
11         self.article_link = article_link
12         self.title = title
13         self.author_link = author_link
14         self.author = author
15 
16 class Lily_Top10_Spider:
17     def __init__(self):
18         self.top10 = []
19         self.user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
20         self.headers = {'User-Agent' : self.user_agent}
21         
22 
23     # 获取十大信息，添加到列表中并返回列表
24     def get_top10article(self):
25         top10_url = 'http://bbs.nju.edu.cn/bbstop10'
26         bbs_url = 'http://bbs.nju.edu.cn/'
27         
28         req = urllib2.Request(top10_url, headers = self.headers)
29         response = urllib2.urlopen(req)
30         top10_page = response.read()
31         #print top10_page
32         
33         #unicode_top10_page = top10_page.decode('utf-8')
34         pattern_str = '<tr.*?bgcolor=.*?><td>(.*?)<td><a.*?href=(.*?)>(.*?)</a><td><a.*?href="(.*?)">(.*?)\n</a><td><a.*?href=(.*?)>(.*?)</a><td>(.*?)\n'
35         pattern = re.compile(pattern_str)
36         #pattern = re.compile(r'<tr.*?bgcolor=.*?><td>(.*?)<td><a.*?href=(.*?)>(.*?)</a><td><a.*?href="(.*?)">(.*?)</a><td><a.*?href=(.*?)>(.*?)</a>')
37         top10_retrive_infos = pattern.findall(top10_page)
38         for info in top10_retrive_infos:
39             article = Article(info[0], bbs_url + info[1], info[2], bbs_url + info[3], info[4], bbs_url + info[5], info[6])
40             self.top10.append(article)
41             #print info
42             
43 
44         for a in self.top10:
45             print a.title, ' ', a.author, ' ', a.board, ' ', a.article_link
46 
47     def get_article(self, url):
48         # url + '&start=-1' 显示本主题全部帖子
49         all_article_url = url + '&start=-1'
50         req = urllib2.Request(all_article_url, headers = self.headers)
51         response = urllib2.urlopen(req)
52         article_content = response.read()
53         #print article_content
54         
55 
56         # use regular experssion to find out all the reply article content
57         pattern_str = '<textarea.*?id=.*?class=hide>(.*?)--\n.*?</textarea>'
58         pattern = re.compile(pattern_str, re.S)
59         all_replies_content = pattern.findall(article_content)
60 
61         f = open('all_replies_content.txt', 'w')
62         #print all_replies
63 
64         result_content = []
65         for reply in all_replies_content:
66             f.write(reply)
67             result_content.append(reply)
68             #print reply
69         return result_content
70         #return self.top10
71         
72    
73 ls = Lily_Top10_Spider()
74 ls.get_top10article()
75 
76 print '#1 article content:'
77 article_content = ls.get_article(ls.top10[9].article_link)
78 for s in article_content:
79     print s
80 print 'print end.'
81

参考文献：HOWTO Fetch Internet Resources Using urllib2
Python爬虫入门教程

现有的比较好的Python爬虫框架--Scrapy

官方网站：http://scrapy.org/
GitHub：https://github.com/scrapy/scrapy

https://www.dkcj.cn/info/2744.html

[Python]小百合十大爬虫

相关文章：

Web自动化测试六 ----- selector选择

AI矢量绘图软件技能学习视频教程

利用JS判断是手机端还是PC端浏览网站

职校中的计算机学的是什么,职校计算机专业主要学什么课

浅谈MySQL存储引擎-InnoDBMyISAM

android ValueAnimator学习

Annotation

GSG灰猩猩插件合集包

百度地图JavaScript API自定义覆盖物、自定义信息窗口增删时的显示问题

英语计算机工程师求职信,电脑工程师的英文求职信样文

java工程webservice的应用案例

Vijos1683 有根树的同构问题

3D广告建模-C4D Octane渲染视频教程

vue实例没有挂载到html上,vue 源码学习 - 实例挂载

为何Redis要比Memcached好用（转）

2022-2028年中国数字化制造产业研究及前瞻分析报告

转载知乎上的一篇：“ 面向对象编程的弊端是什么？”

Windows Azure 如何学习Azure

最全面的Unity游戏开发指南视频教程第2卷

IOS面试题（二）

辽宁省计算机专业A类,辽宁省2008年中职升高职招生考试计算机专业综合试题

MyBatis的插入后获得主键的方式

JAVA 中 13 种锁的实现方式

String的Intern()方法，详解字符串常量池！

硬盘盘符双击无法打开,只能右键打开(解决方法)(转载)

Unity 2021创建2D休闲点击器游戏视频教程

html实现pdf预览打印机,Pdf操作（HTML转PDF，PDF直接网页连接打印机）

CUDA编程遇到的问题

2022-2028年中国数字化档案加工行业市场深度分析及发展策略分析报告

eclipse打开处于无响应状态解决办法