当前位置：首页 > 编程日记 > 正文

如何爬取知乎中问题的回答以及评论的数据？

编程日记 2024-09-22 06:50:00

如何爬取知乎中问题的回答以及评论的数据？

我们以爬取“为什么中医没有得到外界认可？”为例来讨论一下如何爬取知乎中问题的回答以及评论的数据。

爬取网页数据通常情况下会经历以下三个步骤。

第一步：网页分析，确认自己所要数据的真正存储地址，以及这些url地址的规律。

第二步：爬取网页数据，并对这些数据进行清洗和整理变成结构化数据。

第三步：存储数据，以便于后面的分析。

下面我们分别来详细介绍。

一、网页分析

我们利用Chrome浏览器，打开所要爬取的网页：

https://www.zhihu.com/question/370697253

按F12查看元素，点击“Network”，再点击“XHR”选项。

先按左边的小圆圈清空列表，方便后面查找请求链接，再按“F5”刷新一下网页，如下图所示：

在列表中找到存储回答数据的url地址，点击后在“Preview”面板可以看到Josn格式的数据。

观察每一页数据对应的url地址。

第1页：

https://www.zhihu.com/api/v4/questions/370697253/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=0&platform=desktop&sort_by=default

第2页：

https://www.zhihu.com/api/v4/questions/370697253/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=5&platform=desktop&sort_by=default

第3页：

https://www.zhihu.com/api/v4/questions/370697253/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=10&platform=desktop&sort_by=default

我们发现，除了offset属性对应的取值不同，其余部分全部相同。而且offset属性对应的取值从0开始，每一页相差5。最后一页Json中的 paging -> is_end属性为false。

以上是问题回答的网页分析。我们再分析一下针对每个回答的评论。

跟上面的步骤相同，找到这些评论存储的真正网络地址。

观察每一页数据对应的url地址如下：

第1页：

https://www.zhihu.com/api/v4/answers/1014424784/root_comments?order=normal&limit=20&offset=0&status=open

第2页：

https://www.zhihu.com/api/v4/answers/1014424784/root_comments?order=normal&limit=20&offset=20&status=open

第3页：

https://www.zhihu.com/api/v4/answers/1014424784/root_comments?order=normal&limit=20&offset=40&status=open

“1014424784”是该回答的id，不同的回答该id值不同。上面的url是针对同一回答的评论，这些url地址除了offset属性对应的取值不同，其余部分全部相同。而且offset属性对应的取值从0开始，每一页相差20。最后一页Json中的 paging -> is_end属性为false。

二、常用库介绍

（1）requests

requests的作用就是发送网络请求，返回响应数据。

官方文档如下：

https://docs.python-requests.org/zh_CN/latest/user/quickstart.html

（2）json

Json 是一种轻量级的数据交换格式，完全独立于任何程序语言的文本格式。一般，后台应用程序将响应数据封装成Json格式返回。

官方文档如下：

https://docs.python.org/zh-cn/3.7/library/json.html

（3）lxml

lxml 是一个HTML/XML的解析器，主要功能是解析和提取 HTML/XML 数据。

官方文档如下：

https://lxml.de/index.html

由于本图文的篇幅有限，后面会另写图文分别介绍上面这些跟爬虫相关的库。

三、完整代码

GetAnswers方法用于爬取问题的回答数据。

回答数据结构化之后的属性有：帖子的ID（answer_id）、作者名称（author）、发表时间（created_time）、帖子内容（content）。

GetComments方法用于爬取每个问题的评论数据。

评论数据结构化之后的属性有：评论的ID（answer_id_comment_id）、作者名称（author）、发表时间（created_time）、评论内容（content）。

这些数据全部存储在“知乎评论.csv”这个文件中，需要注意的是该文件用Excel打开之后出现中文乱码，解决方法可以参考前面的一篇图文如何解决Python3写入CSV出现’gbk’ codec can’t encode的错误？

import requests
import json
import time
import csv
from lxml import etreeheaders = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36',
}csvfile = open('知乎评论.csv', 'w', newline='', encoding='utf-8')
writer = csv.writer(csvfile)
writer.writerow(['id', 'created_time', 'author', 'content'])def GetAnswers():i = 0while True:url = 'https://www.zhihu.com/api/v4/questions/370697253/answers' \'?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%' \'2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%' \'2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%' \'2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%' \'2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%' \'2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%' \'2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset={0}&platform=desktop&' \'sort_by=default'.format(i)state=1while state:try:res = requests.get(url, headers=headers, timeout=(3, 7))state=0except:continueres.encoding = 'utf-8'jsonAnswer = json.loads(res.text)is_end = jsonAnswer['paging']['is_end']for data in jsonAnswer['data']:l = list()answer_id = str(data['id'])l.append(answer_id)l.append(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(data['created_time'])))l.append(data['author']['name'])l.append(''.join(etree.HTML(data['content']).xpath('//p//text()')))writer.writerow(l)print(l)if data['admin_closed_comment'] == False and data['can_comment']['status'] and data['comment_count'] > 0:GetComments(answer_id)i += 5print('打印到第{0}页'.format(int(i / 5)))if is_end:breaktime.sleep(1)def GetComments(answer_id):j = 0while True:url = 'https://www.zhihu.com/api/v4/answers/{0}/root_comments?order=normal&limit=20&offset={1}&status=open'.format(answer_id, j)state=1while state:try:res = requests.get(url, headers=headers, timeout=(3, 7))state=0except:continueres.encoding = 'utf-8'jsonComment = json.loads(res.text)is_end = jsonComment['paging']['is_end']for data in jsonComment['data']:l = list()comment_id = str(answer_id) + "_" + str(data['id'])l.append(comment_id)l.append(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(data['created_time'])))l.append(data['author']['member']['name'])l.append(''.join(etree.HTML(data['content']).xpath('//p//text()')))writer.writerow(l)print(l)for child_comments in data['child_comments']:l.clear()l.append(str(comment_id) + "_" + str(child_comments['id']))l.append(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(child_comments['created_time'])))l.append(child_comments['author']['member']['name'])l.append(''.join(etree.HTML(child_comments['content']).xpath('//p//text()')))writer.writerow(l)print(l)j += 20if is_end:breaktime.sleep(1)GetAnswers()
csvfile.close()

四、总结

本篇文档是大数据与哲学社会科学实验室召开第75次学术讨论会上汇报的内容，大家如果感兴趣可以在微信后台回复“资料下载”来获取源码，以及该帖子爬取的8000多条数据。

https://www.dkcj.cn/info/19805.html

如何爬取知乎中问题的回答以及评论的数据？

如何爬取知乎中问题的回答以及评论的数据？

一、网页分析

二、常用库介绍

（1）requests

（2）json

（3）lxml

三、完整代码

四、总结

相关文章：

Facebook如何使用Avartarnode提升HDFS可靠性

无法远程分发安装软件原因

小程序的ui应该怎么设计?

什么是ThreadLocal

【组队学习】【25期】Datawhale组队学习内容介绍

为pony程序添加IACA标记（二）

Python培训就业怎么样?

Oracle Connect to an idle instance

【青少年编程】【Scratch】10 画笔模块

4-1 ADO.NET简介

Java培训出来后一般多少工资

NeHe OpenGL第四十一课：体积雾气

如何做中文文本的情感分析？

java游戏开发--连连看-让程序运行更稳定、更高效

学java是不是必须要参加java培训班?

【青少年编程】黄羽恒：我要背单词

【转载】：最佳注释

从 C++ 到 Objective-C

参加UI设计培训如何高效学习

访问级别约束0906

VSCode环境下配置ESLint 对Vue单文件的检测

【青少年编程】黄羽恒：加减乘除法小测试

Python Cookie HTTP获取cookie并处理

利益驱动需求驱动技术驱动谁才是真正的驱动力？

【青少年编程】黄羽恒：翻译小工具 -- 利用有道翻译

iframe 自动适应高和宽问题和其他Frame操作技巧

Python代码编写过程中有哪些重要技巧?

SpringMVC启动分析

Edit Distance

【青少年编程】黄羽恒：翻译小工具 -- 利用百度翻译