前一篇已经写了如何用python获得图片,这篇看获得文本信息。
分析网页结构
整个流程基本一样,先打开网站找到想要的文本信息,小说地址 http://www.136book.com/mieyuechuanheji/, 打开右键查看源码如图:
可以看到章节列表ol标签里class=clearfix
获取所有的class为clearfix的ol标签
url = 'http://www.136book.com/mieyuechuanheji/' req = urllib2.Request(url) response = urllib2.urlopen(req) html = response.read() soup = BeautifulSoup(html) divResult = soup.findAll('ol',attrs={"class":"clearfix"})
遍历ol标签拿到a标签的文本及连接
for div in divResult:
# 拿到所有的a标签
aarray = div.findAll('a')
for a in aarray:
print a.string
link = a.get('href')
link就是每一章节的链接,点击某个章节,查看源码如图:
小说内容在 id=content 的div,获取文本内容
linkreq = urllib2.Request(link)
linkresponse = urllib2.urlopen(linkreq)
htmlres = linkresponse.read()
soups = BeautifulSoup(htmlres)
textResult = soups.findAll('div',attrs={"id":"content"})
遍历内容中所有的p标签,写入带txt文件中
for p in textResult:
parray = p.findAll('p')
for string in parray:
f.write(string.string)
f.write('\n\n')
正在下载:
下载完成:
全部源码如下
#!/usr/bin/python
#-*- coding: utf-8 -*-
#encoding=utf-8
import urllib2
import urllib
from BeautifulSoup import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
if __name__ == '__main__':
url = 'http://www.136book.com/mieyuechuanheji/'
req = urllib2.Request(url)
response = urllib2.urlopen(req)
html = response.read()
soup = BeautifulSoup(html)
divResult = soup.findAll('ol',attrs={"class":"clearfix"})
print divResult
for div in divResult:
# 得到所有的a标签
aarray = div.findAll('a')
f = open('/Users/kangbing/Desktop/python/miyuezhuan.txt','w')
for a in aarray:
print a.string
link = a.get('href')
linkreq = urllib2.Request(link)
linkresponse = urllib2.urlopen(linkreq)
htmlres = linkresponse.read()
soups = BeautifulSoup(htmlres)
textResult = soups.findAll('div',attrs={"id":"content"})
f.write(a.string)
for p in textResult:
parray = p.findAll('p')
for string in parray:
f.write(string.string)
f.write('\n\n')
f.close()
整个爬取小说过程及套路基本都这样,当然获取每个节点的方法有优化的地方,同时可以参考BeautifulSoup这个库的文档,尝试用别的更简单的方法实现,文档地址:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html