Python爬虫:爬取小说芈月传

前一篇已经写了如何用python获得图片,这篇看获得文本信息。

分析网页结构

整个流程基本一样,先打开网站找到想要的文本信息,小说地址 http://www.136book.com/mieyuechuanheji/, 打开右键查看源码如图:

可以看到章节列表ol标签里class=clearfix

获取所有的class为clearfix的ol标签

url = 'http://www.136book.com/mieyuechuanheji/'    req = urllib2.Request(url)    response = urllib2.urlopen(req)    html = response.read()    soup = BeautifulSoup(html)    divResult = soup.findAll('ol',attrs={"class":"clearfix"})

遍历ol标签拿到a标签的文本及连接

for div in divResult:
    # 拿到所有的a标签
    aarray = div.findAll('a')
    for a in aarray:
        print a.string
        link = a.get('href')

link就是每一章节的链接,点击某个章节,查看源码如图:

小说内容在 id=content 的div,获取文本内容

linkreq = urllib2.Request(link)
linkresponse = urllib2.urlopen(linkreq)
   	htmlres = linkresponse.read()
   	soups = BeautifulSoup(htmlres)
   	textResult = soups.findAll('div',attrs={"id":"content"})

遍历内容中所有的p标签,写入带txt文件中

for p in textResult:
    parray = p.findAll('p')
    for string in parray:
        f.write(string.string)
        f.write('\n\n')

正在下载:

下载完成:

全部源码如下

#!/usr/bin/python
#-*- coding: utf-8 -*-
#encoding=utf-8
import urllib2
import urllib
from BeautifulSoup import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
 
if __name__ == '__main__':
    url = 'http://www.136book.com/mieyuechuanheji/'
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    html = response.read()
    soup = BeautifulSoup(html)
    divResult = soup.findAll('ol',attrs={"class":"clearfix"})

    print divResult
    for div in divResult:
        # 得到所有的a标签
        aarray = div.findAll('a')
        f = open('/Users/kangbing/Desktop/python/miyuezhuan.txt','w')
        for a in aarray:
            print a.string
            link = a.get('href')
            linkreq = urllib2.Request(link)
            linkresponse = urllib2.urlopen(linkreq)
            htmlres = linkresponse.read()
            soups = BeautifulSoup(htmlres)
            textResult = soups.findAll('div',attrs={"id":"content"})
            
            f.write(a.string)
            for p in textResult:
                parray = p.findAll('p')
                for string in parray:
                    f.write(string.string)
                    f.write('\n\n')
        f.close()

整个爬取小说过程及套路基本都这样,当然获取每个节点的方法有优化的地方,同时可以参考BeautifulSoup这个库的文档,尝试用别的更简单的方法实现,文档地址:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html