Bob blog

返回上页首页

Python爬虫(三)解析网页内容

2019年12月16日 - 由Bo 0 评论 1757 阅读

python 爬虫 spider

前一篇聊到了简单的请求和获取，这一篇会写关于网页内容的解析。

我们如果是爬取的html，那么内容就会有很多是我们不需要的，我们就需要解析html并抽取到自己需要的内容部分。

对于解析内容，有三种方法：

1. 正则表达式匹配获取。

2. 用lxml解析获取。

3. 用beautifulsoap获取。

对于第一种，匹配的效果不一定就好尤其是对于复杂的html结构，写正则表达式也可能漏掉或者多拿到内容。所以我还是比较推荐后面两种。

lxml和beautifulsoap都是用来解析html/xml的，调用的方法不一样。还有就是beautifulsoap会加载整个DOM。

比如我现在有个简单的需求：给予关键词，获取百度新闻上对应的内容前三页的网页标题。

获取在前一篇已经写过了，这里就分别用lxml和beautifulsoap来解析。

下面的代码是用lxml解析，先载入爬取到的网页html，再通过xpath定位到标题，取出文本。

# with lxml

from utils.http_helper import HttpHelper
from lxml import html as lh


if __name__ == "__main__":
    # capture the news titles recorded in baidu search
    pages = 2
    keyword = "成都地铁"
    for page in range(pages+1):
        url = "https://www.baidu.com/s?tn=news&rsv_dl=ns_pc&word=%s&pn=%d" % (keyword, page*10)
        additional_header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:71.0) Gecko/20100101 Firefox/71.0"}
        response = HttpHelper.get_response_by_url(url, data=None, headers=additional_header)
        if response.status_code == 200:
            result = response.content.decode('utf-8')
            doc = lh.fromstring(result)
            title_xpath = "//div[@class='result']/h3[@class='c-title']/a"
            nodes = doc.xpath(title_xpath)
            print("News title in page %d with the url %s: " % (page, url))
            for n in nodes:
                print(n.text_content().strip())

下面的代码是用beautifulsoap解析，先加载爬取到的html，然后获取对应节点的内容取相应的子节点的文本。

# with beautifulsoap

from utils.http_helper import HttpHelper
from bs4 import BeautifulSoup as bs


if __name__ == "__main__":
    # capture the news titles recorded in baidu search
    pages = 2
    keyword = "成都地铁"
    for page in range(pages+1):
        url = "https://www.baidu.com/s?tn=news&rsv_dl=ns_pc&word=%s&pn=%d" % (keyword, page*10)
        additional_header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:71.0) Gecko/20100101 Firefox/71.0"}
        response = HttpHelper.get_response_by_url(url, data=None, headers=additional_header)
        if response.status_code == 200:
            result = response.content.decode('utf-8')
            bs_result = bs(result, "html.parser")
            title_h3_tag = bs_result.find_all("h3", class_="c-title")
            print("News title in page %d with the url %s: " % (page, url))
            for title in title_h3_tag:
                print(title.find("a").text.strip())

最后都能得到同样的结果，于是我们达成了这个简单的需求，爬取百度新闻前三页并获取到标题内容。

下一篇: 常用的sql

上一篇: Python爬虫(二)获取百度搜索收录结果

Bob's Blog

Python爬虫(三)解析网页内容

共有0条评论

添加评论

暂无评论

Python爬虫(三)解析网页内容

共有0条评论 添加评论

暂无评论

共有0条评论

添加评论