Python爬虫基础

内容概述: requests, beautifulsoup, selenium, xpath

1. import requests:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import requests
import json

# requests的使用
# requests.get()
response = requests.get("http://localhost:31547")
content = response.content

# print(response.content)
print('\n','*'*50,'GET','*'*50)
print(type(response))
print(response.encoding)
print(response.status_code)
print(response.text)



# requests.post to get cookies
print('\n','*'*50,'POST for login','*'*50)
body = {"userName":"admin","password":123456}
r = requests.post('http://localhost:31547/login',data=body)
print("PostResponse.status = ",r.status_code)
print('PostResponse.txt = ',r.text)
print('PostResponse.cookies',r.cookies)
# cookies
cks = r.cookies


# requests.get with cookies
print('\n','*'*50,'GET all users with cookies','*'*50)
r = requests.get('http://localhost:31547/user',cookies=cks)
print(r.text)

print('\n','*'*50,'GET user info with cookies','*'*50)
r = requests.get('http://localhost:31547/user/1',cookies=cks)
print(r.text)

# requests.get articles
print('\n','*'*50,'GET all articles with cookies','*'*50)
r = requests.get('http://localhost:31547/article',cookies=cks)
print(r.text)


print('\n','*'*50,'GET article info','*'*50)
r = requests.get('http://localhost:31547/article/4',cookies=cks)
print(r.text)

# requests.post for add a article
print('\n','*'*50,'Add article','*'*50)
# body
body = {
"id": 39,
"title": "机器学习Machine Learning",
"content": "机器学习与人工智能",
"authorId": "1"

}
# headers
headers = {'Content-type': 'application/json'}
r = requests.post('http://localhost:31547/article',cookies=cks,data=json.dumps(body),headers=headers)
print(r.text)

# requests.put for update article
print('\n','*'*50,'Update article','*'*50)
body = {
"id": 35,
"title": "机器学习Machine Learning",
"content": "机器学习与人工智能, 深度学习",
"authorId": "1"

}
headers = {'Content-type': 'application/json'}
r = requests.put('http://localhost:31547/article/35',cookies=cks,data=json.dumps(body),headers=headers)
print(r.text)


# requests.delete a article
print('\n','*'*50,'Delete article','*'*50)
headers = {'Content-type': 'application/json'}
r = requests.delete('http://localhost:31547/article/34',cookies=cks,data=json.dumps(body),headers=headers)
print(r.text)

2. from bs4 import BeautifulSoup:

BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则Python会使用Python默认的解析器html.parser.

  • python3安装升级相关组件
    1
    2
    pip3 install --upgrade beautifulsoup4
    pip3 install --upgrade html5lib
Parser 使用方式 说明
Python标准库 BeautifulSoup(html,’html.parser’) python内置
lxml Parser BeautifulSoup(html,[‘lxml’,’xml’]) 速度快,支持html和xml
html5lib BeautifulSoupt(html,’html5lib’) 容错性好,以浏览器方式解析,生成html5文档

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

  • Tag(标签)
  • NavigableString(可遍历字符串)
  • BeautifulSoup
  • Comment

2.0 导入和新建一个BeautifulSoup

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<c href="http://example.com/elsie" class="sister" id="link1"><!--This is a comment--></c>,

<p class="story">...</p>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.prettify())

2.1 Tag

1
2
3
4
5
6
7
8
9
10
11
# 可以通过soup.TagName来得到标签, 但是返回的只是第一个满足条件的标签
print('\n\n','*'*60,'Tag','*'*60,'\n')
print(type(soup.head))
print(soup.head,'***name=',soup.head.name,'***attr=',soup.head.attrs)
print(soup.title,'***name=',soup.title.name,'***attr=',soup.title.attrs)
print(soup.p,'***name=',soup.p.name,'***attr=',soup.p.attrs)
print(soup.b,'***name=',soup.b.name,'***attr=',soup.b.attrs)
print(soup.a,'***name=',soup.a.name,'***attr=',soup.a.attrs)
# 获取Tag的属性attrs
for att,val in soup.a.attrs.items():
print(att,'====',val)

2.2 NavigableString:可遍历字符串

1
2
3
4
5
# 获取Tag的文字内容, Tag.string
print('\n\n','*'*60,'NavigableString','*'*60,'\n')
print(type(soup.a.string))
print(type(soup.a.get_text()))
print(soup.a.get_text(),soup.a.string)

2.3 BeautifulSoup

1
2
3
4
5
# BeautifulSoup对象表示的是一个文档的全部内容.大部分时候,可以把它当作Tag对象,它支持遍历文档树和搜索文档树中描述的大部分的方法.
print('\n\n','*'*60,'BeautifulSoup','*'*60,'\n')
print(type(soup))
print(soup.name)
print(soup.attrs)

2.4 Comment

1
2
3
4
5
6
# Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:
# Comment 对象是一个特殊类型的 NavigableString 对象:
print('\n\n','*'*60,'Comment','*'*60,'\n')
print(soup.c)
print(soup.c.string)
print(type(soup.c.string))

2.5 遍历文档树

2.5.1 直接子节点: Tag.contents[返回列表], Tag.children[返回listiterator]

1
2
3
4
5
6
7
8
9
10
11
12
print('\n\n','*'*60,'Iterate the DOM','*'*60,'\n')
# contents
cont = soup.body.contents
print(type(cont),'len(cont)=',len(cont))
for i in cont:
print('>>>',i)

# children
chils = soup.body.children
print(type(chils),chils)
for i in chils:
print("*"*20,i)

2.5.2 所有子孙节点
Tag.descendants[可以对所有tag的子孙节点进行递归循环,和 children类似,我们也需要遍历获取其中的内容。]

1
2
3
print(type(soup.descendants))
for item in soup.descendants:
print("所有子孙节点:",item)

2.5.3 节点的内容:Tag.string,Tag.get_text()

  • 如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 Tag.string得到子节点.
  • 如果一个tag仅有一个子节点,那么这个tag也可以使用Tag.string方法,输出结果与当前唯一子节点的Tag.string 结果相同.
  • 如果tag包含了多个子节点,tag就无法确定Tag.string应该调用哪个子节点的内容,Tag.string的输出结果是None
1
2
3
body = soup.body
print(body.string)
print(soup.title.string)

2.5.4 节点的多个内容: Tag.strings, Tag.stripped_strings[去除空格]

1
2
3
4
for i in body.strings:
print('节点的多个内容',i)
for i in body.stripped_strings:
print('节点的多个内容',i)

2.5.5 父亲节点: Tag.parent

1
2
title = soup.title
print(title.parent)

2.5.6 全部父亲节点: Tag.parents

1
2
for i in (title.parents):
print('全部父亲节点',i.name)

2.5.7 兄弟节点: Tag.next_sibling, Tag.previous_sibling

1
2
3
4
5
a = soup.a
print(a)
while a != None:
a = (a.next_sibling)
print(a)

2.5.8 全部兄弟节点: Tag.next_siblings, Tag.previous_siblings

1
2
3
a = soup.a
for i in a.next_siblings:
print("全部兄弟节点",i)

2.5.9 前后节点: Tag.next_element, Tag.previous_element
与Tag.next_sibling, Tag.previous_sibling不同, 它并不是针对于兄弟节点, 而是在所有节点,不分层次

1
print('前后节点',a.previous_element)

2.5.10 全部前后节点: Tag.next_elements, Tag.previous_elements

1
2
for i in a.previous_elements:
print('全部前后节点:', i.name)


2.6搜索文档树 : find_all( name , attrs , recursive , text , **kwargs )

2.6.1 name参数
1.字符串, 2.正则表达式, 3.列表, 4.True, 5.函数

1
2
3
4
5
6
7
8
9
for i in soup.find_all(['a','b','c']):
print('搜索所有文档: ', i)
for i in soup.find_all(re.compile('^b')):
print('搜索文档with正则: ', i.name)

def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
for i in soup.find_all(has_class_but_no_id):
print('搜索文档with函数: ', i.name)

2.6.2 keyword参数

1
2
3
4
5
6
for i in soup.find_all(id='link1'):
print('keyword参数id : ',i)

#class是python的关键词, 因此使用class_
for i in soup.find_all(class_='sister'):
print('keyword参数class_ : ', i)

2.6.3 text参数

1
2
3
4
for i in soup.find_all(text=['Elsie','Lacie']):
print('text参数 : ', i)
for i in soup.find_all(text=re.compile('Dormouse')):
print('text参数 : ',i)

2.6.4 limit参数

1
2
for i in soup.find_all('a',limit=2):
print('limit参数 : ',i)

2.6.5 recursive参数
调用tag的find_all()方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数recursive=False

1
2
3
4
for i in soup.find_all('title',recursive=False):
print('recursive参数 : ',i)
for i in soup.find_all('html',recursive=False):
print('recursive参数 : ',i)

2.6.6 find( name , attrs , recursive , text , kwargs )
它与find_all()方法唯一的区别是find_all()方法的返回结果是值包含一个元素的列表,而find()方法直接返回结果

1
print('find : ',soup.find('a'))

2.6.7 其他方法

1
2
3
4
5
find_parents()  find_parent()
find_next_siblings() find_next_sibling()
find_previous_siblings() find_previous_sibling()
find_all_next() find_next()
find_all_previous() 和 find_previous()

2.7 CSS查找:

写CSS标签名不加任何修饰, 类名前加点, id名前加#, 在这里我们也可以利用类似的方法来筛选元素,用到的方法是soup.select(), 返回类型是list
2.7.1 通过标签名查找

1
soup.select('title')

2.7.2 通过类名查找

1
2
for i in soup.select('.sister'):
print('通过类名查找 : ', i)

2.7.3 通过id名查找

1
2
for i in soup.select('#link1'):
print('通过id名查找 : ',i)

2.7.4 组合查找

1
2
for i in soup.select('p #link1'):
print('组合查找 : ',i)

2.7.5 属性查找

1
2
for i in soup.select('a[class="sister"]'):
print('属性查找 : ', i)


3. from selenium import webdriver

3.1 安装配置selenium

  • 首先, 安装selenium:

    1
    pip3 install selenium
  • 其次, 安装chrome浏览器driver, 并配置到环境变量

    1
    2
    3
    4
    5
    6
    7
    # 1. 下载chromedriver, 并解压到以下路径
    /usr/local/chromeDriver/chromedriver

    # 2. 配置到环境变量
    #for selenium.webdriver.Chrome()
    export CHROMEDRIVER=/usr/local/chromeDriver/
    export PATH=$PATH:$CHROMEDRIVER

3.2 selenium的使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

browser = webdriver.Chrome()
browser.get('https://www.baidu.com/')

input = browser.find_element_by_name("wd")
input.send_keys('科比')

submit = browser.find_element_by_id("su")

submit.submit()


cks = browser.get_cookies()
print(cks)
# print(browser.page_source)

# input.clear()
# browser.close()


4. xpath

XPath教程
XPath 是一门在XML文档中查找信息的语言, XPath是XSLT中的主要元素.
XQuery和XPointer均构建于XPath表达式之上, 同时在selenium和scrapy框架中都会涉及到xpath.

4.1可以使用chrome浏览器开发者工具得到页面上某个element的xpath

  1. 对于某个页面而言, 可以使用chrome浏览器提供的工具获取element的xpath:
  2. chrome浏览器进入以后, F12进入开发者工具—>Elements中找到要点击的元素—->右键—->copy—->copy XPath
  3. 然后自己可以根据需要略微修改

4.2 使用lxml操作xpath实例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
from lxml import etree

text = '''
<div class="container navigation">
<div class="row">
<div class="col nav">
<ul class="pc-nav">
<li id='1' class='home'><a href="//www.runoob.com/">首页</a></li>
<li id='2' class='front'><a href="/html/html-tutorial.html">HTML</a></li>
<li id='3' class='front'><a href="/css/css-tutorial.html">CSS</a></li>
<li id='4' class='front'><a href="/js/js-tutorial.html">JavaScript</a></li>
<li id='5' class='front'><a href="/jquery/jquery-tutorial.html">jQuery</a></li>
<li id='6' class='front'><a href="/bootstrap/bootstrap-tutorial.html">Bootstrap</a></li>
<li id='7' class='backend'><a href="/sql/sql-tutorial.html">SQL</a></li>
<li id='8' class='backend'><a href="/mysql/mysql-tutorial.html"><span class="bold">MySQL</span></a></li>
<li id='9' class='backend'><a href="/php/php-tutorial.html">PHP</a></li>
<li id='10' class='lang'><a href="/python/python-tutorial.html">Python2</a></li>
<li id='11' class='lang'><a href="/python3/python3-tutorial.html"><span class="bold">Python3</span></a></li>
<li id='12' class='lang'><a href="/cprogramming/c-tutorial.html">C</a></li>
<li id='13' class='lang'><a href="/cplusplus/cpp-tutorial.html">C++</a></li>
<li id='14' class='lang'><a href="/csharp/csharp-tutorial.html">C#</a></li>
<li id='15' class='langs'><a href="/java/java-tutorial.html"><span class="bold">Java</span></a></li>
<li id='16 'class='more'><a href="/sitemap">更多……</a></li>
<!--
<li id='17'><a href="javascript:;" class="runoob-pop">登录</a></li>
-->
</ul>
<ul class="mobile-nav">
<li class='mobile'><a href="//www.runoob.com/">首页</a></li>
<li class='mobile'><a href="/html/html-tutorial.html">HTML</a></li>
<li class='mobile'><a href="/css/css-tutorial.html">CSS</a></li>
<li class='mobile'><a href="/js/js-tutorial.html">JavaScript</a></li>
<a href="javascript:void(0)" class="search-reveal">Search</a>
</ul>

</div>
</div>
</div>
'''

# 除了直接从字符串读取之外,还可以从文件中读取
# html = etree.parse('test.html')
html = etree.HTML(text)
result = etree.tostring(html,pretty_print=True)

print(result)

# 1. 获取所有的li标签
lis = html.xpath('//li')


# 2. 获取某个li标签的class属性和id
for i in lis:
print('2',type(i),' class=',i.xpath('@class'),' id=',i.xpath('@id'))


# 3. 获取li标签的class属性
classes = html.xpath('//li/@class')
print('3',classes)


# 4. 获取li下面的所有a标签
ta = html.xpath('//li/a/@href')
print('4. href for all tag A=',ta)


# 5. 获取li下面所有的span标签(注意与上面的例子区分)
# 因为/是用来获取子元素的, 而<span>并不是<li>的子元素, 所以要用双斜杠
tspan = html.xpath('//li//span')
for i in tspan:
print('5',i.text)


# 6. 获取最后一个a标签的href(注意, 这类的问题很容易出错)
lasta = html.xpath('//li[last()]/a/@href')
print('6', lasta)

# 7. 获取class为lang的标签
langs = html.xpath('//li[@class="lang"]/a')
for i in langs:
print('7','=',i.text)

# 8. 获取class为bold的标签
bolds = html.xpath('//*[@class="bold"]')
for i in bolds:
print('8',i.text)