Ana1.1 下載網頁:使用 requests 模組


由於網際網路的快速發展,在網路上有著非常豐富的資源可以使用。然而,若使用瀏覽器瀏覽一個個網頁並且蒐集所需要的資料,這確是一件令人卻步的事情。幸運的是,我們可以透過撰寫網頁爬蟲程式來蒐集網路上所需要的資料,在Python中有一些第三方套件可以幫助我們達成此目標。

網頁爬蟲程式的可以被分成兩個步驟:

  • 下載網頁:使用 requests 模組。
  • 分析網頁:使用 BeautifulSoup 模組。

安裝 requests 模組

由於 requests 不是 Python 內建的模組,所以必須先使用 pip 安裝 requests 模組。

pip install requests

使用 requests.get() 下載網頁

使用 requests.get() 函數可以較為容易地從一個網站上下載網頁或檔案

>>> import requests
>>> r = requests.get('http://automatetheboringstuff.com/files/rj.txt')
>>>

>>> type(res)
<class 'requests.models.Response'>

>>> r.status_code                      # 透過 status_code 可以得知網頁的請求是否成功
200                                    # 200 表示成功

>>> requests.codes.ok                  # 可以透過 requests 模組來查詢 status_code
200

>>> r.status_code == requests.codes.ok 
True

加入錯誤檢測

若傳入 requests.get() 的網此不存時,可以加入 try-except 做錯誤檢測,當在取得網頁的過程中有錯誤發現時,可以做出相對應的處置,例如:網頁請求不成功時,顯示訊息及不做任何事情。

參考檔案:requests_try_except.py

# 匯入模組
import requests


# 使用模組
r = requests.get('http://inventwithpython.com/page_that_does_not_exist')

try:
    r.raise_for_status()
except Exception as e:
    print('There was a problem: {0}'.format(e))


# 執行結果 - 使用IDLE
>>> 
=========== RESTART: C:/Users/xxxxx/Desktop/Crawlar_read_write/requests_try_except.py ===========
There was a problem: 404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist
>>>

將下載的網頁存入檔案

將下載的網頁內容以「utf8」的編碼方式存入文字檔中。

>>> import requests
>>> r = requests.get('http://automatetheboringstuff.com/files/rj.txt')

>>> with open('rj_utf8.txt', 'w', encoding='UTF-8') as f:
...     f.write(r.text)
...
174128


# 執行結果: rj_utf8.txt
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org/license


Title: Romeo and Juliet

Author: William Shakespeare

Posting Date: May 25, 2012 [EBook #1112]
Release Date: November, 1997  [Etext #1112]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***


... 略 ...

除了文字檔之外,亦可以將所下載的網頁內容存入其他類型的檔案中,例如:在檔案I/O中所提及的CSV、Excel或pickle檔。

牛刀小試

  • 使用 requests 模組
  • 要下載的網頁:Google網站
  • 將下載的網頁存入「*.html」檔案
# 匯入模組
>>> import requests
>>> r = requests.get('http://www.google.com.tw')

>>> r.status_code
200

>>> with open('google_utf8.html', 'w', encoding='UTF-8') as f:
...     f.write(r.text)
...
242852
>>>


# 執行結果: google_utf8.html
<!doctype html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="zh-TW">
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<meta content="/logos/doodles/2017/teachers-day-2017-taiwan-5744001050214400-law.gif" itemprop="image">
<meta content="2017 教師節" property="twitter:title">
<meta content="2017 教師節 #GoogleDoodle" property="twitter:description">
<meta content="2017 教師節 #GoogleDoodle" property="og:description">
<meta content="summary_large_image" property="twitter:card"><meta content="@GoogleDoodles" property="twitter:site">
<meta content="https://www.google.com/logos/doodles/2017/teachers-day-2017-taiwan-5744001050214400-law.gif" property="twitter:image">
<meta content="https://www.google.com/logos/doodles/2017/teachers-day-2017-taiwan-5744001050214400-law.gif" property="og:image">
<meta content="450" property="og:image:width">
<meta content="200" property="og:image:height">
<meta content="http://www.google.com/logos/doodles/2017/teachers-day-2017-taiwan-5744001050214400-2xa.gif" property="og:url">
<meta content="video.other" property="og:type">
<title>Google</title>

... 略 ...

使用 httpbin.org 測試網站

>>> import requests
>>> r = requests.get('http://httpbin.org/get')
>>>

>>> r.status_code
200

>>> r.text
'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, d
eflate", \n    "Cache-Control": "max-age=302400", \n    "Connection": "close", \n    "Host
": "httpbin.org", \n    "User-Agent": "python-requests/2.18.4"\n  }, \n  "origin": "36.231
.96.20, 172.30.4.36, 172.30.0.39, 61.219.37.4", \n  "url": "http://httpbin.org/get"\n}\n'

>>> with open('httpbin_utf8.txt', 'w', encoding='UTF-8') as f:
...     f.write(r.text)
...
344


# httpbin_utf8.txt 內容
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Cache-Control": "max-age=302400", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "origin": "36.231.96.20, 172.30.4.36, 172.30.0.39, 61.219.37.4", 
  "url": "http://httpbin.org/get"
}


>>> d = dict(r.json())
>>> d
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Cache-Contr
ol': 'max-age=302400', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python
-requests/2.18.4'}, 'origin': '36.231.96.20, 172.30.4.36, 172.30.0.39, 61.219.37.4', 'url'
: 'http://httpbin.org/get'}

>>> r.url
'http://httpbin.org/get'

>>> r = requests.get('https://httpbin.org/html')
>>> r
<Response [200]>

>>> with open('httpbin_html.html', 'w', encoding='UTF-8') as f:
...     f.write(r.text)
...
3739


# httpbin_html.html 網頁程式碼內容
<!DOCTYPE html>
<html>
  <head>
  </head>
  <body>
      <h1>Herman Melville - Moby-Dick</h1>

      <div>
        <p>
          Availing himself of the mild, ... 略 ...
        </p>
      </div>
  </body>
</html>

參考資料

results matching ""

    No results matching ""