Python入門

はじめに

Webスクレイピングは、Webサイトからデータを自動収集する技術です。Pythonには強力なライブラリが揃っており、初心者でも比較的簡単に始められます。本記事では、静的なページ向けのBeautifulSoupと、動的なページ向けのSeleniumを中心に、実践的なスクレイピング方法を解説します。

環境構築

Pythonのインストール

Python 3.7以上を推奨。公式サイトからダウンロードするか、Anacondaを利用すると便利です。

必要なライブラリのインストール

pip install requests beautifulsoup4 selenium

Seleniumを使用する場合は、WebDriverも必要です。Chrome用の例：

pip install webdriver-manager

BeautifulSoupによる静的ページのスクレイピング

BeautifulSoupはHTMLやXMLを解析し、特定の要素を抽出するのに適しています。

基本的な使い方

import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
<h1>タイトルを取得</h1>
title = soup.title.text
print(title)
<h1>すべてのリンクを取得</h1>
for link in soup.find_all("a"):
    print(link.get("href"))

実践例：ニュースサイトの見出しを取得

import requests
from bs4 import BeautifulSoup
url = "https://news.yahoo.co.jp/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
<h1>ニュース見出しを抽出（セレクタはサイトに合わせて調整）</h1>
headlines = soup.select(".newsFeed_item_title")
for headline in headlines:
    print(headline.text.strip())

注意点

robots.txtの確認: スクレイピング前にサイトのポリシーを確認。

アクセス間隔: サーバーに負荷をかけないよう、time.sleep()で間隔を空ける。

User-Agent: 一部のサイトはデフォルトのUser-Agentをブロックするため、ヘッダーを設定。

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
response = requests.get(url, headers=headers)

Seleniumによる動的ページのスクレイピング

JavaScriptでレンダリングされるページや、ログインが必要なページにはSeleniumを使用します。

セットアップ

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

基本的な使い方

driver.get("https://example.com")
<h1>要素を取得</h1>
element = driver.find_element(By.CSS_SELECTOR, "h1")
print(element.text)
<h1>スクロール</h1>
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
<h1>終了</h1>
driver.quit()

実践例：無限スクロールページのデータ収集

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com/infinite-scroll")
<h1>スクロールを繰り返す</h1>
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height
<h1>データ取得</h1>
elements = driver.find_elements(By.CLASS_NAME, "item")
for element in elements:
    print(element.text)
driver.quit()

注意点

ヘッドレスモード: 画面表示を省略する場合。

from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)

待機処理: 要素が読み込まれるまで待つ。

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
element = wait.until(EC.presence_of_element_located((By.ID, "target")))

BeautifulSoupとSeleniumの使い分け

特徴	BeautifulSoup	Selenium
対象	静的HTML	動的コンテンツ（JSレンダリング）
速度	高速	低速（ブラウザ起動）
操作	データ抽出のみ	クリック・入力・スクロール可能
学習コスト	低い	やや高い

基本的には、まずBeautifulSoupで試し、動的要素が必要な場合のみSeleniumを使うのが効率的です。

スクレイピングの倫理と法律

著作権: 収集したデータの再利用には注意。

利用規約: サイトの利用規約を必ず確認。

過剰アクセス禁止: 短時間に大量のリクエストを送らない。

APIの利用: 可能なら公式APIを使う。

まとめ

PythonでのWebスクレイピングは、BeautifulSoupとSeleniumを組み合わせることで、ほとんどのサイトに対応できます。最初は簡単な静的ページから始め、徐々に動的ページに挑戦しましょう。適切なマナーを守りながら、データ収集を効率化してください。

参考リンク

BeautifulSoup公式ドキュメント

Selenium公式ドキュメント