Python 爬虫介绍

Python 爬虫介绍（2025年中文讲解）

Python 爬虫是一种通过编程自动化获取网页数据的工具，广泛用于数据采集、分析和自动化任务。Python 因其丰富的库（如 requests、BeautifulSoup、Scrapy）和简洁语法，成为爬虫开发的首选语言。2025年，Python 爬虫在数据科学、AI 数据采集和商业分析中应用广泛，结合现代工具（如异步库 aiohttp、Playwright）应对动态网页和反爬机制。本教程详细介绍 Python 爬虫的基础知识、核心库、用法和实践，基于官方文档、CSDN 和 Python 社区，适合初学者和开发者。建议用 PyCharm 或 VS Code 练习，搭配 Chrome 开发者工具（F12）分析网页。

一、Python 爬虫概览（必知）

定义：爬虫（Web Crawler/Spider）是一种程序，通过模拟浏览器请求网页，解析 HTML/JSON，提取目标数据（如文本、图片、链接）。
核心用途：
数据采集：抓取新闻、商品价格、社交媒体数据。
自动化测试：验证网页内容或功能。
竞争分析：监控竞品网站数据。
特点：
简单易用：Python 库（如 requests）简化 HTTP 请求。
灵活性：支持静态和动态网页爬取。
生态丰富：结合 pandas、matplotlib 等处理数据。
2025年趋势：
异步爬虫（aiohttp、asyncio）提高效率，适合大规模抓取。
动态网页爬取（如 Playwright、Selenium）应对 JavaScript 渲染。
反爬机制（如验证码、IP 封禁）推动代理池和无头浏览器使用。
在 KMP（Kotlin Multiplatform）项目中，Python 爬虫为后端数据采集提供支持。

二、核心组件与库（必会）

以下介绍爬虫开发的关键库和工具，包含安装和基本用法。

1. 核心库

requests：发送 HTTP 请求，获取网页内容。

  pip install requests

BeautifulSoup：解析 HTML/XML，提取数据。

  pip install beautifulsoup4

Scrapy：功能强大的爬虫框架，适合复杂项目。

  pip install scrapy

aiohttp：异步 HTTP 请求，适合高并发。

  pip install aiohttp

Playwright：处理动态网页，支持 JavaScript 渲染。

  pip install playwright
  playwright install

2. 其他工具

lxml：高效解析 HTML/XML，替代 BeautifulSoup。

  pip install lxml

pandas：数据存储和分析。

  pip install pandas

selenium：模拟浏览器行为，处理动态内容。

  pip install selenium

三、爬虫开发步骤（必会）

以下按步骤讲解爬虫开发流程，包含示例代码。

1. 发送 HTTP 请求

使用 requests 获取网页：

  import requests

  url = "https://example.com"
  response = requests.get(url)
  if response.status_code == 200:
      print(response.text)  # 输出 HTML 内容
  else:
      print(f"Failed: {response.status_code}")

说明：
检查 status_code 确保请求成功（200 表示成功）。
可添加 headers 模拟浏览器：
python headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"} response = requests.get(url, headers=headers)

2. 解析网页内容

使用 BeautifulSoup 提取数据：

  from bs4 import BeautifulSoup
  import requests

  url = "https://example.com"
  response = requests.get(url)
  soup = BeautifulSoup(response.text, "html.parser")
  title = soup.find("h1").text
  print(f"Title: {title}")

说明：
html.parser 是默认解析器，也可使用 lxml（更快）。
常用方法：find（找第一个）、find_all（找所有）、select（CSS 选择器）。

3. 处理动态网页

使用 Playwright 渲染 JavaScript：

  from playwright.sync_api import sync_playwright

  with sync_playwright() as p:
      browser = p.chromium.launch(headless=True)
      page = browser.new_page()
      page.goto("https://example.com")
      title = page.query_selector("h1").inner_text()
      print(f"Title: {title}")
      browser.close()

说明：Playwright 模拟浏览器，适合 SPA（单页应用）。

4. 存储数据

保存为 CSV（使用 pandas）：

  import pandas as pd

  data = [{"title": "Example", "url": "https://example.com"}]
  df = pd.DataFrame(data)
  df.to_csv("output.csv", index=False)

5. 异步爬虫（高并发）

使用 aiohttp 异步请求：

  import aiohttp
  import asyncio

  async def fetch(url):
      async with aiohttp.ClientSession() as session:
          async with session.get(url) as response:
              return await response.text()

  async def main():
      url = "https://example.com"
      html = await fetch(url)
      print(html[:100])  # 输出前 100 字符

  asyncio.run(main())

四、实践示例（综合应用）

爬取网页标题（简单爬虫）：

import requests
from bs4 import BeautifulSoup

def scrape_titles(url):
    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")
        titles = soup.find_all("h2")
        return [title.text.strip() for title in titles]
    return []

def main():
    url = "https://example.com"
    titles = scrape_titles(url)
    for i, title in enumerate(titles, 1):
        print(f"{i}. {title}")

if __name__ == "__main__":
    main()

功能：爬取网页所有 <h2> 标题并打印。

爬取动态网页（Playwright 示例）：

from playwright.sync_api import sync_playwright
import pandas as pd

def scrape_dynamic(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        items = page.query_selector_all(".item")
        data = [{"text": item.inner_text()} for item in items]
        browser.close()
        return data

def main():
    url = "https://example.com"
    data = scrape_dynamic(url)
    df = pd.DataFrame(data)
    df.to_csv("dynamic_data.csv", index=False)
    print("Data saved to dynamic_data.csv")

if __name__ == "__main__":
    main()

功能：爬取动态网页的 .item 元素，保存为 CSV。

Scrapy 框架爬虫：

   scrapy startproject myspider
   cd myspider

Spider 文件（myspider/spiders/example_spider.py）：

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = ["https://example.com"]

    def parse(self, response):
        titles = response.css("h2::text").getall()
        for title in titles:
            yield {"title": title.strip()}

运行：

   scrapy crawl example -o output.json

功能：爬取 <h2> 标题，保存为 JSON。

五、注意事项与最佳实践

反爬机制：

User-Agent：伪装浏览器：
python headers = {"User-Agent": "Mozilla/5.0"}
IP 代理：使用代理池（如 requests 配合 proxies）：
python proxies = {"http": "http://proxy:port"} response = requests.get(url, proxies=proxies)
延迟请求：避免被封禁：
python import time time.sleep(1) # 每次请求间隔 1 秒

动态网页：

用 Playwright 或 Selenium 处理 JavaScript 渲染。
Playwright 比 Selenium 更轻量，推荐使用。

数据存储：

优先用 pandas 保存 CSV/JSON：
python df.to_csv("data.csv")
大数据量用数据库（如 SQLite、MySQL）。

合法性：

遵守网站 robots.txt 和条款。
避免高频请求，防止 DDoS。

2025年趋势：

异步爬虫：aiohttp 适合大规模抓取。
无头浏览器：Playwright 取代 Selenium，性能更优。
KMP 集成：Python 爬虫为 Kotlin WebView 提供数据：
kotlin webView.loadUrl("file://data.json")
AI 辅助：VS Code 的 Copilot 可生成爬虫代码。

六、学习建议

练习：用 requests 和 BeautifulSoup 爬取简单网页（如新闻标题）。
资源：
官方文档：https://docs.python-requests.org/
MDN（网页结构）：https://developer.mozilla.org/
CSDN：搜索“Python 爬虫”。
B站：Python 爬虫教程（如“尚硅谷 Python”）。
时间：2-3 天掌握基础爬虫，1 周熟悉动态爬虫和 Scrapy。
实践：开发小型爬虫（如爬取电商价格、新闻标题）。

七、总结

Python 爬虫必知核心库（requests、BeautifulSoup、Scrapy）、开发流程（请求、解析、存储），必会静态/动态网页爬取和反爬处理。2025年，异步爬虫和无头浏览器（如 Playwright）更高效，广泛应用于数据采集和 KMP 项目。相比其他语言，Python 爬虫生态丰富、易上手。

如果需要具体场景代码（如复杂爬虫或 KMP 集成）或有问题，告诉我，我可以提供更详细解答！

Python 爬虫介绍（2025年中文讲解）

一、Python 爬虫概览（必知）

二、核心组件与库（必会）

1. 核心库

2. 其他工具

三、爬虫开发步骤（必会）

1. 发送 HTTP 请求

2. 解析网页内容

3. 处理动态网页

4. 存储数据

5. 异步爬虫（高并发）

四、实践示例（综合应用）

五、注意事项与最佳实践

六、学习建议

七、总结

likuolei

发表回复取消回复

归档

分类

2025 年 12 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Python 爬虫介绍（2025年中文讲解）

一、Python 爬虫概览（必知）

二、核心组件与库（必会）

1. 核心库

2. 其他工具

三、爬虫开发步骤（必会）

1. 发送 HTTP 请求

2. 解析网页内容

3. 处理动态网页

4. 存储数据

5. 异步爬虫（高并发）

四、实践示例（综合应用）

五、注意事项与最佳实践

六、学习建议

七、总结

likuolei

发表回复 取消回复

相关文章

发表回复取消回复