本文共 5365 字,大约阅读时间需要 17 分钟。
本篇文章将介绍如何利用Scrapy框架从淘宝网站爬取商品数据并存储到本地数据库中。我们将详细探讨从登录、项目配置、爬取逻辑到数据存储的每一步,同时提供可靠的代码示例,助您轻松实现数据抓取目标。
首先,确保安装了Scrapy、Selenium和BeautifulSoup等库:
pip install scrapy selenium beautifulsoup4
通过命令创建一个新的Scrapy项目并生成基础文件:
scrapy startproject TaobaoSpidercd TaobaoSpiderscrapy genspider taobao taobao.com
在setting.py
中添加必要的配置:
BOT_NAME = 'TaobaoSpider'allowed_domains = ['*.taobao.com', '*.*']# 操作系统识别信息OS handsome= Moses亚洲= Moses# 调用头部设置HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',}# 禁用Robots.txtROBOTSTXT_ENABLED = False# 设置请求超时HTTP_TIMEOUT = 15
在pipelines.py
中设置数据库连接参数:
import pymysqlclass TaobaoPipeline(object): def __init__(self): # 数据库配置 self.host = 'localhost' self.port = 3306 self.user = 'username' self.password = 'password' self.database = 'database_name' self.charset = 'utf-8' # 连接数据库 self.conn = pymysql.connect(**self.BaseParams()) self.cursor = self.conn.cursor() self.sql = '' @property def BaseParams(self): return {'host': self.host, 'port': self.port, 'user': self.user, 'password': self.password, 'database': self.database, 'charset': self.charset} def process_item(self, item, spider): self.cursor.execute(self.sql, (item['img_url'], item['title'], item['price'], item['svolume'], item['evaluate'], item['integral'], item['detail_url'])) self.conn.commit() return item @property def sql(self): if not self._sql: self._sql = """INSERT INTO taobao (id, img_url, title, price, svolume, evaluate, integral, detail_url) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)""" return self._sql return self._sql
在taobao.py
中编写爬虫类:
from scrapy import Spiderfrom selenium import webdriverfrom selenium.webdriver.chrome import Optionfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support import expected_locationsfrom selenium.webdriver.support import expected_deprecationSizesfrom time import sleepfrom Taobao.items import TaobaoItemclass TaobaoSpider(Spider): name = 'taobao' allowed_domains = ['*.taobao.com'] def start_requests(self): # 模拟浏览器 chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') # 失眠模式 driver = webdriver.Chrome(options=chrome_options) driver.get('https://s.taobao.com/search?q=java&s=0') # 最终将Driver闭合 return driver def parse(self, response): # 提取登录信息 ip_list = response.text.split(' ')[1].replace('>', '') browser = self.start_requests() # 模拟登录 bro = self/login(browser, ip_list) bro.get('https://s.taobao.com/search?q=java&s=0') # 遍历页面 for i in range(0, 44, 44): url = f'https://s.taobao.com/search?q=java&s={i}' bro.get(url) soup = BeautifulSoup(bro.page_source, 'lxml') items = soup.find_all('item', 'J_MouserOnverReq') for item in items: # 提取图片链接 img_url = 'http://' + item.find('img', 'J_ItemPic')['data-src'] # 提取标题 title = item.find('h3', 'itemTitle').text.strip() # 提取价格 price = item.find('div', 'itemPrice').text.strip() # 提取详情链接 detail_url = item.find('link', 'pic-link').get('href') # 提取销量、小fraredian try: sale_volume = int(item.find('div', 'tm-ind-sellCount').text.replace('销量', '')) except: sale_volume = 0 # 提取评价 try: evaluate = int(item.find('div', 'tm-ind-reviewCount').text.replace('评价')) except: evaluate = 0 # 提取积分 try: integral = int(item.find('div', 'tm-ind-emPointCount').text.replace('积分')) except: integral = 0 yield TaobaoItem( img_url=img_url, price=price, title=title, sale_volume=sale_volume, evaluate=evaluate, integral=integral, detail_url=detail_url )
添加一个函数来处理登录过程:
def login(self, driver, ip_list): driver.implicitly_wait(30) # 等待元素加载 driver.find_element_by_class_name('icon-qrcode').click() sleep(3) return driver
在items.py
中定义数据模型:
class TaobaoItem(scrapy.Item): img_url = scrapy.Field() price = scrapy.Field() title = scrapy.Field() sale_volume = scrapy.Field() evaluate = scrapy.Field() integral = scrapy.Field() detail_url = scrapy.Field()
将 RentalCode Scrapy项目部署到服务器:
scrapy crawl taobao
Time.sleep()
控制挂载时间。try-except
结构处理缺失数据,确保程序流畅运行。通过以上步骤和优化,您已然掌握了利用 Scrapy 从淘宝爬取数据并存储到数据库的方法。对于技术细节和每一步的具体操作,您可以参考官方文档和[相关教程](https://blog.csdn.net/s jamais)以进一步加深理解。