Web crawling framework based on asyncio.

Last update: Jan 05, 2023

Overview

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

Web crawling framework based on asyncio.

Related tags

Overview

Requirements

Installation

Usage

Example

Contribution

Owner

Jiuli Gao

让中国用户使用git从github下载的速度提高1000倍!

The first public repository that provides free BUBT website scraping API script on Github.

for those who dont want to pay $10/month for high school game footage with ads

Scrapy-based cyber security news finder

Searching info from Google using Python Scrapy

mlscraper: Scrape data from HTML pages automatically with Machine Learning

A web scraper for nomadlist.com, made to avoid website restrictions.

Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

京东茅台抢购最新优化版本，京东茅台秒杀，优化了茅台抢购进程队列

EBay-email-tracker - Scapes an entire search page of a particular item on eBay and sends regular updates to an email address

Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

This is my CS 20 final assesment.

Scrapy uses Request and Response objects for crawling web sites.

Pseudo API for Google Trends

Grab the changelog from releases on Github

Python script for crawling ResearchGate.net papers✨⭐️📎

Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

download NCERT books using scrapy