A universal package of scraper scripts for humans

Related tags

Web CrawlingScrapera
Overview

Logo

MIT License version-shield release-shield python-shield

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Sponsors
  6. License
  7. Contact
  8. Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

  • Images
  • Text
  • Audio
  • Videos
  • Miscellaneous

  • The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

    DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

    Prerequisites

    Prerequisites can be installed separately through the requirements.txt file as below

    pip install -r requirements.txt

    Installation

    Scrapera is built with Python 3 and can be pip installed directly

    pip install scrapera

    Alternatively, if you wish to install the latest version directly through GitHub then run

    pip install git+https://github.com/DarshanDeshpande/Scrapera.git

    Usage

    To use any sub-module, you just need to import, instantiate and execute

    from scrapera.video.vimeo import VimeoScraper
    scraper = VimeoScraper()
    scraper.scrape('https://vimeo.com/191955190', '540p')

    For more examples, please refer to the individual test folders in respective modules

    Contributing

    Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
    For more guidelines, refer to CONTRIBUTING

    License

    Distributed under the MIT License. See LICENSE for more information.

    Sponsors

    Logo

    Contact

    Feel free to reach out for any issues or requests related to Scrapera

    Darshan Deshpande (Owner) - Email | LinkedIn

    Acknowledgements

    Owner
    Helping Machines Learn Better 💻😃
    A distributed crawler for weibo, building with celery and requests.

    A distributed crawler for weibo, building with celery and requests.

    SpiderClub 4.8k Jan 03, 2023
    A Powerful Spider(Web Crawler) System in Python.

    pyspider A Powerful Spider(Web Crawler) System in Python. Write script in Python Powerful WebUI with script editor, task monitor, project manager and

    Roy Binux 15.7k Jan 04, 2023
    A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

    combined-shop-scraper A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items. Features Define an

    2 Dec 13, 2021
    A universal package of scraper scripts for humans

    Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains.

    299 Dec 15, 2022
    a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

    This is George's Scraping Project To get started cd into the theZoo file and run: chmod +x script.sh then: ./script.sh This will spin up a Postgres co

    George Reyes 7 Nov 27, 2022
    PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

    PaperRobot PaperRobot 是一个论文抓取工具,可以快速批量下载大量论文,方便后期进行持续的论文管理与学习。 PaperRobot通过多个接口抓取论文,目前抓取成功率维持在90%以上。通过配置Config文件,可以抓取任意计算机领域相关会议的论文。 Installation Down

    moxiaoxi 47 Nov 23, 2022
    12306抢票脚本

    12306抢票脚本

    罐子里的茶 457 Jan 05, 2023
    This script is intended to crawl license information of repositories through the GitHub API.

    GithubLicenseCrawler This script is intended to crawl license information of repositories through the GitHub API. Taking a csv file with requirements.

    schutera 4 Oct 25, 2022
    A simple code to fetch comments below an Instagram post and save them to a csv file

    fetch_comments A simple code to fetch comments below an Instagram post and save them to a csv file usage First you have to enter your username and pas

    2 Jul 14, 2022
    Goblyn is a Python tool focused to enumeration and capture of website files metadata.

    Goblyn Metadata Enumeration What's Goblyn? Goblyn is a tool focused to enumeration and capture of website files metadata. How it works? Goblyn will se

    Gustavo 46 Nov 22, 2022
    simple http & https proxy scraper and checker

    simple http & https proxy scraper and checker

    Neospace 11 Nov 15, 2021
    此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

    此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

    N0el4kLs 5 Nov 19, 2021
    Here I provide the source code for doing web scraping using the python library, it is Selenium.

    Here I provide the source code for doing web scraping using the python library, it is Selenium.

    M Khaidar 1 Nov 13, 2021
    Web crawling framework based on asyncio.

    Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Requirements Python3.5+ Installation pip install gain pip install uvloo

    Jiuli Gao 2k Jan 05, 2023
    A Telegram crawler to search groups and channels automatically and collect any type of data from them.

    Introduction This is a crawler I wrote in Python using the APIs of Telethon months ago. This tool was not intended to be publicly available for a numb

    39 Dec 28, 2022
    A web service for scanning media hosted by a Matrix media repository

    Matrix Content Scanner A web service for scanning media hosted by a Matrix media repository Installation TODO Development In a virtual environment wit

    Brendan Abolivier 5 Dec 01, 2022
    Iptvcrawl - A scrapy project for crawl IPTV playlist

    iptvcrawl a scrapy project for crawl IPTV playlist. Dependency Python3 pip insta

    Zhijun 18 May 05, 2022
    Creating Scrapy scrapers via the Django admin interface

    django-dynamic-scraper Django Dynamic Scraper (DDS) is an app for Django which builds on top of the scraping framework Scrapy and lets you create and

    Holger Drewes 1.1k Dec 17, 2022
    A multithreaded tool for searching and downloading images from popular search engines. It is straightforward to set up and run!

    🕳️ CygnusX1 Code by Trong-Dat Ngo. Overviews 🕳️ CygnusX1 is a multithreaded tool 🛠️ , used to search and download images from popular search engine

    DatNgo 32 Dec 31, 2022
    Scrapy uses Request and Response objects for crawling web sites.

    Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

    Md Rashidul Islam 1 Nov 03, 2021