A universal package of scraper scripts for humans

Related tags

Web CrawlingScrapera
Overview

Logo

MIT License version-shield release-shield python-shield

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Sponsors
  6. License
  7. Contact
  8. Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

  • Images
  • Text
  • Audio
  • Videos
  • Miscellaneous

  • The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

    DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

    Prerequisites

    Prerequisites can be installed separately through the requirements.txt file as below

    pip install -r requirements.txt

    Installation

    Scrapera is built with Python 3 and can be pip installed directly

    pip install scrapera

    Alternatively, if you wish to install the latest version directly through GitHub then run

    pip install git+https://github.com/DarshanDeshpande/Scrapera.git

    Usage

    To use any sub-module, you just need to import, instantiate and execute

    from scrapera.video.vimeo import VimeoScraper
    scraper = VimeoScraper()
    scraper.scrape('https://vimeo.com/191955190', '540p')

    For more examples, please refer to the individual test folders in respective modules

    Contributing

    Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
    For more guidelines, refer to CONTRIBUTING

    License

    Distributed under the MIT License. See LICENSE for more information.

    Sponsors

    Logo

    Contact

    Feel free to reach out for any issues or requests related to Scrapera

    Darshan Deshpande (Owner) - Email | LinkedIn

    Acknowledgements

    Owner
    Helping Machines Learn Better 💻😃
    抖音批量下载用户所有无水印视频

    Douyincrawler 抖音批量下载用户所有无水印视频 Run 安装python3, 安装依赖

    28 Dec 08, 2022
    此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

    此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

    N0el4kLs 5 Nov 19, 2021
    Simply scrape / download all the media from an fansly account.

    Simply scrape / download all the media from an fansly account. Providing updates as long as its continuously gaining popularity, so hit the ⭐ button!

    Mika C. 334 Jan 01, 2023
    Poolbooru gelscraper - a simple python script for scraping images off gelbooru pools.

    poolbooru_gelscraper a simple python script for scraping images off gelbooru pools. modules required:requests_html, and os by default saves files with

    savantshuia 1 Jan 02, 2022
    script to scrape direct download links (ddls) from google drive index.

    bhadoo Google Personal/Shared Drive Index scraper. A small script to scrape direct download links (ddls) of downloadable files from bhadoo google driv

    sαɴᴊɪᴛ sɪɴʜα 53 Dec 16, 2022
    Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

    SpaceX Sofware I developed software to scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info to use the software you need Python a

    Maxence Rémy 16 Aug 02, 2022
    爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

    OnTimeHacker V1.0 OnTimeHacker 是一个爬取各大SRC当日公告,并通过微信通知的小工具 OnTimeHacker目前版本为1.0,已支持24家SRC,列表如下 360、爱奇艺、阿里、百度、哔哩哔哩、贝壳、Boss、58、菜鸟、滴滴、斗鱼、 饿了么、瓜子、合合、享道、京东、

    Bywalks 95 Jan 07, 2023
    Console application for downloading images from Reddit in Python

    RedditImageScraper Console application for downloading images from Reddit in Python Introduction This short Python script was created for the mass-dow

    James 0 Jul 04, 2021
    Scraping web pages to get data

    Scraping Data Get public data and save in database This is project use Python How to run a project 1 - Clone the repository 2 - Install beautifulsoup4

    Soccer Project 2 Nov 01, 2021
    Luis M. Capdevielle 1 Jan 14, 2022
    Creating Scrapy scrapers via the Django admin interface

    django-dynamic-scraper Django Dynamic Scraper (DDS) is an app for Django which builds on top of the scraping framework Scrapy and lets you create and

    Holger Drewes 1.1k Dec 17, 2022
    a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

    This is George's Scraping Project To get started cd into the theZoo file and run: chmod +x script.sh then: ./script.sh This will spin up a Postgres co

    George Reyes 7 Nov 27, 2022
    12306抢票脚本

    12306抢票脚本

    罐子里的茶 457 Jan 05, 2023
    Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

    Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

    Gerapy 2.9k Jan 03, 2023
    Facebook Group Scraping Using Beautiful Soup & Selenium

    Extract Facebook group posts that are related to a specific topic and write them to a .json file.

    Fatima Ghadieh 14 Aug 12, 2022
    A tool to easily scrape youtube data using the Google API

    YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

    7 Dec 03, 2022
    An Web Scraping API for MDL(My Drama List) for Python.

    PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

    6 Dec 10, 2022
    A web scraper that exports your entire WhatsApp chat history.

    WhatSoup 🍲 A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

    Eddy Harrington 87 Jan 06, 2023
    A modern CSS selector implementation for BeautifulSoup

    Soup Sieve Overview Soup Sieve is a CSS selector library designed to be used with Beautiful Soup 4. It aims to provide selecting, matching, and filter

    Isaac Muse 151 Dec 23, 2022
    a high-performance, lightweight and human friendly serving engine for scrapy

    a high-performance, lightweight and human friendly serving engine for scrapy

    Speakol Ads 30 Mar 01, 2022