PyQuery-based scraping micro-framework.

Last update: Jul 20, 2022

Related tags

Web Crawling demiurge

Overview

demiurge

PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x.

Documentation: http://demiurge.readthedocs.org

Installing demiurge

$ pip install demiurge

Quick start

Define items to be scraped using a declarative (Django-inspired) syntax:

import demiurge

class TorrentDetails(demiurge.Item):
    label = demiurge.TextField(selector='strong')
    value = demiurge.TextField()

    def clean_value(self, value):
        unlabel = value[value.find(':') + 1:]
        return unlabel.strip()

    class Meta:
        selector = 'div#specifications p'

class Torrent(demiurge.Item):
    url = demiurge.AttributeValueField(
        selector='td:eq(2) a:eq(1)', attr='href')
    name = demiurge.TextField(selector='td:eq(2) a:eq(2)')
    size = demiurge.TextField(selector='td:eq(3)')
    details = demiurge.RelatedItem(
        TorrentDetails, selector='td:eq(2) a:eq(2)', attr='href')

    class Meta:
        selector = 'table.maintable:gt(0) tr:gt(0)'
        base_url = 'http://www.mininova.org'


>>> t = Torrent.one('/search/ubuntu/seeds')
>>> t.name
'Ubuntu 7.10 Desktop Live CD'
>>> t.size
u'695.81\xa0MB'
>>> t.url
'/get/1053846'
>>> t.html
u'<td>19\xa0Dec\xa007</td><td><a href="/cat/7">Software</a></td><td>...'

>>> results = Torrent.all('/search/ubuntu/seeds')
>>> len(results)
116
>>> for t in results[:3]:
...     print t.name, t.size
...
Ubuntu 7.10 Desktop Live CD 695.81 MB
Super Ubuntu 2008.09 - VMware image 871.95 MB
Portable Ubuntu 9.10 for Windows 559.78 MB
...

>>> t = Torrent.one('/search/ubuntu/seeds')
>>> for detail in t.details:
...     print detail.label, detail.value
... 
Category: Software > GNU/Linux
Total size: 695.81 megabyte
Added: 2467 days ago by Distribution
Share ratio: 17 seeds, 2 leechers
Last updated: 35 minutes ago
Downloads: 29,085

See documentation for details: http://demiurge.readthedocs.org

Why demiurge?

Plato, as the speaker Timaeus, refers to the Demiurge frequently in the Socratic dialogue Timaeus, c. 360 BC. The main character refers to the Demiurge as the entity who "fashioned and shaped" the material world. Timaeus describes the Demiurge as unreservedly benevolent, and hence desirous of a world as good as possible. The world remains imperfect, however, because the Demiurge created the world out of a chaotic, indeterminate non-being.

http://en.wikipedia.org/wiki/Demiurge

Contributors

Martín Gaitán (@mgaitan)

Comments

Reausable cleaning functions
You can now add a "clean" kwarg containing a function to a field.

This makes it easy to use quick filtering (I want this data to be an int) and to re-use functions such as parsedatetime.

score = demiurge.TextField(selector=".score .upvoted", clean=int)
opened by traverseda 5
proof of concept: subitem field

short rationale: Sometimes I need to scrap a page to retrieve the actual links where the items are. I would like a way to nest Item classes, analog (in some way) to a in ForeignKey / ManyToManyField in Django.

This is a first PR as a proof of concept, to discuss the idea and its API.

opened by mgaitan 5
RelatedItems only work across urls

An obvious use of RelatedItems (or a similar construct) is recursively mapping a comment tree. Right now there's no elegant way to do that.

An example

http://pastebin.com/WDL4RjkE

Reading through the actual code, I think I might be wrong about this. I'll try and make the docs clearer.

opened by traverseda 2
Use lib "requests" for downloading

I'm right now making use of https://pypi.python.org/pypi/requests-cache which creates a cache of the downloaded stuff magically, and it's awesome. So, I would like to be able to take advantage of it using demiurge.

I don't know if just as an option or as a replacement of pyquery downloader.

What do you think?

opened by jmansilla 2
docs: fix simple typo, ocurrence -> occurrence

There is a small typo in docs/index.rst.

Should read occurrence rather than ocurrence.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

opened by timgates42 1
Fix when no selector defined
the default selector is the whole page ('html') but this is applied through PyQuery.find wich traverses down. example:

In [2]: PyQuery('<html>hello</html>').find('html') Out[2]: [] In [3]: PyQuery('<html>hello</html>')('html') Out[3]: [<html>]
opened by mgaitan 1
support self reference in RelatedItem

RelatedItem('self'). Also, the relateditem's item class could be given by its name (i.eRelatedItem("ItemClass")` ) A typical use case is a listing page with a "next page" link.

opened by mgaitan 0

Releases(v0.2)

v0.2(Sep 20, 2014)
Added docs.

Added RelatedItem.

Added field clean support.

Source code(tar.gz)
Source code(zip)

Owner

Matias Bordese

GitHub Repository http://demiurge.readthedocs.org

Console application for downloading images from Reddit in Python

RedditImageScraper Console application for downloading images from Reddit in Python Introduction This short Python script was created for the mass-dow

0 Jul 04, 2021

WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

WebScraping Web scraping Pyton program that scrapes Job website for python devel

2 Jul 22, 2022

Simple proxy scraper made by using ProxyScrape's api.

What is Moon? Moon is a lightweight and fast proxy scraper made by using ProxyScrape's api. What can i do with this? You can use proxies for varietys

1 Jul 04, 2022

fork huanghyw/jd_seckill

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本，仅用于测试和学习研究，禁止用于商业用途，不能保证其合法性，准确性，完整性和有效性，请根据情况自行判断。本项目内所有资源文件，禁止任何公众号、自媒体进行任何形式的转载、发布。

512 Jan 03, 2023

Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

wallstreetbets-tracker Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit.

91 Dec 08, 2022

Dex-scrapper - Hobby project for scrapping dex data on VeChain

Folders /zumo_abis # abi extracted from zumo repo /zumo_pools # runtime e

3 Jan 20, 2022

An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line!

Social Media Scraper An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line! Go to the website » Vie

2 Aug 03, 2022

✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Spider-Cut is a Network Mapper Framework (NMAP Framework) Installation | Usage | Creators | Donate Installation # Kali Linux | WSL

3 Mar 07, 2022

Scrap-mtg-top-8 - A top 8 mtg scraper using python

1 Jan 24, 2022

A simplistic scraper made to download tons of random screenshots made by people.

printStealer 1.1 What is this tool? This tool is developed to show the insecurity of the screenshot utility called prnt sc. It is a site that stores s

4 Jul 26, 2022

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）

python+selenium实现的web端自动打卡说明本打卡脚本适用于郑州大学健康打卡，其他web端打卡也可借鉴学习。（自己用的，从2月分稳定运行至今）仅供学习交流使用，请勿依赖。开发者对使用本脚本造成的问题不负任何责任，不对脚本执行效果做出任何担保，原则上不提供任何形式的技术支持。为防止

1 Aug 27, 2022

A python script to extract answers to any question on Quora (Quora+ included)

quora-plus-bypass A python script to extract answers to any question on Quora (Quora+ included) Requirements Python 3.x

10 Aug 18, 2022

Web Scraping COVID 19 Meta Portal with Python

Web-Scraping-COVID-19-Meta-Portal-with-Python - Requests API and Beautiful Soup to scrape real-time COVID statistics from worldometer website and perform data cleaning and visual analysis in Jupyter

1 Jan 04, 2022

Web scrapper para cotizar articulos

WebScrapper Este web scrapper esta desarrollado en python 3.10.0 para buscar en la pagina de cyber puerta articulos dentro del catalogo. El programa t

1 Oct 27, 2021

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

slocpi-scraper Sun Life of Canada Philippines Inc Investment Funds Scraper Install dependencies pip install -r requirements.txt Usage General format:

2 Jan 07, 2022

PyQuery-based scraping micro-framework.

Related tags

Overview

demiurge

Installing demiurge

Quick start

Why demiurge?

Contributors

Comments

Reausable cleaning functions

proof of concept: subitem field

RelatedItems only work across urls

Use lib "requests" for downloading

docs: fix simple typo, ocurrence -> occurrence

Fix when no selector defined

support self reference in RelatedItem

Releases(v0.2)

v0.2(Sep 20, 2014)

Owner

Matias Bordese

Console application for downloading images from Reddit in Python

WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

Simple proxy scraper made by using ProxyScrape's api.

fork huanghyw/jd_seckill

Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

Dex-scrapper - Hobby project for scrapping dex data on VeChain

An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line!

✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Scrap-mtg-top-8 - A top 8 mtg scraper using python

A simplistic scraper made to download tons of random screenshots made by people.

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）

A python script to extract answers to any question on Quora (Quora+ included)

Web Scraping COVID 19 Meta Portal with Python

Web scrapper para cotizar articulos

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

Simply scrape / download all the media from an fansly account.

API which uses discord to scrape NameMC searches/droptime/dropping status of minecraft names

A tool to easily scrape youtube data using the Google API

Instagram profile scrapper with python

Python script who crawl first shodan page and check DBLTEK vulnerability

PyQuery-based scraping micro-framework.

Related tags

Overview

demiurge

Installing demiurge

Quick start

Why demiurge?

Contributors

Comments

Reausable cleaning functions

proof of concept: subitem field

RelatedItems only work across urls

Use lib "requests" for downloading

docs: fix simple typo, ocurrence -> occurrence

Fix when no selector defined

support self reference in RelatedItem

Releases(v0.2)

v0.2(Sep 20, 2014)

Owner

Matias Bordese

Console application for downloading images from Reddit in Python

WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

Simple proxy scraper made by using ProxyScrape's api.

fork huanghyw/jd_seckill

Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

Dex-scrapper - Hobby project for scrapping dex data on VeChain

An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line!

✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Scrap-mtg-top-8 - A top 8 mtg scraper using python

A simplistic scraper made to download tons of random screenshots made by people.

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤（从2月份稳定运行至今）

A python script to extract answers to any question on Quora (Quora+ included)

Web Scraping COVID 19 Meta Portal with Python

Web scrapper para cotizar articulos

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

Simply scrape / download all the media from an fansly account.

API which uses discord to scrape NameMC searches/droptime/dropping status of minecraft names

A tool to easily scrape youtube data using the Google API

Instagram profile scrapper with python

Python script who crawl first shodan page and check DBLTEK vulnerability

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）