A high-level distributed crawling framework.

Last update: Jan 04, 2023

Related tags

Overview

Cola: high-level distributed crawling framework

Overview

Cola is a high-level distributed crawling framework, used to crawl pages and extract structured data from websites. It provides simple and fast yet flexible way to achieve your data acquisition objective. Users only need to write one piece of code which can run under both local and distributed mode.

Requirements

Python2.7 (Python3+ will be supported later)
Work on Linux, Windows and Mac OSX

Install

The quick way:

pip install cola

Or, download source code, then run:

python setup.py install

Write applications

Documents will update soon, now just refer to the wiki or weibo application.

Run applications

For the wiki or weibo app, please ensure the installation of dependencies, weibo as an example:

pip install -r /path/to/cola/app/weibo/requirements.txt

Local mode

In order to let your application support local mode, just add code to the entrance as below.

from cola.context import Context
ctx = Context(local_mode=True)
ctx.run_job(os.path.dirname(os.path.abspath(__file__)))

Then run the application:

python __init__.py

Stop the local job by CTRL+C.

Distributed mode

Start master:

coca master -s [ip:port]

Start one or more workers:

coca worker -s -m [ip:port]

Then run the application(weibo as an example):

coca job -u /path/to/cola/app/weibo -r

Coca command

Coca is a convenient command-line tool for the whole cola environment.

master

Kill master to stop the whole cluster:

coca master -k

job

List all jobs:

coca job -m [ip:port] -l

Example as:

list jobs at master: 10.211.55.2:11103
====> job id: 8ZcGfAqHmzc, job description: sina weibo crawler, status: stopped

You can run a job which shown in the list above:

coca job -r 8ZcGfAqHmzc

Actually, you don't have to input the complete job name:

coca job -r 8Z

Part of the job name is fine if there's no conflict.

You can know the status of a running job by:

coca job -t 8Z

The status like counters during running and so on will be output to the terminal.

You can kill a job by the kill command:

coca job -k 8Z

startproject

You can create an application by this command:

coca startproject colatest

Remember, help command will always be helpful:

coca -h

coca master -h

Notes

Chinese docs(wiki).

Donation

Cola is a non-profit project and by now maintained by myself, thus any donation will be encouragement for the further improvements of cola project.

Alipay & Paypal: [email protected]

Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

2.3k Jan 4, 2023

Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

1.6k Jan 1, 2023

Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

212 Nov 5, 2022

PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

109 Jul 20, 2022

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

3 Oct 4, 2022

A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

1 Dec 19, 2021

Python framework to scrape Pastebin pastes and analyze them

pastepwn - Paste-Scraping Python Framework Pastebin is a very helpful tool to store or rather share ascii encoded data online. In the world of OSINT,

105 Dec 29, 2022

This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

- Hello, This Project Contains Amazon Web-bot. - I've developed this bot for fething some items information on Amazon. - Scrapy Framework in Python is

4 Feb 13, 2022

This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Deals of the Day This is a web scraper, using the Python framework Scrapy, built to extract data such as price and product name from the Deals of the

1 Jan 12, 2022

Comments

docs: Fix a few typos
There are small typos in:

cola/cluster/master.py

cola/core/bloomfilter/init.py

cola/core/opener.py

Fixes:

Should read experimentally rather than experimently.

Should read entries rather than enteries.

Should read continuously rather than continously.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md
opened by timgates42 0

任务执行完成后为什么始终不退出

Task类的run方法内有两个循环，最外面循环只有在stop事件出现后才出退出，为什么？

def run(self):
        try:
            curr_priority = 0
            while not self.stopped.is_set():
                priority_name = 'inc' if curr_priority == self.n_priorities \
                                    else curr_priority
                is_inc = priority_name == 'inc'
                
                while not self.nonsuspend.wait(5):
                    continue
                if self.stopped.is_set():
                    break
                
                self.logger.debug('start to process priority: %s' % priority_name)
                
                last = self.priorities_secs[curr_priority]
                clock = Clock()
                runnings = []
                try:
                    no_budgets_times = 0
                    while not self.stopped.is_set():
                        if clock.clock() >= last:
                            break
                        
                        if not is_inc:
                            status = self._apply(no_budgets_times)
                            if status == CANNOT_APPLY:
                                break
                            elif status == APPLY_FAIL:
                                no_budgets_times += 1
                                if not self._has_not_finished(curr_priority) and \
                                    len(runnings) == 0:
                                    continue
                                
                                if self._has_not_finished(curr_priority) and \
                                    len(runnings) == 0:
                                    self._get_unit(curr_priority, runnings)
                            else:
                                no_budgets_times = 0
                                self._get_unit(curr_priority, runnings)
                        else:
                            self._get_unit(curr_priority, runnings)
                            
                        if len(runnings) == 0:
                            break
                        if self.is_bundle:
                            self.logger.debug(
                                'process bundle from priority %s' % priority_name)
                            rest = min(last - clock.clock(), MAX_BUNDLE_RUNNING_SECONDS)
                            if rest <= 0:
                                break
                            obj = self.executor.execute(runnings.pop(), rest, is_inc=is_inc)
                        else:
                            obj = self.executor.execute(runnings.pop(), is_inc=is_inc)
                            
                        if obj is not None:
                            runnings.insert(0, obj)  
                finally:
                    self.priorities_objs[curr_priority].extend(runnings)
                    
                curr_priority = (curr_priority+1) % self.full_priorities
        finally:
            self.counter_client.sync()
            self.save()

opened by brightgems 5

看了下，和上一个issues的log是一样的，应该是mq没有保护好的问题把

Exception in thread Thread-2: Traceback (most recent call last): File "/usr/local/lib/python2.7/threading.py", line 551, in *bootstrap_inner self.run() File "/usr/local/lib/python2.7/threading.py", line 504, in run self.__target(_self.__args, _self.__kwargs) File "/usr/crawl/code/cola-code/cola/core/mq/__init.py", line 103, in _init_process self.put(objs, flush=flush) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 407, in put self._remote_or_local_batch_put(addr, self.caches[addr]) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 348, in _remote_or_local_batch_put self.mq_node.batch_put(objs) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 151, in batch_put self.put(obs, force=force, priority=priority) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 125, in put priority_store.put(objs, force=force) File "/usr/crawl/code/cola-code/cola/core/mq/store.py", line 291, in put result = self.put_one(obj, force, commit=False) File "/usr/crawl/code/cola-code/cola/core/mq/store.py", line 266, in put_one pos = self._seek_writable_pos(m) File "/usr/crawl/code/cola-code/cola/core/mq/store.py", line 228, in _seek_writable_pos size, = struct.unpack('I', map_handle[pos:pos+4]) TypeError: 'NoneType' object has no attribute 'getitem'

opened by tottilin 0

Releases(0.1.0beta)

0.1.0beta(Jul 6, 2015)

The beta version of the 0.1.0
Source code(tar.gz)
Source code(zip)

Owner

Xuye (Chris) Qin

Core developer and architect of Mars which is a tensor-based unified framework for large scale data computation, also worked on PyODPS and cola.

GitHub Repository

Incredibly fast crawler designed for OSINT.

Photon Incredibly fast crawler designed for OSINT. Photon Wiki • How To Use • Compatibility • Photon Library • Contribution • Roadmap Key Features Dat

9.3k Jan 02, 2023

SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

SearchifyX SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features. SearchifyX lets you

28 Dec 20, 2022

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Pythonic Crawling / Scraping Framework Built on Eventlet Features High Speed WebCrawler built on Eventlet. Supports relational databases engines like

173 Dec 05, 2022

A command-line program to download media, like and unlike posts, and more from creators on OnlyFans.

onlyfans-scraper A command-line program to download media, like and unlike posts, and more from creators on OnlyFans. Installation You can install thi

185 Jul 23, 2022

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

12.3k Jan 07, 2023

Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

1 Jan 28, 2022

A Scrapper with python

Scrapper-en-python Scrapper des données signifie récuperer des données pour les traiter ou les analyser. En python, il y'a 2 grands moyens de scrapper

1 Dec 05, 2021

A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022

EBay-email-tracker - Scapes an entire search page of a particular item on eBay and sends regular updates to an email address

Introduction This is a project I built with the sole intent to learn more about

1 Jan 14, 2022

Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

68 Oct 08, 2022

Web Scraping Practica With Python

Web-Scraping-Practica Integrants: Guillem Vidal Pallarols. Lídia Bandrés Solé Fitxers: Aquest document és el primer que trobem. A continuació trobem u

2 Nov 08, 2021

Crawl the information of a given keyword on Google search engine

4 Nov 09, 2022

A high-level distributed crawling framework.

Related tags

Overview

Cola: high-level distributed crawling framework

Overview

Requirements

Install

Write applications

Run applications

Local mode

Distributed mode

Coca command

master

job

startproject

Notes

Donation

You might also like...

Web Scraping Framework

Async Python 3.6+ web scraping micro-framework based on asyncio

Transistor, a Python web scraping framework for intelligent use cases.

PyQuery-based scraping micro-framework.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

A simple django-rest-framework api using web scraping

Python framework to scrape Pastebin pastes and analyze them

This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Comments

docs: Fix a few typos

任务执行完成后为什么始终不退出

看了下，和上一个issues的log是一样的，应该是mq没有保护好的问题把

Releases(0.1.0beta)

0.1.0beta(Jul 6, 2015)

Owner

Xuye (Chris) Qin

Incredibly fast crawler designed for OSINT.

SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

A command-line program to download media, like and unlike posts, and more from creators on OnlyFans.

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

A Scrapper with python

A web crawler script that crawls the target website and lists its links

EBay-email-tracker - Scapes an entire search page of a particular item on eBay and sends regular updates to an email address

腾讯课堂，模拟登陆，获取课程信息，视频下载，视频解密。

simple http & https proxy scraper and checker

Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

A database scraper created with mechanical soup and sqlite

A social networking service scraper in Python

Screen scraping and web crawling framework

Google Developer Profile Badge Scraper

Dex-scrapper - Hobby project for scrapping dex data on VeChain

Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

Web Scraping Practica With Python

Crawl the information of a given keyword on Google search engine