Open Crawl Vietnamese Text

Last update: Jan 05, 2022

Related tags

Overview

Open Crawl Vietnamese Text

This repo contains crawled Vietnamese text from multiple sources.

This list of a topic-centric public data sources in high quality . We have collected and cleaned them from multiple sources. All of the datasets listed below are free.

Here are the ways we clean the data:

Removal of emojis
Removal of emoticons
Removal of URLs
Removal of HTML tags

1. Binhvq News Corpus:

Binhvq News Corpus was crawled from news on the internet with size of 50GB text.

link_raw, link_clean

2. Oscar corpus vietnamese crawl:

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Oscar has mostly 32 GB vietnamese text discarded duplicates.

link_raw, link_clean

3. Dataset story VietNamese :

Including texts of short and long story with size of 10 GB crawled by QAI on the internet.

link_clean

4. Dataset poem VietNamese :

More than 1 million sentences collected by QAI on the internet.

link_clean

Open Crawl Vietnamese Text

Related tags

Overview

Open Crawl Vietnamese Text

1. Binhvq News Corpus:

2. Oscar corpus vietnamese crawl:

3. Dataset story VietNamese :

4. Dataset poem VietNamese :

Owner

QAI Research

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Explore scraping with BeautifulSoup!

Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye

download NCERT books using scrapy

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

A simple Discord scraper for discord bots

腾讯课堂，模拟登陆，获取课程信息，视频下载，视频解密。

Scrapes all articles and their headlines from theonion.com

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

Works very well and you can ask for the type of image you want the scrapper to collect.

CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform

A scalable frontier for web crawlers

Audio media crawler for lbry.

A low-code tool that generates python crawler code based on curl or url

Web3 Pancakeswap Sniper bot written in python3

Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

A crawler of doubamovie

Grab the changelog from releases on Github

Open Crawl Vietnamese Text