DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Last update: Aug 17, 2022

Overview

DANeS - Open-source E-newspaper dataset

_{Source: Technology vector created by macrovector - www.freepik.com.}

DANeS is an open-source E-newspaper dataset by collaboration between DATASET .JSC (dataset.vn) and AIV Group (aivgroup.vn) that contains over 600.000 online paper's articles. The articles are gathered from a number of Vietnamese Publishing Houses such as: tuoitre.vn, baobinhduong.vn, baoquangbinh.vn, kinhtechungkhoan.vn, doanhnghiep.vn, vnexpress.net, ...

We hope to support the community by providing a multi-purpose set of raw data for different subjects (students, developers, companies, …). So if you create something with this dataset, please share with us through our e-mail: [email protected]

Folder Tree
Data format
Labeling process
Reviewing process
Updating process
License of annotated dataset
About-us

Folder Tree

DANeS
  |
  |____README.md
  |
  |____raw_data
  |	   |____ DANeS_batch_#1.json
  |	   |____ DANeS_batch_#2.json
  |	   |____ DANeS_batch_#3.json
  |	   |____ DANeS_batch_#4.json
  |	   |____ DANeS_batch_#5.json
  |	   |____ DANeS_batch_#6.json
  |	   |____ DANeS_batch_#7.json
  |	   |____ DANeS_batch_#8.json
  |	   |____ README.md
  |
  |____annotated_data
  |	   |____ #contains annotated data
  |
  |____model
	   |____ Train_opensource.py
	   |____ README.md
	   |____ LICENSE

Data format

The raw dataset is stored in raw_data folder with .json format and has been divided into 8 batches. Each batch has an array that contains many json and each json is a record of the dataset. Here’s the example of each record's format:

Key	Type	Description
text	string	title of the digital news
meta	json	metadata of the digital news
uri	string	link to the digital news
description	string	description of the digital news

Example for a record of dataset:

{
        "text": "Ba ra đi vào ngày nhận điểm thi, nữ sinh được hỗ trợ học phí",
        "meta": {
            		"description": "Ngày nhận được tin đỗ đại học cũng là lúc bố mất vì Covid-19, L.A dường như gục ngã. Thế nhưng, bên cạnh em đã có các mạnh thường quân hỏi han, hỗ trợ về kinh tế.",
            		"uri": "https://yan.vn/ba-ra-di-vao-ngay-nhan-diem-thi-nu-sinh-duoc-ho-tro-hoc-phi-277328.html"
        	}
}

Labeling process

Annotating:
- The article should be classified under one out of three sentiment: Negative, Positive and Neutral.
- The article will then be classified by 22 topics: World, Politics, Economics, Sports, Cultures, Entertainment,Technology, Science, Education, Daily life, Regulations, Real estate, Social, Traffic, Environment, Stock market, Covid-19, Breaking news, Game, Movies, Health, Travel, Unidentified. Each article can carry numerous relevant and suitable topics.

Reviewing process

The admin or the owner of the project will select qualified reviewers based on their attitude and performance. Reviewing process contains two main phases: cross validation and project reviewing.

The person who is assigned to cross validating will be given 20% of the annotated records from other annotators. This person will also be in charge of re-correcting the mislabeled records.
After the cross validation phase, the person who is assigned to review the project will randomly pick 20 - 50% of the total annotated records. Records that are not meet the given quality can either be:
- Re-corrected by the project reviewer.
- Re-assigned and re-corrected by the formal annotator.

Updating process

The raw data is expected to be fully uploaded at one time.
The annotated records are expected to be updated once a month to official repository of DANeS (https://github.com/dataset-vn/DANeS)

License of annotated dataset

The annotated dataset of DANeS is licensed under Creative Commons Attribution 4.0 International License.

This license lets others distribute, remix, tweak, and build upon your work, even commercially, as long as they credit you for the original creation. This is the most accommodating of licenses offered. Recommended for maximum dissemination and use of licensed materials.

About us

DATASET .JSC - (+84) 98 442 0826 - [email protected]

Dataset’s mission is to support individuals and organizations with data collecting and data processing services by providing tools that simplify and enhance the efficiency of the processes. With the large and professional workers system, Dataset aspires to provide partners with a comprehensive and quality solution, suitable with the characteristics of the technology market.

Website: Dataset.vn

LinkedIn: Dataset.vn - Data Crowdsourcing Platform

Facebook: Dataset.vn - Data Crowdsourcing Platform

AIV Group - (+84) 931 458 189 - [email protected]

AIV Group aims to apply advanced technologies, especially Artificial Intelligence (AI), Cloud Computing, Big Data, … to digitize, modernize the long-established processes of information production and consumption in Viet Nam society. At the same time, we are working on solutions that solve new problems arising in the field of communication that relate to technology’s problems such as: fake news, images, videos are automatically cut and merged ..

Website: AIV Group

Facebook: AIV Group

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Related tags

Overview

DANeS - Open-source E-newspaper dataset

Table of Contents

Folder Tree

Data format

Labeling process

Reviewing process

Updating process

License of annotated dataset

About us

DATASET .JSC - (+84) 98 442 0826 - [email protected]

AIV Group - (+84) 931 458 189 - [email protected]

Owner

DATASET .JSC

A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

Unsupervised Language Modeling at scale for robust sentiment classification

FastFormers - highly efficient transformer models for NLU

Snowball compiler and stemming algorithms

Associated Repository for "Translation between Molecules and Natural Language"

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

ADCS - Automatic Defect Classification System (ADCS) for SSMC

ChatBotProyect - This is an unfinished project about a simple chatbot.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

FireFlyer Record file format, writer and reader for DL training samples.

Script to generate VAD dataset used in Asteroid recipe

AI-Broad-casting - AI Broad casting with python

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

A PyTorch implementation of VIOLET

A full spaCy pipeline and models for scientific/biomedical documents.

Reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer: Self-Attention with Linear Complexity)