PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Related tags

Deep LearningCI-ToD
Overview

Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

License: MIT

This repository contains the PyTorch implementation and the data of the paper: Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System. Libo Qin, Tianbao Xie, Shijue Huang, Qiguang Chen, Xiao Xu, Wanxiang Che. EMNLP2021.[PDF] .

This code has been written using PyTorch >= 1.1. If you use any source codes or the datasets included in this toolkit in your work, please cite the following paper. The bibtex are listed below:

@article{qin2021CIToD,
  title={Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System},
  author={Qin, Libo and Xie, Tianbao and Huang, Shijue and Chen, Qiguang and Xu, Xiao and Che, Wanxiang},
  journal={arXiv preprint arXiv:2109.11292},
  year={2021}
}

Abstract

Consistency Identification has obtained remarkable success on open-domain dialogue, which can be used for preventing inconsistent response generation. However, in contrast to the rapid development in open-domain dialogue, few efforts have been made to the task-oriented dialogue direction. In this paper, we argue that consistency problem is more urgent in task-oriented domain. To facilitate the research, we introduce CI-ToD, a novel dataset for Consistency Identification in Task-oriented Dialog system. In addition, we not only annotate the single label to enable the model to judge whether the system response is contradictory, but also provide more finegrained labels (i.e., Dialogue History Inconsistency(HI), User Query Inconsistency(QI) and Knowledge Base Inconsistency(KBI), which are as shown in the figure below) to encourage model to know what inconsistent sources lead to it. Empirical results show that state-of-the-art methods only achieve performance of 51.3%, which is far behind the human performance of 93.2%, indicating that there is ample room for improving consistency identification ability. Finally, we conduct exhaustive experiments and qualitative analysis to comprehend key challenges and provide guidance for future directions.

Dataset

We construct the CI-ToD dataset based on the KVRET dataset. We release our dataset together with the code, you can find it under data.

The basic format of the dataset is as follows, including multiple rounds of dialogue, knowledge base and related inconsistency annotations (KBI, QI, HI):

[
    {
        "id": 74,
        "dialogue": [
            {
                "turn": "driver",
                "utterance": "i need to find out the date and time for my swimming_activity"
            },
            {
                "turn": "assistant",
                "utterance": "i have two which one i have one for the_14th at 6pm and one for the_12th at 7pm"
            }
        ],
        "scenario": {
            "kb": {
                "items": [
                    {
                        "date": "the_11th",
                        "time": "9am",
                        "event": "tennis_activity",
                        "agenda": "-",
                        "room": "-",
                        "party": "father"
                    },
                    {
                        "date": "the_18th",
                        "time": "2pm",
                        "event": "football_activity",
                        "agenda": "-",
                        "room": "-",
                        "party": "martha"
                    },
                    .......
                ]
            },
            "qi": "0",
            "hi": "0",
            "kbi": "0"
        },
        "HIPosition": []
    }

KBRetriever_DC

Dataset QI HI KBI SUM
calendar_train.json 174 56 177 595
calendar_dev.json 28 9 24 74
calendar_test.json 23 8 21 74
navigate_train.json 453 386 591 1110
navigate_dev.json 55 41 69 139
navigate_test.json 48 44 71 138
weather_new_train.json 631 132 551 848
weather_new_dev.json 81 14 66 106
weather_new_test.json 72 12 69 106

Model

Here is the model structure of non pre-trained model (a) and pre-trained model (b and c).

Preparation

we provide some pre-trained baselines on our proposed CI-TOD dataset, the packages we used are listed follow:

-- scikit-learn==0.23.2
-- numpy=1.19.1
-- pytorch=1.1.0
-- fitlog==0.9.13
-- tqdm=4.49.0
-- sklearn==0.0
-- transformers==3.2.0

We highly suggest you using Anaconda to manage your python environment. If so, you can run the following command directly on the terminal to create the environment:

conda env create -f py3.6pytorch1.1_.yaml

How to run it

The script train.py acts as a main function to the project, you can run the experiments by the following commands:

python -u train.py --cfg KBRetriver_DC/KBRetriver_DC_BERT.cfg

The parameters we use are configured in the configure. If you need to adjust them, you can modify them in the relevant files or append parameters to the command.

Finally, you can check the results in logs folder.Also, you can run fitlog command to visualize the results:

fitlog log logs/

Baseline Experiment Result

All experiments were performed in TITAN_XP except for BART, which was performed on Tesla V100 PCIE 32 GB. These may not be the best results. Therefore, the parameters can be adjusted to obtain better results.

KBRetriever_DC

Baseline category Baseline method QI F1 HI F1 KBI F1 Overall Acc
Non Pre-trained Model ESIM (Chen et al., 2017) 0.512 0.164 0.543 0.432
Infersent (Romanov and Shivade, 2018) 0.557 0.031 0.336 0.356
RE2 (Yang et al., 2019) 0.655 0.244 0.739 0.481
Pre-trained Model BERT (Devlin et al., 2019) 0.691 0.555 0.740 0.500
RoBERTa (Liu et al., 2019) 0.715 0.472 0.715 0.500
XLNet (Yang et al., 2020) 0.725 0.487 0.736 0.509
Longformer (Beltagy et al., 2020) 0.717 0.500 0.710 0.497
BART (Lewis et al., 2020) 0.744 0.510 0.761 0.513
Human Human Performance 0.962 0.805 0.920 0.932

Leaderboard

If you submit papers with these datasets, please consider sending a pull request to merge your results onto the leaderboard. By submitting, you acknowledge that your results are obtained purely by training on the training datasets and tuned on the dev datasets (e.g. you only evaluted on the test set once).

KBRetriever_DC

Baseline method QI F1 HI F1 KBI F1 Overall Acc
ESIM (Chen et al., 2017) 0.512 0.164 0.543 0.432
Infersent (Romanov and Shivade, 2018) 0.557 0.031 0.336 0.356
RE2 (Yang et al., 2019) 0.655 0.244 0.739 0.481
BERT (Devlin et al., 2019) 0.691 0.555 0.740 0.500
RoBERTa (Liu et al., 2019) 0.715 0.472 0.715 0.500
XLNet (Yang et al., 2020) 0.725 0.487 0.736 0.509
Longformer (Beltagy et al., 2020) 0.717 0.500 0.710 0.497
BART (Lewis et al., 2020) 0.744 0.510 0.761 0.513
Human Performance 0.962 0.805 0.920 0.932

Acknowledgement

Thanks for patient annotation from all taggers Lehan Wang, Ran Duan, Fuxuan Wei, Yudi Zhang, Weiyun Wang!

Thanks for supports and guidance from our adviser Wanxiang Che!

Contact us

  • Just feel free to open issues or send us email(me, Tianbao) if you have any problems or find some mistakes in this dataset.
Owner
Libo Qin
Ph.D. Candidate in Harbin Institute of Technology @HIT-SCIR. Homepage: http://ir.hit.edu.cn/~lbqin/
Libo Qin
Cross-media Structured Common Space for Multimedia Event Extraction (ACL2020)

Cross-media Structured Common Space for Multimedia Event Extraction Table of Contents Overview Requirements Data Quickstart Citation Overview The code

Manling Li 49 Nov 21, 2022
Codes for CyGen, the novel generative modeling framework proposed in "On the Generative Utility of Cyclic Conditionals" (NeurIPS-21)

On the Generative Utility of Cyclic Conditionals This repository is the official implementation of "On the Generative Utility of Cyclic Conditionals"

Chang Liu 44 Nov 16, 2022
The official implementation of A Unified Game-Theoretic Interpretation of Adversarial Robustness.

This repository is the official implementation of A Unified Game-Theoretic Interpretation of Adversarial Robustness. Requirements pip install -r requi

Jie Ren 17 Dec 12, 2022
Cweqgen - The CW Equation Generator

The CW Equation Generator The cweqgen (pronouced like "Queck-Jen") package provi

2 Jan 15, 2022
FSL-Mate: A collection of resources for few-shot learning (FSL).

FSL-Mate is a collection of resources for few-shot learning (FSL). In particular, FSL-Mate currently contains FewShotPapers: a paper list which tracks

Yaqing Wang 1.5k Jan 08, 2023
Dilated Convolution with Learnable Spacings PyTorch

Dilated-Convolution-with-Learnable-Spacings-PyTorch Ismail Khalfaoui Hassani Dilated Convolution with Learnable Spacings (abbreviated to DCLS) is a no

15 Dec 09, 2022
Official implementation of the paper Do pedestrians pay attention? Eye contact detection for autonomous driving

Do pedestrians pay attention? Eye contact detection for autonomous driving Official implementation of the paper Do pedestrians pay attention? Eye cont

VITA lab at EPFL 26 Nov 02, 2022
A multilingual version of MS MARCO passage ranking dataset

mMARCO A multilingual version of MS MARCO passage ranking dataset This repository presents a neural machine translation-based method for translating t

75 Dec 27, 2022
A python library for highly configurable transformers - easing model architecture search and experimentation.

A python library for highly configurable transformers - easing model architecture search and experimentation.

Anthony Fuller 51 Nov 20, 2022
LAMDA: Label Matching Deep Domain Adaptation

LAMDA: Label Matching Deep Domain Adaptation This is the implementation of the paper LAMDA: Label Matching Deep Domain Adaptation which has been accep

Tuan Nguyen 9 Sep 06, 2022
Zero-shot Learning by Generating Task-specific Adapters

Code for "Zero-shot Learning by Generating Task-specific Adapters" This is the repository containing code for "Zero-shot Learning by Generating Task-s

INK Lab @ USC 11 Dec 17, 2021
BoxInst: High-Performance Instance Segmentation with Box Annotations

Introduction This repository is the code that needs to be submitted for OpenMMLab Algorithm Ecological Challenge, the paper is BoxInst: High-Performan

88 Dec 21, 2022
Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

Microsoft 282 Jan 09, 2023
Transformer Huffman coding - Complete Huffman coding through transformer

Transformer_Huffman_coding Complete Huffman coding through transformer 2022/2/19

3 May 19, 2022
ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

ENet in Caffe Execution times and hardware requirements Network 1024x512 1280x720 Parameters Model size (fp32) ENet 20.4 ms 32.9 ms 0.36 M 1.5 MB SegN

Timo Sämann 561 Jan 04, 2023
piSTAR Lab is a modular platform built to make AI experimentation accessible and fun. (pistar.ai)

piSTAR Lab WARNING: This is an early release. Overview piSTAR Lab is a modular deep reinforcement learning platform built to make AI experimentation a

piSTAR Lab 0 Aug 01, 2022
An OpenAI-Gym Package for Training and Testing Reinforcement Learning algorithms with OpenSim Models

Authors: Utkarsh A. Mishra and Dr. Dimitar Stanev Advisors: Dr. Dimitar Stanev and Prof. Auke Ijspeert, Biorobotics Laboratory (BioRob), EPFL Video Pl

Utkarsh Mishra 16 Dec 13, 2022
This program automatically runs Python code copied in clipboard

CopyRun This program runs Python code which is copied in clipboard WARNING!! USE AT YOUR OWN RISK! NO GUARANTIES IF ANYTHING GETS BROKEN. DO NOT COPY

vertinski 4 Sep 10, 2021
Measure WWjj polarization fraction

WlWl Polarization Measure WWjj polarization fraction Paper: arXiv:2109.09924 Notice: This code can only be used for the inference process, if you want

4 Apr 10, 2022
ParmeSan: Sanitizer-guided Greybox Fuzzing

ParmeSan: Sanitizer-guided Greybox Fuzzing ParmeSan is a sanitizer-guided greybox fuzzer based on Angora. Published Work USENIX Security 2020: ParmeSa

VUSec 158 Dec 31, 2022