NLIDatasets.jl

NLIDatasets.jl is a Julia package for working with datasets for the Natural Language Inference task (also called Relational Text Entailment, or RTE).

Provides an interface to the following datasets:

SNLI
MultiNLI
XNLI
HANS
BreakingNLI
SciTail
ANLI

SNLI

NLIDatasets.SNLI — Module.

SNLI

A corpus of 570k human-written English sentence pairs for NLI.

SNLI sentence pairs are manually labeled with labels entailment, contradiction, and neutral.

For details, see the SNLI home page or read the 2015 paper "A large annotated corpus for learning natural language inference" by Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher Manning.

Included data:

SNLI.train_tsv()
SNLI.train_jsonl()
SNLI.dev_tsv()
SNLI.dev_jsonl()
SNLI.test_tsv()
SNLI.test_jsonl()

source

MultiNLI

NLIDatasets.MultiNLI — Module.

MultiNLI

A corpus of 433k sentence pairs for NLI.

For details, see the MultiNLI home page or read the 2018 paper "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference" by Adina Williams, NIkita Nangia, and Samuel R. Bowman.

Included data:

MultiNLI.train_tsv()
MultiNLI.train_jsonl()
MultiNLI.dev_matched_tsv()
MultiNLI.dev_matched_jsonl()
MultiNLI.dev_mismatched_tsv()
MultiNLI.dev_mismatched_jsonl()

source

XNLI

NLIDatasets.XNLI — Module.

XNLI

A collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus.

For details, see the 2018 paper "XNLI: Evaluating Cross-lingual Sentence Representations."

XNLI.dev_tsv()
XNLI.dev_jsonl()
XNLI.test_tsv()
XNLI.test_jsonl()

source

HANS

NLIDatasets.HANS — Module.

HANS

HANS (Heauristic Analysis for NLI Systems) is a dataset for NLI.

It contains the set of examples used in the 2019 paper "Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference" by R. Tom McCoy, Ellie Pavlick, and Tal Linzen.

For details, see the 2019 paper "Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference" by by R. Tom McCoy, Ellie Pavlick, and Tal Linzen.

Consists of a set of examples for evaluation, provided with test_tsv.

HANS.test_tsv()

source

BreakingNLI

NLIDatasets.BreakingNLI — Module.

BreakingNLI

A dataset of 8193 premise-hypothesis sentence-pairs for NLI.

Each example is annotated to entailment, contradiction, and neutral. The premise and the hypothesis are identical except for one word/phrase that was replaced. This dataset is meant for testing methods trained to solve the natural language inference task, and it requires some lexical and world knowledge to achieve reasonable performance on it.

For details, see the GitHub page or read the 2018 paper.

Available data:

BreakingNLI.test_jsonl()

source

SciTail

NLIDatasets.SciTail — Module.

SciTail

SciTail is a NLI dataset created from multiple-choice science exams and web sentences.

For details, see the 2018 paper "SciTail: A Textual Entailment Dataset from Science Question Answering" by Tushar Khot, Asish Sabharwal, and Peter Clark.

SciTail.train_tsv()
SciTail.train_jsonl()
SciTail.dev_tsv()
SciTail.dev_jsonl()
SciTail.test_tsv()
SciTail.test_jsonl()

source

ANLI

NLIDatasets.ANLI — Module.

ANLI

For details, see the GitHub page or read the 2019 paper.

Available data:

ANLI.R1_train_jsonl()
ANLI.R1_dev_jsonl()
ANLI.R1_test_jsonl()
ANLI.R2_train_jsonl()
ANLI.R2_dev_jsonl()
ANLI.R2_test_jsonl()
ANLI.R3_train_jsonl()
ANLI.R3_dev_jsonl()
ANLI.R3_test_jsonl()

source