Open-Source-Datasets for NLP

What is NLP?

Natural Language Processing is a branch of artificial intelligence that gives computers the capability to comprehend, generate, process, and manipulate textual information. NLP combines computational linguistics and modelling of the human language with statistical, machine learning, and deep learning models. A few major applications of NLP are as follows.

  • Text Classification
  • Document Summarization
  • Sentiment Analysis
  • Natural Language Generation
  • Machine Translation
  • Question and Answers

What is NLP?

Yes, recent NLP applications are developed with deep learning algorithms. They use sequential models which are built using RNN, LSTM and GRU units. These algorithms are introduced in this domain to achieve the human level performance for language comprehension. These algorithms require large amounts of data to train the models.

In this blog, we will be listing some of the open-source datasets for NLP applications. So, let’s explore the datasets!

Text Classification

In text classification, the algorithm categorizes a given text based on the information it contains. This is used in email classification, sentiment classification, and review classification to name a few. Find a relevant use case for the implementation of text classification here.

The open-source datasets which support text classification are listed below.

  1. IMDB Dataset

    The IMDB dataset includes over 50,000 movie reviews. It is a binary dataset categorized by two labels — positive and negative. This dataset is primarily used for sentiment classification. In this dataset, a sample of 25,000 highly polar movie reviews is utilized for training and the remaining 25,000 for testing.

  2. Amazon Reviews Dataset

    The Amazon Reviews Dataset comprises a few million Amazon customer reviews and ratings. This data can be used to perform sentiment analysis on quick text messages.

  3. Twitter US Airline Sentiment Dataset

    The Twitter US Airline Sentiment Dataset has over 15,000 tweets classified into three categories — positive, negative, and neutral. The negative tweets are also accompanied by reasons such as “late flight”, “bad service”, etc.

  4. WordNet

    WordNet is a lexical database defining semantic relations between words. For tasks such as text classification, sentiment analysis, and text summarization, it is essential to understand the intuition of words in different positions and their similarities. Here, WordNet can be used to solve linguistic problems in NLP models.

  5. SMS Spam Collection

    SMS Spam Collection is a public dataset of labelled SMS messages. The data has been collected as a part of mobile phone spam research. It has a collection of 5574 real messages and can be used for classification and clustering tasks.

Document Summarization

Summarization is the process of providing shorter versions of a text contained in one or more documents. Text summarization can either be a subset of the whole text — abstraction, or semantically meaningful information extracted from the documents — extraction. Both these ways can be tackled using deep learning algorithms. You can further visit here, to look over the implementation of extraction and abstraction-based document summarization.

The following are a few datasets which support document summarization.

  1. Cornell Newsroom

    The Cornell Newsroom dataset is a large dataset containing 1.3 million articles and their summaries, which can be used for training and evaluating text summarization models. The data can be used for extraction and abstraction summarization models. This is a Newsroom dataset.

  2. CNN Daily Mail

    The CNN Daily Mail Dataset is an English-language dataset containing over 300 thousand unique news articles written by Journalists at CNN. The current version supports both extractive and abstraction summaries.

  3. Document Understanding Conferences Datasets

    Document Understanding Conference released datasets for DUC workshops conducted in the past. These datasets contain documents, summaries, and results. The data has been released for every year from 2001 to 2007 named DUC-2001 to DUC-2007. This data can be used for evaluating document summarization models.

Natural Language Generation

Natural Language Generation is an application of NLP that automatically generates human-readable text. The models take vectorized data as input and return the output text in human-readable form. A practical application of NLG is the generation of summaries and smart narratives by feeding structured data from dashboards, sensors, IoT devices, and business reports. To make data more consumable and easily accessible, NextGen BI dashboards are implemented with smart narratives using NLG. You can find an example of the NextGen BI dashboards here.

Some of the datasets that can be used for NLG are as follows.


    The SUMTIME dataset contains weather forecast data procured by human forecasters. This dataset can be used for the automatic generation of weather forecast reports using probabilistic generation models.

  2. E2E

    This dataset is hosted on an NLG challenge website. As a part of this challenge, a new crowd-sourced dataset of 50,000 instances in the restaurant domain is provided. Each instance consists of a dialogue act-based meaning representation (MR) and up to 5 references in natural language. This dataset includes open vocabulary, complex syntactic structures, and diverse discourse phenomena.

  3. ToTTo

    ToTTo is a table-to-text dataset with over 120,000 training examples. This is used for a controlled generation task. The dataset provides a single-sentence description for the content in highlighted cells of a Wikipedia table. This serves as a research benchmark dataset for high-precision conditional text generation.

Machine Translation

Machine Translation is the application of NLP where a computer/machine translates sentences in one language to the corresponding sentences in another language. It is also referred to as Neural Machine Translation which utilizes Neural Networks for the translation task. The input and output sentences can be of different lengths and the word order can differ between the two languages. Machine Learning algorithms learn these differences in a grammatical structure. The task can be staged as a sequence-to-sequence generation problem. The datasets given below can be used for this application.

  1. Bilingual Sentence Pairs from The Tatoeba Project

    The Tatoeba Project dataset is a collection of sentences and their translations from English to other languages. Each collection contains examples of bilingual pairs like English to French, Hindi to English and so on.

  2. IIT Bombay English-Hindi Corpus

    This dataset has both, the parallel corpus English-Hindi, and the monolingual corpus of Hindi. The parallel corpus contains sentences from different sources which has over 1.65 million examples. These examples can be used for training English to Hindi translation models.

  3. NMT Data by Stanford NLP group

    The NMT Data created by the Stanford NLP Group has three sets of translation pairs which are English-German, English-Czech, and English-Vietnamese.

Question and Answering

Question and Answering (Q&A) is one of the most trending applications of NLP. In this application, users can ask questions in natural language and get a brief response from the Q&A models. These models are typically found in search engines and conversational platforms like chatbots which are fairly competent at answering small snippets of information. The Q&A models can also be used to develop conversational AI applications (Conversational Chatbots, Digital/Intelligent/Virtual Assistants) to upgrade business processes. Find a relevant example of conversational chatbots here.

A few datasets which support this application are as follows.

  1. Stanford Question Answering Dataset (SQuAD)

    The SQuAD is a reading comprehension dataset consisting of questions posed by crowd workers on a set of Wikipedia articles. The answer to every question is a segment of text, or span, from the corresponding reading passage. In rare cases, the question might be unanswerable. This dataset contains over 100,000 questions extracted from over 500 articles. The train and development/test datasets are separated and provided in this dataset.

  2. TriviaQA

    TriviaQA is a large-scale dataset created for reading comprehension and Question and Answering. The dataset has over 95,000 question-answer pairs.


    This is a collection of datasets released by Microsoft which focuses mostly on deep learning in search. The MS MARCO dataset contains over 1 million search queries submitted via Bing or Cortana, along with human-generated answers for Bing questions. Apart from Q&A tasks, this collection has datasets for natural language generation, passage ranking, and key phrase extraction.

  4. Natural Questions

    This dataset was released by the Google research team. It contains the queries asked in the Google search engine. The answers are annotated by humans from the Wikipedia page including the top-5 search results, if the answer is present, or else, it is labelled null. The whole dataset is divided into 307,373 examples for training, 7,830 examples for development, and 7,842 for the test set.


In this article, a few common datasets across different domains of NLP are discussed. Some of these datasets can be used for multiple applications of NLP. A few other miscellaneous datasets are ComQA, Quora Insincere Questions Classification, Yelp dataset, 20 Newsgroups, OpinRank.