GithubHelp home page GithubHelp logo

ibm / twitter-customer-care-document-prediction Goto Github PK

View Code? Open in Web Editor NEW
12.0 6.0 5.0 7.01 MB

Twitter dataset for Conversational Document Prediction to Assist Customer Care Agents (Ganhotra et al. 2020, EMNLP)

Home Page: https://www.aclweb.org/anthology/2020.emnlp-main.25

License: Apache License 2.0

customer-care-agents twitter-conversations-dataset machine-learning natural-language-processing dialog dialogue-systems dataset goal-oriented-dialog cdp document-classification

twitter-customer-care-document-prediction's Introduction

Twitter Conversations Dataset for Conversational Document Prediction

Dataset

This directory contains the Twitter Conversations dataset that we release for the task of Conversational Document Prediction.

The dataset includes conversations that occurred between users and customer care agents in 25 organizations on the Twitter platform. Each conversation ends with a customer care agent providing a URL to a document to resolve the issue the user is facing. The task is to predict the document given a dialog context.

Following files are included in the dataset.

File Description
train.json Training dataset
dev.json Validation dataset
test.json Test dataset
docID_content.tsv The content of the documents indexed by the document ID
docID_url.tsv The URLs of the documents indexed by the document ID
company_docIDs.tsv The document IDs corresponding to each organization.

The train, dev and test datasets include 10000, 525 and 500 conversations respectively.

Example

Following data format is followed by each conversation in train, dev and test datasets.

{
        "agentURL": {
            "doc_id": "9417",
            "turn": "4",
            "tweet_ID": "1232360458130161664",
            "url": "https://support.rockstargames.com/articles/204233943/Basic-Troubleshooting-for-the-PlayStation-4-Console",
            "url_utterance": "@tiffamy_lee We have a Support article that may help with that. Please see: https://support.rockstargames.com/articles/204233943/Basic-Troubleshooting-for-the-PlayStation-4-Console *SR"
        },
        "dialogContent": [
            {
                "client": "tiffamy_lee",
                "datetime": "2020-02-25T17:40:50.000Z",
                "message": "@RockstarSupport why isnt my gta installing at all ?",
                "tweet_ID": "1232359814828822534"
            },
            {
                "agent": "RockstarSupport",
                "datetime": "2020-02-25T17:41:59.000Z",
                "message": "@tiffamy_lee Please let us know the platform you are experiencing this on so we can help further. *SR",
                "tweet_ID": "1232360105036984320"
            },
            {
                "client": "tiffamy_lee",
                "datetime": "2020-02-25T17:42:23.000Z",
                "message": "@RockstarSupport Playstation 4",
                "tweet_ID": "1232360202671968257"
            }
        ],
        "dialogHeader": {
            "company": "RockstarSupport",
            "conversationDateTime": "2020-02-25T17:40:50.000Z",
            "duration": "00:08:46.0",
            "messageCount": "4",
            "sessionID": "5McvPTUPveowZZ4Z9j5Y"
        }
    }

Each conversation contains 3 fields.

  1. dialogHeader – Contains general information about the conversation. This include, organization name, number of messages in the conversation and a unique session ID given to the conversation.

  2. dialogContent – Contains the utterances prior to the agent responding with a URL to a document. Each utterance includes information such as the agent/user name, date and time of the utterance, the tweet ID corresponding to the utterance and the text of the utterance.

  3. agentURL – This includes information regarding the last agent utterance which contain an URL to a document. This contains the tweet ID of the agent utterance, the text of the utterance that contains the URL, the URL extracted from the utterance and the document ID of the URL. The content of the provided document can be obtained by referring to the ‘docID_content.tsv’ with the doc_id provided in this field.

Statistics

Main statistics of the dataset are provided supplementary to the dataset in the directory stats.

Data Collection

A set of 25 organizations which conduct customer care operations in Twitter platform were identified. Then, the user_timeline Twitter API was used to collect the tweets from these organizations containing in-domain URLs. The dialogs were constructed starting from these tweets and identifying the previous user and agent tweets to these tweets.

Public data release

During the public data release, adhering to the Twitter Developer policy for Content Redistribution, we will be releasing the Tweet IDs of each Tweet in a conversation and the Code for retrieving the actual tweet given the Tweet ID. We will also be releasing the code to obtain content from the URLs used in the conversations.

License

The dataset is released under Apache 2.0 license. For the full license, see LICENSE. Please cite the following paper if you use this dataset in your work

@inproceedings{ganhotra-etal-2020-conversational,
    title = "Conversational Document Prediction to Assist Customer Care Agents",
    author = "Ganhotra, Jatin  and
      Roitman, Haggai  and
      Cohen, Doron  and
      Mills, Nathaniel  and
      Gunasekara, Chulaka  and
      Mass, Yosi  and
      Joshi, Sachindra  and
      Lastras, Luis  and
      Konopnicki, David",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.25",
    pages = "349--356",
}

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value
name Twitter Conversations Dataset for Conversational Document Prediction (CDP) task
alternateName Twitter CDP dataset
url
sameAs https://github.com/IBM/twitter-customer-care-document-prediction
description The dataset contains the Twitter Conversations for the task of Conversational Document Prediction (CDP). The dataset includes conversations that occurred between users and customer care agents in 25 organizations on the Twitter platform. Each conversation ends with a customer care agent providing a URL to a document to resolve the issue the user is facing. The task is to predict the document given a dialog context.
provider
property value
name IBM
sameAs https://en.wikipedia.org/wiki/IBM
citation https://www.aclweb.org/anthology/2020.emnlp-main.25

twitter-customer-care-document-prediction's People

Contributors

jatinganhotra avatar stevemar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.