GithubHelp home page GithubHelp logo

NTU NLP Lab's Projects

amdrd icon amdrd

Analysis Model of Discourse Relations within a Document(AMDRD)

c2rc2 icon c2rc2

Categorizing Citation Relations in Scientific Papers Based on the Contributions of Cited Papers

chinese-word-ordering-errors-detection-and-correction-corpus icon chinese-word-ordering-errors-detection-and-correction-corpus

Word Ordering Errors (WOEs) are the most frequent type of grammatical errors at sentence level for non-native Chinese language learners. Learners taking Chinese as a foreign language often place character(s) in the wrong places in sentences, and that results in wrong word(s) or ungrammatical sentences. Besides, there are no clear word boundaries in Chinese sentences.

dialogue-mpdd icon dialogue-mpdd

A dialogue dataset is an indispensable resource for building a dialogue system. Additional information like emotions and interpersonal relationships labeled on conversations enables the system to capture the emotion flow of the participants in the dialogue. However, there is no publicly available Chinese dialogue dataset with emotion and relation labels. In this paper, we collect the conversions from TV series scripts, and annotate emotion and interpersonal relationship labels on each utterance. This dataset contains 25,548 utterances from 4,142 dialogues. We also set up some experiments to observe the effects of the responded utterance on the current utterance, and the correlation between emotion and relation types in emotion and relation classification tasks.

finance-fin-some icon finance-fin-some

Fin-SoMe is a dataset with 10,000 labeled financial tweets annotated by experts from both the front desk and the middle desk in a bank's treasury. These annotated results reveal that (1) writer-labeled market sentiment may be a misleading label; (2) writer's sentiment and market sentiment of an investor may be different; (3) most financial tweets provide unfounded analysis results; and (4) almost no investors write down the gain/loss results for their positions.

finance-finnum icon finance-finnum

Numeral is the crucial part of financial documents. In order to understand the detail of opinions in financial documents, we should not only analyze the text, but also need to assay the numeric information in depth. Because of the informal writing style, analyzing social media data is more challenging than analyzing news and official documents. FinNum is a dataset for fine-grained numeral understanding in financial social media data - to identify the category of a numeral.

finance-finprolex icon finance-finprolex

FinProLex provides 5,162 tokens in professional analysts' reports and the financial social media platform posts with expert-like scores. The expert-like scores are calculated based on the pointwise mutual information (PMI).

finance-icrd icon finance-icrd

There are two tasks in the ICRD. We separate the datasets into three parts, including Train/Dev/Test. (1) Premise Detection In the premise detection task, we aim at identifying whether the given sentence is a premise. There are two keys for each instance. "sentence" is the given sentence. If the value of "ans" is 0, means the given sentence is not a premise. If the value of "ans" is 1, means the given sentence is a premise. (2) Claim-Premise Inference When given a claim and a sentence, models are asked to predict whether the given sentence is the premise of the claim. There are three keys for each instance. "claim" is the given claim and "compare_sent" is the other given sentence. If the value of "ans" is 0, means the given sentence is not a premise of the given claim. If the value of "ans" is 1, means the given sentence is a premise of the given claim.

finance-ntusd-fin icon finance-ntusd-fin

NTUSD-Fin provides various scoring methods including frequency, CFIDF, chi-squared value, market sentiment score and word vector for the tokens. Only the tokens appeared at least ten times and shown significantly difference between expected and observed frequency with chi-squared test are remained in our dictionary. The predetermined significance level is 0.05. The market sentiment score is calculated by substracting the bearish PMI from the bullish PMI. There are 8,331 words, 112 hashtags and 115 emojis in the constructed dictionary, NTUSD-Fin.

finance-numattach icon finance-numattach

Numeral is the crucial part of financial documents. In order to understand the detail of opinions in financial documents, we should not only analyze the text, but also need to assay the numeric information in depth. Because of the informal writing style, analyzing social media data is more challenging than analyzing news and official documents. NumAttach is a dataset for fine-grained numeral understanding in financial social media data - to detect the relation between cashtag and the numeral.

finance-numclaim icon finance-numclaim

Numerals provide important information in financial narratives. Our statistical result in the financial analysis reports shows that over 58.47% of sentences contain at least one numeral. Without the numerals, lots of fine-grained information in the analysis reports will be lost. This phenomenon evidences the importance of the numerals in the financial narrative. Based on our observation, investors always make a claim with an estimation. This estimation can be a cue for detecting the investor's fine-grained claim. Therefore, we propose an expert-annotated dataset, NumClaim, for probing argument mining in the financial narrative. Among 5,144 instances in the NumClaim dataset, 23.78% and 76.22% of instances containing numerals are annotated as In-claim'' and Out-of-claim'', respectively.

finance-numeracy-600k icon finance-numeracy-600k

Numeral is the crucial part of in narrative, especially in financial documents. We should not only analyze the text, but also need to assay the numeric information in depth. Numeracy-600K is a dataset for testing the numeracy of machines.

framenet-cfn-lex icon framenet-cfn-lex

A total of 36K lexical units that cover 779 frames for FrameNet in Chinese. This resource is extracted from a large-scale bilingual corpus to achieve higher coverage in terms of lexical units, which is helpful in providing frame recommendations for annotation campaigns or constructing robust frame identification systems.

framenet-cfn-sp icon framenet-cfn-sp

This system is traind on FrameNet subset of 31 frames ('Arriving', 'Accompaniment', 'Visiting', 'Discussion', 'Meet_with', 'Presence', 'Ingestion', 'Ride_vehicle', 'Perception_active', 'Sleep', 'Competition', 'Attending', 'Giving', 'Text_creation', 'Transitive_action', 'Resolve_problem', 'Statement', 'Receiving', 'Taking_time', 'Social_event', 'Departing', 'Deciding', 'Arranging', 'Waiting', 'Perception_experience', 'Contacting', 'Borrowing', 'Commerce_buy', 'Questioning', 'Activity', 'Inspecting') that could fulfill daily events for lifelogging. The overall performance for frame semantic parsing of our system is F1 score 97.12 and 85.14 for training and testing respectively.

icda icon icda

Interactive Clinical Diagnostic Assistant for Medical Interview

lifeeventdialog icon lifeeventdialog

Life Event Dialog contains fine-grained personal life event annotations on DailyDialog.

lifelog-dialog icon lifelog-dialog

Conversation, a common way for people to share their experiences and feelings with others, consists of important information about personal life events of individuals, but is rarely explored. In this dataset, we initiate a task of detecting personal life events from daily conversaion. We extend a multi-turn dialog dataset, DailyDialog, with life event annotation. We collect 600 conversations with 4-6 utterances from 4 topics of DailyDialog. Our goal is to detect the life events of each speaker in real-time.

lifelog-livekb icon lifelog-livekb

People often forget something in the daily life, thus information recall support for people at the right time and at the right place is emerging. Constructing personal knowledge base for individuals is important for the application of memory recall and living assistance. We collect 18 users who set their tweets as public and posted tweets ranged from 2009 to 2017. We aim to extract life events from tweets shared on Twitter, and construct personal knowledge bases of individuals.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.