GithubHelp home page GithubHelp logo

lsiecker / text-mining Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 436.91 MB

Natural and Technical Language Processing using Spacy, Named Entity Recognition and a custom Relationship Extraction and Labeling component

Jupyter Notebook 98.87% Python 1.13%
ner nlp rel spacy tlp wikipedia-dump

text-mining's Introduction

Text-Mining

This repository contains the code for extracting information from texts using NLP techniques. The code extends the Spacy NER component with custom training data and adds a REL component which is inspired by Explosion AI's REL component

Project Structure

  • Data (data/): Contains all the raw wikipedia texts and annotation files.
  • NER component (ner_model/): Contains the trained NER model including the preprocessed training and validation files. This component is based on Spacy NER component tutorial
  • REL component (rel_model/): Contains the trained REL model including the preprocessed training and validation files. This component is based on the Explosion AI's REL component

Documentation

  • Project Initialization and Data Preprocessing: Documentation on how to start the project yourself and how to preprocess the data.
  • Model Training: Documentation on how to start training the models (both the NER and the REL).
  • Model Output: Documentation on how to use the trained models to extract information from text and eventually create the knowledge graph.

Assignment Description

Copied from the assignment description of the course 2AMM30 at the Eindhoven University of Technology. The downloaded official assignment description can be found here.

Data description

In this assignment we will undertake the general challenge of extracting information from 2 different sources:

  1. An unsupervised (albeit structured) large corpus: the entirety of Wikipedia.
  2. Event registration and documentation for nuclear powerplants in the US

Wikipedia

A Wikipedia dump is a copy of all of the content from Wikipedia. This includes all of the articles, images, and other media files. We provide a somewhat stripped/cleaned version of Wikipedia which saves you a considerable amount of computing power required to start your project. This stripped version has had tables, headers and other “non-text” removed.

Technical language data (Nuclear powerplants)

Language data from industry often poses unique challenges due to its specialized nature and domain- specific terminology. Unlike generic text found on the internet, industry-specific language is often rife with technical jargon, abbreviations, and context-dependent meanings. This requires a deep understanding of the specific field to accurately process and interpret the data. To be familiarize with this challenge, a second dataset is provided which describes a collection of events regarding unexpected reactor trips at commercial nuclear power plants in the U.S. It contains a selection of metadata, a short-hand description as well as a longer abstract describing the occurrences in detail.

Assignment

The overall goal consists of the following: Perform information extraction on these dataset. In the end, you should end with a collection of triplets ([subject, relation, object]) which could populate a Knowledge Graph. Perhaps for a particular use-case or subdomain in these datasets.

  • Identify and label entities and relations of interests
  • Select appropriate data sources that (likely) contain that information
  • Build working models that can extract this information
  • Evaluate the performance of extraction in parts and as a whole

text-mining's People

Contributors

diedevanderhoorn avatar fleureok avatar lsiecker avatar marlougielen avatar nielsvbeuningen avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

yusuferkamozyer

text-mining's Issues

training_data and validation_data is equal

The following code will result in the output down below and it is needed to differ.

print("Training data info item 1 \ntext:")
print(training_data[0][0])
print("Labels:")
print(*training_data[0][1]["entities"], sep = "\n")

print("\n Validation data info item 1 \ntext:")
print(validation_data[0][0])
print("Labels:")
print(*validation_data[0][1]["entities"], sep = "\n")

Output:

Training data info item 1 
text:
 Ephesus (; ; ; may ultimately derive from ) was a city in Ancient Greece on the coast of Ionia, southwest of present-day Seluk in zmir Province, Turkey.
Labels:
[1, 9, 'landmark_name']

 Validation data info item 1 
text:
 Ephesus (; ; ; may ultimately derive from ) was a city in Ancient Greece on the coast of Ionia, southwest of present-day Seluk in zmir Province, Turkey.
Labels:
[1, 8, 'landmark_name']

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.