GithubHelp home page GithubHelp logo

eliaswendt / mse-trend-prediction Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 5.72 MB

This repository contains the code underlying our term paper Meta-Science & Evaluation: Trend Prediction.

Jupyter Notebook 98.88% Python 1.11% Shell 0.01%
machine-learning data-mining

mse-trend-prediction's Introduction

Meta-Science & Evaluation: Trend Prediction

This repository contains the code underlying our term paper Meta-Science & Evaluation: Trend Prediction.

Abstract

The volume of scientific work and related publications has increased sharply in the last two decades. Floods of text can make it even more difficult for academia and industry to evaluate the quality and importance of individual papers and participating in this field of research. The ever-advancing power of NLP can help us to process these texts more efficiently than ever before. With this work, we contribute to the ongoing research of metascience and scientometric analysis. In a first step, we derive text embeddings to create unsupervised topic clusters of recent publications and use their citations counts to train a DNN that afterward will forecast relevant topic spaces for new or future research.

Environment setup

Create new virtual environment:

$ python -m venv venv

Activate environment:

$ source venv/bin/activate

Install required python packages:

$ pip install -r requirements.txt

To view Jupyter Notebooks (.ipynb), run:

$ jupyter-notebook

Download and extract datasets

Download from https://hessenbox.tu-darmstadt.de/getlink/fiB2mrQTRZTrjWcmCVySL58H/.

Guidelines for extraction and data folder structuring:

We provide a full data set with all results for papers of the ACL anthology from 1990 until 2020.

  1. Extract data.zip into the project folder. It includes the complete ACL anthology as well as the filtered anthology according to used years and conferences with all additional information as for example links to files and topics (anthology_conferences.csv).
  2. Extract pdfs.zip into the new data/ folder to add the full paper pdfs.
  3. Extract json.zip into the data/ folder to add the papers parsed with science-parse structured in JSON format.
  4. Extract embeddings.zip into the data/ folder to add all tested embeddings created with sentenceBERT and different pretrained models.
  5. Extract semantic_scholar.zip into the data/ folder to add information about papers and authors fetched from Semantic Scholar.
  6. Extract clusters.zip into the data/ folder to add the intermediate and final clustering results. Our best and final clustering is saved into final_best_onde_clustering.json.

Data collection / processing

The basis of the data is ACL Anthology. We further use additional sources to add information e.g. about topics of papers to this basis. Run the following scripts:

  1. parse_data.ipynb Downloads and filters anthology, downloads paper's pdf, and adds abstracts from parsed pdfs (see next point).
  2. parse_pdf.sh Parses paper's pdfs with science-parse by allenai to get the abstracts.
  3. parse_semanticscholar.ipynb Downloads Semantic Scholar information about paper and authors, and adds topics to anthology entries.
  4. parse_cso_classifier.ipynb Add topics to each anthology entry based on the abstract using the python library of the CSO classifier.

Embedding creation

embeddings.ipynb Creates embeddings of titles and abstracts of the papers with SentenceBERT used for clustering.

Clustering

We use clustering based on embeddings to group papers that share topics. The following steps describe our approach of finding the most appropriate algorithm. Finally, we use K-Means clustering with 20 clusters. The embeddings base on the pretrained model paraphrase-distilroberta-base-v2 with titles as input.

  1. clustering.ipynb Runs an extensive search on different clustering algorithms (see clustering_algorithms.py), runs an extensive search on filtered algorithm/configurations in clustering_evaluation.ipynb, and runs the final best clustering.
  2. clustering_evaluation.ipynb Manually filters best algorithm/configuration pairst after first and second extensive search using evaluation metrics defined in clustering_metrics.py.
  3. cluster_presentation.ipynb Here you can search for the most matching clusters of keywords/topics and create plots for the clusters regarding the development of citations and papers in the past and the predicted future by the DNN. Figures are stored in the folder figures/.

Model creation

The model is able to predict the citation count of given papers for the next five years, taking

  • the embedding (SentenceBERT, paraphrase-distilroberta-base-v2)
  • the age of the paper since publication
  • the accumulated h-indices of all authors and
  • the numer of authors.

Requirements

To create the model,

data/semantic_scholar/papers/

must exist and contain one JSON file for each paper. Also, to assign the predictions per paper to the cluster,

data/clusters/final_best_one_clustering.json

has to contain a mapping from cluster index to paper index.

Train the model and perform predictions

Run:

$ python predict_citations_next_few_years.py

This program writes intermediate results to the cache/ folder and the resulting model (keras.callbacks.ModelCheckpoint) to model/best_model.hdf5, alongside a figure showing the development of training and development loss values during the training process.

mse-trend-prediction's People

Contributors

eliaswendt avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.