GithubHelp home page GithubHelp logo

kaoutardaoudhiri / nlp-based-classification-of-metagenomics-tools Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 1.0 135.17 MB

https://kaoutardaoudhiri-nlp-based-classification-of-metagen-app-z9o5q6.streamlitapp.com/

License: GNU General Public License v3.0

Jupyter Notebook 99.63% Python 0.37%
classification natural-language-processing language-model machine-learning

nlp-based-classification-of-metagenomics-tools's Introduction

NLP-based classification of software tools for metagenomics sequencing data analysis into EDAM semantic annotation

by Kaoutar Daoud Hiri, Matjaž Hren and Tomaž Curk

GPLv3 license DOI DOI

This paper has been submitted for peer-review.

We used machine learning methods to develop a metagenomics tool classification system based on their text description extracted from published papers. The users will quickly and efficiently filter tools of 13 different categories by identifying the tool's specific function.

Abstract

Motivation: The rapid growth of metagenomics sequencing data makes metagenomics increasingly dependent on computational and statistical methods for fast and efficient analysis. Consequently, novel analysis tools for big-data metagenomics are constantly emerging. One of the biggest challenges for researchers occurs in the analysis planning stage: selecting the most suitable metagenomics software tool to gain valuable insights from sequencing data. The building process of data analysis pipelines is often laborious and time-consuming since it requires a deep and critical understanding of how to apply a particular tool to complete a specified metagenomics task.

Results: We have addressed this challenge by using machine learning methods to develop a classification system of metagenomics software tools into 13 classes (11 semantic annotations of EDAM and two virus-specific classes) based on the descriptions of the tools. We trained three classifiers (Naive Bayes, Logistic Regression, and Random Forest) using 15 text feature extraction techniques (TF-IDF, GloVe, BERT-based models, and others). The manually curated dataset includes 224 software tools and contains text from the abstract and the methods section of the tools' publications. The best classification performance, with an Area Under the Precision-Recall Curve score of 0.85, is achieved using Logistic regression, BioBERT for text embedding, and text from abstracts only. The proposed system provides accurate and unified identification of metagenomics data analysis tools and tasks, which is a crucial step in the construction of metagenomics data analysis pipelines.

Performance of the classifier on an independent test set

confusion matrix

How to deploy the classifier to predict the main task of a sample tool

  • The classifier is available as a web application at DOI

  • You can also interact with the code by running the jupyter notebook directly in Binder

  • From the Command line:

    • Create a "ToolName.txt" file containing the publication abstract of the metagenomics tool, and save it in the same folder containing "metagenomics-tool-classifier.py" file.

    • Run the classifier, for example on the file "GAVISUNK.txt":

      python metagenomics-tool-classifier.py GAVISUNK.txt
      

How to reproduce this work and create a sample tool classification model

Note: In the Jupyter notebooks the names of the EDAM classes are shortened. We keep the part outside the parentheses: (Sequence) alignment, (Taxonomic) classification, (Sequence) assembly, (Sequence) trimming, (Sequencing) quality control, (Sequence) annotation, (Sequence) assembly validation, (RNA-seq quantification for) abundance estimation, SNP-Discovery, Visualization.

  • To explore the code and results, you can execute the Jupyter notebooks individually:

    • All source code used to preprocess the text, generate the text embeddings and train the models is in the "Code" folder.
    • The data used in this study is in the "Datasets" folder and the results generated by the code are in the "Results" folder.
    • The calculations and figure generation are all run inside Jupyter notebooks.

Dependencies

You'll need a working Python environment to run the code. The recommended way to set up your environment is through the Anaconda Python distribution which provides the conda package manager. Anaconda can be installed in your user directory and does not interfere with the system Python installation.

  • The anaconda list of packages are in the "NLPclassifier.yaml" file.

nlp-based-classification-of-metagenomics-tools's People

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

tomazc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.