GithubHelp home page GithubHelp logo

isabella232 / machinelearningsamples-biomedicalentityextraction Goto Github PK

View Code? Open in Web Editor NEW

This project forked from azure-samples/machinelearningsamples-biomedicalentityextraction

0.0 0.0 0.0 27.24 MB

MachineLearningSamples-BiomedicalEntityExtraction

License: MIT License

Python 83.17% Perl 16.83%

machinelearningsamples-biomedicalentityextraction's Introduction

Biomedical Entity Recognition using TDSP Template

NOTE This content is no longer maintained. Visit the Azure Machine Learning Notebook project for sample Jupyter notebooks for ML and deep learning with Azure Machine Learning.

Link to the Microsoft DOCS site

The detailed documentation for this example includes the step-by-step walk-through: https://docs.microsoft.com/azure/machine-learning/preview/scenario-tdsp-biomedical-recognition

Link to the Gallery GitHub repository

The public GitHub repository for this example contains all the code samples: https://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction

Summary

Entity extraction is a subtask of information extraction (also known as Named-entity recognition (NER), entity chunking and entity identification). Biomedical named entity recognition is a critical step for complex biomedical NLP tasks such as:

  • Extraction of diseases, symptoms from electronic medical or health records.
  • Drug discovery
  • Understanding the interactions between different entity types such as drug-drug interaction, drug-disease relationship and gene-protein relationship.

This real-world scenario focuses on how a large amount of unstructured unlabeled data corpus such as PubMed article abstracts can be analyzed to train a domain-specific word embedding model. Then the output embeddings are considered as automatically generated features to train a neural entity extraction model using Keras with TensorFlow deep learning framework as backend and a small amoht of labeled data.

Description

The aim of this real-world scenario is to highlight how to use Azure Machine Learning Workbench to solve a complicated NLP task such as entity extraction from unstructured text. Here are the key points addressed:

  1. How to train a neural word embeddings model on a text corpus of about 18 million PubMed abstracts using Spark Word2Vec implementation.
  2. How to build a deep Long Short-Term Memory (LSTM) recurrent neural network model for entity extraction on a GPU-enabled Azure Data Science Virtual Machine (GPU DSVM) on Azure.
  3. Demonstrate that domain-specific word embeddings models can outperform generic word embeddings models in the entity recognition task.
  4. Demonstrate how to train and operationalize deep learning models using Azure Machine Learning Workbench.

The following capabilities within Azure Machine Learning Workbench:

  • Instantiation of Team Data Science Process (TDSP) structure and templates.
  • Automated management of your project dependencies including the download and the installation.
  • Execution of code in Jupyter notebooks as well as Python scripts.
  • Run history tracking for Python files.
  • Execution of jobs on remote Spark compute context using HDInsight Spark 2.1 clusters.
  • Execution of jobs in remote GPU VMs on Azure.
  • Easy operationalization of deep learning models as web-services hosted on Azure Container Services.

The detailed documentation for this scenario including the step-by-step walk-through: https://review.docs.microsoft.com/en-us/azure/machine-learning/preview/scenario-tdsp-biomedical-recognition.

For code samples, click the View Project icon on the right and visit the project GitHub repository.

Key components needed to run this example:

  • An Azure subscription

  • Azure Machine Learning Workbench with a workspace created. See installation guide.

  • To run this scenario with Spark cluster, provision Azure HDInsight Spark cluster (Spark 2.1 on Linux (HDI 3.6)) for scale-out computation. To process the full amount of MEDLINE abstracts discussed below, we recommend having a cluster with:

    • a head node of type D13_V2

    • at least four worker nodes of type D12_V2.

    • To maximize performance of the cluster, we recommend to change the parameters spark.executor.instances, spark.executor.cores, and spark.executor.memory by following the instructions here and editing the definitions in "custom spark defaults" section.

  • You can run the entity extraction model training locally on a Data Science Virtual Machine (DSVM) or in a remote Docker container in a remote DSVM.

  • To provision DSVM for Linux (Ubuntu), follow the instructions here. We recommend using NC6 Standard (56 GB, K80 NVIDIA Tesla).

Data/Telemetry

The Biomedical named entity recognition scenario collects usage data and sends it to Microsoft to help improve our products and services. Read our privacy statement to learn more.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com. When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA. This project has adopted the Microsoft Open Source Code of Conduct. For more information, see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

machinelearningsamples-biomedicalentityextraction's People

Contributors

deguhath avatar hning86 avatar linya9191 avatar microsoftopensource avatar mohabdel-msft avatar mohabdel2013 avatar msftgits avatar paulshealy1 avatar rastala avatar rloutlaw avatar zodzun avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.