GithubHelp home page GithubHelp logo

isabella232 / machinelearningsamples-documentcollectionanalysis Goto Github PK

View Code? Open in Web Editor NEW

This project forked from azure-samples/machinelearningsamples-documentcollectionanalysis

0.0 0.0 0.0 5.26 MB

MachineLearningSamples-DocumentCollectionAnalysis

License: MIT License

Python 4.80% Jupyter Notebook 95.20%

machinelearningsamples-documentcollectionanalysis's Introduction

Document Collection Analysis

NOTE This content is no longer maintained. Visit the Azure Machine Learning Notebook project for sample Jupyter notebooks for ML and deep learning with Azure Machine Learning.

Link to the Microsoft DOCS site

The detailed documentation for this real world scenario includes the step-by-step walkthrough:

https://docs.microsoft.com/azure/machine-learning/preview/scenario-document-collection-analysis

Link to the Gallery GitHub repository

The public GitHub repository for this real world scenario contains all materials, including code samples, needed for this example:

https://github.com/Azure/MachineLearningSamples-DocumentCollectionAnalysis

This scenario demonstrates how to summarize and analyze a large collection of documents, including techniques such as phrase learning, topic modeling, and topic model analysis using Azure ML Workbench. Azure Machine Learning Workbench provides for easy scale up for very large document collection, and provides mechanisms to train and tune models within a variety of compute contexts, ranging from local compute to Data Science Virtual Machines to Spark Cluster. Easy development is provided through Jupyter notebooks within Azure Machine Learning Workbench.

Overview

With a large amount of data (especially unstructured text data) collected every day, a significant challenge is to organize, search, and understand vast quantities of these texts. This document collection analysis scenario demonstrates an efficient and automated end-to-end workflow for analyzing large document collection and enabling downstream NLP tasks.

The key elements delivered by this scenario are:

  1. Learning salient multi-words phrase from documents.

  2. Discovering underlying topics presented in the document collection.

  3. Representing documents by the topical distribution.

  4. Presenting methods for organizing, searching, and summarizing documents based on the topical content.

The methods presented in this scenario could enable a variety of critical industrial workloads, such as discovery of topic trends anomaly, document collection summarization, and similar document search. It can be applied to many different types of document analysis, such as government legislation, news stories, product reviews, customer feedbacks, and scientific research articles.

The machine learning techniques/algorithms used in this scenario include:

  1. Text processing and cleaning

  2. Phrase Learning

  3. Topic modeling

  4. Corpus summarization

  5. Topical trends and anomaly detection

Prerequisites

The prerequisites to run this example are as follows:

Data/Telemetry

This advance scenarios for Document Collection Analysis collects usage data and sends it to Microsoft to help improve our products and services. Read our privacy statement to learn more.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (for example, label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information, see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

machinelearningsamples-documentcollectionanalysis's People

Contributors

hning86 avatar kehuangms avatar linya9191 avatar microsoftopensource avatar msftgits avatar rloutlaw avatar umlkhuang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.