GithubHelp home page GithubHelp logo

herbert's Introduction

Herbert

Website

You can visit the website by clicking here!

What is Herbert

Herbert is a semantic search engine aggregates and distills the most reliable herbal medicine information into a curated single report view for consumers curious about using herbal medicine.

Why is it important

Total retail sales of herbal pharmaceuticals surpassed $8.8 billion domestically and the growth has been accelerating over the past decade. Although there is a lot of interest in herbal medication it is difficult for consumers to get trustworthy, reliable, and easy to understand information about the treatments. Standard web searches return a lot of information, but leave it up to the user to sift through all the pages to find the information that is relevant to them. Other services are designed for medical professionals and use complicated jargon that make it difficult for the layperson to understand. While there is lots of interest in alternative herbal medicine, there is also information overload.

What are our advantages

Efficient: Herbert filters irrelevant information and focuses on the semenatic relationship among herbs, conditions and interactions
Trustworthy: Herbert aggregates information from trustworthy data sources and cross references among them
User Friendly: Herbert avoids using overly technical terms and uses layman vocabularies for easy understanding
Transparent: Herbert provides links back to original data sources for user reference

Technical Description

Proccess gif

Our solution to distilling our sources into the relevant data points about an herb involves a multi-stage pipeline of finer granularity of text at each stage. Essentially, we break down pages of texts to get relevant paragraphs, and then relevant paragraphs to relevant sentences, until we finally reach relevant phrases which are turned into our bullet points.

To get our pages of text we use a combination of restful APIs for available sources such as PubMed and Wikipedia and BeautifulSoup + Requests Python libraries for the others that don’t have APIs such as NCCIH.

In order to explain our general extraction pipeline, we refer to an example with a Wikipedia page on ginger illustrated by the animation above.

We start off by getting the relevant paragraphs by selecting the relevant headings from the table of contents.

To get the desired headings, we set ‘seed words’ that are related to topics of interest (e.g. the topic of side-effects with “adverse”, “side-effect”, “interact”, and the like).

We essentially compare our seed words with the content headings and those “similar enough” to our seed words dictate which paragraphs are relevant while the rest are discarded. The two are made comparable through word embeddings augmented with character level n-grams which can be seen in Mikolov et. al’s paper “Enriching Word Vectors with Subword Information”. Essentially, we look at not only whole words but chunks (character n-grams) in case we don’t have the word in the vocabulary but can make use of root words, prefixes, suffixes, etc. The word embedding model was trained on millions of PubMed abstracts, full text from the PubMed Central Open Access subset, and texts from an English Wikipedia text dump. The vectors are compared by a cosine similarity and we set a threshold (empirically determined) to decide how “similar” the vectors need to be in order to retain the heading.

Now that we have our relevant paragraphs (pointed by the content headings), we look to grab relevant sentences. We do this by using a government supported medical ontology software called UMLS to identify sentences with medical content relating to conditions, symptoms, or other medical objects of interest.

From each relevant sentence, we find our relevant phrases through relation extraction in which we look for subject-verb-object triples (SVOs)-the example above being “Ginger alleviates nausea”.

Finally, we apply a similar process to the process described to extract content headings (word2vec + cosine similarity + thresholding). We look for verbs in the SVOs that indicate whether the phrase is explaining what the herb either treats, interacts with, or causes. We normalize the words and combine them along other data sources and put them on the landing pages for our herbs.

Summary Diagram

For summaries of conditions + herbs, we use Wikipedia for both entity resolution and the source for summary content. We summarize via an unsupervised graph-based approach known as TextRank.

Query Formula

The underlying search engine is supported by the Python library Whoosh with the Okapi Best Matching-25 (BM-25) algorithm for relevance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.