brunoamaral / gregory Goto Github PK

View Code? Open in Web Editor NEW

44.0 6.0 6.0 201.59 MB

Artificial Intelligence and Machine Learning to help find scientific research and filter relevant content

Home Page: https://gregory-ai.com/

License: Other

HTML 10.51% Python 89.14% Shell 0.24% Dockerfile 0.11%

multiple-sclerosis health research-tool machine-learning django python neurology

gregory's Introduction

Gregory AI

Gregory is an AI system that uses Machine Learning and Natural Language Processing to track clinical research and identify papers which improves the wellbeing of patients.

Sources for research can be added by RSS feed or manually.

The output can be seen in a static site, using build.py or via the api provided by the Django Rest Framework.

The docker compose file also includes a Metabase container which is used to build dashboards and manage notifications.

Sources can also be added to monitor Clinical Trials, in which case Gregory can notify a list of email subscribers.

For other integrations, the Django app provides RSS feeds with a live update of relevant research and newly posted clinical trials.

Features

Machine Learning to identify relevant research
Configure RSS feeds to gather search results from PubMed and other websites
Configure searches on any public website
Integration with mailgun.com to send emails
Automatic emails to the admin team with results in the last 48hours
Subscriber management
Configure email lists for different stakeholders
Public and Private API to integrate with other software solutions and websites
Configure categories to organize search results based on keywords in title
Configure different “subjects” to have keep different research areas segmented
Identify authors and their ORCID
Generate different RSS feeds

Current Use case for Multiple Sclerosis

https://gregory-ms.com

Rest API: https://api.gregory-ms.com

Running in Production

Server Requirements

Docker and docker-compose with 2GB of swap memory to be able to build the MachineLearning Models. (Adding swap for Ubuntu)
Mailgun (optional)

Installing Gregory

Install python dependencies locally
Edit the .env file to reflect your settings and credentials.

DOMAIN_NAME=DOMAIN.COM
# Set this to the subdomain you configured with Mailgun. Example: mg.domain.com
EMAIL_DOMAIN=
# The SMTP server and credentials you are using. For example: smtp.eu.mailgun.org
# These variables are only needed if you plan to send notification emails
EMAIL_HOST=
EMAIL_HOST_PASSWORD=
EMAIL_HOST_PASSWORD=
EMAIL_HOST_USER=
# We use Mailgun by default on the newsletters, input your API key here
EMAIL_MAILGUN_API_URL=
EMAIL_PORT=587
EMAIL_USE_TLS='True'
# Where you cloned the repository
GREGORY_DIR=
# Leave this blank and come back to them when you're finished installing Metabase.
METABASE_SECRET_KEY=
# Where do you want to host Metabase?
METABASE_SITE_URL='https://metabase.DOMAIN.COM/'
# Set your postgres DB and credentials
POSTGRES_DB=
POSTGRES_PASSWORD=
POSTGRES_USER=
SECRET_KEY='Yeah well, you know, that is just, like, your DJANGO SECRET_KEY, man' # you should set this manually https://docs.djangoproject.com/en/4.0/ref/settings/#secret-key

Execute python3 setup.py.

The script checks if you have all the requirements and run to help you setup the containers.

Once finished, login at https://api.DOMAIN.TLD/admin or wherever your reverse proxy is listening on.

Go to the admin dashboard and change the example.com site to match your domain
Go to custom settings and set the Site and Title fields.
Configure your RSS Sources in the Django admin page.
Setup database maintenance tasks. Gregory needs to run a series of tasks to fetch missing information before applying the machine learning algorithm. For that, we are using Django-Con. Add the following to your crontab:

*/3 * * * * /usr/bin/docker exec -t admin ./manage.py runcrons
#*/10 * * * * /usr/bin/docker exec -t admin ./manage.py get_takeaways
*/5 * * * * /usr/bin/flock -n /tmp/get_takeaways /usr/bin/docker exec admin ./manage.py get_takeaways

How everything fits together

Django and Postgres

Most of the logic is inside Django, the admin container provides the Django Rest Framework, manages subscriptions, and sends emails.

The following subscriptions are available:

Admin digest

This is sent every 48 hours with the latest articles and their machine learning prediction. Allows the admin access to an Edit link where the article can be edited and tagged as relevant.

Weekly digest

This is sent every Tuesday, it lists the relevant articles discovered in the last week.

Clinical Trials

This is sent every 12 hours if a new clinical trial was posted.

The title of the email footer for these emails needs to be set in the Custom Settings section of the admin backoffice.

Django also allows you to add new sources from where to fetch articles. Take a look at /admin/gregory/sources/

Node-RED

We use Node-RED to collect articles from sources without an RSS. These flows need to be added manually and configured to write to the postres database. If your node-red container does not show a series of flows, import the flows.json file from this repository.

Mailgun

Emails are sent from the admin container using Mailgun.

To enable them, you will need a mailgun account, or you can replace them with another way to send emails.

You need to configure the relevant variables for this to work:

EMAIL_USE_TLS=true
EMAIL_MAILGUN_API='YOUR API KEY'
EMAIL_DOMAIN='YOURDOMAIN'
EMAIL_MAILGUN_API_URL="https://api.eu.mailgun.net/v3/YOURDOMAIN/messages"

As an alternative, you can configure Django to use any other email server.

RSS feeds and API

Gregory has the concept of 'subject'. In this case, Multiple Sclerosis is the only subject configured. A Subject is a group of Sources and their respective articles. There are also categories that can be created. A category is a group of articles whose title matches at least one keyword in list for that category. Categories can include articles across subjects.

There are options to filter lists of articles by their category or subject in the format articles/category/<category> and articles/subject/<subject> where and is the lowercase name with spaces replaced by dashes.

Available RSS feeds

Latest articles, /feed/latest/articles/
Latest articles by subject, /feed/articles/subject/<subject>/
Latest articles by category, /feed/articles/category/<category>/
Latest clinical trials, /feed/latest/trials/
Latest relevant articles by Machine Learning, /feed/machine-learning/
Twitter feed, /feed/twitter/. This includes all relevant articles by manual selection and machine learning prediction. It's read by Zapier so that we can post on twitter automatically.

How to update the Machine Learning Algorithms

This is not working right now and there is a pull request to setup an automatic process to keep the machine learning models up to date.

It's useful to re-train the machine learning models once you have a good number of articles flagged as relevant.

cd docker-python; source .venv/bin/activate
python3 1_data_processor.py
python3 2_train_models.py

Running for local development

Edit the env.example file to fit your configuration and rename to .env

sudo docker-compose up -d
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

Thank you to

@Antoniolopes for helping with the Machine Learning script.
@Chbm for help in keeping the code secure.
@Jneves for help with the build script
@Malduarte for help with the migration from sqlite to postgres.
@Melo for showing me Hugo
@Nurv for the suggestion in using Spacy.io
@Rcarmo for showing me Node-RED

And the Lobsters at One Over Zero

gregory's People

Contributors

Stargazers

Watchers

Forkers

anachaba moohax rmourey26 scytmj data-science-knowledge-center-nova-sbe mtpereira

gregory's Issues

implement email digest with new articles to be sent weekly

weekly summary includes too many articles

some of the articles listed seem that they were not marked as relevant.

make dates for clinical trials and articles explicit UTC

Add European Clinical Trial Register

https://www.clinicaltrialsregister.eu/ctr-search/search?query=multiple+sclerosis

These results are available as an RSS Feed.
https://www.clinicaltrialsregister.eu/ctr-search/rest/feed/bydates?query=multiple+sclerosis

error building the container on Ubuntu 21.04

$ sudo docker-compose up

Creating volume "gregory_flows" with local driver
Creating volume "gregory_python" with local driver
Creating node-red ... error

ERROR: for node-red  Cannot create container for service node-red: failed to mount local volume: mount ./docker-python:/var/lib/docker/volumes/gregory_python/_data, flags: 0x1000: no such file or directory

ERROR: for node-red  Cannot create container for service node-red: failed to mount local volume: mount ./docker-python:/var/lib/docker/volumes/gregory_python/_data, flags: 0x1000: no such file or directory
ERROR: Encountered errors while bringing up the project.

Admin container can't run training for the Machine Learning models

include DOI number in articles table

Manage subscriptions through django's admin

add an RSS feed for search results

this will be a by product of #5

Add forms to subscribe to the weekly digest and clinical trial notifications

Currently we need to add users to the mailing lists manually. This would allow them to subscribe on their own.

Requires

Spam protection
Ability to add existing users to new lists
Ability to unsubscribe from a list by email link

API for related articles returns the source_id instead of the source_name

Example https://gregory-ms.com/articles/1/

pythonshell node-red module is missing from the dockerfile

deleting an article does not delete the relationship with the category(ies)

I must have missed something when I wrote the models.

from django.db import models
class Categories(models.Model):
	category_id = models.AutoField(primary_key=True)
	category_name = models.CharField(blank=True, null=True,max_length=200)
	category_description = models.TextField(blank=True, null=True)

	def __str__(self):
		return self.category_name

	class Meta:
		managed = True
		verbose_name_plural = 'categories'
		db_table = 'categories'

class Articles(models.Model):
	article_id = models.AutoField(primary_key=True)
	title = models.TextField(blank=False, null=False, unique=True)
	summary = models.TextField(blank=True, null=True)
	link = models.URLField(blank=False, null=False, max_length=2000)
	published_date = models.DateTimeField(blank=True, null=True)
	discovery_date = models.DateTimeField()
	source = models.ForeignKey('Sources', models.DO_NOTHING, db_column='source', blank=True, null=True)
	relevant = models.BooleanField(blank=True, null=True)
	ml_prediction_gnb = models.BooleanField(blank=True, null=True)
	ml_prediction_lr = models.BooleanField(blank=True, null=True)
	noun_phrases = models.JSONField(blank=True, null=True)
	categories = models.ManyToManyField(Categories)
	entities = models.ManyToManyField('Entities')
	sent_to_admin = models.BooleanField(blank=True, null=True)
	sent_to_subscribers = models.BooleanField(blank=True, null=True)
	sent_to_twitter = models.BooleanField(blank=True, null=True)
	doi = models.CharField(max_length=280, blank=True, null=True)

	def __str__(self):
		return str(self.article_id)

	class Meta:
		managed = True
		# unique_together = (('title', 'link'),)
		verbose_name_plural = 'articles'
		db_table = 'articles'


class Entities(models.Model):
	entity = models.TextField()
	label = models.TextField()


	class Meta:
		managed = True
		verbose_name_plural = 'entities'
		db_table = 'entities'


class Sources(models.Model):
	TABLES = [('articles', 'Articles'),('trials','Trials')]


	source_id = models.AutoField(primary_key=True)
	source_for = models.CharField(choices=TABLES, max_length=50, default='articles')
	name = models.TextField(blank=True, null=True)
	link = models.TextField(blank=True, null=True)
	language = models.TextField()
	subject = models.TextField()
	method = models.TextField()
	

	def __str__(self):
		return self.name

	class Meta:
		managed = True
		verbose_name_plural = 'sources'
		db_table = 'sources'


class Trials(models.Model):
	trial_id = models.AutoField(primary_key=True)
	discovery_date = models.DateTimeField(blank=True, null=True)
	title = models.TextField(blank=False,null=False, unique=True)
	summary = models.TextField(blank=True, null=True)
	link = models.URLField(blank=False, null=False, max_length=2000)
	published_date = models.DateTimeField(blank=True, null=True)
	source = models.ForeignKey('Sources', models.DO_NOTHING, db_column='source', blank=True, null=True)
	relevant = models.BooleanField(blank=True, null=True)
	sent = models.BooleanField(blank=True, null=True)
	sent_to_twitter = models.BooleanField(blank=True, null=True)
	sent_to_subscribers = models.BooleanField(blank=True, null=True)

	def __str__(self):
		return str(self.trial_id) 

	class Meta:
		managed = True
		verbose_name_plural = 'trials'
		db_table = 'trials'

move database from SQLite to Postgres

reasons for it:

better handling of timestamp data
equal integration with metabase
faster (?) response time

best approach would be psql -d gregory -f ./docker-data/gregory.db but it results in syntax errors because of the html values in some columns.

convert machine learning scripts to use True / False instead of Zero / One

list relevant results in the last 30 days in the doctor's page

maybe using Metabase to avoid extra coding

make stopwords configurable

Gregory needs to be agnostic in order to be applied to any number of subjects.

Right now, stop words, or stop sentences, are hardcoded into the javascript:

https://github.com/brunoamaral/gregory/blob/main/assets/js/gregory.js#L5-L20

Maybe move this information into config.toml or another single configuration file that makes sense.

Make Dockerfile.django use requirements.txt

both articles and clinical trials use "source" as a parameter

This does not feel right because on clinical trials it should not be a source. Maybe it should be "sponsor" or "published_in"

update flows to use postgres

relevant articles are not listed

include spacy.io in node-red container

We are using https://github.com/explosion/spaCy to detect the noun phrases in the title of articles. This information is then used to list related articles on each page.

Half of the build process is running spacy.io, so it should be included in the node-red flows to save that information in the database.

We could run it as a separate script, but I don't want to split the different processing steps between the container and the host server.

Automatic categorisation does not take synonyms into account

This is a caveat where the system fails to include articles in the corresponding category if the noun used is different. For example Ocrelizumab and Ocrevus, or Natalizumab and Tysabri. These nouns correspond to a single medication, respectively, however, in the current state, Gregory can only identify them as being separate entities.

Data should be fetched from the PG database and not the api

Right now, we are fetching articles/all?format=json which uses Django Rest Framework to dump the whole database.

Fetching from the postgres database directly will speed up the build and cut down on processing.

Move db maintenance into django container

list of clinical trials in the weekly email is too long

fix sorting of clinical trials on website

Let visitors browse the database freely

add listing and pagination for articles in markdown format

depends on #5

fetch rss data via python feedparser

The node-red feedparser doesn't let us add an RSS url to it, so instead we will be using /python-ml/feedreader.py

implement view of articles by source

running 3_predict.py returns an error using the scikit branch

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/multiclass.py", line 100, in _predict_binary
    score = np.ravel(estimator.decision_function(X))
AttributeError: 'GaussianNB' object has no attribute 'decision_function'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "3_predict.py", line 120, in <module>
    data = predictor(dataset)
  File "3_predict.py", line 110, in predictor
    prediction = pipelines[model].predict([input])
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 470, in predict
    return self.steps[-1][1].predict(Xt, **predict_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/multiclass.py", line 457, in predict
    indices.extend(np.where(_predict_binary(e, X) > thresh)[0])
  File "/usr/local/lib/python3.7/dist-packages/sklearn/multiclass.py", line 103, in _predict_binary
    score = estimator.predict_proba(X)[:, 1]
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 125, in predict_proba
    return np.exp(self.predict_log_proba(X))
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 104, in predict_log_proba
    jll = self._joint_log_likelihood(X)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 489, in _joint_log_likelihood
    n_ij = -0.5 * np.sum(np.log(2.0 * np.pi * self.var_[i, :]))
AttributeError: 'GaussianNB' object has no attribute 'var_'

Create frontend view for current research

https://www.mssociety.org.uk/research/explore-our-research/emerging-research-and-treatments/explore-treatments-in-trials

The goal is to list the current research as listed by MS Society with a listing of trials and published articles.

update docker image to include postgres nodes for node-red

Listing for physical therapists is missing

make build.py split the json into markdown files

add more information about sources to the database

Example:

[
    {
        "source": "CUF",
        "link": "https://www.example.com"
    },
    {
        "source": "ClinicalTrials.gov",
        "link": "https://www.example.com"
    },
    {
        "source": "Novartis",
        "link": "https://www.example.com"
    }
]

Other relevant information, the link of the search page and keywords we use.

excel export does not contain the full data source

implement search page

options:

lunr.js
Algolia

New sources for articles, by João Sequeira (Capuchos)

1. Registo nacional de ensaios clínicos (https://www.rnec.pt/pt_PT)
2. MS Journal (https://journals.sagepub.com/home/msj)
3. MS and Related Disorders Journal (https://www.msard-journal.com/)

Move API to django rest framework

List all articles
relevancy API
Remove pagination from articles
https://api.gregory-ms.com/articles/all[](https://api.gregory-ms.com/articles/id/19)

Example: https://api.gregory-ms.com/articles/all

List article that matches the {ID} number.

https://api.gregory-ms.com/articles/id/{ID}

Example: https://api.gregory-ms.com/articles/id/19[](https://api.gregory-ms.com/articles/keyword/myelin)

List all articles by keyword.

https://api.gregory-ms.com/articles/keyword/{keyword}

Example: https://api.gregory-ms.com/articles/keyword/myelin

List related articles by keywords

POST https://gregory-ms.com/articles/related/[](https://api.gregory-ms.com/articles/relevant)

Expects a json object of keywords in the post body.

{ "keywords": ['trials','gait rehabilitation','multiple sclerosis'] }
https://gregory-ms.com/articles/related/

List all relevant articles.

These are articles that we show on the home page because they appear to offer new courses of treatment.

https://api.gregory-ms.com/articles/relevant[](https://api.gregory-ms.com/articles/source/1)

Example: https://api.gregory-ms.com/articles/relevant

Articles’ Sources

List all articles from specified {source}.

https://api.gregory-ms.com/articles/source/{source_id}

Example: https://api.gregory-ms.com/articles/source/1[](https://api.gregory-ms.com/articles/sources)

List all available sources.

https://api.gregory-ms.com/articles/sources[](https://api.gregory-ms.com/trials/all)

Example: https://api.gregory-ms.com/articles/sources

Trials

List all trials.

https://api.gregory-ms.com/trials/all[](https://api.gregory-ms.com/trials/keyword/myelin)

Example: https://api.gregory-ms.com/trials/all

List all trials by keyword.

https://api.gregory-ms.com/trials/keyword/{keyword}

Example: https://api.gregory-ms.com/trials/keyword/myelin[](https://api.gregory-ms.com/trials/source/pubmed)

Trials’ Sources

List all trials from specified {source}.

https://api.gregory-ms.com/trials/source/{source}

Example: https://api.gregory-ms.com/trials/source/pubmed[](https://api.gregory-ms.com/trials/sources)

List all available sources.

https://api.gregory-ms.com/trials/sources

Example: https://api.gregory-ms.com/trials/sources

Main requirement is that the value for an article's title or URL needs to be unique.
Diagram below shows a possible DB model for version 7.

DataError at /admin/gregory/articles/2097/change/
invalid input syntax for type json
LINE 1: ...amptz, "sent_to_twitter" = NULL, "noun_phrases" = '[''Centra...
                                                             ^
DETAIL:  Token "'" is invalid.