jakelever / civicmine Goto Github PK
View Code? Open in Web Editor NEWText mining cancer biomarkers for the CIVIC database
Home Page: http://bionlp.bcgsc.ca/civicmine
License: MIT License
Text mining cancer biomarkers for the CIVIC database
Home Page: http://bionlp.bcgsc.ca/civicmine
License: MIT License
HI @jakelever ,
I read the CivicMine paper and came across the following mentioned in the Abstract -
To this end, a group of cancer genomics experts annotated biomarkers and their clinical associations discussed in 800 sentences and achieved good inter-annotator agreement
May I know if the corpus linked by @swartchris8 in #2 (https://github.com/jakelever/cancermine/blob/master/data/cancermine_corpus.zip) is the one that contains these 800 sentences? Were these sentences were manually annotated by experts?
If not can you point me to the corpus of these 800 manually annotated sentences?
Hello Jake,
I have set up the environment for civicmine as described in your Github. I am trying to replicate the analysis with "pubrunner --test ." but I ran into some issues. Here is the output and error description:
I am using python 3.7 and Ubuntu 19.10. The problem is briefly solved if I change the following line of codes in the codecs.py file.
def write(self, object):
""" Writes the object's contents encoded to self.stream.
"""
data, consumed = self.encode(object.decode('uft-8'), self.errors)
self.stream.write(data)
However, other problems with snakemake rise up, therefore I have to undone the changes. I have been working on this for the past two days. I will appreciate it if you can render me some support.
Thanks.
Hi Jake,
Quick question: You have four categories (diagnostic, predictive, predisposing, prognostic). Do you by any chance have any plans to classify the corpus with whether a relationship in a sentence is a true positive or false positive? OR, do you know of a corpus like this out in the interwebs?
Thanks,
KMS
I was wondering where I can find the 1500 annotated sentences used to train the model?
I would assume these are the sentences: https://github.com/jakelever/cancermine/blob/master/data/cancermine_corpus.zip
Hi jakelever,
Thanks for this wonderful project.
When i used the civicmine (http://bionlp.bcgsc.ca/civicmine) i can't find "T790M" in any sentence. It was odd for me because EGFR T790M is very famous biomarker in treatment cancer.
This is a tokenizer problem that Spacy language model (en_core_web_sm) tokenizes the "T790M" as a "T790" and "M". (('T790', 'NOUN'), ('M', 'PROPN'))
I changed the kindred package like this (kindred/Parser.py)
if not model in Parser._models:
Parser._models[model] = spacy.load(model, disable=['ner'])
self.nlp = Parser._models[model]
special_case = [{ORTH: "T790M"}]
self.nlp.tokenizer.add_special_case("T790M", special_case)
Now "T790M" is ('T790M', 'VERB') fixed.
best,
jakelever
Hi
Great repo!
I added an option to search by gene using a free text query.
If this is interesting I can make a PR.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.