bionlplab / bioc Goto Github PK
View Code? Open in Web Editor NEWData structures and code to read/write BioC XML and Json.
License: MIT License
Data structures and code to read/write BioC XML and Json.
License: MIT License
I’m writing a python script, to convert biocxml file into pubtator file.
I did not find similar script, so all I can do is to write one on my own.
The bioc files are downloaded from :
https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/BioRED.zip
I tried to read the "Test.BioC.XML" in two ways:
1:
with open(fpath, 'r') as fp:
collection = biocxml.load(fp)
docs = collection.documents
2:
with biocxml.iterparse(fpath) as reader:
collection_info = reader.get_collection_info()
for doc in reader:
It is strange to find that all annotations are missing, but relations are corrected parsed.
Any idea why this happens?
First of all, thank you for this incredibly useful library.
I am trying to parse a brat file of this resource but I get the following error:
[ins] In [26]: a2 = brat.load_ann("devel/PMC-3333881-05-MATERIALS_AND_METHODS.a1")
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Input In [26], in <module>
----> 1 a2 = brat.load_ann("devel/PMC-3333881-05-MATERIALS_AND_METHODS.a1")
File ~/.venv/bigbio/lib/python3.9/site-packages/bioc/brat/decoder.py:214, in load_ann(fp, docid)
212 doc.add_annotation(loads_brat_note(line))
213 if line[0] == 'A' or line[0] == 'M':
--> 214 doc.add_annotation(loads_brat_attribute(line))
215 if line[0] == '*':
216 doc.add_annotation(loads_brat_equiv(line))
File ~/.venv/bigbio/lib/python3.9/site-packages/bioc/brat/decoder.py:16, in loads_brat_attribute(s)
12 """
13 ID [tab] TYPE REFID [FLAG1 FLAG2 ...]
14 """
15 toks = s.split('\t')
---> 16 assert len(toks) == 2, 'Illegal format: %s' % s
18 att = BratAttribute()
19 att.id = toks[0]
AssertionError: Illegal format: M
I have a BioC formatted dataset that I'd like to be able to use in brat. I looked at the code for brat2bioc
and it looks like there's no way to convert in the opposite direction; however, in the brat2bioc tech report it says that the original code in java allowed conversion in both directions ("that translates annotations originally in brat format into BioC and vice versa").
Is there a way to bring this functionality into the python module?
EDIT: spelling
Hi,
I used the Brat export function of a protected corpus of a given BioC-XML-file, but I have an error
AttributeError: 'BioCDocument' object has no attribute 'entities'
Is it possible, to create BioC files without the definition of 'entities'?
I created the entities by my self:
` for passage in doc.passages:
i = i + 1
annotations = []
for ann in passage.annotations:
off = ann.locations
key = len(annotations)
start = off[0].offset
end = off[0].offset + off[0].length
ann = 'T' + str(key) + '\t' + ann.infons['type'] + ' ' + str(start) + ' ' + str(end) + '\t' + passage.text[off[0].offset:(off[0].offset + off[0].length)]
annotations.append(ann)
`
Do you have an idea?
Best regards, Christina
For datsaets like tmVar v2.0, pubtator fails due to the presence of the "RSID: xxxxxxxxx" at the end of the annotations. Example file here: https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/download/tmVar/tmVar.Normalization.txt
PubMed Central provides their Open Access articles in the BioC JSON
-format (see API and Bulk Download). I downloaded one portion with
wget https://ftp.ncbi.nlm.nih.gov/pub/wilbur/BioC-PMC/PMC095XXXXX_json_ascii.tar.gz
and want to document-wise apply a filter (need to save memory). I tried following code:
from tqdm import tqdm
import gzip
import io
keyword = 'diabetes'
my_doi_list = []
path_file_PMC = '/content/PMC095XXXXX_json_ascii.tar.gz'
path_file_PMC_filtered = '/content/result'
with gzip.open(path_file_PMC, 'rb') as gz, open(path_file_PMC_filtered, 'wb') as f_out:
f = io.BufferedReader(gz)
for line in tqdm(f.readlines()):
record = json.loads(line)
# doi = record['documents'][0]['passages'][0]['infons']['article-id_doi']
if keyword in record['documents'][0]['passages'][0]['text']:
# my_doi_list.append(doi)
f_out.write(line)
But face an error:
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
0%| | 0/95046 [00:00<?, ?it/s]
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
[<ipython-input-21-cc459cfaf959>](https://localhost:8080/#) in <module>
19 # f = gz
20 for line in tqdm(f.readlines()):
---> 21 record = json.loads(line)
22 doi = record['documents'][0]['passages'][0]['infons']['article-id_doi']
23 if keyword in record['documents'][0]['passages'][0]['text']: # TODO: <<< change this to your filter
2 frames
[/usr/lib/python3.7/json/__init__.py](https://localhost:8080/#) in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
346 parse_int is None and parse_float is None and
347 parse_constant is None and object_pairs_hook is None and not kw):
--> 348 return _default_decoder.decode(s)
349 if cls is None:
350 cls = JSONDecoder
[/usr/lib/python3.7/json/decoder.py](https://localhost:8080/#) in decode(self, s, _w)
335
336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()
339 if end != len(s):
[/usr/lib/python3.7/json/decoder.py](https://localhost:8080/#) in raw_decode(self, s, idx)
353 obj, end = self.scan_once(s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Then I found your python package, and I thought I could use one of the codes provided here: https://bioc.readthedocs.io/en/latest/biocjson.html
However, I don't want to unzip or untar, but from your code examples it is not clear which format the file at filename
has. Is it possible to use functions of your package but use .tar.gz.
as input or do I need to unzip (w/o untar)?
I'm trying to read the BioRED file from here: https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/
As you will see, it has pubtator formats. The annotations are read in propely, but when I try to read the relations, it doesn't provide any output
The relations are written as follows:
14510914 Association D050033 D007454 No
14510914 Positive_Correlation p|DEL|439_443| C564766 Novel
14510914 Positive_Correlation p|DEL|439_443| D003409 Novel
14510914 Association C564766 D007454 No
Is this something bioc.pubtator
supports?
Hi,
I am attempting to use BioC on MacOS X and when I try to import bioc
in Python 2.7.10 I get the following error:
>>> import bioc
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/bioc/__init__.py", line 4, in <module>
from .bioc import BioCCollection, BioCDocument, BioCPassage, BioCSentence, BioCAnnotation, \
File "/Library/Python/2.7/site-packages/bioc/bioc.py", line 24
def __init__(self, refid: str, role: str):
^
SyntaxError: invalid syntax
I have updated to the most recent version:
sudo pip install bioc
Password:
Requirement already satisfied: bioc in /Library/Python/2.7/site-packages (1.3.1)
Requirement already satisfied: docutils==0.14 in ./Library/Python/2.7/lib/python/site-packages (from bioc) (0.14)
Requirement already satisfied: lxml==4.2.5 in /Library/Python/2.7/site-packages (from bioc) (4.2.5)
Requirement already satisfied: jsonlines==1.2.0 in /Library/Python/2.7/site-packages (from bioc) (1.2.0)
Requirement already satisfied: six in ./Library/Python/2.7/lib/python/site-packages (from jsonlines==1.2.0->bioc) (1.11.0)
Thank you in advance for your help!
Karyn
Hi, the new release has stopped the use of BioCXMLDocumentReader using a 'with' statement. It looks like the __enter__ and __exit__ methods were removed in commit 8174c1d. The documentation still suggests that you can do it that way. It'd be really useful if this could be reintroduced.
Here's a short bit of example code that gives the error below it.
import bioc
with bioc.BioCXMLDocumentReader('test.bioc.xml') as reader:
pass
Traceback (most recent call last):
File "itertest.py", line 3, in <module>
with bioc.BioCXMLDocumentReader('test.bioc.xml') as reader:
AttributeError: __enter__
First of all, great library. Thank you for you work.
I'm wondering is there a strict requirement for lxml=4.4.1
. Or could it be more flexible like lxml>=4.4.1
. It doesn't seem like lxml
introduced breaking changes https://lxml.de/4.5/changes-4.5.0.html, but it would be really helpful in projects with multiple dependencies which can conflict.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.