contentmine / futuretdm Goto Github PK

Materials of FutureTDM project

Jupyter Notebook 100.00%

tdm text-mining contentmine eupmc jupyter-notebook network-analysis tutorial zika data-mining open-access

futuretdm's Introduction

FutureTDM

All materials of the involvement in the FutureTDM project.

All materials worked out in this repository where conducted within the EU Horizon2020 project Future TDM - The Future of Text and Data Mining, an EU Horizon2020 research project with participation of Open Knowledge International and ContentMine.

The main outcomes are:

three tutorials about specific use-cases of text data mining techniques
one workshop
one presentation of the outcomes at a conference

All content and data is licensed under the Creative Commons Attribution 4.0 International License. All code is under the MIT license.

DO CONTENT MINING

To do text data mining with the content mine software you need two things:

Install the ContentMine software. Find out more in installation.md.
Learn about text data mining. As preparation we recommend to have a look at the resources list in installation.md.

TUTORIALS

We worked out three different use-cases to show the power of text data mining with our software.

Zika Virus

Use text data mining to get an overview on the research around the zika virus. How did the research field evolve in the last decades? What authors and journals contributed most and how are they connected? and and dive into the data/publications to get a better understanding, on the state of it and have a look at the species mentioned.

Go to the Zika Tutorial.

P-Hacking

(soon to come...)

Systematic Literature Review (Train the Trainees for Librarians)

Filter out and find relevant publications, to support you doing a systematic review around your research question - in a fully open and reproducible way.

Go to the Systematic Literature Review Tutorial.

WORKSHOPS

FutureTDM Workshop II at Brussels

Date: 29th of March 2017

Location: EU Parliament, Brussels

Go to the documentation.

Workshop at ELPUB 2017 Conference

Date: 6th of June 2017

Location: 21st ELPUB Conference at Limassol, Cyprus

Go to the documentation.

Presentation at FutureTDM Symposium at Salzburg

Date: 13th of June 2017

Location: University of Applied Sciences in Salzburg, Austria

Go to the documentation.

COPYRIGHT

All content is openly licensed under the Creative Commons Attribution 4.0 license, unless otherwisely stated.

All sourcecode is free software: you can redistribute it and/or modify it under the terms of the MIT License. Visit http://opensource.org/licenses/MIT to learn more about the MIT License.

CONTRIBUTION

In the spirit of free software, everyone is encouraged to help improve the content created and curated here.

Here are some ways you can contribute:

by reporting bugs
by suggesting new sections
by translating to a new language
by writing or editing documentation
by analyzing the data
by visualizing the data
by writing code (no pull request is too small: fix typos in the user interface, add code comments, clean up inconsistent whitespace)
by refactoring code
by closing issues
by reviewing pull requests
by enriching the data with other data sources

When you are ready, submit a pull request.

Submitting an Issue

We use the GitHub issue tracker to track bugs and features. Before submitting a bug report or feature request, check to make sure it hasn't already been submitted. When submitting a bug report, please try to provide a screenshot that demonstrates the problem.

RESSOURCES

FutureTDM

ContentMine

Materials: Software tutorials, training guidelines and trainign modules for ContentMine.
pyCProject: Python wrapper for CProject.
Dictionaries
Discourse

futuretdm's People

Contributors

Stargazers

Watchers

Forkers

daniel-mietchen egonw metavi bibliometrics makaraduman

futuretdm's Issues

tree isn't always available

We rely on tree (well the tutorial references) like it is a unix command available everywhere but it seems like it doesn't come with MacOSX out of the box. I don't think it comes with ubuntu either. We should therefore probably remove it from the tutorial and replace it with something else or walk users through installation.

Singular/ plural of "tutorial"

The folder name is singular, but the folder contains several tutorials, and https://github.com/ContentMine/FutureTDM/blob/master/README.md refers to the folder name as plural.

Some consistency would help.

Issues encountered in PREPARATION

I am now ignoring my existing installation and going through the instructions in https://github.com/ContentMine/FutureTDM/blob/master/README.md for the recommended local installation, logging the problems on the way.

Upon

sudo npm install getpapers

I got

npm WARN deprecated [email protected]: use uuid module instead
npm WARN deprecated [email protected]: ReDoS vulnerability parsing Set-Cookie https://nodesecurity.io/advisories/130
- [email protected] node_modules/bluebird

which I ignored as well as the long Node package tree, which was followed by

npm WARN enoent ENOENT: no such file or directory, open '/Users/danielmietchen/package.json'
npm WARN danielmietchen No description
npm WARN danielmietchen No repository field.
npm WARN danielmietchen No README data
npm WARN danielmietchen No license field.

I tried to use https://github.com/ContentMine/getpapers/blob/master/package.json as a fix but that got me

sudo npm install getpapers
npm ERR! Darwin 16.3.0
npm ERR! argv "/Users/danielmietchen/.nvm/versions/node/v7.1.0/bin/node" "/Users/danielmietchen/.nvm/versions/node/v7.1.0/bin/npm" "install" "getpapers"
npm ERR! node v7.1.0
npm ERR! npm  v3.10.9
npm ERR! code ENOSELF

npm ERR! Refusing to install getpapers as a dependency of itself
npm ERR! 
npm ERR! If you need help, you may report this error at:
npm ERR!     <https://github.com/npm/npm/issues>

npm ERR! Please include the following file with any support request:
npm ERR!     /Users/danielmietchen/Programming/FutureTDM/FutureTDM-master/npm-debug.log

I circumvented that by copying over the package.json from my existing installation, but I guess there should be better documentation as to what to do at this point.

Also, the installation instructions are inconsistent:

https://github.com/ContentMine/contentmine.github.io/blob/master/osx.md says "npm install getpapers"
https://github.com/ContentMine/getpapers/blob/master/README.md says "npm install --global getpapers"

After getpapers, the tutorial lists norma as the next dependency to install, but https://github.com/ContentMine/norma refers to http://contentmine.github.io/ , which brought me to http://contentmine.github.io/osx.html , which then points to "the zip" at https://github.com/ContentMine/norma/releases .

The instruction at http://contentmine.github.io/osx.html to "Add the bin directory that you unzipped to your path" is something I remember having had trouble with in the past, which was fixed by PMR who sat next to me. That kind of troubleshooting is not available with the tutorial, so the documentation should be improved.

Since both norma and ami are already in my path, and no newer releases are available, I did not install them again.

In terms of installing Python3 and Jupyter, some word on how to handle existing installations would also be useful, or on how to check compatibility with ContentMine. Both are already installed on my system, so I am ignoring this part.

https://github.com/ContentMine/FutureTDM/blob/master/README.md does not list a version for pyCProject, but pip install pycproject went smoothly and resulted in Successfully installed pycproject-0.0.6.dev0.

Put a data dump on Zenodo and use it as the default in the notebook

That would help reduce the barriers to get started.

Feedback on the notebook

I ran the notebook after the ContentMine pipeline finished, with the following observations:

import numpy as np

from pandas import Series, DataFrame

import matplotlib.pyplot as plt

from pycproject.readctree import CProject

from pycproject.factnet import *

import os

from collections import Counter



%matplotlib inline

resulted in

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-bb95c5abafed> in <module>()
      1 import numpy as np
----> 2 from pandas import Series, DataFrame
      3 import matplotlib.pyplot as plt
      4 from pycproject.readctree import CProject
      5 from pycproject.factnet import *

//anaconda/lib/python3.5/site-packages/pandas/__init__.py in <module>()
     42 import pandas.core.config_init
     43 
---> 44 from pandas.core.api import *
     45 from pandas.sparse.api import *
     46 from pandas.stats.api import *

//anaconda/lib/python3.5/site-packages/pandas/core/api.py in <module>()
      7 from pandas.core.common import isnull, notnull
      8 from pandas.core.categorical import Categorical
----> 9 from pandas.core.groupby import Grouper
     10 from pandas.core.format import set_eng_float_format
     11 from pandas.core.index import Index, CategoricalIndex, Int64Index, Float64Index, MultiIndex

//anaconda/lib/python3.5/site-packages/pandas/core/groupby.py in <module>()
     15 from pandas.core.base import PandasObject
     16 from pandas.core.categorical import Categorical
---> 17 from pandas.core.frame import DataFrame
     18 from pandas.core.generic import NDFrame
     19 from pandas.core.index import Index, MultiIndex, CategoricalIndex, _ensure_index

//anaconda/lib/python3.5/site-packages/pandas/core/frame.py in <module>()
     39                                    create_block_manager_from_arrays,
     40                                    create_block_manager_from_blocks)
---> 41 from pandas.core.series import Series
     42 from pandas.core.categorical import Categorical
     43 import pandas.computation.expressions as expressions

//anaconda/lib/python3.5/site-packages/pandas/core/series.py in <module>()
     33 from pandas.core.internals import SingleBlockManager
     34 from pandas.core.categorical import Categorical, CategoricalAccessor
---> 35 import pandas.core.strings as strings
     36 from pandas.tseries.common import (maybe_to_datetimelike,
     37                                    CombinedDatetimelikeProperties)

AttributeError: module 'pandas' has no attribute 'core'

I could not figure out how to fix this, so taking a break now.

Getpapers quoting is confusing

The link to the workshop resources tutorial for getpapers isn't all that clear on complex EuPMC queries. Specifically is suggests a bad example that doesn't work.

getpapers -q ABSTRACT:ursus maritimus -o ursus -n

In this case we never even search for the word maritimus. This is obviously confusing to readers.

Update existing ContentMine installation

I already have a ContentMine installation but no simple way to test whether it conforms to the specs, or to update it to some specific version.

Feedback on "Learn ContentMining"

The section

Learn ContentMining

Run through the tutorials of getpapers, norma and ami.
Get a basic understanding of what a CProject and Scholarly HTML is.

in https://github.com/ContentMine/FutureTDM/blob/master/README.md probably sounds a bit daunting for beginners (I'm not, so I'm not sure), and some more context on what this entails, how long it should take and how it all fits together would be very helpful.

The first link, then, is highly irritating, as it did not point to a getpapers tutorial - this is hopefully fixed by #4 , which has the "getpapers" link point to https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/getpapers , which is an actual tutorial that I am following from now on, commenting only if there were surprises.

Instead of the ursus maritimus example (which I had done in the past), I went for

getpapers -q 'thank you' -n -o thank-you
info: Searching using eupmc API
info: Running in no-execute mode, so nothing will be downloaded
info: Found 36011 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 4.5.3.2 vs. 5.0.1 reported by api

This version incompatibility is irritating, but I'll ignore it for the moment.

Those 36k results are a bit too many for a quick download, so I'm adding "donation" as an additional keyword:

getpapers -q 'thank you' donation -n -o thank-you
info: Searching using eupmc API
info: Running in no-execute mode, so nothing will be downloaded
info: Found 36011 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 4.5.3.2 vs. 5.0.1 reported by api

Same number of results — not sure why. Perhaps add some pointers as to whether and how quote and non-quote search strings can be combined?

I then went for a search term with way fewer results: trigonopterus .

Back in https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/getpapers , it says

To have a look at folder file structure, use the tree command.

That gave me

$ tree trigonopterus/
-bash: tree: command not found

I then tried

$ brew install tree
Error: The following formula:
  tree
cannot be installed as a a binary package and must be built from source.
To continue, you must install Xcode from the App Store,
or the CLT by running:
  xcode-select --install

which got me googling and landing at https://superuser.com/questions/359723/mac-os-x-equivalent-of-the-ubuntu-tree-command . I tried none of these options for tree, though one of the answers suggests that my brew attempt should have worked. Instead, I went for

 find trigonopterus/.

After the getpapers tutorial, I am switching to the one on norma:

norma --project trigonopterus -i fulltext.xml -o scholarly.html --transform nlm2html

This resulted in multiple lines of the kind

UNKNOWN: prefix: Dr
UNKNOWN: prefix: Prof
UNKNOWN: prefix: Mr
UNKNOWN: prefix: Ms

UNKNOWN: sec-meta: Taxon classificationAnimaliaColeopteraCurculionidae

I did not see the need to do the PDF part, so skipped it and went on to ami:

ami2-species --project trigonopterus/ -i scholarly.html --sp.species --sp.type binomial

ami2-word --project trigonopterus/ -i scholarly.html --w.words wordFrequencies

ami2-sequence --project trigonopterus --filter file\(\*\*/results.xml\) -o sequencesfiles.xml

all of these worked fine. I think I am now prepared enough for the Zika tutorial, which I will tackle next.

contentmine / futuretdm Goto Github PK

futuretdm's Introduction

FutureTDM

DO CONTENT MINING

TUTORIALS

Zika Virus

P-Hacking

Systematic Literature Review (Train the Trainees for Librarians)

WORKSHOPS

FutureTDM Workshop II at Brussels

Workshop at ELPUB 2017 Conference

Presentation at FutureTDM Symposium at Salzburg

COPYRIGHT

CONTRIBUTION

Submitting an Issue

RESSOURCES

futuretdm's People

Contributors

Stargazers

Watchers

Forkers

futuretdm's Issues

Learn ContentMining

Recommend Projects

Recommend Topics

Recommend Org

Jobs