mdanalysis / mdanalysisdata Goto Github PK

View Code? Open in Web Editor NEW

12.0 6.0 5.0 7.26 MB

Access to data for workshops and extended tests of MDAnalysis.

Home Page: https://www.mdanalysis.org/MDAnalysisData

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

mdanalysis python molecular-dynamics dataset-manager

mdanalysisdata's Introduction

MDAnalysisData

Access to data for workshops and extended tests of MDAnalysis.

Data sets are stored at external stable URLs (e.g., on figshare, zenodo, or DataDryad) and this package provides a simple interface to download, cache, and access data sets.

Installation

To use, install the package

pip install --upgrade MDAnalysisData

or install with conda

conda install --channel conda-forge mdanalysisdata

Accessing data sets

Import the datasets and access your data set of choice:

from MDAnalysisData import datasets

adk = datasets.fetch_adk_equilibrium()

The returned object contains attributes with the paths to topology and trajectory files so that you can use it directly with, for instance, MDAnalysis:

import MDAnalysis as mda
u = mda.Universe(adk.topology, adk.trajectory)

The metadata object also contains a DESCR attribute with a description of the data set, including relevant citations:

print(adk.DESCR)

Managing data

Data are locally stored in the data directory ~/MDAnalysis_data (i.e., in the user's home directory). This location can be changed by setting the environment variable MDANALYSIS_DATA, for instance

export MDANALYSIS_DATA=/tmp/MDAnalysis_data

The location of the data directory can be obtained with

MDAnalysisData.base.get_data_home()

If the data directory is removed then data are downloaded again. Data file integrity is checked with a SHA256 checksum when the file is downloaded.

The data directory can we wiped with the function

MDAnalysisData.base.clear_data_home()

Contributing new datasets

Please add new datasets to MDAnalysisData. See Contributing new datasets for details, but in short:

raise an issue in the issue tracker describing what you want to add; this issue will become the focal point for discussions where the developers can easily give advice
deposit data in an archive under an Open Data compatible license (CC0 or CC-BY preferred)
write accessor code in MDAnalysisData

Credits

This package is modelled after sklearn.datasets. It uses code from sklearn.datasets (under the BSD 3-clause license).

No data are included; please see the DESCR attribute for each data set for authorship, citation, and license information for the data.

mdanalysisdata's People

Contributors

Stargazers

Watchers

Forkers

vod555 lilyminium icamps zafar-hussain trellixvulnteam

mdanalysisdata's Issues

HTTPError: HTTP Error 403: Forbidden

when i try to
adk_trans = datasets.fetch_adk_transitions_DIMS()

print(adk_trans.DESCR)
print(adk_trans.topology)

it show 403 forbidden, how to fixed this error, thanks!

download progress bar

Something like

https://stackoverflow.com/a/36022923 or
https://stackoverflow.com/a/20943461 or
maybe tqdm as shown with the Hooks and callbacks for urllib.urlretrieve

Make a new release to have a py3.12 compatible package

With #61 merged, we should try to make a new release.

I'm not sure what is the status of MDAnalysisData, we could stick on a deployment workflow or just make a bog standard release as-is.

dataset for 2021 parallelization tutorial

See MDAnalysis/WorkshopPrace2021#36 — @mnmelo 's dataset for the workshop should live in MDAnalysisData.

periodically test downloading datasets

Normal tests only check base functionality.

We should also periodically test that we can actually download the files. See #17 for notes.

need tests

Need tests that run on travis.

Downloading data as part of the tests might be expensive for big datasets but we should do something.

docs for contributing

The docs should have a section Contributing.

This can use https://github.com/MDAnalysis/MDAnalysisData/wiki/add-new-data-set .

transitions dataset doesn't fetch

Looks like it just needs to be metadata['DESCR']

data.datasets.fetch_adk_transitions_DIMS()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-8bdebcdd3177> in <module>
----> 1 data.datasets.fetch_adk_transitions_DIMS()

~/miniconda3/envs/mda19/lib/python3.6/site-packages/MDAnalysisData/adk_transitions.py in fetch_adk_transitions_DIMS(data_home, download_if_missing)
     87     return _fetch_adk_transitions(METADATA['DIMS'],
     88                                   data_home=data_home,
---> 89                                   download_if_missing=download_if_missing)
     90 
     91 def fetch_adk_transitions_FRODA(data_home=None, download_if_missing=True):

~/miniconda3/envs/mda19/lib/python3.6/site-packages/MDAnalysisData/adk_transitions.py in _fetch_adk_transitions(metadata, data_home, download_if_missing)
    192                                records.N_trajectories))
    193 
--> 194     records.DESCR = _read_description(DESCRIPTION)
    195 
    196     return records

NameError: name 'DESCRIPTION' is not defined

add data files for pmda

Add a data set that we can use for pmda, especially for a workshop demo. See Wiki:add new data set for initial guide lines on how to do this: basically

dump to a data repository
copy and paste code in MDAnalysisData

use versioneer

Use versioneer for versioning. Use release-MAJOR.MINOR.PATCH as the tag (as we do for MDAnalysis/mdanalysis).

Smaller adk_transitions dataset

Thank you for this helpful repo of data! For tutorial purposes, would it be possible to upload smaller versions of the adk_transitions datasets? e.g. 10 trajectories each, for a more manageable collective ~70 mb.

Edit: or even 5 each...

generate docs

The datasets include docs. They should be built with sphinx and displayed under https://www.mdanalysis.org/MDAnalysisData.

create sphinx templates
copy the MDAnalysis styles (local.css, custom.css, and logos)
use Travis CI to build docs (they have an option that should allow us to do it automatically with gh-pages)

need release to incorporate recent fixes

make a ~~0.3.1~~ 0.4.0 (because of #24 )

CG fiber dataset: docs?

Is the CG fiber dataset officially released?

If so

integrate into docs (usage.rst etc – check how the others were added)
CHANGELOG entry

add AdK equilibrium trajectory dataset

Add the https://figshare.com/articles/Molecular_dynamics_trajectory_for_benchmarking_MDAnalysis/5108170 dataset.

Model adk_equilibrium.py after https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/california_housing.py

We should also put everything from https://becksteinlab.github.io/MDAnalysis-workshop/datadownload.html on figshare and make available as MDAnalysisData.datasets.

version information in docs is missing

With the new GH actions-based docs.yml deployment (PR #52 ), the docs build does not get sufficient information to display the correct version information

Release: 0+untagged.1.g779bed1

Instead this should be something like '0.8.0+20.g654705d.

The problem is that versioneer requires the git tags, which are not provided with the shallow checkout.

RTD fails

RTD does not build

Error

Problem in your project's configuration. No default configuration file found at repository's root.

https://readthedocs.org/projects/mdanalysisdata/builds/22389205/

0.5.0 release

I messed up the 0.4.0 release (progress bar was not included) so I am making a new 0.5.0... will probably also include King of Mocks testing.

Protein-ligand system

As discussed in MDAnalysis/UserGuide#105 (comment), an MD simulation of a protein-ligand complex would be useful for MDA tutorials etc...

I probably have quite a few simulations of some more classic FEP benchmark systems (T4 lysozyme, BRD4, etc..) that could be uploaded (although only ~ 20 ns in length since that's what you'll usually do for an ABFE window).

That being said, it might be that a more elaborate trajectory might be more helpful here, particularly considering potential future use cases, e.g. binding site waters (density analysis code), hydrogen bonding, RMSD (ligand symmetry), etc...

If such a trajectory doesn't already exist, I'm willing to do the leg work in generating the data if we can agree on an "ideal" system.

missing dependency in setup.py

https://travis-ci.org/conda-forge/staged-recipes/builds/444754030

looks like setuptools is missing as a runtime dependency as well

Python compatibility

What's the compat goal for this package? Can we be aggressive and only target 3.5+?

change doc style to MDA/UserGuide 1.0.0

The docs should look uniform across the mdanalysis.org site.

Apply @lilyminium 's RTD-based style.

(See also MDAnalysis/pmda#124)

Make next release 3.6+?

Mainly looking at this because of #48, supporting py2.7/py3.5 (if just the act of conda installing an environment) is becoming increasingly tedious.

Given that MDA 2.0 will be py3.6+ and that 2.7 has now long been deprecated, would it be worth making MDAnalysisData's python 3.6->3.9?

update deprecated use of pkg_resources

We use pkg_resources:

DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    from pkg_resources import resource_string

Use importlib instead, along the line of

import importlib.resources as importlib_resources

__all__ = ["DX", "CCP4", "gOpenMol"]

DX = importlib_resources.files(__name__) / 'test.dx'

(from https://github.com/MDAnalysis/GridDataFormats/blob/master/gridData/tests/datafiles/__init__.py where this was recently fixed)

add more data from Becksteinlab

Collection of data sets

figshare: Molecular Dynamics Benchmarks and Demonstration Data (also contains the dataset for #2 )
- AdK equilibrium data set #2
- IFABP + water dataset
- vesicles
AdK PSA trajectory ensemble (ca 1.2 GiB, from the SPIDAL MDAnalysis+RadicalPilot tutorial, stored on dropbox: SPIDAL_Workshop/SPIDAL-MDAnalysis-Midas-Tutorial/tutorial-data) – will become available at doi 10.6084/m9.figshare.7165306

Do not include:

Google Drive: parallel analysis benchmark trajectories (should be transitioned to figshare if possible) – possibly do not initially include in MDAnalysisData because it is mainly useful for benchmarking but not for analysis
- LeafletFinder – these are mostly the vesicles, for which a figshare exists
- synthetic PSA trajectory ensemble with trajectories that have been increased in length and size by concatentation (the parallel_analysis_benchmark/PSA folder); originally based on the DIMS trajectories in the AdK PSA trajectory ensemble

add trajectory for transformations notebook

Add the peptide trajectory from MDAnalysis/binder-notebook#13 as soon as @davidercruz has deposited it on figshare.

interface with markovmodel/mdshare ?

mdshare at https://github.com/markovmodel/mdshare provides "a python-based downloader for molecular dynamics (MD) data from a public FTP server at FU Berlin." (related to work on pyemma).

Similar idea as MDAnalysisData (and predates it – I only saw it today).

Perhaps we can interface with it and make it a provider of data?

tests for "big" fetch_*() functions

We should test all fetch_*() functions/datasets (see #6 and PR #16 ). The "big" datasets have sizes > ~150 MB and download takes a long(ish) time.

We should find a way to mock-out the download or replace it with something tiny. It would nevertheless be good to ascertain that the download URLs are still valid (e.g. by download the first few bytes).

Add codecov token to CI

Following from #69

I've uploaded the CODECOV_TOKEN secret if you want to add the entry in CI in your next PR @orbeckst

use pooch

https://pypi.org/project/pooch/ does all the downloading, caching and checksumming so we don't have to write code

use MDA theme for docs

While we're at fixing docs (#65 #66) we should also update to the MDA sphinx theme.

add PMDA SciPy paper data

add https://figshare.com/articles/Molecular_Dynamics_trajectories_of_membrane_protein_YiiP/8202149

GH docs workflow fails

The docs workflow fails https://github.com/MDAnalysis/MDAnalysisData/actions/runs/6695446777

error: invalid command 'build_sphinx'

(see also #65 )

Automate deployment

Add a deploy workflow (see MDA and others)

upload to testpypi when release tagging
Upload to PyPi when making a gh release
include a simple twine check in CI
update development docs

@IAlibay has a template for a suitable workflow. Either copy or turn it into an action and use the action. This workflow needs testpypi and PyPi tokens as gh environment variables.

release 0.8.1

quick mini release

tagged
GitHub
pypi
conda

fix formatting of README on pypi

https://pypi.org/project/MDAnalysisData/ does not render the README.md properly.

Might have to set the content type explicitly with long_description_content_type

long_description_content_type='text/markdown'

add GoatCounter tracking to docs

Add

the GoatCounter tracking snippet

<script data-goatcounter="https://mdanalysis.goatcounter.com/count"  
        async src="//gc.zgo.at/count.js"></script>

just before the </body> tag (better for performance) or anywhere in <head> in all generated pages.

Link to the privacy policy MDAnalysis/MDAnalysis.github.io#129

to the generated API docs.

See MDAnalysis/MDAnalysis.github.io#134.

update tarfile behavior

Running the tests in py 3.12 I got the warning:

~/anaconda3/envs/py312/lib/python3.12/tarfile.py:2220: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
    warnings.warn(

We need to look into what changes and how we need to adapt the current code (possibly just say "extract all files, I know they are ok").

This may be related to security risks mentioned in PR #58.

description cannot be read from zipped eggs

see #10 (comment)

[CI] testing after deployment fails

See #62 (comment)

🔴 Test after deployment FAILED https://github.com/MDAnalysis/MDAnalysisData/actions/runs/6698031942/job/18199197552

==================================== ERRORS ====================================
_________________ ERROR at setup of test_fetch_remote_sha_fail _________________
file /home/runner/work/MDAnalysisData/MDAnalysisData/MDAnalysisData/tests/test_base.py, line   107
  def test_fetch_remote_sha_fail(remote_topology, mocker):
E       fixture 'mocker' not found
>       available fixtures: bunch, cache, capfd, capfdbinary, caplog, capsys, capsysbinary,   doctest_namespace, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, remote_topology, some_text, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

Need to add packages to the test suite.

release 0.6.0

Will happen in 5 mins...

add six dependency to setup.py

see title

Change license to BSD-3

The code is currently under GPL because MDAnalysis is under GPL. However, we want to keep MDAnalysisData general and don't want to import MDAnalysis so we can keep the license less restrictive. Most of the code originally came from sklearn so we should just go back to the sklearn license, which is BSD-3.

At the moment the only code contributor is @orbeckst and he agrees to this change ;-)

transition CI from Travis to GitHub actions

Use GitHub actions!

pytest-cov does not collect coverage

Since PR #34 , code coverage is not collected anymore.

 pytest -v --cov MDAnalysisData MDAnalysisData
============================= test session starts ==============================
platform linux -- Python 3.6.3, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- /home/travis/virtualenv/python3.6.3/bin/python
cachedir: .pytest_cache
rootdir: /home/travis/build/MDAnalysis/MDAnalysisData, inifile: setup.cfg
plugins: pep8-1.0.6, mock-1.10.4, cov-2.6.1
collecting ... collected 35 items / 18 deselected / 17 selected
...
...
coverage.py warning: No data was collected. (no-data-collected)

conda package for MDAnalysisData

We need a conda package so that we can easily distribute MDAnalysisData for the workshop.

This can be modelled after MDAnalysis/pmda#74

sitemap.xml contains broken links

https://www.mdanalysis.org/MDAnalysisData/sitemap.xml contains links with the version inserted such as ~~https://www.mdanalysis.org/MDAnalysisData/0.8/CG_fiber.html~~ which do not exist.

Hence indexing/searching fails, see MDAnalysis/MDAnalysis.github.io#202