GithubHelp home page GithubHelp logo

mdanalysis / mdanalysisdata Goto Github PK

View Code? Open in Web Editor NEW
12.0 6.0 5.0 7.26 MB

Access to data for workshops and extended tests of MDAnalysis.

Home Page: https://www.mdanalysis.org/MDAnalysisData

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
mdanalysis python molecular-dynamics dataset-manager

mdanalysisdata's Introduction

MDAnalysisData

Build Status codecov docs PRs welcome Anaconda-Server Badge DOI

Access to data for workshops and extended tests of MDAnalysis.

Data sets are stored at external stable URLs (e.g., on figshare, zenodo, or DataDryad) and this package provides a simple interface to download, cache, and access data sets.

Installation

To use, install the package

pip install --upgrade MDAnalysisData

or install with conda

conda install --channel conda-forge mdanalysisdata

Accessing data sets

Import the datasets and access your data set of choice:

from MDAnalysisData import datasets

adk = datasets.fetch_adk_equilibrium()

The returned object contains attributes with the paths to topology and trajectory files so that you can use it directly with, for instance, MDAnalysis:

import MDAnalysis as mda
u = mda.Universe(adk.topology, adk.trajectory)

The metadata object also contains a DESCR attribute with a description of the data set, including relevant citations:

print(adk.DESCR)

Managing data

Data are locally stored in the data directory ~/MDAnalysis_data (i.e., in the user's home directory). This location can be changed by setting the environment variable MDANALYSIS_DATA, for instance

export MDANALYSIS_DATA=/tmp/MDAnalysis_data

The location of the data directory can be obtained with

MDAnalysisData.base.get_data_home()

If the data directory is removed then data are downloaded again. Data file integrity is checked with a SHA256 checksum when the file is downloaded.

The data directory can we wiped with the function

MDAnalysisData.base.clear_data_home()

Contributing new datasets

Please add new datasets to MDAnalysisData. See Contributing new datasets for details, but in short:

  1. raise an issue in the issue tracker describing what you want to add; this issue will become the focal point for discussions where the developers can easily give advice
  2. deposit data in an archive under an Open Data compatible license (CC0 or CC-BY preferred)
  3. write accessor code in MDAnalysisData

Credits

This package is modelled after sklearn.datasets. It uses code from sklearn.datasets (under the BSD 3-clause license).

No data are included; please see the DESCR attribute for each data set for authorship, citation, and license information for the data.

mdanalysisdata's People

Contributors

ialibay avatar lilyminium avatar micaela-matta avatar orbeckst avatar richardjgowers avatar vod555 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mdanalysisdata's Issues

HTTPError: HTTP Error 403: Forbidden

when i try to
adk_trans = datasets.fetch_adk_transitions_DIMS()

print(adk_trans.DESCR)
print(adk_trans.topology)

it show 403 forbidden, how to fixed this error, thanks!

need tests

Need tests that run on travis.

Downloading data as part of the tests might be expensive for big datasets but we should do something.

transitions dataset doesn't fetch

Looks like it just needs to be metadata['DESCR']

data.datasets.fetch_adk_transitions_DIMS()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-8bdebcdd3177> in <module>
----> 1 data.datasets.fetch_adk_transitions_DIMS()

~/miniconda3/envs/mda19/lib/python3.6/site-packages/MDAnalysisData/adk_transitions.py in fetch_adk_transitions_DIMS(data_home, download_if_missing)
     87     return _fetch_adk_transitions(METADATA['DIMS'],
     88                                   data_home=data_home,
---> 89                                   download_if_missing=download_if_missing)
     90 
     91 def fetch_adk_transitions_FRODA(data_home=None, download_if_missing=True):

~/miniconda3/envs/mda19/lib/python3.6/site-packages/MDAnalysisData/adk_transitions.py in _fetch_adk_transitions(metadata, data_home, download_if_missing)
    192                                records.N_trajectories))
    193 
--> 194     records.DESCR = _read_description(DESCRIPTION)
    195 
    196     return records

NameError: name 'DESCRIPTION' is not defined

add data files for pmda

Add a data set that we can use for pmda, especially for a workshop demo. See Wiki:add new data set for initial guide lines on how to do this: basically

  • dump to a data repository
  • copy and paste code in MDAnalysisData

use versioneer

Use versioneer for versioning. Use release-MAJOR.MINOR.PATCH as the tag (as we do for MDAnalysis/mdanalysis).

Smaller adk_transitions dataset

Thank you for this helpful repo of data! For tutorial purposes, would it be possible to upload smaller versions of the adk_transitions datasets? e.g. 10 trajectories each, for a more manageable collective ~70 mb.

Edit: or even 5 each...

generate docs

The datasets include docs. They should be built with sphinx and displayed under https://www.mdanalysis.org/MDAnalysisData.

  • create sphinx templates
  • copy the MDAnalysis styles (local.css, custom.css, and logos)
  • use Travis CI to build docs (they have an option that should allow us to do it automatically with gh-pages)

CG fiber dataset: docs?

Is the CG fiber dataset officially released?

If so

  • integrate into docs (usage.rst etc – check how the others were added)
  • CHANGELOG entry

version information in docs is missing

With the new GH actions-based docs.yml deployment (PR #52 ), the docs build does not get sufficient information to display the correct version information

Release: 0+untagged.1.g779bed1

Instead this should be something like '0.8.0+20.g654705d.

The problem is that versioneer requires the git tags, which are not provided with the shallow checkout.

0.5.0 release

I messed up the 0.4.0 release (progress bar was not included) so I am making a new 0.5.0... will probably also include King of Mocks testing.

Protein-ligand system

As discussed in MDAnalysis/UserGuide#105 (comment), an MD simulation of a protein-ligand complex would be useful for MDA tutorials etc...

I probably have quite a few simulations of some more classic FEP benchmark systems (T4 lysozyme, BRD4, etc..) that could be uploaded (although only ~ 20 ns in length since that's what you'll usually do for an ABFE window).

That being said, it might be that a more elaborate trajectory might be more helpful here, particularly considering potential future use cases, e.g. binding site waters (density analysis code), hydrogen bonding, RMSD (ligand symmetry), etc...

If such a trajectory doesn't already exist, I'm willing to do the leg work in generating the data if we can agree on an "ideal" system.

Python compatibility

What's the compat goal for this package? Can we be aggressive and only target 3.5+?

Make next release 3.6+?

Mainly looking at this because of #48, supporting py2.7/py3.5 (if just the act of conda installing an environment) is becoming increasingly tedious.

Given that MDA 2.0 will be py3.6+ and that 2.7 has now long been deprecated, would it be worth making MDAnalysisData's python 3.6->3.9?

update deprecated use of pkg_resources

We use pkg_resources:

DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    from pkg_resources import resource_string

Use importlib instead, along the line of

import importlib.resources as importlib_resources

__all__ = ["DX", "CCP4", "gOpenMol"]

DX = importlib_resources.files(__name__) / 'test.dx'

(from https://github.com/MDAnalysis/GridDataFormats/blob/master/gridData/tests/datafiles/__init__.py where this was recently fixed)

add more data from Becksteinlab

Collection of data sets

Do not include:

  • Google Drive: parallel analysis benchmark trajectories (should be transitioned to figshare if possible) – possibly do not initially include in MDAnalysisData because it is mainly useful for benchmarking but not for analysis
    • LeafletFinder – these are mostly the vesicles, for which a figshare exists
    • synthetic PSA trajectory ensemble with trajectories that have been increased in length and size by concatentation (the parallel_analysis_benchmark/PSA folder); originally based on the DIMS trajectories in the AdK PSA trajectory ensemble

tests for "big" fetch_*() functions

We should test all fetch_*() functions/datasets (see #6 and PR #16 ). The "big" datasets have sizes > ~150 MB and download takes a long(ish) time.

We should find a way to mock-out the download or replace it with something tiny. It would nevertheless be good to ascertain that the download URLs are still valid (e.g. by download the first few bytes).

Automate deployment

Add a deploy workflow (see MDA and others)

  • upload to testpypi when release tagging
  • Upload to PyPi when making a gh release
  • include a simple twine check in CI
  • update development docs

@IAlibay has a template for a suitable workflow. Either copy or turn it into an action and use the action. This workflow needs testpypi and PyPi tokens as gh environment variables.

update tarfile behavior

Running the tests in py 3.12 I got the warning:

~/anaconda3/envs/py312/lib/python3.12/tarfile.py:2220: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
    warnings.warn(

We need to look into what changes and how we need to adapt the current code (possibly just say "extract all files, I know they are ok").

This may be related to security risks mentioned in PR #58.

[CI] testing after deployment fails

See #62 (comment)

🔴 Test after deployment FAILED https://github.com/MDAnalysis/MDAnalysisData/actions/runs/6698031942/job/18199197552

==================================== ERRORS ====================================
_________________ ERROR at setup of test_fetch_remote_sha_fail _________________
file /home/runner/work/MDAnalysisData/MDAnalysisData/MDAnalysisData/tests/test_base.py, line   107
  def test_fetch_remote_sha_fail(remote_topology, mocker):
E       fixture 'mocker' not found
>       available fixtures: bunch, cache, capfd, capfdbinary, caplog, capsys, capsysbinary,   doctest_namespace, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, remote_topology, some_text, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

Need to add packages to the test suite.

Change license to BSD-3

The code is currently under GPL because MDAnalysis is under GPL. However, we want to keep MDAnalysisData general and don't want to import MDAnalysis so we can keep the license less restrictive. Most of the code originally came from sklearn so we should just go back to the sklearn license, which is BSD-3.

At the moment the only code contributor is @orbeckst and he agrees to this change ;-)

pytest-cov does not collect coverage

Since PR #34 , code coverage is not collected anymore.

 pytest -v --cov MDAnalysisData MDAnalysisData
============================= test session starts ==============================
platform linux -- Python 3.6.3, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- /home/travis/virtualenv/python3.6.3/bin/python
cachedir: .pytest_cache
rootdir: /home/travis/build/MDAnalysis/MDAnalysisData, inifile: setup.cfg
plugins: pep8-1.0.6, mock-1.10.4, cov-2.6.1
collecting ... collected 35 items / 18 deselected / 17 selected
...
...
coverage.py warning: No data was collected. (no-data-collected)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.