gdcc / pydataverse Goto Github PK
View Code? Open in Web Editor NEWPython module for Dataverse Software (dataverse.org).
Home Page: http://pydataverse.readthedocs.io/
License: MIT License
Python module for Dataverse Software (dataverse.org).
Home Page: http://pydataverse.readthedocs.io/
License: MIT License
The example at http://guides.dataverse.org/en/4.15/api/native-api.html#add-a-file-to-a-dataset illustrates that jsonData
can be uploaded when adding a file:
curl -H "X-Dataverse-key:$API_TOKEN" -X POST -F '[email protected]' -F 'jsonData={"description":"My description.","directoryLabel":"data/subdir1","categories":["Data"], "restrict":"true"}' "https://example.dataverse.edu/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID"
The jsonData
object allows the user to add additional metadata about the file:
It would be great if api.upload_file
supported this jsonData
object.
Afaik, at the current state (for me it's 4.11), DDI XML is the only way to get existing DOIs into Dataverse. Therefore this functionality would be very useful. I guess it's just the :importddi that would need to be added.
Add the passing of the Api token at the Api() creation of the basic usage example at the Docs.
It should be possible to create an Dataverse object from a JSON file, such as this example file.
The following error is thrown:
FileNotFoundError: [Errno 2] No such file or directory: 'schemas/json/dataverse_upload_schema.json'
Include Dataverse schema
napi = NativeApi(base_url=url, api_token=token)
dv = Dataverse()
with open('dataverse.json') as dataverse:
data = json.load(dataverse)
dv.set(data)
try:
napi.create_dataverse(identifier=dv.alias, metadata=dv.to_json())
# Uploading the file directly works, though
#napi.create_dataverse(identifier=dv.alias, metadata=data)
except OperationFailedError:
print("Dataverse already created")
Branch - develop
Commit hash - 3b040ff
Environment - macOS using Pipenv
https://jenkins.dataverse.org is a new service being offered to the Dataverse community for automated testing, continuous integration and perhaps any other use you can dream up. 😄
For more about this Jenkins service, please see http://guides.dataverse.org/en/4.14/developers/testing.html#continuous-integration
I'm very glad to see that Travis tests are already set up for pyDataverse at https://travis-ci.com/AUSSDA/pyDataverse
I am not suggesting that we replace Travis with Jenkins. Rather, I'm suggesting a "belt and suspenders" approach. In fact, for Dataverse itself we are currently using Travis to know if our Java code even compiles (and if the unit tests pass) and Jenkins to know if our API test suite is passing.
The way to add pyDataverse is to talk to me and @donsizemore at http://chat.dataverse.org (we're both in the eastern timezone of the United States and don't work weekends 😛 ). We'll get the test suite passing (with help from @skasberger probably) and then add it as a job to https://github.com/IQSS/dataverse-jenkins . Actually, once I talk to Don I'll probably create an issue over in that issue tracker for adding the job definition (XML, I believe).
Check which API endpoints are compatible with using as identifier the Dataverse database ID and/or the PID.
API Endpoints
After that, update the requests, so that both variations are possible and implemented.
See #71
Purpose:
Functionalities:
Resources
the dict() function outputs empty arrays, but they should not. Check, if this is also the case for Dataverses and Datasets.
The function get_datafile()
is using the following call to get_request()
As auth
is by default False
when calling get_request(query_str, params=None, auth=False)
the API token is not being sent and the server may return a 403
error.
The same behavior could be noticed by calling get_dataset_export()
, get_datafiles()
, get_datafile_bundle()
, ...
Other functions like get_dataverse()
work as expected by using get_request()
with the extra auth
parameter:
Is there any workaround besides changing auth=True
in the get_request()
definition?
Environment:
pyDataverse==0.2.1
requests==2.22.0
urllib3==1.25.7
http://guides.dataverse.org/en/4.15/api/native-api.html#delete-published-dataset describes a "destroy" API that allows superusers to delete datasets even after they are published. I can think of a couple use cases for this.
Here are the curl command examples from the API Guide link above:
Destroy by Persistent ID (PID):
curl -H "X-Dataverse-key:$API_TOKEN" -X DELETE http://$SERVER/api/datasets/:persistentId/destroy/?persistentId=doi:10.5072/FK2/AAA000
Destroy by dataset ID:
curl -H "X-Dataverse-key:$API_TOKEN" -X DELETE http://$SERVER/api/datasets/999/destroy
I'm happy to make a pull request if you'd like. Please let me know.
Error: /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'long_description_content_type'
The first argument to many functions is query_str
but maybe it should be path
instead. This tripped me up a little. Here's the doc I was reading for v0.2.1:
Here's a reference on query strings vs path from https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Examples
It looks like I used the word "endpoint" instead of "path" here:
endpoint = '/builtin-users/' + username + '/api-token'
resp = api.get_request(endpoint, params=params, auth=True)
So maybe "path" or "endpoint"? To me, "query string" has a specific meaning. It's the key/value pairs after the "?" like tag=networking&order=newest
in the example above.
In the function Dataset.export_metadata(), there must be the format passed to self.dict() -> self.dict(format=format).
Check out also for Dataverse and Datafile, if the same problem appears.
When pulling a dataset, the response object contains the json of the dataset. Would it be attractive to instead return a Dataset object?
The Dataset class would simply contain getter and setter functions for all properties and the constructor only needs the json as an input.
I assume this would improve clarity (one could create a print function to show the metadata maybe using pandas?).
Additionally, a Dataset object could be passed to other functions like create_dataset.
Add logging functionality to all modules.
Dataset 10100 could not be created.
to Dataset 10100 could not be created via API.
INFO
, WARNING
, ERROR
Snippets:
import requests
import json
def pretty_print_request(request):
print( '\n{}\n{}\n\n{}\n\n{}\n'.format(
'-----------Request----------->',
request.method + ' ' + request.url,
'\n'.join('{}: {}'.format(k, v) for k, v in request.headers.items()),
request.body)
)
def pretty_print_response(response):
print('\n{}\n{}\n\n{}\n\n{}\n'.format(
'<-----------Response-----------',
'Status code:' + str(response.status_code),
'\n'.join('{}: {}'.format(k, v) for k, v in response.headers.items()),
response.text)
)
def test_post_headers_body_json():
url = 'https://httpbin.org/post'
# Additional headers.
headers = {'Content-Type': 'application/json' }
# Body
payload = {'key1': 1, 'key2': 'value2'}
# convert dict to json by json.dumps() for body data.
resp = requests.post(url, headers=headers, data=json.dumps(payload,indent=4))
# Validate response headers and body contents, e.g. status code.
assert resp.status_code == 200
resp_body = resp.json()
assert resp_body['url'] == url
# print full request and response
pretty_print_request(resp.request)
pretty_print_response(resp)
Using a Python script with create_dataset() I created a a new dataset on demo.dataverse.org
(and one more dataverse server).
api = Api(base_url = dvserver, api_token = dvtoken)
api.create_dataset("1", dsmd)
Where dsmd
is the content of dataset-finch1.json
as a string (and slightly modified version of it for my last test) linked in the documentation.
dsmd = """{
"datasetVersion": {
"metadataBlocks": {
"citation": {
"fields": [
{
"value": "Dörwin's Fænches",
.
.
.
"""
Everything seems to work fine but non ascii characters are not displayed (replace with �
) when I open the dataverse through the browser nor when download it back with get_dataset()
.
I'm on Windows 10 with Python 3.6.4 and pyDataverse 0.2.1. I tried to run it as a script from the command line and in Spyder with the same result.
This proposal is a bit out of scope for an issue tracker, but I really see a great opportunity for a synergy here.
If it became part of the effort to represent the Dataverse Native API in an OpenAPI Specification (formerly Swagger Specification, example: https://editor.swagger.io/ ), clients (or at least their interfaces) for many languages could be generated by a code generator like Swagger Codegen. At the same time it would be a major contribution to the Dataverse core project to have an OpenAPI definition.
I'd be glad to participate in this whole effort (OpenAPI or not), because I've been looking for place to collaborate on such code, for example to duplicate (for visibility) metadata from Datacite to our institutional Dataverse, which I'm currently implementing at the WZB.
Best
Jonas
Allow downloading of unpublished draft dataset and its data files using the API token and its access credentials.
Export the metadata of a list of Dataverses, Datasets or Datafiles to a csv file. Header should be the attribute name. one row = one Dataverse, Dataset or Datafile. List must contain only same type of data, not mixtures of Dataverses or Datasets for example.
Purpose:
Functionalities:
Write a test, which checks the content of Dataverse.dict(), if expected structure and values are inside.
As a developer I would like to have more control of the keyword arguments get_datafile
uses in requests to download files. I have had issues downloading large files in the past and had to implement a custom get_request
method to allow access to the stream
parameter of requests.get
.
Purpose
Functionality
history_ID.json
history.json structure (DRAFT):
metadata
date-created
: string, YYYY-MM-DD HH:MM:SShistory_version
: version des history schemashistory
: [{}]
dataset_id
: stringdataverse_dataset_version
: stringdatafiles
: [FILENAMES], ohne Pfadtimestamp
: string, YYYY-MM-DD HH:MM:SSdescription
: stringobject_type
: dataset oder datafileobject_id
creator
change_type
: {dataverse_release
: , dataverse_process
: }
dataverse
: init, update, delete, move
release
: init, update, delete, moveedit
: major/minor release version change in Dataversedelete
: major/minor release version change in Dataverseinternal
specific_change_type
: nähere Beschreibung des change types, zb aussda
Write a test, which checks the content of Dataset.json(), if expected structure and values are inside.
Add resources section to the Docs, where materials, such as videos, presentations, tutorials, blog posts, screencasts etc about pyDataverse can be collected.
Write a test, which checks the content of Dataset.dict(), if expected structure and values are inside.
Support for the following APIs would be appreciated:
Implement mapping from and to DDI XML.
Requirements
As I mentioned at IQSS/dataverse#5235 (comment) I'm curious if the "DVTree" (Dataverse Tree) format could be used to upload sample data to a brand new Dataverse installation for use in demos and usability testing.
I would love to see some docs. Or a pointer to the code for now. Thanks! 😄
Please support the "Show Contents of a Dataverse" API.
At http://guides.dataverse.org/en/4.15/api/native-api.html#show-contents-of-a-dataverse it is documented like this:
Lists all the DvObjects under dataverse id.
GET http://$SERVER/api/dataverses/$id/contents
As a workaround for now I'm using api.get_request: https://github.com/IQSS/dataverse-sample-data/blob/ed52c316f530229b0c40463dc18c5f16d07cf11d/destroy_all_dvobjects.py#L30
Implement mapping from and to custom JSON.
Requirements
Add import functionality for Dataverse API Download JSON formats for Dataverses, Datasets and Datafiles, coming as a result to API requests.
Clarify: Which requests should be used for this?
Explaination:
When you retrieve a Dataset via the Api, you get more metadata for your Dataset, then you have to send for it's original creation (e.g. creation data, pid, UNF, etc.). So you need an own mapping with an own Schema file.
The Import is more important, than to export into this format (can not think of a use-case for the export).
Functionalities:
The metadata attribute title
for a Datafile is not working.
When uploading a file, it should be possible to add metadata to the file.
Also a feedback could be given, wether a file already exists with the same checksum. (Should there be an option to force overwriting the file?)
shorten the import function name. Also update docstrings.
Hi @skasberger,
for GDCC/dvcli I want to make heavy use of your great library.
As I wrote some comments to other issues already, I wonder if we should talk about the scope of your library. It doesn't make much sense to implement things in dvcli and then here again or the other way round.
I would be happy to either contribute here or give you access to the dvcli project.
If you would prefer talking over writing, hit me 😄
Review other Python API wrapper modules to learn about testing.
I'm using pyDataverse 0.2.1 and I can't publish a dataset. I'm getting the following error:
Traceback (most recent call last):
File "create_and_publish_dataset.py", line 15, in <module>
resp = api.publish_dataset(dataset_pid, type='major')
File "/home/pdurbin/envs/dataverse-sample-data/lib/python3.6/site-packages/pyDataverse/api.py", line 727, in publish_dataset
query_str += '?persistentId={0}&type={1}'.format(identifier, type)
NameError: name 'identifier' is not defined
Something like this should fix it:
dhcp-10-250-190-90:pyDataverse pdurbin$ git diff src/pyDataverse/api.py
diff --git a/src/pyDataverse/api.py b/src/pyDataverse/api.py
index 2bebc05..972e427 100644
--- a/src/pyDataverse/api.py
+++ b/src/pyDataverse/api.py
@@ -673,7 +673,7 @@ class Api(object):
print('Dataset {} created.'.format(identifier))
return resp
- def publish_dataset(self, pid, type='minor', auth=True):
+ def publish_dataset(self, identifier, type='minor', auth=True):
"""Publish dataset.
Publishes the dataset whose id is passed. If this is the first version
@@ -705,7 +705,7 @@ class Api(object):
Parameters
----------
- pid : string
+ identifier : string
Persistent identifier of the dataset (e.g.
``doi:10.11587/8H3N93``).
type : string
dhcp-10-250-190-90:pyDataverse pdurbin$
Here's the code I'm using to exercise the bug:
from pyDataverse.api import Api
import json
import dvconfig
base_url = dvconfig.base_url
api_token = dvconfig.api_token
api = Api(base_url, api_token)
print(api.status)
dataset_json = 'data/dataverses/open-source-at-harvard/datasets/open-source-at-harvard/open-source-at-harvard.json'
with open(dataset_json) as f:
metadata = json.load(f)
dataverse = ':root'
resp = api.create_dataset(dataverse, json.dumps(metadata))
print(resp.json())
dataset_pid = resp.json()['data']['persistentId']
resp = api.publish_dataset(dataset_pid, type='major')
print(resp.json())
The "dvconfig" stuff comes from https://github.com/IQSS/dataverse-sample-data
Implement mapping from and to DSpace JSON.
Requirements
I'm using pyDataverse 0.2.1 and trying to get ds.export_metadata working based on the example at https://pydataverse.readthedocs.io/en/v0.2.1/developer.html#pyDataverse.models.Dataset.export_metadata
Here's my code:
from pyDataverse.models import Dataset
ds = Dataset()
data = {
'title': 'pyDataverse study 2019',
'dsDescription': 'New study about pyDataverse usage in 2019',
'author': [{'authorName': 'LastAuthor1, FirstAuthor1'}],
'datasetContact': [{'datasetContactName': 'LastContact1, FirstContact1'}],
'subject': ['Engineering'],
}
ds.set(data)
ds.export_metadata('export_dataset.json')
Here's the error I'm getting:
Traceback (most recent call last):
File "exportds3.py", line 11, in <module>
ds.export_metadata('export_dataset.json')
File "/home/pdurbin/envs/dataverse-sample-data/lib/python3.6/site-packages/pyDataverse/models.py", line 1175, in export_metadata
return write_file_json(filename, self.dict())
File "/home/pdurbin/envs/dataverse-sample-data/lib/python3.6/site-packages/pyDataverse/models.py", line 945, in dict
'value': self.__generate_dicts(key, val)
File "/home/pdurbin/envs/dataverse-sample-data/lib/python3.6/site-packages/pyDataverse/models.py", line 1092, in __generate_dicts
for k, v in d.items():
AttributeError: 'str' object has no attribute 'items'
What am I doing wrong? Thanks.
reference: IQSS/dataverse#3068
The native API can be a bit verbose for non-expert users. Include an option to transform the Dataverse native API response to a more usable format.
To illustrate here is the metadata information for a dataset title and author:
(sample code to transform the output below: https://github.com/IQSS/json-schema-test/blob/master/filemetadata/api_test/metadata_transformer.py)
{
"citation": {
"title": "North Carolina Vital Statistics -- Birth/Infant Deaths 1976",
"author": [
{
"authorName": "State Center for Health Statistics"
}
],
"metadataBlocks": {
"citation": {
"displayName": "Citation Metadata",
"fields": [
{
"typeName": "title",
"multiple": false,
"typeClass": "primitive",
"value": "North Carolina Vital Statistics -- Birth/Infant Deaths 1976"
},
{
"typeName": "author",
"multiple": true,
"typeClass": "compound",
"value": [
{
"authorName": {
"typeName": "authorName",
"multiple": false,
"typeClass": "primitive",
"value": "State Center for Health Statistics"
}
}
]
},
cc/ @pdurbin
Get a DOI for the repo. Check, how versioning of releases is done. when is a new doi assigned to the repo/code?
![doi:10.7910/DVN/TJCLKP](https://img.shields.io/badge/DOI-10.7910%2FDVN%2FTJCLKP-orange.svg)](https://doi.org/10.7910/DVN/TJCLKP)
)Purpose
Synchronize a local directory with a remote folder within a dataset at Dataverse.
User story
As a user of Dataverse, I would like to be able to continuously (e.g., daily, weekly) "mirror" ongoing data collections (e.g., by means of web scraping) with a (draft) version of my dataset at Dataverse. Currently, only one-time transfers are convenient to manage using PyDataverse.
Functionality
get_datafiles()
; use as argument a particular folder at the remote dataset (or the entire dataset, default)folder\
that needs to be synchronizedsync_folder()
function, with arguments: local_folder
(default: .
), remote_folder
(default: .
), direction
(one of mirror local to remote but do not delete anything on remote; mirror remote to local but do not delete anything in local; synchronize both directories, and delete files where needed), comparison
(only on the basis of file names, or also on the basis of file hashes (default: hash+filename))On this page, right here:
Just a detail, but you know what they say: that's where God and the Devil are!
Write a test, which checks the content of Dataverse.json(), if expected structure and values are inside.
Update pyDataverse to use the endpoint described in the 5.2 release, in this PR: IQSS/dataverse#7345
hello,
when i try to import API using this line of code
from pyDataverse.api import Api
there is some error occurred
Traceback (most recent call last): File "C:/Users/MIsawe/PycharmProjects/untitled/main.py", line 1, in <module> from pyDataverse.api import Api File "C:\Users\MIsawe\PycharmProjects\untitled\venv\lib\site-packages\pyDataverse\__init__.py", line 6, in <module> from requests.packages import urllib3 ModuleNotFoundError: No module named 'requests'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.