nceas / arcticreport Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
The way we count creators right now is not ideal - just looking at names. Are there better ways to do this? ORCIDs are an obvious choice, but much of the legacy data doesn't have ORCIDs.
The two primary query functions in the package: query_objects
and query_version_chains
take 20 minutes and 100 minutes to run, respectively. query_objects
returns a data.frame with a row for every object in the ADC. query_version_chains
takes the result of query_objects
and assigns an arbitrary series identifier to each version chain. The rest of the functionality in the package is slicing, dicing, summarizing, and plotting metrics based on those two tables.
Since the functions take so long to run, it is definitely not practical to run these two functions often. For CI, we could build a status page that runs everything once a day or so, and fills in tables for the quarterly metrics when those milestones show up. For local testing, or creating one-off kind of plots, it would be beneficial to set up a standard way of caching those query results for ease of use.
Open to any suggestions. The bigger of the two tables is about 100MB when saved to disk.
It works in certain cases, and you can plot cumulative count or size of either data files or metadata files, but it could be smarter at:
The unnesting step in this function (tidyr::unnest_longer
) is fairly slow, performance I think could be improved by using data.table
instead, as shown here. Only downside is that it adds a dependency
I had the idea on a call that there is a way we can make query_objects
much faster by keeping parts of the cache that are still relevant. Below are some changes that would need to be made
dateModified
to the fields returned by the query
functionfilter
to keep all objects with a dateModified
older than the datetime at runtimequery
for objects only with a dateModified
more recent than the datetime at runtimecache_tolerance
parameterthe count_creators
function has a list of creators that are removed. The comment in the code says:
# Grep-based filters
# Bryce created these (and we can expand these) based upon what I saw in the results
# that looked like organizations or non-persons of some sort or another
We should review this list against the list of unique creators and decide if we want to expand, revise, or altogether remove this list.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.