futrell / cliqs Goto Github PK
View Code? Open in Web Editor NEWCrosslinguistic investigations in quantitative syntax
Crosslinguistic investigations in quantitative syntax
By naming, I meant, we should create a DOI, for each release, so that the data can be cited on its own. This can be done with either https://guides.github.com/activities/citable-code/ (using Zenodo) or from researchgate.
As with versioning, it is to ensure that changes are being tracked, happen immutably, and citation is linked to a specific version. I think git still works fine with the current corpora, but I am undecided with which data versioning tool is best (git-annex? dat? git-lfs?). I have downloaded the entire ud-treebanks-v1.4, found the entire *.conllu to be 3 GB. I gzipped each of them, which result in 480 MB (this is about the size of the entire IETF RFC's!).
As with distributing, in addition to the FTP server, torrent suffices (as it has been used in cern opendata and datagovuk).
Edit: add urls for better access
@Futrell, WDYT of publishing the code (maybe in a jupyter nb) that was used to create the plots summarizing the post-processed output? What if \exists
a "reproducibility number" for any paper, where its count is increased whenever a peer has just validated its result. I haven't fully fleshed out yet what should be the sufficient criterion of validating a result, or if there are stages/hierarchies of criteria (perhaps it is in between of verifying a result and falsifying a result).
At least this should be about checking against systematic bugs, as opposed to attesting whether a discovery is 5-sigma certain. This could complement one rough measure of a scientific consensus, e.g. (citation number / size of a field). ...
(in short, I meant, request for the code for the fancy plots!)
I was trying to reproduce the result of
Line 49 in 1c72b06
I found that pooling did result in a 2x speed up of the run.
Without parallel:
python run_mindep.py run en fr 866.40s user 0.48s system 99% cpu 14:28.04 total
python run_mindep.py run en fr 893.17s user 0.53s system 99% cpu 14:55.14 total
python run_mindep.py run en fr 905.34s user 0.56s system 99% cpu 15:08.00 total
With parallel (pmap):
python run_mindep.py run en fr 404.78s user 13.91s system 48% cpu 14:23.18 total
python run_mindep.py run en fr 410.19s user 14.25s system 47% cpu 15:01.91 total
python run_mindep.py run en fr 418.29s user 14.64s system 54% cpu 13:09.16 total
This was ran on "Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz", quadcore.
I think the run could be ~an order of magnitude faster by inserting several numba @jit
s to deptransform/depgraph. So far I had tested with @jit
-ing gen_row
but didn't observe any speed up.
There could be universals that could be better uncovered with languages that are context-free. Programming lang treebanks are almost nonexistent, so what came to my mind if I were to construct one is to draw source from formalized mathematical proofs (that had been implemented in various langs) and established software protocols (that had been implemented in various langs).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.