futrell / cliqs Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 5.0 145 KB

Crosslinguistic investigations in quantitative syntax

Python 89.22% Shell 0.61% R 10.18%

cliqs's People

Contributors

Stargazers

Watchers

Forkers

rht kashenfelter kylebgorman langprocgroup mahowak

cliqs's Issues

Versioning, naming, and distributing the cliqs dataset

By naming, I meant, we should create a DOI, for each release, so that the data can be cited on its own. This can be done with either https://guides.github.com/activities/citable-code/ (using Zenodo) or from researchgate.

As with versioning, it is to ensure that changes are being tracked, happen immutably, and citation is linked to a specific version. I think git still works fine with the current corpora, but I am undecided with which data versioning tool is best (git-annex? dat? git-lfs?). I have downloaded the entire ud-treebanks-v1.4, found the entire *.conllu to be 3 GB. I gzipped each of them, which result in 480 MB (this is about the size of the entire IETF RFC's!).

As with distributing, in addition to the FTP server, torrent suffices (as it has been used in cern opendata and datagovuk).

Edit: add urls for better access

Ensure reproducibility of the full result of mindep

@Futrell, WDYT of publishing the code (maybe in a jupyter nb) that was used to create the plots summarizing the post-processed output? What if \exists a "reproducibility number" for any paper, where its count is increased whenever a peer has just validated its result. I haven't fully fleshed out yet what should be the sufficient criterion of validating a result, or if there are stages/hierarchies of criteria (perhaps it is in between of verifying a result and falsifying a result).
At least this should be about checking against systematic bugs, as opposed to attesting whether a discovery is 5-sigma certain. This could complement one rough measure of a scientific consensus, e.g. (citation number / size of a field). ...

(in short, I meant, request for the code for the fancy plots!)

Sentence-level parallelism

I was trying to reproduce the result of

cliqs/run_mindep.py

Line 49 in 1c72b06

 # it looks like that's actually slower than parallelizing over corpora, for some 

I found that pooling did result in a 2x speed up of the run.

Without parallel:

python run_mindep.py run en fr  866.40s user 0.48s system 99% cpu 14:28.04 total
python run_mindep.py run en fr  893.17s user 0.53s system 99% cpu 14:55.14 total
python run_mindep.py run en fr  905.34s user 0.56s system 99% cpu 15:08.00 total

With parallel (pmap):

python run_mindep.py run en fr  404.78s user 13.91s system 48% cpu 14:23.18 total
python run_mindep.py run en fr  410.19s user 14.25s system 47% cpu 15:01.91 total
python run_mindep.py run en fr  418.29s user 14.64s system 54% cpu 13:09.16 total

This was ran on "Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz", quadcore.
I think the run could be ~an order of magnitude faster by inserting several numba @jits to deptransform/depgraph. So far I had tested with @jit-ing gen_row but didn't observe any speed up.

Testing the DLM hypothesis on context-free language corpora?

There could be universals that could be better uncovered with languages that are context-free. Programming lang treebanks are almost nonexistent, so what came to my mind if I were to construct one is to draw source from formalized mathematical proofs (that had been implemented in various langs) and established software protocols (that had been implemented in various langs).

futrell / cliqs Goto Github PK

cliqs's People

Contributors

Stargazers

Watchers

Forkers

cliqs's Issues

Versioning, naming, and distributing the cliqs dataset

Ensure reproducibility of the full result of mindep

Sentence-level parallelism

Testing the DLM hypothesis on context-free language corpora?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs