GithubHelp home page GithubHelp logo

cliqs's People

Contributors

futrell avatar rht avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cliqs's Issues

Versioning, naming, and distributing the cliqs dataset

By naming, I meant, we should create a DOI, for each release, so that the data can be cited on its own. This can be done with either https://guides.github.com/activities/citable-code/ (using Zenodo) or from researchgate.

As with versioning, it is to ensure that changes are being tracked, happen immutably, and citation is linked to a specific version. I think git still works fine with the current corpora, but I am undecided with which data versioning tool is best (git-annex? dat? git-lfs?). I have downloaded the entire ud-treebanks-v1.4, found the entire *.conllu to be 3 GB. I gzipped each of them, which result in 480 MB (this is about the size of the entire IETF RFC's!).

As with distributing, in addition to the FTP server, torrent suffices (as it has been used in cern opendata and datagovuk).

Edit: add urls for better access

Ensure reproducibility of the full result of mindep

@Futrell, WDYT of publishing the code (maybe in a jupyter nb) that was used to create the plots summarizing the post-processed output? What if \exists a "reproducibility number" for any paper, where its count is increased whenever a peer has just validated its result. I haven't fully fleshed out yet what should be the sufficient criterion of validating a result, or if there are stages/hierarchies of criteria (perhaps it is in between of verifying a result and falsifying a result).
At least this should be about checking against systematic bugs, as opposed to attesting whether a discovery is 5-sigma certain. This could complement one rough measure of a scientific consensus, e.g. (citation number / size of a field). ...

(in short, I meant, request for the code for the fancy plots!)

Sentence-level parallelism

I was trying to reproduce the result of

# it looks like that's actually slower than parallelizing over corpora, for some
.

I found that pooling did result in a 2x speed up of the run.

Without parallel:

python run_mindep.py run en fr  866.40s user 0.48s system 99% cpu 14:28.04 total
python run_mindep.py run en fr  893.17s user 0.53s system 99% cpu 14:55.14 total
python run_mindep.py run en fr  905.34s user 0.56s system 99% cpu 15:08.00 total

With parallel (pmap):

python run_mindep.py run en fr  404.78s user 13.91s system 48% cpu 14:23.18 total
python run_mindep.py run en fr  410.19s user 14.25s system 47% cpu 15:01.91 total
python run_mindep.py run en fr  418.29s user 14.64s system 54% cpu 13:09.16 total

This was ran on "Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz", quadcore.
I think the run could be ~an order of magnitude faster by inserting several numba @jits to deptransform/depgraph. So far I had tested with @jit-ing gen_row but didn't observe any speed up.

Testing the DLM hypothesis on context-free language corpora?

There could be universals that could be better uncovered with languages that are context-free. Programming lang treebanks are almost nonexistent, so what came to my mind if I were to construct one is to draw source from formalized mathematical proofs (that had been implemented in various langs) and established software protocols (that had been implemented in various langs).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.