Comments (7)
I think that sounds good. But I'm still unsure how to get all the different pieces of the system to share code when possible.
(When I tag @desilinguist does he see it, or does he have to be connected to the repository somehow?)
from aclpub.
This functionality more properly belongs in acl-pub and could be removed from here, but we should make sure that the current round of pub chairs, at least, since we might be interrupting something. Tagging the ones whose handles I could find:
from aclpub.
@desilinguist and I just discussed this the other day actually:
From EMNLP 2018, he was expecting .csv files with the author index, session chair information, anthology paper id mappings, and abstracts. I do not know the script that was used to create these .csv files for him, so we decided for this year that Nitin would write his own scraper from these .html files that live in the cdom/ generated part of the proceedings.tgz (authors.html, index.html, and program.html).
@davidweichiang, if there is another way to obtain this information/create these .csv relevant files without using these .html files, it's fine to remove that code from our perspective, if we can have that code to create the .cvs files either be generated directly when we pull the proceedings.tgz or run after it's been pulled.
from aclpub.
Hm, the plot thickens. I think I would suggest (if @desilinguist has not written this scraper already) that these .csv files be generated straight from the db, meta, and abstracts files, which are plain text and simpler to deal with than HTML. That would smooth the road to removing HTML generation from ACLPUB.
However, the db, meta, and abstracts files will contain TeX control sequences. I would suggest, if it's practical, that Nitin's scraper use the same TeX-to-XML/HTML code that @mbollmann is writing for the Anthology. Is that possible?
(Currently, there are about a half-dozen pieces of code across three Git repositories that do this same task, and each covers a different subset of TeX. I'm hoping that they can be unified into one really good one.)
from aclpub.
Here is some information from Nitin: "The only .csv file I created via scraping was anthology-mapping.csv. The rest were provided to me by the EMNLP program chairs. I haven't started to work on any script yet." He also said that the anthology-mapping.csv was created by scraping one of these .html files.
So it seems we can shift towards removing the html generation to directly generate the files into an easier-to-read format, like separate .csv files or a single YAML file that can be parsed (a suggestion from Nitin). For a concrete example, the .csv files that Nitin used for EMNLP are here: https://github.com/emnlp2018/emnlp2018.github.io/tree/master/scripts/data
from aclpub.
I am the current owner of the acl-org
organization so I see everything. Bwahahaha. Sorry :)
from aclpub.
I would suggest leaving the HTML generation code in, since it is very useful for people to download the "all" target and check that everything is OK by looking at the CD rom directory.
from aclpub.
Related Issues (20)
- use only attachment tags
- Add option to add DOIs to the footer stamp HOT 12
- For discussion: add `shortbooktitle` field to the meta file HOT 4
- U+200E in author name causes crash
- Documentation HOT 4
- Changes to "meta" file in START HOT 1
- Sanity checks
- generating new ID format in cdrom/ layout HOT 8
- use geometry package HOT 8
- Ignore duplicate entries in the order file HOT 4
- \textasciitilde HOT 2
- load hyperref later?
- bad bibkeys HOT 2
- \bibliographystyle{acl_natbib} HOT 1
- broken links in start.md
- improvements for START book chair guide HOT 7
- Local copies of template no longer say DO NOT DISTRIBUTE HOT 4
- doc/ and docs/ HOT 3
- Latex template files for authors HOT 24
- Is the ACL 2020 format the same as EMNLP 2022 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aclpub.