GithubHelp home page GithubHelp logo

Remove HTML generation code? about aclpub HOT 7 CLOSED

acl-org avatar acl-org commented on July 28, 2024
Remove HTML generation code?

from aclpub.

Comments (7)

davidweichiang avatar davidweichiang commented on July 28, 2024 1

I think that sounds good. But I'm still unsure how to get all the different pieces of the system to share code when possible.

(When I tag @desilinguist does he see it, or does he have to be connected to the repository somehow?)

from aclpub.

mjpost avatar mjpost commented on July 28, 2024

This functionality more properly belongs in acl-pub and could be removed from here, but we should make sure that the current round of pub chairs, at least, since we might be interrupting something. Tagging the ones whose handles I could find:

from aclpub.

slukin avatar slukin commented on July 28, 2024

@desilinguist and I just discussed this the other day actually:

From EMNLP 2018, he was expecting .csv files with the author index, session chair information, anthology paper id mappings, and abstracts. I do not know the script that was used to create these .csv files for him, so we decided for this year that Nitin would write his own scraper from these .html files that live in the cdom/ generated part of the proceedings.tgz (authors.html, index.html, and program.html).

@davidweichiang, if there is another way to obtain this information/create these .csv relevant files without using these .html files, it's fine to remove that code from our perspective, if we can have that code to create the .cvs files either be generated directly when we pull the proceedings.tgz or run after it's been pulled.

from aclpub.

davidweichiang avatar davidweichiang commented on July 28, 2024

Hm, the plot thickens. I think I would suggest (if @desilinguist has not written this scraper already) that these .csv files be generated straight from the db, meta, and abstracts files, which are plain text and simpler to deal with than HTML. That would smooth the road to removing HTML generation from ACLPUB.

However, the db, meta, and abstracts files will contain TeX control sequences. I would suggest, if it's practical, that Nitin's scraper use the same TeX-to-XML/HTML code that @mbollmann is writing for the Anthology. Is that possible?

(Currently, there are about a half-dozen pieces of code across three Git repositories that do this same task, and each covers a different subset of TeX. I'm hoping that they can be unified into one really good one.)

from aclpub.

slukin avatar slukin commented on July 28, 2024

Here is some information from Nitin: "The only .csv file I created via scraping was anthology-mapping.csv. The rest were provided to me by the EMNLP program chairs. I haven't started to work on any script yet." He also said that the anthology-mapping.csv was created by scraping one of these .html files.

So it seems we can shift towards removing the html generation to directly generate the files into an easier-to-read format, like separate .csv files or a single YAML file that can be parsed (a suggestion from Nitin). For a concrete example, the .csv files that Nitin used for EMNLP are here: https://github.com/emnlp2018/emnlp2018.github.io/tree/master/scripts/data

from aclpub.

desilinguist avatar desilinguist commented on July 28, 2024

I am the current owner of the acl-org organization so I see everything. Bwahahaha. Sorry :)

from aclpub.

rrgerber avatar rrgerber commented on July 28, 2024

I would suggest leaving the HTML generation code in, since it is very useful for people to download the "all" target and check that everything is OK by looking at the CD rom directory.

from aclpub.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.