GithubHelp home page GithubHelp logo

Comments (4)

danyaljj avatar danyaljj commented on September 9, 2024

As far as we know, these sentences were directly taken from a different dataset. I don't think we changed the tokenization; they came in like this.

from natural-instructions.

sanderland avatar sanderland commented on September 9, 2024

As far as we know, these sentences were directly taken from a different dataset. I don't think we changed the tokenization; they came in like this.

There is no tokenization to change, that is just speculation on the origin.

Regardless of the origin and cause, you do not consider mis-formatted strings and "UNK" strings in your training data to be an issue?

from natural-instructions.

danyaljj avatar danyaljj commented on September 9, 2024

I think the segment you mention is from this file which was introduced in this PR: #284 The PR does not mention the origin of the data (only cites the relevant paper).

My opinion is that, these would a concern is they are incomprehensible to humans (which I think they are not, unless I am missing something).

from natural-instructions.

sanderland avatar sanderland commented on September 9, 2024

Some examples with unreadable outputs:

https://github.com/allenai/natural-instructions/blob/master/tasks/task1579_gigaword_incorrect_summarization.json

    "Positive Examples": [
        {
            "input": "four east timorese youths who scaled the french embassy 's fence here thursday , left the embassy on their way to portugal friday .",
            "output": "UNK latest east javanese asylum seekers leave for peru",
            "explanation": "The example is correct, as the location names are different from the passage"
        },
        {
            "input": "bosnian croat forces have begun torching homes in parts of western bosnia captured during a summer offensive but due to return to serbian control under the dayton peace agreement , un officials said friday .",
            "output": "croats build homes in areas for development",
            "explanation": "The example is correct, as it incorrectly summarizes the passage"
        }
    ],
    "Negative Examples": [
        {
            "input": "five east timorese youths who scaled the french embassy 's fence here thursday , left the embassy on their way to portugal friday .",
            "output": "UNK latest east timorese asylum seekers leave for portugal",
            "explanation": "The example is incorrect, as it correctly summarizes the passage"
        },
        {
            "input": "bosnian croat forces have begun torching homes in parts of western bosnia captured during a summer offensive but due to return to serb control under the dayton peace agreement , un officials said friday .",
            "output": "croats torch homes in areas due to return to serbs",
            "explanation": "The example is incorrect, as it correctly summarizes the passage"
        }
    ],
    "Instances": [
        {
            "id": "task1579-7f515582fdd9457cb5c481c3f89aa349",
            "input": "up to ## afghans have been killed and hundreds injured by a massive explosion at an ammunition depot in the eastern provincial capital jalalabad , kabul red cross officials said thursday .",
            "output": [
                "mobutu notes parliament 's decision to sack government"
            ]
        },
        {
            "id": "task1579-d4478904d95e41d699d0a984530c1fd9",
            "input": "the independent political groups known as ###s should be required to disclose their funding sources while election campaigns are in progress , three republican house members said wednesday .",
            "output": [
                "vanity of UNK : the conductor as composer as entrepreneur"
            ]
        },

        {
            "id": "task1579-60c744c284634dc39cc668edd6e1237b",
            "input": "an e-mail sent to paul harris , a member of the virginia house of delegates , elicits an automatic response : `` i will be out of the office from UNK until UNK .",
            "output": [
                "national semiconductor 's loss is smaller than expected"
            ]
        },

Overall having weird artifacts in inputs seems ok, but having them in outputs seems not: you don't want a model to learn this and start inserting spaces between apostrophes in answers.

from natural-instructions.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.