GithubHelp home page GithubHelp logo

google-research-datasets / natural-questions Goto Github PK

View Code? Open in Web Editor NEW
912.0 35.0 151.0 2.99 MB

Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems.

License: Apache License 2.0

Python 92.46% CSS 0.31% HTML 5.48% Shell 1.75%

natural-questions's Introduction

Natural Questions

Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems.

Please see http://ai.google.com/research/NaturalQuestions to get the data and view the leaderboard. For more details on the design and content of the dataset, please see the paper Natural Questions: a Benchmark for Question Answering Research. To help you get started on this task we have provided some baseline systems that can be branched.

Data Description

NQ contains 307,372 training examples, 7,830 examples for development, and we withold a further 7,842 examples for testing. In the paper, we demonstrate a human upper bound of 87% F1 on the long answer selection task, and 76% on the short answer selection task.

To run on the hidden test set, you will have to upload a Docker image containing your system to the NQ competition site. Instructions on building the Docker image are given here.

Data Format

Each example in the original NQ format contains the rendered HTML of an entire Wikipedia page, as well as a tokenized representation of the text on the page.

This section will go on to define the full NQ data format, but we recognize that most users will only want a version of the data in which the text has already been extracted. We have supplied a simplified version of the training set and we have also supplied a simplify_nq_example function in data_utils.py which maps from the original format to the simplified format. Only the original format is provided by our competition site. If you use the simplified data, you should call simplify_nq_example on each example seen during evaluation and you should provide predictions using the start_token and end_token offsets that correspond to the whitespace separated tokens in the document text.

As well as recognizing predictions according to token offsets, the evaluation script also recognizes predictions as byte offsets into the original HTML. This allows users to define their own text extraction and tokenization schemes.

To help you explore the data, this repository also contains a simple data browser that you can run on your own machine, and modify as you see fit. We also have provided extra preprocessing utilities and tensorflow dataset code in the repository containing the baseline systems presented in our paper. The rest of this section describes the data format thouroughly in reference to a toy example.

Each example contains a single question, a tokenized representation of the question, a timestamped Wikipedia URL, and the HTML representation of that Wikipedia page.

"question_text": "who founded google",
"question_tokens": ["who", "founded", "google"],
"document_url": "http://www.wikipedia.org/Google",
"document_html": "<html><body><h1>Google</h1><p>Google was founded in 1998 by ..."

We release the raw HTML, since this is what was seen by our annotators, and we would like to support approaches that make use of the document structure. However, we expect most initial efforts will prefer to use a tokenized representation of the page.

"document_tokens":[
  { "token": "<h1>", "start_byte": 12, "end_byte": 16, "html_token": true },
  { "token": "Google", "start_byte": 16, "end_byte": 22, "html_token": false },
  { "token": "inc", "start_byte": 23, "end_byte": 26, "html_token": false },
  { "token": ".", "start_byte": 26, "end_byte": 27, "html_token": false },
  { "token": "</h1>", "start_byte": 27, "end_byte": 32, "html_token": true },
  { "token": "<p>", "start_byte": 32, "end_byte": 35, "html_token": true },
  { "token": "Google", "start_byte": 35, "end_byte": 41, "html_token": false },
  { "token": "was", "start_byte": 42, "end_byte": 45, "html_token": false },
  { "token": "founded", "start_byte": 46, "end_byte": 53, "html_token": false },
  { "Token": "in", "start_byte": 54, "end_byte": 56, "html_token": false },
  { "token": "1998", "start_byte": 57, "end_byte": 61, "html_token": false },
  { "token": "by", "start_byte": 62, "end_byte": 64, "html_token": false },

Each token is either a word or a HTML tag that defines a heading, paragraph, table, or list. HTML tags are marked as such using the boolean field html_token. Each token also has an inclusive start_byte and exclusive end_byte that identifies the token's position within the example's UTF-8 indexed HTML string.

Long Answer Candidates

The first task in Natural Questions is to identify the smallest HTML bounding box that contains all of the information required to infer the answer to a question. These long answers can be paragraphs, lists, list items, tables, or table rows. While the candidates can be inferred directly from the HTML or token sequence, we also include a list of long answer candidates for convenience. Each candidate is defined in terms of offsets into both the HTML and the document tokens. As with all other annotations, start offsets are inclusive and end offsets are exclusive.

"long_answer_candidates": [
  { "start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "top_level": true },
  { "start_byte": 65, "end_byte": 102, "start_token": 13, "end_token": 21, "top_level": false },

In this example, you can see that the second long answer candidate is contained within the first. We do not disallow nested long answer candidates, we just ask annotators to find the smallest candidate containing all of the information required to infer the answer to the question. However, we do observe that 95% of all long answers (including all paragraph answers) are not nested below any other candidates. Since we believe that some users may want to start by only considering non-overlapping candidates, we include a boolean flag top_level that identifies whether a candidate is nested below another (top_level = False) or not (top_level = True). Please be aware that this flag is only included for convenience and it is not related to the task definition in any way. For more information about the distribution of long answer types, please see the data statistics section below.

Annotations

The NQ training data has a single annotation with each example and the evaluation data has five. Each annotation defines a "long_answer" span, a list of short_answers, and a yes_no_answer.  If the annotator has marked a long answer, then the long answer dictionary identifies this long answer using byte offsets, token offsets, and an index into the list of long answer candidates. If the annotator has marked that no long answer is available, all of the fields in the long answer dictionary are set to -1.

"annotations": [{
  "long_answer": { "start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "candidate_index": 0 },
  "short_answers": [
    {"start_byte": 73, "end_byte": 78, "start_token": 15, "end_token": 16},
    {"start_byte": 87, "end_byte": 92, "start_token": 18, "end_token": 19}
  ],
  "yes_no_answer": "NONE"
}]

Each of the short answers is also identified using both byte offsets and token indices. There is no limit to the number of short answers. There is also often no short answer, since some questions such as "describe google's founding" do not have a succinct extractive answer. When this is the case, the long answer is given but the "short_answers" list is empty.

Finally, if no short answer is given, it is possible that there is a yes_no_answer for questions such as "did larry co-found google". The values for this field YES, or NO if a yes/no answer is given. The default value is NONE when no yes/no answer is given. For statistics on long answers, short answers, and yes/no answers, please see the data statistics section below.

Data Statistics

The NQ training data contains 307,373 examples. 152,148 have a long answer and 110,724 have a short answer. Short answers can be sets of spans in the document (106,926), or yes or no (3,798). Long answers are HTML bounding boxes, and the distribution of NQ long answer types is as follows:

HTML tags Percent of long answers
<P> 72.9%
<Table> 19.0%
<Tr> 1.5%
<Ul>, <Ol>, <Dl> 3.2%
<Li>, <Dd>, <Dt> 3.4%

While we allow any paragraph, table, or list element to be a long answer, we find that 95% of the long answers are not contained by any other long answer candidate. We mark these top level candidates in the data, as described above.

Short answers may contain more than one span, if the question is asking for a list of answers (e.g. who made it to stage 3 in american ninja warrior season 9). However, almost all short answers (90%) only contain a single span of text. All short answers are contained by the long answer given in the same annotation.

Prediction Format

Please see the evaluation script for a description of the prediction format that your model should output.

Contact us

If you have a technical question regarding the dataset, code or publication, please create an issue in this repository. This is the fastest way to reach us.

If you would like to share feedback or report concerns, please email us at [email protected].

natural-questions's People

Contributors

chrisgorgo avatar daphnelg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

natural-questions's Issues

EOFError: Compressed file ended before the end-of-stream marker was reached (on Google Colab)?

This is my code to open the data:

jsonfilename = "/content/v1.0-simplified_nq-dev-all.jsonl.gz"
with gzip.open(jsonfilename, 'r') as fin:
data = json.loads(fin.read().decode('utf-8'))

However, I'm getting this error:
EOFError Traceback (most recent call last)
in ()
8 jsonfilename = "/content/v1.0-simplified_nq-dev-all.jsonl.gz"
9 with gzip.open(jsonfilename, 'r') as fin:
---> 10 data = json.loads(fin.read().decode('utf-8'))

1 frames
/usr/lib/python3.7/gzip.py in read(self, size)
491 break
492 if buf == b"":
--> 493 raise EOFError("Compressed file ended before the "
494 "end-of-stream marker was reached")
495

EOFError: Compressed file ended before the end-of-stream marker was reached

How to get "Long Answer Candidates" from wikipedia source code.

I want to convert the source code of the a wikipedia webpage into the format provided by you in the competition. For an example if we look at this webpage, I want to convert it into following format (as provided by you in the competition):

{
  "example_id": "-1220107454853145579",
  "question_text": "who is the south african high commissioner in london",
  "document_text": "High Commission of South Africa , London - wikipedia <H1> High Commission of South Africa , London </H1> <Table> <Tr> <Th_colspan=\"2\"> High Commission of South Africa in London </Th> </Tr> <Tr> <Td_colspan=\"2\"> </Td> </Tr> <Tr> <Th> Location </Th> <Td> Trafalgar Square , London </Td> </Tr> <Tr> <Th> Address </Th> <Td> Trafalgar Square , London , WC2N 5DP </Td> </Tr> <Tr> <Th> Coordinates </Th> <Td> 51 ° 30 ′ 30 '' N 0 ° 07 ′ 37 '' W  /  51.5082 ° N 0.1269 ° W  / 51.5082 ; - 0.1269 Coordinates : 51 ° 30 ′ 30 '' N 0 ° 07 ′ 37 '' W  /  51.5082 ° N 0.1269 ° W  / 51.5082 ; - 0.1269 </Td> </Tr> <Tr> <Th> High Commissioner </Th> <Td> Vacant </Td> </Tr> </Table> Balcony of South Africa House <P> The High Commission of South Africa in London is the diplomatic mission from South Africa to the United Kingdom . It is located at South Africa House , a building on Trafalgar Square , London . As well as containing the offices of the High Commissioner , the building also hosts the South African consulate . It has been a Grade II * Listed Building since 1982 . </P> <H2> Contents </H2> <Ul> <Li> 1 History </Li> <Li> 2 See also </Li> <Li> 3 References </Li> <Li> 4 External links </Li> </Ul> <H2> History ( edit ) </H2> <P> South Africa House was built by Holland , Hannen & Cubitts in the 1930s on the site of what had been Morley 's Hotel until it was demolished in 1936 . The building was designed by Sir Herbert Baker , with architectural sculpture by Coert Steynberg and Sir Charles Wheeler , and opened in 1933 . The building was acquired by the government of South Africa as its main diplomatic presence in the UK . During World War II , Prime Minister Jan Smuts lived there while conducting South Africa 's war plans . </P> <P> In 1961 , South Africa became a republic , and withdrew from the Commonwealth due to its policy of racial segregation . Accordingly , the building became an Embassy , rather than a High Commission . During the 1980s , the building , which was one of the only South African diplomatic missions in a public area , was targeted by protesters from around the world . During the 1990 Poll Tax Riots , the building was set alight by rioters , although not seriously damaged . </P> <P> The first fully free democratic elections in South Africa were held on the 27 April 1994 , and 4 days later , the country rejoined the Commonwealth , 33 years to the day after it withdrew upon becoming a republic . Along with country 's diplomatic missions in other Commonwealth countries , the mission once again became a High Commission . </P> <P> Today , South Africa House is no longer a controversial site , and is the focal point of South African culture in the UK . South African President Nelson Mandela appeared on the balcony of South Africa House in 1996 , as part of his official UK state visit . In 2001 , Mandela again appeared on the balcony of South Africa House to mark the seventh anniversary of Freedom Day , when the apartheid system was officially abolished . </P> <H2> See also ( edit ) </H2> <Ul> <Li> List of diplomatic missions of South Africa </Li> <Li> High Commission of Canada to the United Kingdom </Li> <Li> High Commission of Uganda , London </Li> </Ul> <H2> References ( edit ) </H2> <Table> <Tr> <Td> </Td> <Td> Wikimedia Commons has media related to South Africa House , London . </Td> </Tr> </Table> <Ol> <Li> ^ Jump up to : `` The London Diplomatic List '' ( PDF ) . 14 December 2013 . Archived from the original ( PDF ) on 11 December 2013 . </Li> <Li> Jump up ^ Historic England . `` Details from listed building database ( 1066238 ) '' . National Heritage List for England . Retrieved 28 September 2015 . </Li> <Li> Jump up ^ Cubitts 1810 -- 1975 , published 1975 </Li> <Li> Jump up ^ `` The east side of Trafalgar Square '' . BHO . Retrieved 22 November 2015 . </Li> <Li> Jump up ^ Palliser , David Michael ; Clark , Peter ; Daunton , Martin J. ( 2000 ) . The Cambridge Urban History of Britain : 1840 -- 1950 . Cambridge University Press . p. 126 . </Li> <Li> ^ Jump up to : South Africa returns to the Commonwealth fold , The Independent , 31 May 1994 </Li> <Li> Jump up ^ Burns , Danny ( 1992 ) . Poll tax rebellion . AK Press . p. 90 . </Li> <Li> Jump up ^ United Kingdom of Great Britain and Northern Ireland , Department of International Relations and Cooperation </Li> <Li> Jump up ^ Hero 's welcome for Mandela at concert . BBC News . April 30 , 2001 . </Li> </Ol> <H2> External links ( edit ) </H2> <Ul> <Li> Official site </Li> </Ul> <Table> <Tr> <Th_colspan=\"2\"> <Ul> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> Diplomatic missions in the United Kingdom </Th> </Tr> <Tr> <Th> Africa </Th> <Td> <Ul> <Li> Algeria </Li> <Li> Angola </Li> <Li> Botswana </Li> <Li> Burundi </Li> <Li> Cameroon </Li> <Li> Democratic Republic of the Congo </Li> <Li> Egypt </Li> <Li> Equatorial Guinea </Li> <Li> Eritrea </Li> <Li> Ethiopia </Li> <Li> Gabon </Li> <Li> The Gambia </Li> <Li> Ghana </Li> <Li> Guinea </Li> <Li> Ivory Coast </Li> <Li> Kenya </Li> <Li> Lesotho </Li> <Li> Liberia </Li> <Li> Libya </Li> <Li> Malawi </Li> <Li> Mauritania </Li> <Li> Mauritius </Li> <Li> Morocco </Li> <Li> Mozambique </Li> <Li> Namibia </Li> <Li> Nigeria </Li> <Li> Rwanda </Li> <Li> Senegal </Li> <Li> Seychelles </Li> <Li> Sierra Leone </Li> <Li> South Africa </Li> <Li> South Sudan </Li> <Li> Sudan </Li> <Li> Swaziland </Li> <Li> Tanzania </Li> <Li> Togo </Li> <Li> Tunisia </Li> <Li> Uganda </Li> <Li> Zambia </Li> <Li> Zimbabwe </Li> </Ul> </Td> </Tr> <Tr> <Th> Americas </Th> <Td> <Ul> <Li> Antigua and Barbuda </Li> <Li> Argentina </Li> <Li> The Bahamas </Li> <Li> Barbados </Li> <Li> Belize </Li> <Li> Bolivia </Li> <Li> Brazil </Li> <Li> Canada </Li> <Li> Chile </Li> <Li> Colombia </Li> <Li> Costa Rica </Li> <Li> Cuba </Li> <Li> Dominica </Li> <Li> Dominican Republic </Li> <Li> Ecuador </Li> <Li> El Salvador </Li> <Li> Grenada </Li> <Li> Guatemala </Li> <Li> Guyana </Li> <Li> Haiti </Li> <Li> Honduras </Li> <Li> Jamaica </Li> <Li> Mexico </Li> <Li> Nicaragua </Li> <Li> Panama </Li> <Li> Paraguay </Li> <Li> Peru </Li> <Li> Saint Kitts and Nevis </Li> <Li> Saint Lucia </Li> <Li> Saint Vincent and the Grenadines </Li> <Li> Trinidad and Tobago </Li> <Li> United States of America </Li> <Li> Uruguay </Li> <Li> Venezuela </Li> </Ul> </Td> </Tr> <Tr> <Th> Asia </Th> <Td> <Ul> <Li> Afghanistan </Li> <Li> Armenia </Li> <Li> Azerbaijan </Li> <Li> Bahrain </Li> <Li> Bangladesh </Li> <Li> Brunei </Li> <Li> Cambodia </Li> <Li> China </Li> <Li> East Timor </Li> <Li> Georgia </Li> <Li> India </Li> <Li> Indonesia </Li> <Li> Iran </Li> <Li> Iraq </Li> <Li> Israel </Li> <Li> Japan </Li> <Li> Jordan </Li> <Li> Kazakhstan </Li> <Li> Kuwait </Li> <Li> Kyrgyzstan </Li> <Li> Laos </Li> <Li> Lebanon </Li> <Li> Malaysia </Li> <Li> Maldives </Li> <Li> Mongolia </Li> <Li> Myanmar </Li> <Li> Nepal </Li> <Li> North Korea </Li> <Li> Oman </Li> <Li> Pakistan </Li> <Li> The Philippines </Li> <Li> Qatar </Li> <Li> Saudi Arabia </Li> <Li> Singapore </Li> <Li> South Korea </Li> <Li> Sri Lanka </Li> <Li> Syria </Li> <Li> Tajikistan </Li> <Li> Thailand </Li> <Li> Turkey </Li> <Li> Turkmenistan </Li> <Li> United Arab Emirates </Li> <Li> Uzbekistan </Li> <Li> Vietnam </Li> <Li> Yemen </Li> </Ul> </Td> </Tr> <Tr> <Th> Europe </Th> <Td> <Ul> <Li> Albania </Li> <Li> Austria </Li> <Li> Belarus </Li> <Li> Belgium </Li> <Li> Bosnia and Herzegovina </Li> <Li> Bulgaria </Li> <Li> Croatia </Li> <Li> Cyprus </Li> <Li> Czech Republic </Li> <Li> Denmark </Li> <Li> Estonia </Li> <Li> Finland </Li> <Li> France </Li> <Li> Germany </Li> <Li> Greece </Li> <Li> Hungary </Li> <Li> Iceland </Li> <Li> Ireland </Li> <Li> Italy </Li> <Li> Kosovo </Li> <Li> Latvia </Li> <Li> Lithuania </Li> <Li> Luxembourg </Li> <Li> Macedonia </Li> <Li> Malta </Li> <Li> Moldova </Li> <Li> Monaco </Li> <Li> Montenegro </Li> <Li> The Netherlands </Li> <Li> Norway </Li> <Li> Poland </Li> <Li> Portugal </Li> <Li> Romania </Li> <Li> Russia </Li> <Li> Serbia </Li> <Li> Slovakia </Li> <Li> Slovenia </Li> <Li> Spain </Li> <Li> Sweden </Li> <Li> Switzerland </Li> <Li> Ukraine </Li> <Li> Vatican City ( Apostolic Nunciature ) </Li> </Ul> </Td> </Tr> <Tr> <Th> Oceania </Th> <Td> <Ul> <Li> Australia </Li> <Li> Fiji </Li> <Li> New Zealand </Li> <Li> Papua New Guinea </Li> <Li> Tonga </Li> </Ul> </Td> </Tr> <Tr> <Th> States with limited recognition </Th> <Td> <Ul> <Li> North Cyprus </Li> <Li> Palestine </Li> <Li> Taiwan </Li> </Ul> </Td> </Tr> <Tr> <Th> De facto independent states </Th> <Td> <Ul> <Li> Somaliland </Li> </Ul> </Td> </Tr> <Tr> <Th> British Overseas Territories </Th> <Td> <Ul> <Li> Anguilla </Li> <Li> Bermuda </Li> <Li> British Virgin Islands </Li> <Li> Cayman Islands </Li> <Li> Falkland Islands </Li> <Li> Gibraltar </Li> <Li> Montserrat </Li> <Li> Saint Helena </Li> <Li> Tristan da Cunha </Li> <Li> Turks and Caicos Islands </Li> </Ul> </Td> </Tr> <Tr> <Th> Other economies with their own representations </Th> <Td> Hong Kong </Td> </Tr> <Tr> <Th> International organisations </Th> <Td> <Ul> <Li> Arab League </Li> <Li> European Union </Li> <Li> International Organisation for Migration </Li> <Li> United Nations <Ul> <Li> UNHCR </Li> <Li> World Food Programme </Li> </Ul> </Li> <Li> World Bank </Li> </Ul> </Td> </Tr> </Table> <Table> <Tr> <Th_colspan=\"3\"> <Ul> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> Trafalgar Square , London </Th> </Tr> <Tr> <Th> Buildings </Th> <Td> <Table> <Tr> <Th> Current </Th> <Td> <Ul> <Li> Clockwise from North : National Gallery </Li> <Li> St Martin - in - the - Fields </Li> <Li> South Africa House </Li> <Li> Drummonds Bank </Li> <Li> Admiralty Arch </Li> <Li> Uganda House <Ul> <Li> Embassy of Burundi </Li> <Li> High Commission of Uganda </Li> </Ul> </Li> <Li> Canadian Pacific building </Li> <Li> Admiralty ( pub ) </Li> <Li> Canada House </Li> </Ul> </Td> </Tr> <Tr> <Th> Former </Th> <Td> <Ul> <Li> Morley 's Hotel </Li> <Li> Northumberland House </Li> <Li> Royal Mews </Li> </Ul> </Td> </Tr> </Table> </Td> <Td> </Td> </Tr> <Tr> <Th> Statues </Th> <Td> <Table> <Tr> <Th> Plinths </Th> <Td> <Ul> <Li> SE : Henry Havelock </Li> <Li> SW : Charles Napier </Li> <Li> NE : George IV </Li> <Li> NW : Fourth plinth </Li> </Ul> </Td> </Tr> <Tr> <Th> Busts </Th> <Td> <Ul> <Li> Lord Beatty </Li> <Li> Lord Jellicoe </Li> <Li> Lord Cunningham </Li> </Ul> </Td> </Tr> <Tr> <Th> Other </Th> <Td> <Ul> <Li> Charles I <Ul> <Li> Charing Cross </Li> </Ul> </Li> <Li> Nelson 's Column </Li> <Li> James II </Li> <Li> George Washington </Li> </Ul> </Td> </Tr> </Table> </Td> </Tr> <Tr> <Th> Adjacent streets </Th> <Td> <Ul> <Li> Charing Cross Road </Li> <Li> Cockspur Street </Li> <Li> Northumberland Avenue </Li> <Li> Strand </Li> <Li> Whitehall </Li> </Ul> </Td> </Tr> <Tr> <Th> People </Th> <Td> <Table> <Tr> <Th> Architects </Th> <Td> <Ul> <Li> Charles Barry </Li> <Li> Norman Foster </Li> <Li> Edwin Lutyens </Li> <Li> John Nash </Li> </Ul> </Td> </Tr> <Tr> <Th> Fourth plinth sculptors </Th> <Td> <Ul> <Li> Elmgreen and Dragset </Li> <Li> Katharina Fritsch <Ul> <Li> Hahn / Cock </Li> </Ul> </Li> <Li> Antony Gormley <Ul> <Li> One & Other </Li> </Ul> </Li> <Li> Marc Quinn </Li> <Li> Thomas Schütte </Li> <Li> Yinka Shonibare </Li> <Li> Mark Wallinger </Li> <Li> Rachel Whiteread </Li> <Li> Bill Woodrow </Li> </Ul> </Td> </Tr> </Table> </Td> </Tr> <Tr> <Th> Events </Th> <Td> <Ul> <Li> Poll Tax Riots </Li> </Ul> </Td> </Tr> <Tr> <Th> Miscellaneous </Th> <Td> <Ul> <Li> Christmas tree </Li> </Ul> </Td> </Tr> <Tr> <Td_colspan=\"3\"> <Ul> <Li> </Li> <Li> Commons </Li> </Ul> </Td> </Tr> </Table> Retrieved from `` https://en.wikipedia.org/w/index.php?title=High_Commission_of_South_Africa,_London&oldid=850142361 '' Categories : <Ul> <Li> Diplomatic missions in London </Li> <Li> Trafalgar Square </Li> <Li> Diplomatic missions of South Africa </Li> <Li> Herbert Baker buildings and structures </Li> <Li> South Africa -- United Kingdom relations </Li> <Li> South Africa and the Commonwealth of Nations </Li> <Li> Grade II * listed buildings in the City of Westminster </Li> <Li> Buildings and structures completed in 1933 </Li> </Ul> <Ul> <Li> </Li> <Li> </Li> </Ul> <H2> </H2> <H3> </H3> <Ul> <Li> </Li> <Li> Talk </Li> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> </Ul> <H3> </H3> <Ul> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> </Ul> <H3> </H3> <H3> </H3> <Ul> <Li> </Li> <Li> Contents </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> <Li> </Li> <Li> About Wikipedia </Li> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> <Li> </Li> </Ul> <H3> </H3> <Ul> <Li> Afrikaans </Li> </Ul> Edit links <Ul> <Li> This page was last edited on 13 July 2018 , at 22 : 10 ( UTC ) . </Li> <Li> </Li> </Ul> <Ul> <Li> </Li> <Li> About Wikipedia </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <Ul> <Li> </Li> <Li> </Li> </Ul>",
  "long_answer_candidates": [
    {
      "end_token": 136,
      "start_token": 18,
      "top_level": true
    },
    {
      "end_token": 30,
      "start_token": 19,
      "top_level": false
    },
    {
      "end_token": 45,
      "start_token": 34,
      "top_level": false
    },
    {
      "end_token": 59,
      "start_token": 45,
      "top_level": false
    },
    {
      "end_token": 126,
      "start_token": 59,
      "top_level": false
    },
    {
      "end_token": 135,
      "start_token": 126,
      "top_level": false
    },
    {
      "end_token": 211,
      "start_token": 141,
      "top_level": true
    },
    {
      "end_token": 336,
      "start_token": 240,
      "top_level": true
    },
    {
      "end_token": 425,
      "start_token": 336,
      "top_level": true
    },
    {
      "end_token": 488,
      "start_token": 425,
      "top_level": true
    },
    {
      "end_token": 570,
      "start_token": 488,
      "top_level": true
    }
  ]
}

My main question is, can I get the script to convert the source code of the webpage into "document_text" and "long_answer_candidates" ?

pre-processing code for document_html to document_tokens

Hi, I'm looking for the code for pre-processing document_html to document_tokens. When I just join the document_tokens using whitespace, the resulting document is different from what I have in the original Wikipedia article.

For instance document_tokens with whitespace join looks like (say, format 1):
It is mostly water ( up to 95 % by volume ) , and contains dissolved proteins ( 6 -- 8 % )

while the same part I get from WikiExtractor (https://github.com/attardi/wikiextractor) is (format 2):
It is mostly water (up to 95% by volume), and contains dissolved proteins (6\u20138%)

Since I want to inference on the articles from WikiExtractor, I need to change format 1 to format 2 (or vice versa) for NQ training, or document_html to format 2 while keeping the answer annotations.
For me, converting document_html to format 2 while keeping the answer annotations will be the best option.

Thanks.

Running nq_browser.py on Windows OS

I'm having trouble running the nq_browser.py on my Windows 10 machine. There seems to be a lot of module or parts of the script that don't work in windows OS. Are there any workarounds or alternatives to run on Windows OS?

clarification of 'score'

Hello, in 'nq_eval.py' it is mentioned that "Each prediction should be provided with a long answer score, and a short answers score".

May I clarify what these scores refer to? Are these scores supposed to represent the confidence of the model's predictions, or is there a fixed method to obtain scores?

For example, can we define the 'score' to simply be the sum of the start and end logits of the prediction?

Lastly, are scores also required for null predictions?

Thank you very much!

Error when loading gold file

Hello,

I am trying to generate predictions from the sample of the development set provided here. However, the script fails upon loading the gold data.

Input: python natural_questions/make_test_data.py --gold_path sample/v1.0_sample_nq-dev-sample.jsonl.gz --output_path whatever

Error:

I0201 17:10:07.177878 140447407568704 eval_utils.py:261] parsing sample/v1.0_sample_nq-dev-sample.jsonl.gz ..... 
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "XXX/miniconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "XXX/miniconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "XXX/natural-questions/eval_utils.py", line 264, in read_annotation_from_one_split
    for line in input_file:
  File "XXX/miniconda3/lib/python3.7/gzip.py", line 374, in readline
    return self._buffer.readline(size)
  File "XXX/miniconda3/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "XXX/miniconda3/lib/python3.7/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "XXX/miniconda3/lib/python3.7/gzip.py", line 406, in _read_gzip_header
    magic = self._fp.read(2)
  File "XXX/miniconda3/lib/python3.7/gzip.py", line 91, in read
    self.file.read(size-self._length+read)
  File "XXX/miniconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "natural-questions/make_test_data.py", line 120, in <module>
    app.run(main)
  File "XXX/miniconda3/lib/python3.7/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "XXX/miniconda3/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "natural-questions/make_test_data.py", line 53, in main
    n_threads=FLAGS.num_threads)
  File "XXX/natural-questions/eval_utils.py", line 303, in read_annotation
    dict_list = pool.map(read_annotation_from_one_split, input_paths)
  File "XXX/miniconda3/lib/python3.7/multiprocessing/pool.py", line 290, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "XXX/miniconda3/lib/python3.7/multiprocessing/pool.py", line 683, in get
    raise self._value

Specifically it fails at line 264 of eval_utils.py when trying to iterate over the gold file.

Can you help me solve this issue?

Thanks a lot!

how to train with this data

there are many answers include long answers and short answers, if I train a model, which answer should I use, it couldn't use all of the condidates long answers, if do that, the model can't converage.

Question identification: good or bad?

First, thank you for the amazing dataset.

As described in the paper, the first step of the annotation process is to classify the question as good or bad. But I did not found this information in the dataset. Is it available?

Regarding Char offset

I cannot understand the char offset included in the BERT-BASELINE code. Why HTML tokens were not considered while calculating the offset values. Can you please elaborate?

Convert token to text

Is there a way to convert the output (currently in the form of tokens) of the model to text for easy interpretation and testing?

Python3 data processing

Problem

Environment

Docker image FROM tensorflow/tensorflow:latest-gpu

Modification

I get ride of the warnings by modifying as recommended by TF2.

Processing Scirpt

python3 -m language.question_answering.bert_joint.prepare_nq_data \
  --logtostderr \
  --input_jsonl /mnt/data/v1.0/nq-train-??.jsonl.gz \
  --output_tfrecord ~/output_dir/nq-train.tfrecords-00000-of-00001 \
  --max_seq_length=512 \
  --include_unknowns=0.02 \
  --vocab_file=bert-joint-baseline/vocab-nq.txt

Output

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Examples with correct context retained: 0 out of 0

Epilog

When you people will release the Python3 version? Don't want to revert to Python2.

Encoding issue in traindata

Hi,
I've stumbled on a strange problem reading the files downloaded with gsutil -m cp -R gs://natural_questions/v1.0 <mydir>.
I try to find the answers using the start_byte and end_byte positions of the tokens in document_html. tokens with low start/end_byte are correct, but later on in the document the positions are wrong. Using the following python3 script shows the error:

import json

fp = open("nq-train-00.jsonl", encoding="utf-8")
line = fp.readline()
j = json.loads(line)

for toks in j["document_tokens"]:  
    print("token:    {%s}\t%d\t%d" % (toks["token"], toks["start_byte"], toks["end_byte"]))
    print(" in text: {%s}" % (j["document_html"][toks["start_byte"]:toks["end_byte"]]))
    print()

This produces in the beginning a correct correspondance between the token mentionned in document_token

token:    {The} 92      95
 in text: {The}

token:    {Walking}     96      103
 in text: {Walking}

token:    {Dead}        104     108
 in text: {Dead}

but later on, notably after a unbreakable space (U+00A0) in document_html things get weird:

token:    {season}      53862   53868
 in text: {season}

token:    {8}   53870   53871
 in text: {)}

token:    {)}   53871   53872
 in text: {<}

token:    {</Th>}       53872   53877
 in text: {/TH> }

token:    {</Tr>}       53878   53883
 in text: {/TR> }

It looks as if the start/end_bytes are shifted. The same happens with mdashes — (U+2014), ←

Is there a corrected version available, or is there is list of characters which have been replaced by a sequence of characters before calculating start/end_byte

Running simplify_nq_data on dev test?

Is there any guidance on how we can run the simplify nq_data? I'm trying to run the simplify script provided on the dev test downloaded from the Google set but it doesn't seem to output anything.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.