GithubHelp home page GithubHelp logo

microsoft / msmarco Goto Github PK

View Code? Open in Web Editor NEW
23.0 11.0 14.0 7.37 MB

website for MS Marco

Home Page: https://microsoft.github.io/msmarco/.

License: Creative Commons Attribution 4.0 International

CSS 17.11% HTML 23.56% JavaScript 52.71% Python 6.62%

msmarco's Introduction

Terms and Conditions

The MS MARCO and ORCAS datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The datasets are provided "as is" without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. By using any of these datasets you are automatically agreeing to abide by these terms and conditions. Upon violation of any of these terms, your rights to use the dataset will end automatically.

Please contact us at [email protected] if you own any of the documents made available but do not want them in this dataset. We will remove the data accordingly. If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Legal Notices

Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE file, and grant you a license to any code in the repository under the MIT License, see the LICENSE-CODE file.

Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.

Privacy information can be found at https://privacy.microsoft.com/en-us/.

Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.

msmarco's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

msmarco's Issues

textually-duplicate passages in msmarco v2

I noticed that there's a sizeable number of passages in the v2 corpus that have text that exactly matches other passages: ~27.8 million passages, which amounts to around 20% of all passages in the corpus. Sometimes it's extremely prevalent, with one passage even being repeated 23,680 times [1]. [code] [file containing the duplicate passage IDs]

This is realistic, of course, since multiple documents often do contain the same passage. This is reflected in the other passage fields. I am wondering how this will affect evaluation, though. If I recall correctly, in the past NIST assessors evaluated the passage retrieval task irrespective of the context from the document. Is that the case again this year, or will the associated document also be considered? If only the passage text is considered, how will duplicates be handled?

[1] FWIW cases like this particular one (msmarco_passage_27_152452064, an advertising disclosure from Yellow Pages) are rather unlikely to be an answer to an actual question. Other exact duplicates are high-quality answers, though.

Downloading msmarco_v2_doc.tar

We're really excited that the v2 document corpus is now available! A couple of questions:

  • Since this file is pretty big, is it possible to replicate it in multiple regions? I'm seeing pretty drastic differences in download speeds depending on where the request is coming from. On the West Coast US: ~100MB/s. On the East Coast: ~10MB/s. In the UK: 2-3MB/s. Using azcopy didn't make a difference.
  • And/or could HTTP Range requests be enabled on the files, allowing downloads to recover from network interruptions without needing to start over? (I'm not super familiar with Azure, but from what I can tell, it looks like this is something that can be enabled.)

Availability of previous NIST Labels

Hi,

I was wondering if the NIST Labels from the previous years are available?
/ Is it still possible to evaluate models on those labels?

Also I was wondering what those labels are? (Are they manual rankings (1, 2, 3, ..) of e.g. the top 10 documents? Were the documents for labelling pre-selected or did labellers go through the entire document corpus?)

Thanks for the great work! :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.