GithubHelp home page GithubHelp logo

abdulwarissherzad / document-retrieval-system Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 9.92 MB

Document Retrieval System / Simple Text Retrieval System, for the Reuters-21578 dataset [SGM -> XML -> Text File]

Home Page: https://paperswithcode.com/dataset/reuters-21578

Java 100.00%
java reuters reuters-21578 reuters-connect reuters-corpus reuters-dataset reuters-text reutersnews

document-retrieval-system's Introduction

Acknowledgements

Language Static Badge

There are many standard text collections of text categorization. Reuters-21578 dataset is one of them. This collection has been used widely in a number of studies especially in information retrieval, machine learning and other corpus based research. The Reuters-21578 collection is freely available in the Internet. The files are in Standard Generalized Markup Language (SGML) format. SGML, defined by ISO 8879, is a metalanguage for defining markup languages for documents. It is descendent of IBM's Generalized Markup Language (GML) created in the 1960s. As a markup language, it has a specific vocabulary (elements and attributes) and a declared syntax (defined grammars). In 1998, World Wide Web Consortium (W3C) has published and recommended Extended Markup Language (XML) for Internet community. XML is a profile or subset of SGML.

Documentation

Documentation

Document Retrieval System

It was designed to describe data and to focus on what data is. Due to a number of technical reasons in SGML, XML becomes more acceptable for serving documents over the web. The "Reuters-21578, Distribution 1.0" corpus consist of stories appeared on the Reuters newswire in 1987. This corpus was first used in the CONSTRUE text categorization system (Hayes & Weinstein, 1990) based on a Reuters-22173. This new version was introduced in order to fix all the problems such as duplication of stories, typographical errors, etc. Java programing language does not has any API to parse SGML file but the Java programming language contains several methods for processing and writing XML. Older Java versions supported only the DOM API (Document Object Model) and the SAX (Simple API for XML) API DOM can be used to read and write XML files. SAX (Simple API for XML) is a Java API for sequential reading of XML files but this new version contain many features.

Screenshots

App Screenshot 'Simple Retieval System'

App Screenshot 'Out Put after slected news'

App Screenshot 'Output Folder'

Contributing

Contributions are always welcome!

See contributing.md for ways to get started.

Please adhere to this project's code of conduct.

πŸš€ About Me

I'm a Java developer, and I graduated in 2021, and subsequently, I worked for one year at Neptune Company. Following that, I have continued to work independently 🦾πŸ”₯ on my own projects....

Authors

document-retrieval-system's People

Contributors

abdulwarissherzad avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.