GithubHelp home page GithubHelp logo

medhakant / irsystem Goto Github PK

View Code? Open in Web Editor NEW
1.0 0.0 0.0 1.73 MB

An advanced form of the previously implemented search Engine which acts as a information retrieval system over the cranfield collection of the 1400 documents and also makes use of the stemmer algorithm. Other things are pretty much the same as the previously implemented SearchEngine project.

Java 100.00%
java cranfield-collection parsing information-retrieval linked-list hashmap hashmaps information-extraction data data-structures hashset query-engine

irsystem's Introduction

Readme for Information Retrieval System which I named SearchEngine

PreProcessing( Part 1)

For preprocessing step a java file PreProcessing.java was created and it has a split method to split all the different documents in the cranfield collection into 1400 separate documents and then save them as txt files. The other one is in the main of the PreProcessing class which processes the broken down documents by removing all characters other than words and alphabets, removing all stopwords and also stemming the text. To test it yourself, just compile and run the PreProcessing class simply. Though you will need to create two folders in same folder where java classes are present namely "splitdoc" and "stemmeddoc" which have been created so that the files are arranged. An output file containing the length of the documents is also made.

Information Retrieval( Part 2)

For this purpose, a java class PageEntry was created that inputs a documents and breaks it down into words and the words are stored in a HashMap which also I built myself. And then calculated the IDF and WordFrequency in a document. Then a MySort class was created to sort the result to show top 100 results only. To test this part simply run the SearchEngine class and input query messages on terminal. The output it produces are stored in text files named query#.txt, where # is the query number starting from 0. For this part too you need to create a folder in the same location where the java classes are present names "queryOutputs" which will contain all the output query text documents. The output will also be printed to the terminal. Also, one thing to notice, you don't to run part1 again for part2 but you need to copy the folder "stemmeddoc" from part1 to the code location of part2.

irsystem's People

Contributors

medhakant avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.