GithubHelp home page GithubHelp logo

pvnieo / searchy Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 0.0 37.18 MB

Implementation of a search engine on the cacm and CS276 (Stanford) collections.

Python 10.88% Jupyter Notebook 89.12%
search-engine vector-space-model boolean-search cacm stanford-corpus python-3

searchy's Introduction

Moteur de recherche

Build Status

Implémentation d'un moteur de recherche pour une collection de fichiers.

Installation

Searchy tourne sous python >= 3.6, utilisez pip pour installer les dépendances

pip3 install -r requirements.txt

Installez les dépendances demandées par nltk avec la commande suivante:

python3 -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet');"

Usage

Utilisez le script searchy.py pour indexer une collection:

usage: searchy.py [-h] [-q QUERY] [-m {bool,vect}]
                  [-n {cos,dice,jaccard,overlap}] [-t THRESHOLD]
                  [-w {f,tfidf,nf}] [-s] [-f] [--no-cache]
                  collection

Builds a search engine on a collection of documents

positional arguments:
  collection            Path to collection file (CACM format), directory or
                        url to zip

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY, --query QUERY
                        Execute a search query
  -m {bool,vect}, --model {bool,vect}
                        Search engine model
  -n {cos,dice,jaccard,overlap}, --norm {cos,dice,jaccard,overlap}
                        Vectorial search norm
  -t THRESHOLD, --threshold THRESHOLD
                        Vectorial search norm threshold
  -w {f,tfidf,nf}, --weighting {f,tfidf,nf}
                        Vectorial weighting method
  -s, --silent          Disable verbose mode
  -f, --force           Force re-indexing overwrite cache
  --no-cache            Disable disk cache

Exemple d'usage

Model vectoriel

Les requêtes sont des phrases. Ici on chechre dans la collection CACM.

$ ./searchy.py data/CACM/cacm.all
Loading data/CACM/cacm.all
Using cache 64f76a63
  documents 	 3204
  tokens 	 113754
  terms 	 5961
memory: 0.42 mb
🔍  > Processes and Proofs of Theorems and Programs
 -----
 3079. An Algorithm for Reasoning About Equality [93.99%]
 -----
.T
An Algorithm for Reasoning About Equality
.W
A simple technique for reasoning about equalities
that is fast and complete for ground formulas
...
 -----
 3140. Social Processes and Proofs of Theorems and Programs [93.87%]
 -----
.T
Social Processes and Proofs of Theorems and Programs
.W
It is argued that formal verifications of
programs, no matter how obtained, will not play the
same key role in the development of computer science and software
engineering as proofs do in mathematics.  Furthermore the absence
...

total results: 260     2.94 s

Pour charger la collection Stanford de manière rapide, vous pouvez la télécharger et l'extraire dans le dossier dumps/pa1-data/pa1-data pour avoir une structure similaire à

dumps/pa1-data/pa1-data/0
dumps/pa1-data/pa1-data/1
...
dumps/pa1-data/pa1-data/9

Et puis charger la avec searchy:

$ ./searchy.py dumps/pa1-data

Sinon on peut utiliser l'url directement comme argument ce qui fera l'opération précédente automatiquement.

$ ./searchy.py http://web.stanford.edu/class/cs276/pa/pa1-data.zip

Model booléen

Les requêtes doivent être au format booléen suivant: (mot1 & mot2) | ~mot3 les opérateurs booléen autorisés sont: & (et), | (ou), ~ (négation).

$ ./searchy.py -m bool data/CACM/cacm.all
Loading data/CACM/cacm.all
Using cache 64f76a63
  documents 	 3204
  tokens 	 113754
  terms 	 5961
memory: 0.42 mb
🔍  > processes & Proofs & theorems & programs
 -----
 3140. Social Processes and Proofs of Theorems and Programs [100.00%]
 -----
.T
Social Processes and Proofs of Theorems and Programs
.W
It is argued that formal verifications of
programs, no matter how obtained, will not play the
same key role in the development of computer science and software
engineering as proofs do in mathematics.  Furthermore the absence
of continuity, the inevitability of change, and the complexity of
specification of significantly many real programs make the form
al verification process difficult to justify and manage.  It is felt
that ease of formal verification should not dominate program
language design.
.K
Formal mathematics, mathematical proofs,
program verification, program specification
2.10 4.6 5.24

total results: 1     2.96 s

searchy's People

Contributors

lypnol avatar pvnieo avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.