GithubHelp home page GithubHelp logo

tsinghuasearchengine's Introduction

Heritrix抓取数据

主要工作

  • 设置爬虫的seed为http://news.tsinghua.edu.cn/http://info.tsinghua.edu.cn/
  • 设置搜索的范围,具体为拒绝所有链接,除非这些链接属于tsinghua.edu.cn域,或者从tsinghua.edu.cn域直接指向.但是不爬取过长,以某些特定格式结尾的链接,如mpeg,或者不能正确解析的链接.
  • 最后设置存储的方式为镜像,并且只存储支持的格式(htm, html, pdf)

Lucene构建搜索引擎框架

主要工作

  • 根据链接结构计算每个网页的PageRank
  • 遍历所有文档,使用IKAnalyzer进行分词,构建索引,每个文档的权重(Boost)由其PageRank决定
  • 提供单词搜索功能,返回相似度和PageRank加权最高的文档集合

Tomcat构建搜索前端

主要工作

  • 搭建了MyEclipse+Tomcat7基本框架
  • 处理了HTTP Get请求,返回相应搜索结果

运行截图

alt text

tsinghuasearchengine's People

Stargazers

 avatar

Watchers

Shengjia Zhao avatar

Forkers

evenqaq

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.