GithubHelp home page GithubHelp logo

nkusearchit's Introduction

介绍

NkuSearchIt是一个Web搜索引擎南开资源站,本项目通过爬虫迭代地获取了南开校内的资源,并搭建网站提供丰富且个性化的检索服务,技术栈使用 Django + ElastiSearch

image

实验过程

  1. 网页抓取

    1. 网页爬虫使用requests库进行处理

        response = requests.get(url, timeout=crawl_timeout, headers=headers_parameters, allow_redirects=False)  # allow_redirects是否允许网址跳转
        response.encoding = response.apparent_encoding  # 设置编码为网页编码,否则容易乱码
    2. 在当前爬取网页找到更多的网页

        for item in bs.find_all("a"):  # 当前网页html的所有a标签
            href = item.get("href")  # 找到链接
    3. 过滤无效网页

      (1)404、302、301网页无法访问的网页

      (2)校外网址

      (3)过滤某些特殊网址,例如公安备案网址

    4. 单独处理下载链接

        index_suffix = href.rfind(".")  # 下载类型后缀(如果有)
        if href[index_suffix + 1:] in download_suffix_list:  # 如果是下载地址,则存到es的document索引
            json_data_document = {"url": href, "text": item.get_text()}
            res = es.index(index="test_document", document=json_data_document)  # 建立索引
  2. 文本索引

    1. 使用BeautifulSoup库处理html

        html = get_html(url)  # 获得网页html
        bs = BeautifulSoup(html, "html.parser")  # 获得bs解析包
        bs.title.get_text()  # 获得title
        content = ""
        for item in bs.findAll():  # 找到所有标签的内容
            content += item.get_text()  # 获得content网页内容
    2. 保存每个链接url对应的title和content

        json_data = {"url": url, "title": title, "content": content}
        with open(os.path.join(dirname, index.__str__() + ".json"), 'w', encoding="utf-8") as file:  # 保存url、title和content
        	json.dump(json_data, file, ensure_ascii=False)
    3. 构建索引

        with open(os.path.join(path, file), encoding="utf-8") as file:
            json_data = json.load(file)
            res = es.index(index="test", document=json_data)  # 建立索引
  3. 链接分析

    1. 使用有向有权图来构建链接分析,用了pygraph库的digraph

        if url_expand not in urls_taken:  # 链接未访问
            page_rank_digraph.add_node(url_expand)  # 添加page_rank图节点
            page_rank_digraph.add_edge((url, url_expand))  # 新添加的节点肯定不存在相关边,直接添加
        else:  # 链接已访问
            if not page_rank_digraph.has_edge((url, url_expand)):  # 若不存在边,则添加
                page_rank_digraph.add_edge((url, url_expand))
            else:  # 若存在边,则设置边的权重+1
                page_rank_digraph.set_edge_weight((url, url_expand),page_rank_digraph.edge_weight((url, url_expand)) + 1)
    2. 计算链接分析,使用上一步生成的图进行计算(迭代法)

        for i in range(self.max_iterations):
            change = 0
            for node in nodes:
                rank = 0
                for incident_page in self.graph.incidents(node):  # 遍历所有“入射”的页面
                    rank += self.damping_factor * (page_rank[incident_page] / len(self.graph.neighbors(incident_page)))
                rank += damping_value
                change += abs(page_rank[node] - rank)  # 绝对值
                page_rank[node] = rank
  4. 查询服务

    1. 站内查询

      爬取的链接进行了过滤,所以所有的查询都是站内查询,这里使用了title和content加权排名的方式进行查询

        res = es.search(index="test", query={"bool": {
            "should": [{"match": {'title': {"query": search_text, "boost": 2}}},
                       {"match": {'content': {"query": search_text, "boost": 1}}}],
            "minimum_should_match": "50%"}})
    2. 文档查询

      链接使用通配查询,文本用普通match查询

        res = es.search(index="test_document", query={"bool": {
            "should": [{"wildcard": {'url': {"wildcard": "*." + search_document_suffix, "boost": 2}}},
                       {"match": {'text': {"query": search_document_name, "boost": 1}}}],
            "minimum_should_match": "50%"}}, size=10)
    3. 短语查询

      使用match_phrase实现短语查询

        res = es.search(index="test", query={"bool": {
            "should": [{"match_phrase": {'title': {"query": search_phrase, "boost": 2}}},
                       {"match_phrase": {'content': {"query": search_phrase, "boost": 1}}}],
            "minimum_should_match": "50%"}}, size=10)
    4. 通配查询

      使用wildcard实现统配查询

        res = es.search(index="test", query={"bool": {
            "should": [{"wildcard": {'title': {"wildcard": search_regexp, "boost": 2}}},
                       {"wildcard": {'content': {"wildcard": search_regexp, "boost": 1}}}],
            "minimum_should_match": "50%"}}, size=10)
    5. 查询日志

        def add_search_log(search_text, search_type):
            with open("search_log.json", 'r', encoding="utf-8") as file:  # 加载搜索历史
                search_log = json.load(file)
                file.close()
            with open("search_log.json", 'w+', encoding="utf-8") as file:  # 添加当前搜索到搜索历史
                search_log.append({"search_text": search_text, "search_type": search_type, "time": datetime.now().strftime("%Y:%m:%d:%H:%M:%S")})
                json.dump(search_log, file)
                file.close()
    6. 网页快照

        def search_result_open(request):  # 在django的views中添加这个新的动作,添加网页快照的功能
            result_title = request.POST.get('result_title')
            result_content_all = request.POST.get('result_content_all')
            params = {"result_title": result_title, "result_content_all": result_content_all}
            return render(request, 'search/search_result_open.html', {'params': params})
  5. 个性化服务

    记录用户打开的网页,添加到用户的浏览历史里,保存到一个json文件,下次查询搜索的这个网址会有相应的加分

      def search_result_url_open(request):
          result_url = request.GET.get('result_url')  # 获得访问的url
          with open("url_open_history.json", 'r', encoding="utf-8") as file:  # 加载用户浏览历史
              url_open_history = json.load(file)
              file.close()
          url_open_history[result_url] += 1
          with open("url_open_history.json", 'w', encoding="utf-8") as file:  # 添加用户浏览历史
              json.dump(url_open_history, file)
              file.close()
          return HttpResponseRedirect(result_url)
  6. Web页面

    使用Django搭建网站,并搜索相关词语进行测试:

    image

    image

    image

nkusearchit's People

Contributors

bughere avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.