GithubHelp home page GithubHelp logo

qihoo360 / poseidon Goto Github PK

View Code? Open in Web Editor NEW
2.0K 155.0 429.0 9.38 MB

A search engine which can hold 100 trillion lines of log data.

License: BSD 3-Clause "New" or "Revised" License

Shell 2.64% Makefile 1.30% Go 66.46% Protocol Buffer 3.63% Java 15.02% Roff 10.95%
poseidon search-engine golang big-data map-reduce

poseidon's Introduction

波塞冬:Poseidon

波塞冬,是希腊神话中的海神,在这里是寓意着海量数据的主宰者。

Poseidon 系统是一个日志搜索平台,可以在数百万亿条、数百PB大小的日志数据中快速分析和检索特定字符串。 360公司是一个安全公司,在追踪 APT(高级持续威胁)事件时,经常需要在海量的历史日志数据中检索某些信息, 例如某个恶意样本在某个时间段内的活动情况。在 Poseidon 系统出现之前,都是写 Map/Reduce 计算任务在 Hadoop 集群中做计算, 一次任务所需的计算时间从数小时到数天不等,大大制约了 APT 事件的追踪效率。 Poseidon 系统就是为了解决这个需求,能在几秒钟内从数百万亿条规模的数据集中找出我们需要的数据,大大提高工作效率; 同时,这些数据不需要额外存储,仍然存放在Hadoop集群中,节省了大量存储和计算资源。该系统可以应用于任何结构化或非结构化海量(从万亿到千万亿规模)数据的查询检索需求。

所用技术

  • 倒排索引:构建日志搜索引擎的核心技术
  • Hadoop:用于存放原始数据和索引数据,并用来运行Map/Reduce程序来构建索引
  • Java:构建索引时是用Java开发的Map/Reduce程序
  • Golang:检索程序是用Golang开发的
  • Redis/Memcached:用于存储 Meta 元数据信息

目录结构

builder

这里存放的是数据生成工具

  • doc :将原始日志转换为Poseidon格式的数据。
  • docmeta :将Doc相关的元数据信息写入NoSQL库中的工具。
  • index :从原始日志生成倒排索引数据的程序工具,是Hadoop 的 Map/Reduce 作业程序。
  • indexmeta :将倒排索引的元数据写入NoSQL库中的工具。

common

目前仅仅用来存放该项目中用到的 protobuf 定义

docs

存放了相关的技术文档。

service

这里存放的是各个HTTP微服务服务的程序

  • hdfsreader :读取HDFS中某个文件路径的一段数据。
    • /service/hdfsreader
  • idgenerator :全局的ID生成中心
    • /service/idgenerator
  • meta :针对存放Meta信息的NoSQL提供统一的HTTP接口服务
    • /service/meta/business/doc/get : DocGzMeta 信息查询接口
    • /service/meta/business/doc/set : DocGzMeta 信息更新接口
    • /service/meta/business/index/get : InvertedIndexGzMeta 信息查询接口
    • /service/meta/business/index/set : InvertedIndexGzMeta 信息更新接口
  • searcher :Poseidon搜索引擎的核心检索服务
  • proxy :searcher的一个代理,并能实现跨时间的查询服务
  • allinone : 为简化部署,将 idgenerator/meta/searcher/proxy 四个微服务集成在一个进程中,提供统一的服务接口

其他

  • qq交流群:21557451

poseidon's People

Contributors

dunixd avatar guojun1992 avatar liwei-ch avatar qihoo360github avatar zieckey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

poseidon's Issues

Caused by: java.lang.VerifyError: class proto.PoseidonIf$DocIdList$Builder

在hadoop2.6.0上运行index目录下的start.sh,报错:
Caused by: java.lang.VerifyError: class proto.PoseidonIf$DocIdList$Builder overrides final method mergeUnknownFields.(Lcom/google/protobuf/UnknownFieldSet;)Lcom/google/protobuf/GeneratedMessage$Builder;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at proto.PoseidonIf$DocIdList.toBuilder(PoseidonIf.java:1572)
at proto.PoseidonIf$DocIdList.newBuilder(PoseidonIf.java:1564)
at InvertedIndex.ReduceGroupData$MetaData.(ReduceGroupData.java:15)
at InvertedIndex.ReduceGroupData.runGroup(ReduceGroupData.java:54)
at InvertedIndex.InvertedIndexGenerateReducer.runAsMemory(InvertedIndexGenerateReducer.java:183)
at InvertedIndex.InvertedIndexGenerateReducer.reduce(InvertedIndexGenerateReducer.java:118)
at InvertedIndex.InvertedIndexGenerateReducer.reduce(InvertedIndexGenerateReducer.java:32)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

在当前用户下可以执行protoc命令,说明protobuf安装成功的.代码是2017-01-16从github下载编译的

Poseidon遇到的问题

快速开始索引的是TAB数据,我尝试索引JSON格式数据
修改dist/index-0.1/etc/test.json,把data_format改为JSON(这里应该还有其他地方需要配置,求文档)
执行docformat,index都没问题,但是sh test.sh xxx 查不出数据。

问题求助
1.如何索引JSON格式数据
2.如何在Hadoop集群中运行,而不是“local_mock”:true
3.如何进行模糊搜索、是否支持聚合查询

十分感谢

CommonLogParser解析时候每个tokenizer的分别输出

查看了一下CommonLogParser.java 的代码 74~92行:

if (format_ == Format.JSON) {
            try {
                JSONObject json = new JSONObject(line);

                for (int i = 0; i < this.tokenParsers_.size(); i++) {
                    output_set = tokenParsers_.get(i).Process(json);
                    if (output_set == null || output_set.isEmpty()) {
                        continue;
                    }
                    for (String s : output_set) {
                        ok = true;
                        output(s, tokenParsers_.get(i).alias(), docid, line_offset, context);
                    }
                }
            } catch (Exception e) {
                //e.printStackTrace();
                System.err.println(line);
            }
        } 

根据代码的理解是每个tokenizer分别处理原始的输入(json), 然后有结果后直接 output结果。这个跟配置文件:

"tokenizer": {
      "text": [
        {
          "split": "\\|\\|"
        },
        "urlencode",
        "keyword"
      ]
    },

理解的应该所有的tokenizer都按照顺序处理过最后再输出。

请问这是故意这样处理的还是一个bug?

使用hdfs的bug(将mock_local置为false)

1.dist/index-0.1/lib里的docformat,index,indexmeta的jar包分别为(build.sh生成的):
index-0.1.jar,indexmeta-0.1.jar,docmeta-0.1.jar
但是start.sh却写着:
cmd="$HADOOP jar ${current_dir}/lib/docmeta-1.0-SNAPSHOT.jar ...
cmd="$HADOOP jar ${current_dir}/lib/index-1.0-SNAPSHOT.jar ...
cmd="$HADOOP jar ${current_dir}/lib/indexmeta-1.0-SNAPSHOT.jar ...

2.start.sh在某些情况下时无法正确的将dist/index-0.1/lib中的jar包加入HADOOP_CLASSPATH,导致hadoop jar docmeta-1.0.jar...命令会报错,no class found meta/MetaConfigured。

3.hadoop jar docmeta-1.0.jar....会报如下错误:
java.io.IOException: Incomplete HDFS URI, no host: hdfs:///
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:170)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:355)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:520)
at meta.MetaConfigured.run(MetaConfigured.java:97)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at meta.DocMetaConfigured.main(DocMetaConfigured.java:22)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
但etc/test.json里的namenode是配了的。所以不知道为什么会报no host。

2019/10/31 20:39:14 logto_hdfs_collector.go:754: MAIN [ERROR] LogtoHdfsCollector.copySingleFileToHdfs mkdir err, remoteDir: /home/poseidon/src/test/docid/2019-10-30, remotePath: /home/poseidon/src/test/docid/2019-10-30/0_n3_1031202219_2019-10-30-20-00.gz, retry: 180, err: exit status 1

2019/10/31 20:39:14 logto_hdfs_collector.go:754: MAIN [ERROR] LogtoHdfsCollector.copySingleFileToHdfs mkdir err, remoteDir: /home/poseidon/src/test/docid/2019-10-30, remotePath: /home/poseidon/src/test/docid/2019-10-30/0_n3_1031202219_2019-10-30-20-00.gz, retry: 180, err: exit status 1

执行/bin/bash bin/mock_start.sh 2016-12-12失败

[root@localhost index-0.1]# /bin/bash bin/mock_start.sh 2016-12-12
/usr/jdk1.7.0_25/bin/java
index_process_base_path /root/src/github.com/Qihoo360/poseidon/dist/index-0.1
bussiness test
hdp_src /home/poseidon/src//test/2016-12-12
index_process_base_path /root/src/github.com/Qihoo360/poseidon/dist/index-0.1
java -classpath /root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jaxb-impl-2.2.3-1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jersey-guice-1.9.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jersey-client-1.9.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jackson-mapper-asl-1.9.13.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-mapreduce-client-shuffle-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/xercesImpl-2.9.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/commons-compress-1.4.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/commons-net-3.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/junit-3.8.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/apacheds-i18n-2.0.0-M15.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jsr305-3.0.0.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/commons-collections-3.2.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-hdfs-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/api-asn1-api-1.0.0-M20.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/commons-io-2.4.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/zookeeper-3.4.6.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/protobuf-java-3.1.0.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/curator-recipes-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-auth-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jsp-api-2.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jackson-core-asl-1.9.13.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-common-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/log4j-1.2.17.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/netty-3.7.0.Final.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/xml-apis-1.3.04.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-yarn-common-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/commons-daemon-1.0.13.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/index-0.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/curator-framework-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-mapreduce-client-common-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/commons-logging-1.1.3.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-annotations-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/httpclient-4.2.5.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-mapreduce-client-core-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/java-xmlbuilder-0.4.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/aopalliance-1.0.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/api-util-1.0.0-M20.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jackson-xc-1.9.13.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-yarn-api-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jaxb-api-2.2.2.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-mapreduce-client-jobclient-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/snappy-java-1.0.4.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jetty-6.1.26.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jetty-util-6.1.26.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/commons-httpclient-3.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/indexmeta-0.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/commons-cli-1.2.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/cglib-2.2.1-v20090111.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/stax-api-1.0-2.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-yarn-server-common-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/commons-configuration-1.9.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/servlet-api-2.5.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/javax.inject-1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/slf4j-log4j12-1.7.10.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/slf4j-api-1.7.10.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/httpcore-4.2.4.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/guice-servlet-3.0.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/json.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/ikanalyzer2012ff.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/leveldbjni-all-1.8.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/commons-codec-1.6.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/activation-1.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jersey-core-1.9.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/paranamer-2.3.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/guice-3.0.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-yarn-client-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/asm-3.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/xz-1.0.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jets3t-0.9.0.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jersey-server-1.9.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jersey-json-1.9.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/docmeta-0.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/xmlenc-0.52.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/avro-1.7.4.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/hadoop-yarn-server-nodemanager-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/commons-lang-2.6.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jackson-jaxrs-1.9.13.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/netty-all-4.0.23.Final.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/gson-2.2.4.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/htrace-core-3.1.0-incubating.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jsch-0.1.42.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jettison-1.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/curator-client-2.7.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/guava-16.0.1.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/jline-0.9.94.jar:/root/src/github.com/Qihoo360/poseidon/dist/index-0.1/lib/commons-math3-3.1.1.jar: meta.DocMetaConfigured /home/poseidon/src//test/docid/2016-12-12/ /home/poseidon/src//test/firstdocid/2016-12-12/ 2016-12-12 /root/src/github.com/Qihoo360/poseidon/dist/index-0.1/etc/test.json
doc meta setter map reduce test 2016-12-12 failed

在没有hadoop集群的情况下进行poseidon的部署,既然没有hadoop集群上面再运行的时候为什么会有hadoop的jar包?难道必须先部署hadoop集群吗?
启动memcached用的是/usr/local/bin/memcached -d -m 10 -u root -l 127.0.0.1 -p 11211 -c 256 -P /tmp/memcached.pid命令,地址和meta一致。
在执行完sh bin/demo.sh weibo_data.txt命令后并没有像官方文档所说有/home/poseidon/src/test/YYYY-MM-DD/新文件生成,而是只在/home/poseidon/data目录下生成了log20161212002048_2016-12-12-00.txt
这些问题非常的困扰我,望能够找出答案。。。谢谢

searcher 搜索不到东西

searcher 日志:
2017/03/17 20:56:57 request filter =
2017/03/17 20:56:57 http://127.0.0.1:39610/service/meta/test/index/get
2017/03/17 20:56:57 http://127.0.0.1:39610/service/meta/test/index/get
2017/03/17 20:56:57 text17031603983756

2017/03/17 20:56:57 map[text17031603983756: :]
2017/03/17 20:56:57 FetchIndex symc key map[text17031603983756: :]
2017/03/17 20:56:57 stored get key text17031603983756
2017/03/17 20:56:57 stored get key result=
2017/03/17 20:56:57 fetchIndexMeta ok, token=3607866730897298,filePath=/home/poseidon/src/test/index/2017-03-16/textindex/part-00792.gz,offset=0,length=0
2017/03/17 20:56:57 FetchIndex routine field=text token=3607866730897298 path=/home/poseidon/src/test/index/2017-03-16/textindex/part-00792.gz
2017/03/17 20:56:57 ReadZip url=http://127.0.0.1:39997/read-hdfs?path=/home/poseidon/src/test/index/2017-03-16/textindex/part-00792.gz&offset=0&length=0
2017/03/17 20:56:57 unzip err : len=0
2017/03/17 20:56:57 ReadHDFS indexData read err EOF token=3607866730897298
2017/03/17 20:56:57 FetchIndex routine field=text token=3607866730897298 docList.size=0
2017/03/17 20:56:57 SearchDocItems token 3607866730897298 FetchIndex err=read hdfs fail
2017/03/17 20:56:57 ---- 127.0.0.1:36418 handleSearch &{100 0 0 {2017-03-16 test map[text:3607866730897298]}}, err=read hdfs fail

不管搜什么 len都是0 导致读hdfs失败

不知道哪有问题

poseidon集成hdfs的问题汇总及解决方法

主要遇到的问题:
1、Incomplete HDFS URI, no host: hdfs:///
解决方法:MetaConfigured和InvertedIndexGenerate中修改

  •            conf.set("fs.default.name", "hdfs:///");
    
  •            conf.set("fs.default.name", "hdfs://" + name_node);
    

2、java.io.FileNotFoundException: File does not exist: //home/poseidon/src/test/conf/2017-01-05/IKAnalyzer.cfg.xml
解决方法:MetaConfigured和InvertedIndexGenerate中修改:

  •        fs_default_name = "hdfs://" + name_node + "/";
    
  •        fs_default_name = "hdfs://" + name_node;
    

3、Error: java.lang.ClassNotFoundException: org.wltea.analyzer.core.IKSegmenter
解决方法:下载IKAnalyzer2012_u6.jar相关包放到hadoop,classpath中即可
4、Error: class proto.PoseidonIf$DocIdList$Builder overrides final method setUnknownFields.(Lcom/google/protobuf/UnknownFieldSet;)Lcom/google/protobuf/GeneratedMessage$Builder;
protuobuf兼容问题集群中是2.5,poseidon是用的3.1,将3.1的jar替换到mapreduce的classpath中即可。
classpath所在路径:mapreduce-site.xml中的mapreduce.application.classpath
5、org.json.JSONException: JSONObject["token_filter_files"] not found.
缺少该属性,在test.json中增加该属性,属性值是json对象(该属性没有的话,测试过程中没有影响到测试结果)
6、curl查询结果时报错:java.io.IOException: No FileSystem for scheme: hdfs
缺少hadoop相关包,主要是hdfs相关的,放到hdfsreader-0.1/lib目录下即可
7、curl查询结果还会报一个protobuf相关的错误,将protobuf-java-3.1.jar拷贝到hdfsreader-0.1/lib 目录即可
8、其他还有一些错误,如:相关的配置文件中不能用127.0.0.1,hadoop_cmd 可能由于环境问题不能使用快捷方式,

下面说下我的问题:用官方提供的测试数据,在进行curl查询的时候,经常性的第一次查询结果显示为没有的,在执行一次才有数据,如果我输入days:条件中输入多个日期比如2017-01-06,2017-01-09,返回的结果中day属性有时候是2017-01-06,有时候是2016-01-07。当查询调节keywords中的text有多个值的时候也会出现类似的问题。期待解决

Offline Installation

How to install poseidon without network, because our servers have no access to network.

idgenerator flag redefined: log_dir

/src/github.com/Qihoo360/poseidon/dist/idgenerator/bin/idgenerator flag redefined: log_dir
panic: /src/github.com/Qihoo360/poseidon/dist/idgenerator/bin/idgenerator flag redefined: log_dir

goroutine 1 [running]:
panic(0x739b40, 0xc420011490)
/usr/local/go/src/runtime/panic.go:500 +0x1a1
flag.(*FlagSet).Var(0xc4200581e0, 0x955f40, 0xc420011440, 0x7bcc29, 0x7, 0x7cdea2, 0x2f)
/usr/local/go/src/flag/flag.go:791 +0x43e
flag.(*FlagSet).StringVar(0xc4200581e0, 0xc420011440, 0x7bcc29, 0x7, 0x0, 0x0, 0x7cdea2, 0x2f)
/usr/local/go/src/flag/flag.go:694 +0x8b
flag.(*FlagSet).String(0xc4200581e0, 0x7bcc29, 0x7, 0x0, 0x0, 0x7cdea2, 0x2f, 0xc420011430)
/usr/local/go/src/flag/flag.go:707 +0x90
flag.String(0x7bcc29, 0x7, 0x0, 0x0, 0x7cdea2, 0x2f, 0xc420049f28)
/usr/local/go/src/flag/flag.go:714 +0x69
github.com/golang/glog.init()
/root/goquery/src/github.com/golang/glog/glog_file.go:41 +0x148
github.com/zieckey/simgo.init()
/root/goquery/src/github.com/zieckey/simgo/module_monitor.go:73 +0x76
main.init()
/src/github.com/Qihoo360/poseidon/service/idgenerator/main.go:41 +0x33

doc meta setter map reduce test 2017-02-17 failed

/2017-02-17/ /home/zxn/src/github.com/Qihoo360/poseidon/dist/index-0.1/home/poseidon/src//test/firstdocid/2017-02-17/ 2017-02-17 /home/zxn/src/github.com/Qihoo360/poseidon/dist/index-0.1/etc/test.json
doc meta setter map reduce test 2017-02-17 failed

executing request http://127.0.0.1:39610/service/meta/test/doc/set
executing request http://127.0.0.1:39610/service/meta/test/doc/set
executing request http://127.0.0.1:39610/service/meta/test/doc/set
executing request http://127.0.0.1:39610/service/meta/test/doc/set
executing request http://127.0.0.1:39610/service/meta/test/doc/set
executing request http://127.0.0.1:39610/service/meta/test/doc/set
cleanup
executing request http://127.0.0.1:39610/service/meta/test/doc/set

  • echo dco meta setter map reduce 'test 2017-02-16 success'
  • rm -rf /home/poseidon/src//test/conf/2017-02-16/fname_begin_docid.txt
  • mkdir -p /home/poseidon/src//test/conf/2017-02-16/
  • mv /home/poseidon/src//test/firstdocid/2017-02-16//part-r-00000 /home/poseidon/src//test/conf/2017-02-16/fname_begin_docid.txt
    mv: cannot stat '/home/poseidon/src//test/firstdocid/2017-02-16//part-r-00000': No such file or directory
  • rm -rf /home/poseidon/src//test/index/2017-02-16

the folder is wrong
the data in /home/poseidon/src but it read data from .../index-0.1/home/poseidon/src/

文档倒排索引构建问题

两个公式
1.map阶段
TokenId = HashId % 200

2.reduce阶段
FileId = TokenId / 1000

数据并不是如公式计算的这样

可视化界面

poseidon现在是否有提供一个可视化的界面用来查看当前的任务进度等信息?

说几个快速开始的问题

  1. 在mac OSX 系统下,cp -r 命令是与linux上不同的,导致service目录下几个微服务的打包到dist下的文件都是没有保留目录结构的,没有/bin, /conf, /log目录,我暂时的解决办法是用了rsync -r命令。
  2. service中proxy和searcher目录中的serverctl文件中的$APP变量都有poseidon前缀,与go build生成的可执行文件名不同,导致serverctl start找不到对应文件。

我在我的fork下做了修改,需要的话我可以pull request。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.