GithubHelp home page GithubHelp logo

Comments (5)

Nicole00 avatar Nicole00 commented on June 11, 2024

spark-connector得到的Dataframe是由 schema和数据分别按顺序映射组装而成,schema是通过metaClient单独读取,要先确认下metaClient读取到的tag的schema顺序和scan出的数据的顺序是否一致。

  1. 你先看下spark日志中“dataset's schema:“ 这一行日志打印出的DF的schema信息
  2. 可以通过java-client scan 该tag中的数据,看得到的数据顺序是否与schema顺序一致。

from nebula-java.

ReviveChan avatar ReviveChan commented on June 11, 2024

直接通过网页客户端scan应该也行吧?这个是scan的结果
image
dataset schema日志:

20/12/07 21:30:06 INFO NebulaRelation: dataset's schema: StructType(StructField(_vertexId,StringType,false), StructField(vid,StringType,true), StructField(vlength,LongType,true), StructField(inDegree,LongType,true), StructField(groupID,LongType,true), StructField(isKey,LongType,true))

看来不是一致的,不知道怎么调整?

from nebula-java.

ReviveChan avatar ReviveChan commented on June 11, 2024

我翻了下代码
https://github.com/vesoft-inc/nebula-java/blob/v1.0/tools/nebula-spark/src/main/scala/com/vesoft/nebula/tools/connector/reader/NebulaRelation.scala#L46
这边构造df的schema时,使用metaClient.getTagSchema返回的nebula schema类型是Map[String, Class]
看起来是可能会出现顺序丢失的情况,不知道是不是这个原因

from nebula-java.

ReviveChan avatar ReviveChan commented on June 11, 2024

大致改了下,可以读到正确的顺序了

  /**
    * return the dataset's schema. Schema includes configured cols in returnCols or includes all properties in nebula.
    */
  def getSchema(nebulaOptions: NebulaOptions): StructType = {
    val returnColMap = nebulaOptions.getReturnColMap
    val fields: ListBuffer[StructField] = new ListBuffer[StructField]
    val metaClient = NebulaUtils.createMetaClient(nebulaOptions.getHostAndPorts, nebulaOptions)

    import scala.collection.JavaConverters._
    var nebulaSchema: Schema = null

    returnColMap.keySet.foreach(k => {
      if (DataTypeEnum.VERTEX.toString.equalsIgnoreCase(nebulaOptions.dataType)) {
        fields.append(DataTypes.createStructField("_vertexId", DataTypes.StringType, false))
        nebulaSchema = metaClient.getTag(nebulaOptions.spaceName, nebulaOptions.label)
      } else {
        fields.append(DataTypes.createStructField("_srcId", DataTypes.StringType, false))
        fields.append(DataTypes.createStructField("_dstId", DataTypes.StringType, false))
        nebulaSchema = metaClient.getEdge(nebulaOptions.spaceName, nebulaOptions.label)
      }
      if (nebulaOptions.allCols) {
        // if allCols is true, then fields should contain all properties.
        nebulaSchema.columns.asScala
          .foreach(columnDef => {
            LOG.info(s"prop name ${columnDef.getName}, type ${columnDef.getType} ")
            fields.append(
              DataTypes.createStructField(columnDef.getName,
                NebulaUtils.convertDataType(NebulaTypeUtil.supportedTypeToClass(columnDef.getType.getType)),
                true))
          })
      } else {
        // todo 暂未实现指定列
        throw new Error("to be continued")
      }
      labelFields ++ Map(k -> fields)
      datasetSchema = new StructType(fields.toArray)
    })
    LOG.info(s"dataset's schema: $datasetSchema")
    datasetSchema
  }

df schema顺序:

20/12/07 22:19:55 INFO NebulaRelation: dataset's schema: StructType(StructField(_vertexId,StringType,false), StructField(vid,StringType,true), StructField(vlength,LongType,true), StructField(groupID,LongType,true), StructField(isKey,LongType,true), StructField(inDegree,LongType,true))

不过指定列版本的我就没想了。。

from nebula-java.

Nicole00 avatar Nicole00 commented on June 11, 2024

"这边构造df的schema时,使用metaClient.getTagSchema返回的nebula schema类型是Map[String, Class]
看起来是可能会出现顺序丢失的情况,不知道是不是这个原因"
是你说的这个原因,还是按照tag本身的shema更准确。对于指定列可以采用metaClient.getTagSchema的结果,由入参的列序列来决定顺序。 欢迎来提一个pr~

from nebula-java.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.