GithubHelp home page GithubHelp logo

emailhy / facets-overview-spark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from chesterxgchen/facets-overview-spark-old

0.0 1.0 0.0 832 KB

Scala + Spark Implementation of the Google Facets Overview Statistic Protobuf generator

License: Apache License 2.0

Java 0.23% Scala 99.77%

facets-overview-spark's Introduction

Facets Overview Spark

  • this code is intended to be give back to open-source community, therefore this is only temporary put here for testing

Google Open Sourced the Facets Project in 2017 (https://github.com/PAIR-code/facets), which can help Data Scientists to better understand the data set and under the slogan that "Better data leads to better models".

The Google "Facets" includes two sub-projects: Facets-overview and Facets-dive.

"Facets Overview "takes input feature data from any number of datasets, analyzes them feature by feature and visualizes the analysis.

Based on Facets( github page: https://github.com/PAIR-code/facets)

Overview gives a high-level view of one or more data sets. It produces a visual feature-by-feature statistical analysis, and can also be used to compare statistics across two or more data sets. The tool can process both numeric and string features, including multiple instances of a number or string per feature.
Overview can help uncover issues with datasets, including the following:

* Unexpected feature values
* Missing feature values for a large number of examples
* Training/serving skew
* Training/test/validation set skew
Key aspects of the visualization are outlier detection and distribution comparison across multiple datasets. Interesting values (such as a high proportion of missing data, or very different distributions of a feature across multiple datasets) are highlighted in red. Features can be sorted by values of interest such as the number of missing values or the skew between the different datasets.

The Facets-overview is consists of

  • Feature Statistics Protocol Buffer
  • Feature Statistics Generation
  • Visualization

The implementation of Feature Statistics Generation have two flavors : Python (used by Jupyter notebook) and Javascripts (used by web )

Overview visualization

Current implemention is in python depends on Numpy or Javascripts. Which means the data statics generation is limited by one machine or one browser.

This project provides an additional implementation for Statistics generation. We can use Spark to leverage the spark generate stats with distributed computinig capability

Overview visualization with Spark

Design Considerations

  • Need to find a Scala Protobuf generator
  • The python implementation is in loop-mutable and update fasion, we need re-arrange it to use immutable container and collection fasion Numpy Implementation consider all the data is in Array on the current computer, we need some way to transform data without collect the data into an array Need to find equivalent Numpy functions such as avg, mean, stddev, histogram
  • The "overview" implementation can be apply to Tensorflow Record (tf.sampels, tf.sequenceSamples), where the data value could be array or array of array of data. In this Scala + Spark Implementation, tensorflow record support is leveraging tensorflow/ecosystem/spark-tensorflow-connector, where we can load the TFRecords into spark data frame. Then the rest of the implmentation is no difference.
  • We use DataFrame to represent the Feature. this is equivalent the feature array used in the Numpy. Efficiency is not the major concern in this first version of design, we may have to pass data in multiple passes.

Data Structures

Based on Feature_statistics Protobuf definitions,

  • Each dataset has DatasetFeatureStatistics which contains lists metadata (name of dataset, number of sample, and Seq[FeatureNameStatistics])
  • Each FeatureNameStatistics is feature stats for the given dataset, which includes feature name and data type and one of (NumericStatistics,StringStatistics or binary_stats)
  • NumericStatistics consists CommonStatistics (std_dev,mean, median,min, max, histogram)
  • StringStatistics consists CommonStatistics as well as (unique values, {value, frequency}, top_values,avg_length, rank_histogram )

Google's Python implementation mainly use dictionary to hold data structures. In this implemenation, we define several additional data structures to help organize the data.

  • Each dataset can be presented by the NamedDataFrame
  • Each dataset can be split with different features, so that it can also be converted to DataEntrySet
  • Each DataEntry represents one feature with basic information as well the DataFrame for the feature. Special note: feat_lens is used for tensorflow records
  • Each Feature associate with FeatureNameStatistics defined above, we use BasicNumStats and BasicStringStats to capture the basic statistics

case class NamedDataFrame(name:String, data: DataFrame)

case class DataEntrySet(name     : String,
                        size     : Long,
                        entries  : Array[DataEntry])

case class DataEntry(featureName : String,
                     `type`      : ProtoDataType,
                     values      : DataFrame,
                     counts      : DataFrame,
                     missing     : Long,
                     featLens    : Option[DataFrame] = None)

case class BasicNumStats(name: String,
                         numCount  : Long = 0L,
                         numNan    : Long = 0L,
                         numZeros  : Long = 0L,
                         numPosinf : Long = 0,
                         numNeginf : Long = 0,
                         stddev    : Double = 0.0,
                         mean      : Double = 0.0,
                         min       : Double = 0.0,
                         median    : Double = 0.0,
                         max       : Double = 0.0,
                         histogram : (Array[Double], Array[Long]) )

case class BasicStringStats(name     : String,
                            numCount : Long = 0,
                            numNan   : Long = 0L)

 

Main class


class FeatureStatsGenerator(datasetProto: DatasetFeatureStatisticsList) {

  import FeatureStatsGenerator._

  /**
    * Creates a feature statistics proto from a set of pandas data frames.
    *
    * @param dataFrames         A list of dicts describing tables for each dataset for the proto.
    *                           Each entry contains a 'table' field of the dataframe of the data and a 'name' field
    *                           to identify the dataset in the proto.
    * @param catHistgmLevel int, controls the maximum number of levels to display in histograms
    *                           for categorical features. Useful to prevent codes/IDs features
    *                           from bloating the stats object. Defaults to None.
    * @return The feature statistics proto for the provided tables.
    */
  def protoFromDataFrames(dataFrames     : List[NamedDataFrame],
                          features       : Set[String] = Set.empty[String],
                          catHistgmLevel : Option[Int] = None): DatasetFeatureStatisticsList = {

    genDatasetFeatureStats(toDataEntries( dataFrames), features, catHistgmLevel)

  }


 

FeatureStatsGenerator

Usage Smaple


 
  test("integration") {
    val features = Array("Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Marital Status",
                         "Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
                         "Hours per week", "Country", "Target")

    val trainData: DataFrame = loadCSVFile("src/test/resources/data/adult.data.csv")
    val testData = loadCSVFile("src/test/resources/data/adult.test.txt")

    val train = trainData.toDF(features: _*)
    val test = testData.toDF(features: _*)

    val dataframes = List(NamedDataFrame(name = "train", train),
                          NamedDataFrame(name = "test", test))

    val proto = generator.protoFromDataFrames(dataframes)
    persistProto(proto,base64Encode = false, new File("src/test/resources/data/stats.pb"))
    persistProto(proto,base64Encode = true, new File("src/test/resources/data/stats.txt"))
  }

  
  private def loadCSVFile(filePath: String) : DataFrame = {
    val spark = sqlContext.sparkSession
    spark.read
      .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
      .option("header", "false") //reading the headers
      .option("mode", "DROPMALFORMED")
      .option("inferSchema", "true")
      .load(filePath)
  }


  def writeToFile(fileName:String, content:String): Unit = {
    import java.nio.charset.StandardCharsets
    import java.nio.file.{Files, Paths}
    Files.write(Paths.get(fileName), content.getBytes(StandardCharsets.UTF_8))
  }

  private def toJson(proto: DatasetFeatureStatisticsList) : String = {
    import scalapb.json4s.JsonFormat
    JsonFormat.toJsonString(proto)
  }
  
  
  private[spark] def persistProto(proto: DatasetFeatureStatisticsList, base64Encode: Boolean, file: File ):Unit = {
    if (base64Encode) {
      import java.util.Base64
      val b = Base64.getEncoder.encode(proto.toByteArray)
      import java.nio.charset.StandardCharsets.UTF_8
      import java.nio.file.{Files, Paths}

      Files.write(Paths.get(file.getPath), new String(b, UTF_8).getBytes(UTF_8))
    }
    else {
      Files.write(Paths.get(file.getPath), proto.toByteArray)
    }
  }


facets-overview-spark's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.