GithubHelp home page GithubHelp logo

snidhal / asn1sparkdatasource Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 8.0 291 KB

ASN.1 Data Source for Apache Spark 2.x

License: Apache License 2.0

Scala 30.11% Java 69.89%
asn1 spark datasource asn-1 asn hadoop

asn1sparkdatasource's Introduction

ASN.1 Data Source for Apache Spark 2.x

A library for parsing and querying ASN.1 encoded data (Ber/Der) with Apache Spark, for Spark SQL and DataFrames.

Requirements

This library requires Spark 2.0+

Features

This package allows reading ASN.1 encoded files in local or distributed filesystem as Spark DataFrames. When reading files the API accepts several options:

  • path: location of files. Similar to Spark can accept standard Hadoop globbing expressions.
  • schemaFileType: the type of the file that contain the schema (currently supports asn and json files).
  • schemaFilePath: the path of the file that contain the schema definition (currently supports scala and java).
  • customDecoderLanguage: the language in which the custom decoder is written.
  • customDecoder: the fully qualified name of the user custom decoder.
  • precisionFactor: the number of next records to check the start position for splitting, by default equal to 5 .
  • mainTag: the name of main structure of the asn file, by default equal to 'sequence' .

Scala API

schema inference is not yet supported, the path of the schema definition file and its type or explicit schema definition are necessary :

  • asn schema definition:
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName("spark-asn1-datasource")
val spark = SparkSession.builder().config(conf).master("local[*]").getOrCreate()
val asn1DataFrame = spark.read.format("asn1V1")
      .option("schemaFileType","asn")
      .option("schemaFilePath", "src/test/resources/simpleTypes.asn")
      .load("src/test/resources/simpleTypes.ber")
  • json schema definition:
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName("spark-asn1-datasource")
val spark = SparkSession.builder().config(conf).master("local[*]").getOrCreate()
val asn1DataFrame = spark.read.format("asn1V1")
      .option("schemaFileType","json")
      .option("schemaFilePath", "src/test/resources/simpleTypes.json")
      .load("src/test/resources/simpleTypes.ber")
  • explicit schema definition: You can manually specify the schema when reading data:
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val conf = new SparkConf().setAppName("spark-asn1-datasource")
val spark = SparkSession.builder().config(conf).master("local[*]").getOrCreate()
val schema = StructType(
      StructField("recordNumber", IntegerType, false) ::
        StructField("callingNumber", StringType, true) ::
        StructField("calledNumber", StringType, true) ::
        StructField("startDate", StringType, true) ::
        StructField("startTime", StringType, true) ::
        StructField("duration", IntegerType, true) :: Nil
    )

val asn1DataFrame = spark.read.format("asn1V1")
      .schema(schema)
      .load("src/test/resources/simpleTypes.ber")

You can use your own decoding logic: you need to extend the ScalaDecoder Trait and put the decoding logic that takes an encoded record and a schema,decode it and return it as a sequence

package customDecoding

import customDecoding.ScalaDecoder
import org.apache.hadoop.io.Text
import org.apache.spark.sql.types.StructType


object CustomScalaDecoder extends ScalaDecoder {
override def decode(record: Text, schema: StructType): Seq[Any] = {
//your own decoding logic
return null
}
}

After creating your Custom Decoder use the customDecoder feature to integrate it:

import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName("spark-asn1-datasource")
val spark = SparkSession.builder().config(conf).master("local[*]").getOrCreate()
val asn1DataFrame = spark.read.format("asn1V1")
      .option("schemaFileType","asn")
      .option("schemaFilePath", "src/test/resources/simpleTypes.asn")
      .option("customDecoder","customDecoding.CustomScalaDecoder")
      .option("customDecoderLanguage","scala")
      .load("src/test/resources/simpleTypes.ber")

Hadoop InputFormat

The library contains a Hadoop input format for asn.1 encoded files, which you may make direct use of as follows:

import org.apache.spark.sql.SparkSession
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import hadoopIO.AsnInputFormat

val spark = SparkSession.builder().master("local[*]").getOrCreate()
val conf: Configuration = new Configuration(spark.sparkContext.hadoopConfiguration)
conf.set("precisionFactor","5")
val records = spark.sparkContext
                       .newAPIHadoopFile(
                       "src/test/resources/simpleTypes.ber", 
                       classOf[AsnInputFormat], 
                       classOf[LongWritable], 
                       classOf[Text], 
                       conf)

asn1sparkdatasource's People

Contributors

dependabot[bot] avatar malekdevbrain avatar snidhal avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

asn1sparkdatasource's Issues

splitting logic handling all asn.1 types (tagged / untagged)

the current version handles only files based on sequence or set types and the separator is using only two byte the first for the tag number and the second for length witch is not always the case

  1. modifying the Jasn1 class generation method to add methods that returns the needed parameters :
  • isTagged method returns if the main type is tagged or not
  • getInternalTags method returns the list of internal tags (like in the case of choice type)
  1. integrating the parameters in the splitting logic

nested schema handling

the current version only handles non nested schema being created from the ans.1 definition file using Jasn1 BerClassWriter class , this class should be able to generate a nested schema.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.