GithubHelp home page GithubHelp logo

fixed-width's People

Contributors

dk1844 avatar miroslavpojer avatar zejnilovic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fixed-width's Issues

Improve code-coverage & add GH check action

Background

Improve or add code coverage support to be able measure code quality.

Feature

New ability to measure current code coverage as one of QA metrics.
Add GH action to check changed file coverage.

Leverage the `columnNameOfCorruptRecord` existence in case of parsing errors

Background

If a special column, which name is specified by Spark spark.sql.columnNameOfCorruptRecord setting, exists in the schema, it is used to log parsing errors into it.

Feature

If the column of the name specified in spark.sql.columnNameOfCorruptRecord exists in the schema, use it for logging parsing errors there.

Related to and dependent on #1

'width' metadata is not recognized if integer

Describe the bug

If the compulsory width metadata value type is not string, it's not recognized. Integer type value should be accepted too.

To Reproduce

Define a schema, where the metadata value of a field is an integer value, not string. (e.g. 5 instead of 5). Processing will fail with an exception: "Unable to parse metadata: width of column:..."

Expected behaviour

Integer definition of width is accepted.

Detect unrecognized data source options

Background

Options are passed to Spark data sources as key-value pairs. It is very easy to make a typo in an option name. If the data source has a default behaviour for this option (e.g. the option is not mandatory), the passed option will be unrecognized and therefore ignored.

It would be very useful if the data source would detect unrecognized options.

Feature

  1. Detect unrecognized options and log them as warnings.
  2. Add .option("pedantic", true), which, when enabled, will fail the Spark Application if there is at least one unrecognized option passed to the source.

Additional context

As an inspiration for possible implementation, Cobrix uses a wrapper that takes incoming options. All configuration is read from that wrapper instead of the original configuration. The wrapper records which options are queried at least once. At any given point it allows getting the list of options that haven't been used.
An advantage of such implementation is that if there are dependencies between options, e.g. a particular option is only used if some other option is enabled, it is tracked automatically.

https://github.com/AbsaOSS/cobrix/blob/master/spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/parameters/Parameters.scala

Add timestampFormat to enable parsing both date and timestamp

Background

When dateFormat is specified it either represents a date or a date + time. That effectively disables to be able both timestamp and date within the same file.

Feature

Add timestampFormat parameter used similarly as dateFormat and make each of them exclusive to use with timestamp and date respectively.

Reflect the Spark `mode` setting in the reader

Background

Spark has the mode setting to describe readers behavior in case of parsing error.

Feature

Reflect PERMISSIVE and FAILFAST settings in case of parsing errors.

  • FAILFAST - in case of parsing error, the process immediately fails with exception
  • PERMISSIVE - the parsing continues with best effort results

Add code coverage support

Background

Add code coverage support to be able measure code quality.

Feature

New ability to measure current code coverage as one of QA metrics.

Create a setting via metadata to drive the trimming of the column value

Background

usually the values in fixed width forma columns might have some trailing spaces. That can disable parsing of non-string types.

Feature

Create a code accepting a metadata setting driving the trimming of the column values, overriding the global setting.

Proposed Solution

Suggested metadata field name: trim.
Accepted type boolean or string convertible to boolean.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.