quux00 / hive-json-schema Goto Github PK

View Code? Open in Web Editor NEW

227.0 227.0 70.0 306 KB

Tool to generate a Hive schema from a JSON example doc

Java 100.00%

hive-json-schema's People

Contributors

Stargazers

Watchers

hive-json-schema's Issues

License

This looks like a great package, but would you consider adding a licence to it? (https://docs.github.com/en/free-pro-team@latest/github/building-a-strong-community/adding-a-license-to-a-repository)

Thanks.

IllegalStateException on json with empty arrays

Hi @quux00, thank you for the tool, it's very handy!

So, I tried to generate a Hive schema out of a json containing an empty array, but it throws an exception

Exception in thread "main" java.lang.IllegalStateException: Array is empty: [] at net.thornydev.JsonHiveSchema.arrayJoin(JsonHiveSchema.java:136) at net.thornydev.JsonHiveSchema.toHiveSchema(JsonHiveSchema.java:129) at net.thornydev.JsonHiveSchema.valueToHiveSchema(JsonHiveSchema.java:179) at net.thornydev.JsonHiveSchema.createHiveSchema(JsonHiveSchema.java:103) at net.thornydev.JsonHiveSchema.main(JsonHiveSchema.java:66)

I read your recommendation on using a doc with a single entry in each array, does that mean that the tool doesn't support Union types yet? Even though I've come to understand that these fields support is incomplete and can only be used in SELECT clauses.

Wrong classpath in README

In readme there are execution examples e.g.:
java -cp target/json-hive-schema-1.0.jar net.thorndev.JsonHiveSchema file.json

This won't work, because correct classpath is net.thornydev.JsonHiveSchema file.json
Likely a 'y' as omitted by a typo.

So, the correct examples should be like
java -cp target/json-hive-schema-1.0.jar net.thornydev.JsonHiveSchema file.json

Impala compatibility

Hi!

I'm trying to use the Hive table on Impala, but I can't find a way to make Impala understand the "ADD JAR" command.

It is possible to use a Hive table created with this Json serialization with Impala?

Regards,
André

Order of the columns are not same as input.json

I have used below JSON to generate the DDL.

{
"business_id": "String",
"name": "String",
"neighborhood": "String",
"address": "String",
"city": "String",
"state": "String",
"postal_code": 12345,
"latitude": 124124124124,
"longitude": -111.936102,
"stars": 4.5,
"review_count": 17,
"is_open": 0,
"attributes": [{
"BikeParking": true,
"BusinessAcceptsBitcoin": false,
"BusinessAcceptsCreditCards": false,
"BusinessParking": {
"street": false,
"validated": false,
"lot": true,
"valet": false
},
"DogsAllowed": false,
"RestaurantsPriceRange2": 2,
"WheelchairAccessible": true
}],
"categories": [
"Tobacco Shops",
"Nightlife",
"Vape Shops",
"Shopping"
],
"hours": [
"Monday 11:0-21:0",
"Tuesday 11:0-21:0",
"Wednesday 11:0-21:0",
"Thursday 11:0-21:0",
"Friday 11:0-22:0",
"Saturday 10:0-22:0",
"Sunday 11:0-18:0"
],
"type": "business"
}

It created below

CREATE TABLE my_table_name (
address string,
attributes array<struct<bikeparking:boolean, businessacceptsbitcoin:boolean, businessacceptscreditcards:boolean, businessparking:struct<lot:boolean, street:boolean, valet:boolean, validated:boolean>, dogsallowed:boolean, restaurantspricerange2:int, wheelchairaccessible:boolean>>,
business_id string,
categories array,
city string,
hours array,
is_open int,
latitude int,
longitude double,
name string,
neighborhood string,
postal_code int,
review_count int,
stars double,
state string,
type string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

I want to generate the order of the columns in the same order of JSON but in the output, columns are sorted alphabetical order.

Support multiple JSONs

I find you project interesting, however, I think it lacks a key feature - the ability to deduce a schema from multiple json documents, one per line. Then you compute the "greatest common denominator" of all of them.

This removes a layer of human intervention (putting all the keys in one document). For the implementation details, you can check out this project:

https://github.com/strelec/hive-serde-gen

generating jars not working

Hi, I have just submitted the issue in comment of existing issues, but that did not sound relevant to the existing issue so I created a new one. I am trying to build jar files but getting it stuck at the download phase as follows. Please can you help>

[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for net.thornydev:json-hive-schema:jar:1.0
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 13, column 15
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO]
[INFO] -------------------< net.thornydev:json-hive-schema >-------------------
[INFO] Building json-hive-schema 1.0
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from nexus: http://nexus.elsst.com/content/groups/public/org/apache/maven/plugins/maven-assembly-plugin/2.4/maven-assembly-plugin-2.4.jar
Progress (1): 25/226 kB

Hive creation failed

Hi,
This was so great tool for simplifing the json record into hive structure.
I have created the jar files and used them to generate hive DDL .
But the create table statement is failing in hive.
can you pls help to resolve the issue?

sample data record:
{"country":"uk","state":"ny","city":"fr","street":"nyk","zip":"1009","data":[{"country_code":"uk","state_set":"ny","city_code":"fr","street_code":"nyk","zip_code":"1009"}]}

user@ubuntu:~/lab/programs$ java -jar json-hive-schema-1.0-jar-with-dependencies.jar /home/user/Documents/json_cntry_schema.json
CREATE TABLE x (
city string,
country string,
data array<struct<city_code:string, country_code:string, state_set:string, street_code:string, zip_code:string>>,
state string,
street string,
zip string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

hive> CREATE external TABLE jcountry (
> city string,
> country string,
> data array<struct<city_code:string, country_code:string, state_set:string, street_code:string, zip_code:string>>,
> state string,
> street string,
> zip string)
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> LOCATION '/user/data1/jason/';
FAILED: Parse Error: line 4:2 mismatched input 'data' expecting Identifier near ',' in column specification

Thanks ,
Pushpa

Only first item examined?

It seems as if only the first item in an array is examined?

If the two elements of "wobble" in the example are swapped, "details2" is not generated.

auto generated schema with reserved name: timestamp

I changed: timestamp to time_stamp, to overcome the problem.

CREATE TABLE visits (
time_stamp string,
user_info struct<app_key:string, device_id:string, user_id:string>,
visit struct<end_time:string, event_type:string, id:string, is_confirmed:boolean, is_ongoing:boolean, place:struct<estimated_address:struct<cc:string, city:string, country:string, formatted_address:string, formatted_city:string, postal_code:string, state:string, street_address:string>, estimated_geolocation:struct<accuracy:double, lat:double, long:double>, first_visit_time:string, id:string, last_visit_time:string, type:string>, place_id:string, start_time:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

The attribute name is unique in entire json.

Hi, The object name/attribute name is duplicate in the different object like with in array or structure then it is now working.
expected: The names within an object SHOULD be unique. but not in the entire json.

Made a simple web app for this

save you some time for run java runtime and install dependancies

https://json-to-hive-schema-convertor.herokuapp.com/

keys with whitespace or punctuation

Hi, thanks for making this, it is saving me some trouble.

I did notice that when reading keys with whitespace, the output schema syntax is invalid.

Here's an example:

{"name": "sometext",
"stuff": {"white space":false},
"white space": "blah blah"
}

The resulting schema output will be invalid syntax due to the spaces:

CREATE TABLE x (
  name string,
  stuff struct<white space:boolean>,
  white space string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

Hive does support column names with whitespace (and, in fact, any Unicode character).

The simple solution is to enclose the column names in backticks.

Use bigint for Long values

The generated table definition uses 'int' data types for values that are longs.
The 'curated' JSON document used Long.MAX_VALUE as the field value.

The scalarNumericType() method should probably attempt convert the value to an int and if that fails then return bigint instead of int.

Ability to read stdin

Would like to be able to pass "-" instead of a filename to read from stdin.

quux00 / hive-json-schema Goto Github PK

hive-json-schema's People

Contributors

Stargazers

Watchers

Forkers

hive-json-schema's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs