quux00 / hive-json-schema Goto Github PK
View Code? Open in Web Editor NEWTool to generate a Hive schema from a JSON example doc
Tool to generate a Hive schema from a JSON example doc
This looks like a great package, but would you consider adding a licence to it? (https://docs.github.com/en/free-pro-team@latest/github/building-a-strong-community/adding-a-license-to-a-repository)
Thanks.
Hi @quux00, thank you for the tool, it's very handy!
So, I tried to generate a Hive schema out of a json containing an empty array, but it throws an exception
Exception in thread "main" java.lang.IllegalStateException: Array is empty: [] at net.thornydev.JsonHiveSchema.arrayJoin(JsonHiveSchema.java:136) at net.thornydev.JsonHiveSchema.toHiveSchema(JsonHiveSchema.java:129) at net.thornydev.JsonHiveSchema.valueToHiveSchema(JsonHiveSchema.java:179) at net.thornydev.JsonHiveSchema.createHiveSchema(JsonHiveSchema.java:103) at net.thornydev.JsonHiveSchema.main(JsonHiveSchema.java:66)
I read your recommendation on using a doc with a single entry in each array, does that mean that the tool doesn't support Union types yet? Even though I've come to understand that these fields support is incomplete and can only be used in SELECT clauses.
In readme there are execution examples e.g.:
java -cp target/json-hive-schema-1.0.jar net.thorndev.JsonHiveSchema file.json
This won't work, because correct classpath is net.thornydev.JsonHiveSchema file.json
Likely a 'y' as omitted by a typo.
So, the correct examples should be like
java -cp target/json-hive-schema-1.0.jar net.thornydev.JsonHiveSchema file.json
Hi!
I'm trying to use the Hive table on Impala, but I can't find a way to make Impala understand the "ADD JAR" command.
It is possible to use a Hive table created with this Json serialization with Impala?
Regards,
André
I have used below JSON to generate the DDL.
{
"business_id": "String",
"name": "String",
"neighborhood": "String",
"address": "String",
"city": "String",
"state": "String",
"postal_code": 12345,
"latitude": 124124124124,
"longitude": -111.936102,
"stars": 4.5,
"review_count": 17,
"is_open": 0,
"attributes": [{
"BikeParking": true,
"BusinessAcceptsBitcoin": false,
"BusinessAcceptsCreditCards": false,
"BusinessParking": {
"street": false,
"validated": false,
"lot": true,
"valet": false
},
"DogsAllowed": false,
"RestaurantsPriceRange2": 2,
"WheelchairAccessible": true
}],
"categories": [
"Tobacco Shops",
"Nightlife",
"Vape Shops",
"Shopping"
],
"hours": [
"Monday 11:0-21:0",
"Tuesday 11:0-21:0",
"Wednesday 11:0-21:0",
"Thursday 11:0-21:0",
"Friday 11:0-22:0",
"Saturday 10:0-22:0",
"Sunday 11:0-18:0"
],
"type": "business"
}
It created below
CREATE TABLE my_table_name (
address string,
attributes array<struct<bikeparking:boolean, businessacceptsbitcoin:boolean, businessacceptscreditcards:boolean, businessparking:struct<lot:boolean, street:boolean, valet:boolean, validated:boolean>, dogsallowed:boolean, restaurantspricerange2:int, wheelchairaccessible:boolean>>,
business_id string,
categories array,
city string,
hours array,
is_open int,
latitude int,
longitude double,
name string,
neighborhood string,
postal_code int,
review_count int,
stars double,
state string,
type string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
I want to generate the order of the columns in the same order of JSON but in the output, columns are sorted alphabetical order.
I find you project interesting, however, I think it lacks a key feature - the ability to deduce a schema from multiple json documents, one per line. Then you compute the "greatest common denominator" of all of them.
This removes a layer of human intervention (putting all the keys in one document). For the implementation details, you can check out this project:
Hi, I have just submitted the issue in comment of existing issues, but that did not sound relevant to the existing issue so I created a new one. I am trying to build jar files but getting it stuck at the download phase as follows. Please can you help>
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for net.thornydev:json-hive-schema:jar:1.0
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 13, column 15
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO]
[INFO] -------------------< net.thornydev:json-hive-schema >-------------------
[INFO] Building json-hive-schema 1.0
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from nexus: http://nexus.elsst.com/content/groups/public/org/apache/maven/plugins/maven-assembly-plugin/2.4/maven-assembly-plugin-2.4.jar
Progress (1): 25/226 kB
Hi,
This was so great tool for simplifing the json record into hive structure.
I have created the jar files and used them to generate hive DDL .
But the create table statement is failing in hive.
can you pls help to resolve the issue?
sample data record:
{"country":"uk","state":"ny","city":"fr","street":"nyk","zip":"1009","data":[{"country_code":"uk","state_set":"ny","city_code":"fr","street_code":"nyk","zip_code":"1009"}]}
user@ubuntu:~/lab/programs$ java -jar json-hive-schema-1.0-jar-with-dependencies.jar /home/user/Documents/json_cntry_schema.json
CREATE TABLE x (
city string,
country string,
data array<struct<city_code:string, country_code:string, state_set:string, street_code:string, zip_code:string>>,
state string,
street string,
zip string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
hive> CREATE external TABLE jcountry (
> city string,
> country string,
> data array<struct<city_code:string, country_code:string, state_set:string, street_code:string, zip_code:string>>,
> state string,
> street string,
> zip string)
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> LOCATION '/user/data1/jason/';
FAILED: Parse Error: line 4:2 mismatched input 'data' expecting Identifier near ',' in column specification
Thanks ,
Pushpa
It seems as if only the first item in an array is examined?
If the two elements of "wobble" in the example are swapped, "details2" is not generated.
I changed: timestamp to time_stamp, to overcome the problem.
CREATE TABLE visits (
time_stamp string,
user_info struct<app_key:string, device_id:string, user_id:string>,
visit struct<end_time:string, event_type:string, id:string, is_confirmed:boolean, is_ongoing:boolean, place:struct<estimated_address:struct<cc:string, city:string, country:string, formatted_address:string, formatted_city:string, postal_code:string, state:string, street_address:string>, estimated_geolocation:struct<accuracy:double, lat:double, long:double>, first_visit_time:string, id:string, last_visit_time:string, type:string>, place_id:string, start_time:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
Hi, The object name/attribute name is duplicate in the different object like with in array or structure then it is now working.
expected: The names within an object SHOULD be unique. but not in the entire json.
save you some time for run java runtime and install dependancies
Hi, thanks for making this, it is saving me some trouble.
I did notice that when reading keys with whitespace, the output schema syntax is invalid.
Here's an example:
{"name": "sometext",
"stuff": {"white space":false},
"white space": "blah blah"
}
The resulting schema output will be invalid syntax due to the spaces:
CREATE TABLE x (
name string,
stuff struct<white space:boolean>,
white space string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
Hive does support column names with whitespace (and, in fact, any Unicode character).
The simple solution is to enclose the column names in backticks.
The generated table definition uses 'int' data types for values that are longs.
The 'curated' JSON document used Long.MAX_VALUE as the field value.
The scalarNumericType() method should probably attempt convert the value to an int and if that fails then return bigint instead of int.
Would like to be able to pass "-" instead of a filename to read from stdin.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.