rbheemana / cobol-to-hive Goto Github PK

View Code? Open in Web Editor NEW

24.0 8.0 23.0 605 KB

Serde for Cobol Layout to Hive table

License: Apache License 2.0

Java 100.00%

hive serde mainframe cobol hdfs hadoop

cobol-to-hive's Introduction

Cobol-to-Hive

Serde for Cobol Layout to Hive table

Latest Updates

in MainframeVBRecordReader.java, on line 105, added cast as,

 filePosition = (Seekable) cIn;

modified pom.xml so that we can compile it using maven as,

mvn package

Commented the util package so that compilation can be done without maven install.
Added support for PIC clause starting with v Ex: v9(6)
Fixed issue for signed decimals.
Added support to ignore fields based on java regex pattern supplied via 'cobol.field.ignorePattern'='JAVA_REGEX_PATTERN'

ex: 'cobol.field.ignorePattern'='filler*'
Added support for multiple 01 levels

cobol-to-hive's People

Contributors

Stargazers

Watchers

cobol-to-hive's Issues

Ignore fields with name FILLER

Stop creating columns for cobol fields specified as FILLER.
When deserializing FILLER fields should not be populated.

performance for large ebcdic data

Hi Ram,

I wonder if you have tested Cobol-to-HIve on large ebcdic data (~1000 columns, ~100M rows, vb.length=32100)? Is this package designed to be used on large data in principle?

P.S. My workflow is to load ebcdic data in hive table and then save it to another format (for example parquet) from Spark.

Thanks, Anton

(post comments) Intoduction to Cobol to Hive Serde blogs

New line character and carriage return in string fields

When the serde is outputting string fields and the is a character like newline or carriage return in the string it will split the row at that character so that row becomes two and is split in the improper position. To get around this I added a replace all function to change newline and carriage return to spaces.

It is a little bit of a hack the way that I did it because it does not address characters that split the lines (form feed, vertical tab, etc.). It just does newline and carriage returns.

CobolStringField.java
case STRING:
return s1;
case VARCHAR:
return new HiveVarchar(s1, this.length);
I changed to this:
case STRING:
s1 = s1.replaceAll("\n", " ");
s1 = s1.replaceAll("\r", " ");
return s1;
case VARCHAR:
s1 = s1.replaceAll("\n", " ");
s1 = s1.replaceAll("\r", " ");
return new HiveVarchar(s1, this.length);

Missing LICENSE file

Can you please specify what LICENSE (Apache 2?) you make this code available with?

V9(6) in copybook -- not able to create column

I have 1 column in copybook as
01 DETAILS.
02 RATE PIC V9(6).

Serde is not able to convert ?? Please help .

ISSUE WITH PICTURE CLAUSE S99V999 COMP-3.

I see that, it was coded in such a way to handle S9(02)V9(03) COMP-3(which is having brackets)

but not raw format S99V999 COMP3,

most of the organizations copybook may have raw format fields

Ex: 09 PROM-FACTOR PIC S99V999 COMP-3.

"LAYOUT_GEN" error

Hi Ram,
I am having this error while creating Hive table:
"Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. LAYOUT_GEN"

Will you please comment, help.

Translate COMP-5

Hi,
We are trying to translate a COMP-5 S9(10) field.
COMP-5 does not seems do be managed in the SERDE, shall we use COMP-4 instead or is there another option ?
Thanks

Not converting decimal location correctly comp 3

It is not able to convert the comp 3 decimal location. As a result it is only showing one row of data in table.
Where as we have 100 rows of data.
Any idea where I am going wrong?
Fir the S9(15)V9(2) COMP-3 Field!

Support parsing of PICTURE clause, currently only PIC is supported

(post comments) Transfer of mainframe file to HDFS blog

How to compile this project?

Hello,

Could you provide instructions on how to build a serde jar from source?

Thanks in advance.

Failing count(*) for the 1st time under beeline and also using prior table length (FB)

We are using this Serde for past few years under CLI but we are in the process of switching to beeline. While testing under beeline, we are encountering 2 strange behavior, which I think is related to same problem. All our tables are FB files and the same JAR is working perfectly fine under CLI and count(*) is returning fine.

When we issue count() for table (FB file) after add JAR for the very 1st time, we are getting following exception. But if we re-issue count() from exactly same table, it is running fine. But if we are selecting a column(s) from the table for the very 1st time, then there is NO issue.
"diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.IOException: Fixed record length 0 is invalid. It should be set to a value greater than zero".
Any subsequent 1st count(*) for another table, the length of prior table is used and doesn't match with the total length of the 2nd table file and hence throwing exception. Here also, if we select column(s), then the current table FB length is used.

Any help is appreciated.

Thanks,
Rama.

issues while using cobol serde

Hello!

Thank you for this great work.
I'm testing the conversion of an EBCDIC File to Hive Table and have some issues:

When using in Hive:
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe'
I have the following error:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: at least one column must be specified for the table

HOWEVER,
when using:
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde2.cobol.CobolSerDe'
It works, but the format PIC S9(8)V9(3) USAGE COMP-3. is not correct for decimals:
8 instead of 8.1
9.16 instead of 9.216

Could you please help ?
Thanks!

Add support for unit Testing

Use https://github.com/sakserv/hadoop-mini-clusters to create testing environment.

S9(18) COMP-3 or greater value is returning null Value

Hi folks- I had a scenario where S9(18) COMP-3 or more is returning null value while using the SERDE. Noticed when the length is greater than or equal to 19, the fieldtype is set to String, however in the Switch statement String type is not handled.Due to which the field value is set to null by default.

Added a new case string statement to handle it, Attached the code change in the notepad. Could you please look into it.

Class Name - CobolNumberFiled

CobolNumberFiled.txt

Column names remains static

Once table is created, drop and create is not updating column names with new copy book details.

Details:
when creating a new table and does a describe on it.. its showing the table layout of the previous table.. set cobol.hive.mapping, this is also showing old layout(created using previous cobol copy book).

when creating external table A that points to copybook A, that works fine. But when creating another external table B pointing to another copybook B in a different location, the original table A structure has been replace with table B. Now Table A and B is having the same structure.

When i drop and recreate the table in hive cli, it is showing new ddl. but in hue its still showing the old ddl of the table.

Simple For loop Does not work properly

let num = [];

Field with format S9(11)V99 displays null value in hive.

Field with S9 format (11) V99 displays null value in hive.

Please, could you help us to verify if it is a problem or some step was executed incorrectly?

Details are on this drive: https://drive.google.com/open?id=1DE9nMV055NR1hG0kJIrx74E2nnT9fIW1PS9209A3QSo

regards Daniel

How to handle multiple 01 level in copybook

Hi @rbheemana @datawarlock ,

I have copybook with multiple 01 levels. How do i handle this with this SERDE.

Please let me know.

Db2 to Hive - handling Comp and comp3 data , which are part of a group variable

I am currently working on a Mainframes-Db2 to Hadoop , data download. Facing issues in how to read the data from Hadoop , that is a Mainframes group variable with COMP and COMP-3 fields

Below are the details of the Issue. Any suggestions /thoughts/ Information is greatly appreciated. Thankyou in advance !

Example:
columnA of the DB2 table is equivalent to a group variable in Cobol that is a combination of COMP and COMP-3 variables in Mainframes as shown below

10 COLUMNA
15 AAAA-DAILY-COUNTS OCCURS 30 TIMES.
    20 AAAA-BBB-REQT-D  PIC S9(05) COMP-3
    20 AAAA-BBBBB-D   PIC S9(09) COMP.

15 AAAA-WEEKLY-COUNTS OCCURS 26 TIMES.
    20 AAAA-DDDD-W              PIC S9(05) COMP-3.
    20 AAAA-CCCC-IN-VSN-W    PIC S9(09) COMP.

Data is landed from DB2 to Hadoop using Attunity tool .As part of validation for data landing, we are checking the HEX(ColumnA data in DB2) should be equal to HEX( COLUMN A data in HAdoop).

In what possible ways can we convert
1) the HEX form of the compressed field in Hadoop( which is a Group variable in cobol with comp and comp-3 fields) ,to readable , uncompressed form ?
2) Or any suggestion on other possible ways

load EBCDIC file from local dir to hdfs

Hi there,

I saw there is a way to load EBCDIC file from Mainframe server to hdfs (http://rbheemana.github.io/Cobol-to-Hive/transfer.html).
I wonder if one can load EBCDIC file from local dir to hdfs?

Thanks, Anton

Password encryption

@rbheemana everything written in this blog is working fine for me. however is there a way to
hide or make a password alias in the following command.
hadoop jar Cobolserde.jar {hostid:port} {username} {password}{mf_file} {location}. I am using this command in shell script.(hadoop jar CobolSerde.jar mf_host:21 atul96 main123 'acx.fds.fd(0)' /atul/mf So now how do I hide my password 'main123'. I want to run this in production environment.

(post comments) Cobol to Hive Serde blog

issues while using cobol serde

hello
I have error when I use ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe'
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: at least one column must be specified for the table

I followed this issues but not success #16
I used this copybook

01 DOC2.
20 DOC2-V-ID PIC X(17).
20 DOC2-N-DATA PIC S9(5) COMP-3.
20 DOC2-VAL-DATA PIC X(100).

could you help me please ?
thanks

Not able to read signed character

in Copybook below are there.

FIELDNAME PIC S9(011)V99
SIGN TRAILING SEPARATE.
After deserialisation I am getting below column converted in HIVE.
FIELDNAME(13,3) and the value is 20.398 ... Instead 203.98 as V99 is there in the end.

Can yu please let me what could be the issue?

Support COMP and COMP-4 fields

Add support for the Cobol fields defined as COMP and COMP-4 fields.

Support Alpha Numeric Pic Clause when specified as XXXX

Issue with parsing the COM-3 columns

Hi Bheemana,

I have used the serde provided by you to pull the file from mainframe server and was able to create jar file. Created hive table using the serde. when I selected data from hive table.. com-3 fields were not shown correctly.
Mainframe -- hive table output
0 --12000000
0 --12000000
28.66 --12000000.02816
9.66 --12000000.00966

Do you have any idea why it is not able parse com-3 fields

Thanks in advance

Data and copybook file

Can anyone upload a sample data and copybook file?

COMP-3 Fields with even number of values is not calculating length properly.

The number of bytes a PIC S9(m) COMP-3 takes is (m+1)/2. So the number of bytes used for 9(2) and a 9(3) are the same. The last nibble of the byte will contain the "sign" of the value. Mainframes can implicitly deal with this but you have to explicitly change the value for the serde to handle it.

SERDE should create length as 2 for S9(2) COMP-3. Currently it is creating a length of 1.

large Files

Have you does any performance testing , i read the post on large files and the time seems a bit on the higher side . Could you please let me know if Hive internally uses Map-reduce to process a custom SerDe conversion when a Select query is executed

Thanks
Lep

Signed Packed decimals

COMP-3 PIC S9(08)V99.

Sorry, If I missed this in the documentation. How should I be defining signed packed values in copybook?

Execution error HiveException: Duplicate column name

We are executing your jar and performing below steps but we are getting hive exception as shown below. Can you please suggest some solution to this or why we are getting this error.

STEP 1: Add the CobolSerde.jar to the Hdfs path

STEP 2: Add the layout to the Hdfs path.

STEP 3: Login to Hive

STEP 4: Add the jar.
add jar hdfs://devmid:8020/dev/test/CobolSerde.jar

STEP 5: Run the DDL.

DROP TABLE IF EXISTS abc_enc.cobol_layout2;
CREATE TABLE abc_enc.cobol_layout1
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES ('cobol.layout.url'='hdfs://devmid:8020/dev/test/layout2.txt','fb.length'='1000');

ERROR :
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name srvc_fr_dt_ccyy in the table definition.

As per my understanging:
Error point to the layout portion
20 SRVC-FROM-DATE.
25 SRVC-FR-DT-CCYY PIC X(4).
25 SRVC-FR-DT-MMDD PIC X(4).
20 SRVC-TO-DATE.
25 SRVC-TO-DT-CCYY PIC X(4).
25 SRVC-TO-DT-MMDD PIC X(4).

facing issues while importing data from mainframe

Hi, I have a below copybook structure on mainframe:

01  WS-TEST-RECORD GROUP-USAGE IS NATIONAL.                 
    05 WS-TEST-COUNTRY-CD          PIC N(03)   VALUE SPACES.
    05 WS-TEST-COUNTRY-NM          PIC N(64)   VALUE SPACES.
    05 WS-TEST-FUND-CD             PIC N(03)   VALUE SPACES.
    05 WS-TEST-YYYYMM-YM           PIC N(06)   VALUE SPACES.
    05 WS-TEST-BK-RATE             PIC N(11)   VALUE SPACES.

the I added jar to my hive script
ADD JAR /u/user/vikrant/hivespark/CobolSerde.jar;
and below is the copybook structure I have created in hdfs

/hdfspathlocation/vikrant/ddl/cobol.copybook

01   TEST-INPUT-FILE.
     05 WS-TEST-COUNTRY-CD             PIC N(03).
     05 WS-TEST-COUNTRY-NM             PIC N(64).
     05 WS-TEST-FUND-CD                PIC N(03).
     05 WS-TEST-YYYYMM-YM              PIC N(06).
     05 WS-TEST-BK-RATE                PIC N(11).

then I did FTP the file using below code to my unix location and later moved that file to hdfs location
filename = "INPUT_TESTFILE"
ftp.retrbinary('RETR \'CS.CS0008.TEST.INPUT.TESTFILE\'',  codecs.open(filename, 'wb').write)
ftp.quit()

I am creating an external table in hive with below configuration:

CREATE EXTERNAL TABLE udb.Cobol2Hive
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/hdfspathlocation/mainframe'
TBLPROPERTIES ('cobol.layout.url'='/hdfspathlocation/cobol.copybook','fb.length'='87');

I am getting below error message now:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: at least one column must be specified for the table

how to handle Header and Trailer from this JAR

Hi @rbheemana @datawarlock

Could you guys please help in removing the header and trailer from EBCDIC binary file.

Variable Block Duplicates

Hello,

We've implemented the cobol to Hive Serde, with some small tweaks we previously found for handling COMP-3 values, but we are now seeing some issues with Serde causing duplicate records. The duplicate records seem to be coming from the split boundary. The immediate record after the split seems to be duplicated, but not in all scenarios. We've had some files, with block size divisible by 4, which seems to be the bytereader's approach that worked, and some also that did not work, ruling our our potential assumption we were having issues with Odd sized files. I was wondering if anyone had noticed this issue and may have some feedback?

skipping few fields

My copybook is relatively very huge. FB length of 1806.

Your steps worked great and I could able to create table and view data. Only issue I observed is it skipped several fields of copybook and created the rest as table columns from there.

Could you please throw some light on why several lines of copybook might got skipped?

Thanks,
Charan

issues while using cobol serde

Hello!

Thank you for this great work.
I'm testing the conversion of an EBCDIC File to Hive Table and have some issues:

Could you please help ?
Thanks!

Below the complete DLL

ADD JAR hdfs://cluster/user/username/cobolserde.jar; CREATE EXTERNAL TABLE IF NOT EXISTS new_table ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/TMP/DATA' TBLPROPERTIES ('cobol.layout.url'= 'hdfs://cluster/user/username/my_layouts/layout','fb.length'='120')

Issue while using the cobol2hive serde, comp & com 3 not getting converted

Hi Ram,

We need your guidance to solve the below issue
Description: -
We have received a binary file containing COMP and COMP 3 fixed length file.
The table is getting created using the cobol2hive serde with all the required columns based on the copy book with all the data type matching.

Below are few of the issues while transforming the data: -

While doing select * from abc; we are getting below error
Error: java.io.IOException: java.io.IOException: Partial record(length = 48) found at the end of split. (state=,code=0)
column with comp field are not getting converted properly.
For example value for COMP column which is '10001' in actual mainframe data file, we are getting 538978065 in the table.
The Mainframe file imported seems to be correct but the output table is not getting populated correctly for COMP and COMP 3

Sincerely,
Aniket

Trying to add support for the zoned decimal as well.

We are trying to add the support for zoned decimal as well. We will share the code after we are done.

How to handle the COBOL EBCDIC file with header of 115 bytes.

Hi Ram,

I have the EBSDIC fie with header of 115 bytes long the actual data starts fro 116.

Record format . . . : VB
Record length . . . : 521
Block size . . . . : 27998

Please help to for the example layout for such header file.

Regards,
Harshit

No data

Hi ,
I did try your code to run on hive. But i get no data back when i query the cobol2hive table . i have attached the file i have tried and also the code.
input.txt
bc70copy.txt
Kindly help with this request.

Issue with redefines with picture clause (which is not group)

I see the code is not handling redefines with self picture clause, Can you guide me , So that I can enhacne the code and give it

EX:
30 TOTAL-INT REDEFINES TOTAL-FEE PIC S9(12)V9(05) COMP-3.

Please help asap , its really need for us

S99 not interpreted correctly

Hi Guys,

We are having a problem with specific datatype
Here are the symptoms,
original copybook had S99 which was showing a value ==> null
we changed it to S9(2), still showing the same value ==> null
we changed it to, S9(3) COMP-3 ==> 15012
real value is 1

I'd appreciate if you can offer any insights or ideas on how to solve this.

thanks much,

ameet

how to remove header and trailer from binary file.

Hi @datawarlock @rbheemana ,

i'm getting header and trailer as well in binary file which is sent by external system. How do i ignore those records while reading the file.

Format:
Header(1 row)
Details(1000000 rows)
Trailer(1 row)

help is highly appreciated.

"signed fields" for the representation of PIC S9 fields

Hi Ram,

I wonder if "signed fields" for the representation of PIC S9 fields (for example, PIC S9(5) COMP-3.) are properly supported.

Thank you,
Anton

rbheemana / cobol-to-hive Goto Github PK

cobol-to-hive's Introduction

Cobol-to-Hive

Latest Updates

cobol-to-hive's People

Contributors

Stargazers

Watchers

Forkers

cobol-to-hive's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs