rbheemana / cobol-to-hive Goto Github PK
View Code? Open in Web Editor NEWSerde for Cobol Layout to Hive table
License: Apache License 2.0
Serde for Cobol Layout to Hive table
License: Apache License 2.0
We are using this Serde for past few years under CLI but we are in the process of switching to beeline. While testing under beeline, we are encountering 2 strange behavior, which I think is related to same problem. All our tables are FB files and the same JAR is working perfectly fine under CLI and count(*) is returning fine.
When we issue count() for table (FB file) after add JAR for the very 1st time, we are getting following exception. But if we re-issue count() from exactly same table, it is running fine. But if we are selecting a column(s) from the table for the very 1st time, then there is NO issue.
"diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.IOException: Fixed record length 0 is invalid. It should be set to a value greater than zero".
Any subsequent 1st count(*) for another table, the length of prior table is used and doesn't match with the total length of the 2nd table file and hence throwing exception. Here also, if we select column(s), then the current table FB length is used.
Any help is appreciated.
Thanks,
Rama.
Hi, I have a below copybook structure on mainframe:
01 WS-TEST-RECORD GROUP-USAGE IS NATIONAL.
05 WS-TEST-COUNTRY-CD PIC N(03) VALUE SPACES.
05 WS-TEST-COUNTRY-NM PIC N(64) VALUE SPACES.
05 WS-TEST-FUND-CD PIC N(03) VALUE SPACES.
05 WS-TEST-YYYYMM-YM PIC N(06) VALUE SPACES.
05 WS-TEST-BK-RATE PIC N(11) VALUE SPACES.
the I added jar to my hive script
ADD JAR /u/user/vikrant/hivespark/CobolSerde.jar;
and below is the copybook structure I have created in hdfs
/hdfspathlocation/vikrant/ddl/cobol.copybook
01 TEST-INPUT-FILE.
05 WS-TEST-COUNTRY-CD PIC N(03).
05 WS-TEST-COUNTRY-NM PIC N(64).
05 WS-TEST-FUND-CD PIC N(03).
05 WS-TEST-YYYYMM-YM PIC N(06).
05 WS-TEST-BK-RATE PIC N(11).
then I did FTP the file using below code to my unix location and later moved that file to hdfs location
filename = "INPUT_TESTFILE"
ftp.retrbinary('RETR \'CS.CS0008.TEST.INPUT.TESTFILE\'', codecs.open(filename, 'wb').write)
ftp.quit()
I am creating an external table in hive with below configuration:
CREATE EXTERNAL TABLE udb.Cobol2Hive
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/hdfspathlocation/mainframe'
TBLPROPERTIES ('cobol.layout.url'='/hdfspathlocation/cobol.copybook','fb.length'='87');
I am getting below error message now:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: at least one column must be specified for the table
Hi @datawarlock @rbheemana ,
i'm getting header and trailer as well in binary file which is sent by external system. How do i ignore those records while reading the file.
Format:
Header(1 row)
Details(1000000 rows)
Trailer(1 row)
help is highly appreciated.
Stop creating columns for cobol fields specified as FILLER.
When deserializing FILLER fields should not be populated.
I have 1 column in copybook as
01 DETAILS.
02 RATE PIC V9(6).
Serde is not able to convert ?? Please help .
@rbheemana everything written in this blog is working fine for me. however is there a way to
hide or make a password alias in the following command.
hadoop jar Cobolserde.jar {hostid:port} {username} {password}{mf_file} {location}. I am using this command in shell script.(hadoop jar CobolSerde.jar mf_host:21 atul96 main123 'acx.fds.fd(0)' /atul/mf So now how do I hide my password 'main123'. I want to run this in production environment.
Hi there,
I saw there is a way to load EBCDIC file from Mainframe server to hdfs (http://rbheemana.github.io/Cobol-to-Hive/transfer.html).
I wonder if one can load EBCDIC file from local dir to hdfs?
Thanks, Anton
Hi ,
I did try your code to run on hive. But i get no data back when i query the cobol2hive table . i have attached the file i have tried and also the code.
input.txt
bc70copy.txt
Kindly help with this request.
When the serde is outputting string fields and the is a character like newline or carriage return in the string it will split the row at that character so that row becomes two and is split in the improper position. To get around this I added a replace all function to change newline and carriage return to spaces.
It is a little bit of a hack the way that I did it because it does not address characters that split the lines (form feed, vertical tab, etc.). It just does newline and carriage returns.
CobolStringField.java
case STRING:
return s1;
case VARCHAR:
return new HiveVarchar(s1, this.length);
I changed to this:
case STRING:
s1 = s1.replaceAll("\n", " ");
s1 = s1.replaceAll("\r", " ");
return s1;
case VARCHAR:
s1 = s1.replaceAll("\n", " ");
s1 = s1.replaceAll("\r", " ");
return new HiveVarchar(s1, this.length);
We are trying to add the support for zoned decimal as well. We will share the code after we are done.
Hi
Have you does any performance testing , i read the post on large files and the time seems a bit on the higher side . Could you please let me know if Hive internally uses Map-reduce to process a custom SerDe conversion when a Select query is executed
Thanks
Lep
It is not able to convert the comp 3 decimal location. As a result it is only showing one row of data in table.
Where as we have 100 rows of data.
Any idea where I am going wrong?
Fir the S9(15)V9(2) COMP-3 Field!
in Copybook below are there.
FIELDNAME PIC S9(011)V99
SIGN TRAILING SEPARATE.
After deserialisation I am getting below column converted in HIVE.
FIELDNAME(13,3) and the value is 20.398 ... Instead 203.98 as V99 is there in the end.
Can yu please let me what could be the issue?
Hi Ram,
We need your guidance to solve the below issue
Description: -
We have received a binary file containing COMP and COMP 3 fixed length file.
The table is getting created using the cobol2hive serde with all the required columns based on the copy book with all the data type matching.
Below are few of the issues while transforming the data: -
Sincerely,
Aniket
Hi Ram,
I have the EBSDIC fie with header of 115 bytes long the actual data starts fro 116.
Record format . . . : VB
Record length . . . : 521
Block size . . . . : 27998
Please help to for the example layout for such header file.
Regards,
Harshit
Hi Guys,
We are having a problem with specific datatype
Here are the symptoms,
original copybook had S99 which was showing a value ==> null
we changed it to S9(2), still showing the same value ==> null
we changed it to, S9(3) COMP-3 ==> 15012
real value is 1
I'd appreciate if you can offer any insights or ideas on how to solve this.
thanks much,
ameet
I see that, it was coded in such a way to handle S9(02)V9(03) COMP-3(which is having brackets)
but not raw format S99V999 COMP3,
most of the organizations copybook may have raw format fields
Ex: 09 PROM-FACTOR PIC S99V999 COMP-3.
Use https://github.com/sakserv/hadoop-mini-clusters to create testing environment.
Could you guys please help in removing the header and trailer from EBCDIC binary file.
Hello,
Could you provide instructions on how to build a serde jar from source?
Thanks in advance.
I see the code is not handling redefines with self picture clause, Can you guide me , So that I can enhacne the code and give it
EX:
30 TOTAL-INT REDEFINES TOTAL-FEE PIC S9(12)V9(05) COMP-3.
Please help asap , its really need for us
Hi Ram,
I am having this error while creating Hive table:
"Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. LAYOUT_GEN"
Will you please comment, help.
Hi @rbheemana @datawarlock ,
I have copybook with multiple 01 levels. How do i handle this with this SERDE.
Please let me know.
COMP-3 PIC S9(08)V99.
Sorry, If I missed this in the documentation. How should I be defining signed packed values in copybook?
let num = [];
function createNum () {
for (i = 0; i <= 15;) {
let numGen = Math.floor(Math.random() * 15) + 1;
if (!num.includes(numGen)) {
num.push(numGen);
i++;
};
};
}
console.log(createNum());
document.getElementById("selectedNumbersShownHere").innerHTML = num;
console.log(num);
Hi folks- I had a scenario where S9(18) COMP-3 or more is returning null value while using the SERDE. Noticed when the length is greater than or equal to 19, the fieldtype is set to String, however in the Switch statement String type is not handled.Due to which the field value is set to null by default.
Added a new case string statement to handle it, Attached the code change in the notepad. Could you please look into it.
Class Name - CobolNumberFiled
Hi,
We are trying to translate a COMP-5 S9(10) field.
COMP-5 does not seems do be managed in the SERDE, shall we use COMP-4 instead or is there another option ?
Thanks
Add support for the Cobol fields defined as COMP and COMP-4 fields.
Hi Ram,
I wonder if "signed fields" for the representation of PIC S9 fields (for example, PIC S9(5) COMP-3.) are properly supported.
Thank you,
Anton
Field with S9 format (11) V99 displays null value in hive.
Please, could you help us to verify if it is a problem or some step was executed incorrectly?
Details are on this drive: https://drive.google.com/open?id=1DE9nMV055NR1hG0kJIrx74E2nnT9fIW1PS9209A3QSo
regards Daniel
We are executing your jar and performing below steps but we are getting hive exception as shown below. Can you please suggest some solution to this or why we are getting this error.
STEP 1: Add the CobolSerde.jar to the Hdfs path
STEP 2: Add the layout to the Hdfs path.
STEP 3: Login to Hive
STEP 4: Add the jar.
add jar hdfs://devmid:8020/dev/test/CobolSerde.jar
STEP 5: Run the DDL.
DROP TABLE IF EXISTS abc_enc.cobol_layout2;
CREATE TABLE abc_enc.cobol_layout1
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES ('cobol.layout.url'='hdfs://devmid:8020/dev/test/layout2.txt','fb.length'='1000');
ERROR :
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name srvc_fr_dt_ccyy in the table definition.
As per my understanging:
Error point to the layout portion
20 SRVC-FROM-DATE.
25 SRVC-FR-DT-CCYY PIC X(4).
25 SRVC-FR-DT-MMDD PIC X(4).
20 SRVC-TO-DATE.
25 SRVC-TO-DT-CCYY PIC X(4).
25 SRVC-TO-DT-MMDD PIC X(4).
My copybook is relatively very huge. FB length of 1806.
Your steps worked great and I could able to create table and view data. Only issue I observed is it skipped several fields of copybook and created the rest as table columns from there.
Could you please throw some light on why several lines of copybook might got skipped?
Thanks,
Charan
Hello!
Thank you for this great work.
I'm testing the conversion of an EBCDIC File to Hive Table and have some issues:
When using in Hive:
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe'
I have the following error:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: at least one column must be specified for the table
HOWEVER,
when using:
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde2.cobol.CobolSerDe'
It works, but the format PIC S9(8)V9(3) USAGE COMP-3. is not correct for decimals:
8 instead of 8.1
9.16 instead of 9.216
Could you please help ?
Thanks!
Below the complete DLL
ADD JAR hdfs://cluster/user/username/cobolserde.jar; CREATE EXTERNAL TABLE IF NOT EXISTS new_table ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/TMP/DATA' TBLPROPERTIES ('cobol.layout.url'= 'hdfs://cluster/user/username/my_layouts/layout','fb.length'='120')
I am currently working on a Mainframes-Db2 to Hadoop , data download. Facing issues in how to read the data from Hadoop , that is a Mainframes group variable with COMP and COMP-3 fields
Below are the details of the Issue. Any suggestions /thoughts/ Information is greatly appreciated. Thankyou in advance !
Example:
columnA of the DB2 table is equivalent to a group variable in Cobol that is a combination of COMP and COMP-3 variables in Mainframes as shown below
10 COLUMNA
15 AAAA-DAILY-COUNTS OCCURS 30 TIMES.
20 AAAA-BBB-REQT-D PIC S9(05) COMP-3
20 AAAA-BBBBB-D PIC S9(09) COMP.
15 AAAA-WEEKLY-COUNTS OCCURS 26 TIMES.
20 AAAA-DDDD-W PIC S9(05) COMP-3.
20 AAAA-CCCC-IN-VSN-W PIC S9(09) COMP.
Data is landed from DB2 to Hadoop using Attunity tool .As part of validation for data landing, we are checking the HEX(ColumnA data in DB2) should be equal to HEX( COLUMN A data in HAdoop).
In what possible ways can we convert
1) the HEX form of the compressed field in Hadoop( which is a Group variable in cobol with comp and comp-3 fields) ,to readable , uncompressed form ?
2) Or any suggestion on other possible ways
The number of bytes a PIC S9(m) COMP-3 takes is (m+1)/2. So the number of bytes used for 9(2) and a 9(3) are the same. The last nibble of the byte will contain the "sign" of the value. Mainframes can implicitly deal with this but you have to explicitly change the value for the serde to handle it.
SERDE should create length as 2 for S9(2) COMP-3. Currently it is creating a length of 1.
Hi Bheemana,
I have used the serde provided by you to pull the file from mainframe server and was able to create jar file. Created hive table using the serde. when I selected data from hive table.. com-3 fields were not shown correctly.
Mainframe -- hive table output
0 --12000000
0 --12000000
28.66 --12000000.02816
9.66 --12000000.00966
Do you have any idea why it is not able parse com-3 fields
Thanks in advance
Hello!
Thank you for this great work.
I'm testing the conversion of an EBCDIC File to Hive Table and have some issues:
When using in Hive:
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe'
I have the following error:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: at least one column must be specified for the table
HOWEVER,
when using:
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde2.cobol.CobolSerDe'
It works, but the format PIC S9(8)V9(3) USAGE COMP-3. is not correct for decimals:
8 instead of 8.1
9.16 instead of 9.216
Could you please help ?
Thanks!
Can you please specify what LICENSE (Apache 2?) you make this code available with?
hello
I have error when I use ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe'
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: at least one column must be specified for the table
I followed this issues but not success #16
I used this copybook
01 DOC2.
20 DOC2-V-ID PIC X(17).
20 DOC2-N-DATA PIC S9(5) COMP-3.
20 DOC2-VAL-DATA PIC X(100).
could you help me please ?
thanks
Once table is created, drop and create is not updating column names with new copy book details.
Details:
when creating a new table and does a describe on it.. its showing the table layout of the previous table.. set cobol.hive.mapping, this is also showing old layout(created using previous cobol copy book).
when creating external table A that points to copybook A, that works fine. But when creating another external table B pointing to another copybook B in a different location, the original table A structure has been replace with table B. Now Table A and B is having the same structure.
When i drop and recreate the table in hive cli, it is showing new ddl. but in hue its still showing the old ddl of the table.
Hello,
We've implemented the cobol to Hive Serde, with some small tweaks we previously found for handling COMP-3 values, but we are now seeing some issues with Serde causing duplicate records. The duplicate records seem to be coming from the split boundary. The immediate record after the split seems to be duplicated, but not in all scenarios. We've had some files, with block size divisible by 4, which seems to be the bytereader's approach that worked, and some also that did not work, ruling our our potential assumption we were having issues with Odd sized files. I was wondering if anyone had noticed this issue and may have some feedback?
Can anyone upload a sample data and copybook file?
Hi Ram,
I wonder if you have tested Cobol-to-HIve on large ebcdic data (~1000 columns, ~100M rows, vb.length=32100)? Is this package designed to be used on large data in principle?
P.S. My workflow is to load ebcdic data in hive table and then save it to another format (for example parquet) from Spark.
Thanks, Anton
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.