guidewire / cda-client Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 19.0 215 KB

Cloud Data Access client

License: Apache License 2.0

Scala 100.00%

etl kafka parquet spark

cda-client's People

Stargazers

Watchers

Forkers

cwilliams-gw sushil-thasale conf1102 scottrhodes sasidivakarunigw maxreboallstars xunil9 rvasqz86 hawngeek anilbhagchandani blarsen1981 jay17june mehdi-ot arajendranuaig learnerenabler uaig ishmitrosieshine1 bharath-9362 hirak1984

cda-client's Issues

CDA parquet file not represent column as CLOB

We have found that users have a clob value in the UI but when the parquet file comes from CDA Reader it does not present the schema as a clob so the field creates as varchar and fails to import the data because there is to many characters to fit in the varchar field. How are folks handling this scenario?

7 Junit tests are failing

HI,

I ran this OOTB code. I am seeing 7 failures. Any idea why these 7 tests are failing. Your help is highly appreciated.

com.guidewire.cda.SavepointsProcessorTest > Testing SavepointsProcessor functionality SavepointsProcessor.checkExists test when savepoints directory
exists, and check if the savepoints json file exists FAILED
java.net.URISyntaxException

com.guidewire.cda.SavepointsProcessorTest > Testing SavepointsProcessor functionality SavepointsProcessor.readSavepoints should read savepoints from
a savepoints.json file in a given directory, or return empty Map if no savepoints file exist FAILED
java.net.URISyntaxException

com.guidewire.cda.SavepointsProcessorTest > Testing SavepointsProcessor functionality SavepointsProcessor.getSavepoints should get a savepoint for a
table from a savepoints.json file, which is an Option FAILED
java.net.URISyntaxException

com.guidewire.cda.SavepointsProcessorTest > Testing SavepointsProcessor functionality SavepointsProcessor.writeSavepoints should write the lastSucce
ssfulReadTimestamps to a savepoints.json file FAILED
java.net.URISyntaxException

com.guidewire.cda.SavepointsProcessorTest > Testing SavepointsProcessor functionality SavepointsProcessor.writeSavepoints should allow savepoint ent
ry to be updated multiple times, for the same table FAILED
java.net.URISyntaxException

com.guidewire.cda.TableReaderTest > Testing TableReader functionality TableReader.getFingerprintsWithUnprocessedRecords should retain all fingerprin
ts if no savepoint data exists FAILED
java.net.URISyntaxException

gw.cda.api.outputwriter.LocalOutputWriterTest > initializationError FAILED
org.scalatest.exceptions.NotAllowedException at LocalOutputWriterTest.scala:69
Caused by: java.net.URISyntaxException at LocalOutputWriterTest.scala:69

36 tests completed, 7 failed

Handling blacklisted tables in CDA

Hi,

Does anybody know how to retrieve a list of table name in the configuration and avoid the download of those tables in CDA ?

Issue with building the project: com.github.johnrengelman.shadow

While building the project, I am getting the following error:
Plugin [id: 'com.github.johnrengelman.shadow', version: '5.1.0'] was not found in any of the following sources:

I am using JDK8 and Intellij community edition. Any help would be highly appreciated.

Build Fails on Windows

When following the "Build the Build CDA Client" section of the readme.md, I get nothing but errors. All of the errors have to do with paths. As an example of one of the errors:

Caused by: java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\src\temp\cda-client-test

All other errors are very similar to the above. How would I get around this?

Also, do I really have to build the client in order to run it? Isn't there a precompiled version of the CDA Client?

Thanks for the help in advance!

Sample Parquet files missing in the repository

Need your help!
Do we have the sample parquet files checked in?
The sample-manifest.json and savepoints.json files in the repo, refer to tables such as taccountlineitem, taccount, note etc. For these tables, the parquet files are expected in S3 bucket.
If would immensely help the developers, if we can get the zip file containing the parquet/manifest file,

If the sample files are not present in the git repository, please suggest if I can get them in an external location.
Thanks in advance!

Timezone issue while loading data to oracle

Hi,
The date time is entered in Eastern Standard time from UI but the data is getting loaded in S3 buckets in epoch time. So when it comes to oracle by the CDA open source reader, oracle treats the epoch time as UTC and convert it back to EST and we are getting a different time in the oracle which is 4-5 hrs behind the actual time.

For example : UI shows : 06/13/2022 10:00 AM (EST)
Data in Oracle : 13-JUN-22 06.00.00.000000000 AM

Has anyone experienced the same issue and how have you handled the same? can some one suggest any idea to overcome this problem?

Your help is appreciated

Throws error if subfolder not found or empty and aborts the table merge

Getting below error while its encountering missing path and download for that table aborts.

[ForkJoinPool-1-worker-3] WARN com.guidewire.cda.TableReader - Copy Job FAILED for 'cc_claim' for fingerprint '4e588b71e9a149148b623a22da443314': org.apache.spark.sql.AnalysisException: Path does not exist: s3a://tenant-xxx/cc/4e588b71e9a149148b623a22da443314/1664585730000/*.parquet;

I was told by GW that its normal to have empty folder reference on the .cda/batch-metrics.json file but the actual path don't exists. How to handle this scenario?

"The failures are related to timestamp folders that don’t contain any Parquet files in them. This is standard behavior where, if all records processed in a batch for a given table were deemed as duplicates (previously seen), CDA would correctly not write them out to S3. But it would still write out the reconciliation stats to .cda/batch-metrics.jsonfile. It’s the presence of this file that’s causing the folder to show up in S3, even if there are no Parquet files inside of it."

timestamp folder created in different fingureprintschema folder not in sequence.

I have a table with 2 schemafingerprint directories and there multiple timestamp directories inside it. It was assumed that the timestamp directory with minimum value in 2nd schemafingerprint directory should be greater than the maximum value of the timestamp directory in the 1st schemafingerprint directory, but this is not the case and we are missing data due to it.

As the there are timestamp directories with value less than the max value of the timestamp directory in the 1st schemafingerprint directory . the application picks up data only from the timestamp directories of 2nd schemafingerprint if there value is greater than the max value of the timestamp directory in 1st schemafingerprint .

Performance issue with CDA client while running in windows server

Hi,
We are not getting a good performance with the CDA client while running in windows server. We are trying to load the data to oracle tables. Does some one has any recommended performance tuning suggestion on running on windows server? or some benchmarks in loading data to oracle ?

I have couple of questions as well. If any one can answer them, it will be very helpful.

We noticed that the performance is getting delayed , since there is iteration for every row and column for the null check as well as for making the data type compatible with oracle. For the datatype conversion, has anyone did it outside of the CDA Tool during database (oracle) insert operation using loading tools. What is your experience?
We have noticed that CDA is connecting to S3 when doing insert to the database . If this is for the validation of the record, can GW change the logic from spawning new connections constantly ? This trips are costing us load performances.

Thanks in advance

TableDataReader.scala: nextReadPointKey probably should not add 1 to lastReadPoint.get.toLong

TableDataReader.getFingerprintToProcess and TableDataReader.getTableS3LocationWithTimestampsAfterLastSave calculate nextReadPointKey as lastReadPoint.get.toLong + 1. The value of nextReadPointKey is then passed to s3Client's listObjects through a ListObjectRequest instance's marker property.

As per the documentation for marker at https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/model/ListObjectsRequest.html, "Amazon S3 starts listing after this specified key" (emphasis mine). Hence, unless I am mistaken, adding 1 to lastReadPoint.get.toLong is superfluous and may cause some files to be skipped if they have a 1 second difference in timestamps.

cda-client app need to be copy batch-metrics.json

Hi, I need to copy batch-manifest.json file which is available in each table cc_user/39848ddcb37a4c4a83046dcf9f4dac5d/1653539130000/.cda/batch-metrics.json, Scala is totally new to me, appreciate you guidance on how to get the file to local filesystem.

Thank you

cda-client taking long time to write to csv file

Hi team,
Scala writing to dataframe (9262) to csv taking 8.48 hours. We are seeing the process throwing OOO because of this long running and volume of the tables processed. Please let me know any way we can reduce performance for below line of code
here saveAsSingleFile=true

if (saveAsSingleFile) {
tableDF.coalesce(1).write.option("header", includeColumnNames).option("emptyValue",null).option("nullValue",null).mode(SaveMode.Overwrite).csv(pathToFolderWithCSV)
} else {

  tableDF.write.option("header", includeColumnNames).option("emptyValue","").option("nullValue",null).mode(SaveMode.Overwrite).csv(pathToFolderWithCSV)
}

Spark and Hadoop Version confirmation

Sorry for previously not asking the question on the right platform. I need clarity around exact version of Spark. Did you use spark2.4.7 with Hadoop 2.7?

I also noticed hadoopAWSVersion = 2.8.5. With what version of Hadoop this is compatible?

Thanks a lot for your quick support.

guidewire / cda-client Goto Github PK

cda-client's People

Stargazers

Watchers

Forkers

cda-client's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs