azure / azure-data-lake-store-java Goto Github PK

View Code? Open in Web Editor NEW

20.0 23.0 33.0 660 KB

Microsoft Azure Data Lake Store Filesystem Library for Java

License: Other

Java 100.00%

azure-data-lake-store-java's Introduction

Azure Datalake Store client for Java

For an introduction to Azure Data Lake, see here: https://azure.microsoft.com/en-us/services/data-lake-store/
For getting started introduction to the SDK, see here: https://github.com/Azure-samples/data-lake-store-java-upload-download-get-started
For the SDK Javadoc, see here: http://azure.github.io/azure-data-lake-store-java/javadoc

azure-data-lake-store-java's People

Contributors

Stargazers

Watchers

azure-data-lake-store-java's Issues

Dead link on README.md

The link to https://azure.github.io/azure-data-lake-store-java/ does not work.

Need sample Java Code to upload/create files in data lake gen2 blob storage

I am looking for a sample code similar to this one for Gen1:
https://github.com/Azure-Samples/data-lake-store-java-upload-download-get-started

Not compliant with Java 11

The JAR built of azure-data-lake-store-sdk includes wildfly-openssl in version 1.0.0.CR5 which is not compatible with Java 11.

Running with jdk-11.0.1:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.wildfly.openssl.ByteBufferUtils (file:/D:/maven_repo/com/microsoft/azure/azure-data-lake-store-sdk/2.3.4/azure-data-lake-store-sdk-2.3.4.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.wildfly.openssl.ByteBufferUtils
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Exception in thread "main" java.lang.ExceptionInInitializerError
	at org.wildfly.openssl.ByteBufferUtils.<clinit>(ByteBufferUtils.java:39)
	at org.wildfly.openssl.OpenSSLEngine.readEncryptedData(OpenSSLEngine.java:338)
	at org.wildfly.openssl.OpenSSLEngine.wrap(OpenSSLEngine.java:444)
	at java.base/javax.net.ssl.SSLEngine.wrap(SSLEngine.java:479)
	at org.wildfly.openssl.OpenSSLSocket.write(OpenSSLSocket.java:501)
	at org.wildfly.openssl.OpenSSLOutputStream.write(OpenSSLOutputStream.java:46)
	at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
	at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
	at java.base/java.io.PrintStream.flush(PrintStream.java:417)
	at java.base/sun.net.www.MessageHeader.print(MessageHeader.java:301)
	at java.base/sun.net.www.http.HttpClient.writeRequests(HttpClient.java:655)
	at java.base/sun.net.www.http.HttpClient.writeRequests(HttpClient.java:666)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:711)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1602)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1509)
	at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527)
	at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:329)
	at com.microsoft.azure.datalake.store.HttpTransport.makeSingleCall(HttpTransport.java:307)
	at com.microsoft.azure.datalake.store.HttpTransport.makeCall(HttpTransport.java:90)
	at com.microsoft.azure.datalake.store.Core.getFileStatus(Core.java:691)
	at com.microsoft.azure.datalake.store.ADLStoreClient.getDirectoryEntry(ADLStoreClient.java:742)
	at com.microsoft.azure.datalake.store.ADLStoreClient.getDirectoryEntry(ADLStoreClient.java:725)
	 ...
Caused by: java.lang.IllegalAccessException: class org.wildfly.openssl.ByteBufferUtils cannot access class jdk.internal.ref.Cleaner (in module java.base) because module java.base does not export jdk.internal.ref to unnamed module @60e949e1
	at java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:361)
	at java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:591)
	at java.base/java.lang.reflect.Method.invoke(Method.java:558)
	at org.wildfly.openssl.ByteBufferUtils.<clinit>(ByteBufferUtils.java:36)
	... 27 more

Forcing the last version of openssl is working fine :

org.wildfly.openssl wildfly-openssl 1.0.4.Final

Unit tests

Is there a standard account used for the unit tests for this SDK?

startAfter field in ADLStoreClient.enumerateDirectory ignored if unicode

I've noticed that calling ADLStoreClient.enumerateDirectory with the startAfter field will cause the entire directory to be listed if the startAfter has unicode characters within it. For example, in a given directory where the unicode filename "澳门.tst" is the last file within the directory, providing that path as the startAfter will re-list the entire directory contents. I've seen this reproduced for other unicode paths. This makes paging through a directory of unicode paths impossible, as any time the last path within a page is a unicode path, it will restart from the beginning.

Incompatible with JAVA16

Create Directory throws below error:

Java: 16

java.lang.IllegalStateException: javax.security.cert.CertificateException: Could not find class: java.lang.ClassNotFoundException: com/sun/security/cert/internal/x509/X509V1CertImpl
at org.wildfly.openssl.OpenSSlSession.initPeerCertChain(OpenSSlSession.java:285) ~[wildfly-openssl-1.0.7.Final.jar:1.0.7.Final]
at org.wildfly.openssl.OpenSSlSession.initialised(OpenSSlSession.java:295) ~[wildfly-openssl-1.0.7.Final.jar:1.0.7.Final]
at org.wildfly.openssl.OpenSSLSessionContext.clientSessionCreated(OpenSSLSessionContext.java:132) ~[wildfly-openssl-1.0.7.Final.jar:1.0.7.Final]
at org.wildfly.openssl.OpenSSLClientSessionContext.storeClientSideSession(OpenSSLClientSessionContext.java:94) ~[wildfly-openssl-1.0.7.Final.jar:1.0.7.Final]
at org.wildfly.openssl.OpenSSLEngine.handshakeFinished(OpenSSLEngine.java:1016) ~[wildfly-openssl-1.0.7.Final.jar:1.0.7.Final]
at org.wildfly.openssl.OpenSSLEngine.getHandshakeStatus(OpenSSLEngine.java:1065) ~[wildfly-openssl-1.0.7.Final.jar:1.0.7.Final]
at org.wildfly.openssl.OpenSSLEngine.unwrap(OpenSSLEngine.java:654) ~[wildfly-openssl-1.0.7.Final.jar:1.0.7.Final]
at java.base/javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:679) ~[na:na]
at org.wildfly.openssl.OpenSSLSocket.runHandshake(OpenSSLSocket.java:324) ~[wildfly-openssl-1.0.7.Final.jar:1.0.7.Final]
at org.wildfly.openssl.OpenSSLSocket.startHandshake(OpenSSLSocket.java:210) ~[wildfly-openssl-1.0.7.Final.jar:1.0.7.Final]
at java.base/sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:574) ~[na:na]
at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:183) ~[na:na]
at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1419) ~[na:na]
at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1390) ~[na:na]
at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:220) ~[na:na]
at com.microsoft.azure.datalake.store.HttpTransport.makeSingleCall(HttpTransport.java:293) ~[azure-data-lake-store-sdk-2.3.9.jar:2.3.9]
at com.microsoft.azure.datalake.store.HttpTransport.makeCall(HttpTransport.java:90) ~[azure-data-lake-store-sdk-2.3.9.jar:2.3.9]
at com.microsoft.azure.datalake.store.Core.mkdirs(Core.java:440) ~[azure-data-lake-store-sdk-2.3.9.jar:2.3.9]
at com.microsoft.azure.datalake.store.ADLStoreClient.createDirectory(ADLStoreClient.java:665) ~[azure-data-lake-store-sdk-2.3.9.jar:2.3.9]

Data Lake library needs to use Shade plugin to prevent class path versioning issues

We have noticed an issue with some of the Azure Mgmt SDKs in that they don't repackage their dependencies into a custom Microsoft namespace, using either the Maven Shade plugin or the Gradle Shadow plugin. Without using these plugins, this can cause versioning issues when this SDK is used in conjunction with other open source libraries using similar, but different versions of dependencies (e.g. 'class path hell'). Often, when class path hell occurs, you'll see an exception at run time saying "NoSuchMethod" because there are two versions of the same dependency used, but the wrong version gets loaded for one of the libraries that uses it. Using the Shade\Shadow plugins will prevent this.

I'm logging this issue because in looking at the POM for the azure-data-lake-store-java library, it doesn't reference the Maven Shade plugin. Please take a look at using this plugin to ensure that version conflicts don't occur when using the azure-data-lake-store-jave SDK.

Here's a link to a similar issue that was logged against the Document DB library that describes this type of problem in greater detail since a versioning conflict issue was found when using the Document DB library in combination with the App Insights SDK. Azure/azure-documentdb-java#90

AdlFileSystem: SetOwner intermittently fails

I have a simple script that uses Hadoop CLI to do create and setOwner continuously in the following way:

hadoop fs -put /some/file fooX
hadoop fs -chown username:groupname fooX
X++
back to step 1.

This works most of the time, but intermittently, the setOwner call fails with:

Exception in thread "main" org.apache.hadoop.security.AccessControlException: Set Owner failed with failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.).

The service principal used has owner-level access and we're using the same username and group on every call so my suspicion is that sometimes, the file doesn't exist at the time of setOwner. Is it possible that in some situation, read-after-write consistency isn't met and the file is not created at the time of chowning?

This is with hadoop-azure-datalake-2.7.3.2.6.4.0-91 and azure-data-lake-store-sdk-2.1.4. Please let me know what other info I can provide.

MSI token provider doesn't allow users to configure RetryOptions

While getting the token through MSI AzureADAuthenticator has hardcoded Exponential backoff policy.

  RetryPolicy retryPolicy = new ExponentialBackoffPolicyforMSI(3, 1000, 2);

This interval is too low, specially in cases where AKS clusters are trying to attach pod Identity, which can take longer than 8s.

Can we allow users to configure this or reuse AAD OAuth2.0 refresh token options here?

MSI endpoint chunked response not handled

AzureADAuthenticator.java exceptions when the response to the MSI endpoint is chunked. This means the content length is not present and causes the content length check to return false: responseContentLength > 0 (line 276)

if (httpResponseCode == 200 && responseContentType.startsWith("application/json") && responseContentLength > 0) {

This leads to a null pointer exception in the else block which is meant for handling errors.

I found this issue in the Hadoop fork of this project: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/oauth2/AzureADAuthenticator.java
However, it seems like it should be fixed here first as this is the official release. Let me know if I should post a Hadoop issue as well.

Example curl from AKS pod showing chunked response. I did not intentionally do anything to get a chunked response from the endpoint.

$ curl -v -H "metadata: true" "http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https://storage.azure.com/"
* Expire in 0 ms for 6 (transfer 0x55a4fcdb5fb0)
*   Trying 169.254.169.254...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x55a4fcdb5fb0)
* Connected to 169.254.169.254 (169.254.169.254) port 80 (#0)
> GET /metadata/identity/oauth2/token?api-version=2018-02-01&resource=https://storage.azure.com/ HTTP/1.1
> Host: 169.254.169.254
> User-Agent: curl/7.64.0
> Accept: */*
> metadata: true
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Wed, 15 Jun 2022 19:27:16 GMT
< Transfer-Encoding: chunked
<
{"access_token":"...","refresh_token":"","expires_in":"86400","expires_on":"1655407636","not_before* Connection #0 to host 169.254.169.254 left intact

NoClassDefFoundError when used with Azure Mamagement SDK caused by incompatible Jackson version

Using this library with the Azure Management Java libraries causes a NoClassDefFoundError. This is because azure-data-lake-store-sdk depends on Jackson 2.8.6, but the Azure Java libraries depend on 2.9.4. This library should be upgraded to use 2.9.4 to avoid having to specify an exclusion.

Exception in thread "main" java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/type/WritableTypeId
	at com.fasterxml.jackson.databind.jsontype.TypeSerializer.typeId(TypeSerializer.java:78)
	at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase._typeIdDef(BeanSerializerBase.java:679)
	at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeWithType(BeanSerializerBase.java:599)
	at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:729)
	at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:719)
	at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:155)
	at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
	at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
	at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:2643)
	at com.fasterxml.jackson.databind.ObjectMapper.valueToTree(ObjectMapper.java:2785)
	at com.microsoft.rest.serializer.FlatteningSerializer.serialize(FlatteningSerializer.java:166)
	at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
	at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
	at com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1396)
	at com.fasterxml.jackson.databind.ObjectWriter._configAndWriteValue(ObjectWriter.java:1120)
	at com.fasterxml.jackson.databind.ObjectWriter.writeValueAsBytes(ObjectWriter.java:1017)
	at com.microsoft.rest.serializer.JacksonConverterFactory$JacksonRequestBodyConverter.convert(JacksonConverterFactory.java:78)
	at com.microsoft.rest.serializer.JacksonConverterFactory$JacksonRequestBodyConverter.convert(JacksonConverterFactory.java:69)
	at retrofit2.ParameterHandler$Body.apply(ParameterHandler.java:355)
	at retrofit2.RequestFactory.create(RequestFactory.java:108)
	at retrofit2.OkHttpCall.createRawCall(OkHttpCall.java:190)
	at retrofit2.OkHttpCall.execute(OkHttpCall.java:173)
	at retrofit2.adapter.rxjava.CallExecuteOnSubscribe.call(CallExecuteOnSubscribe.java:40)
	at retrofit2.adapter.rxjava.CallExecuteOnSubscribe.call(CallExecuteOnSubscribe.java:24)
	at rx.Observable.unsafeSubscribe(Observable.java:10327)
	at rx.internal.operators.OnSubscribeMap.call(OnSubscribeMap.java:48)
	at rx.internal.operators.OnSubscribeMap.call(OnSubscribeMap.java:33)
	at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
	at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
	at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
	at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
	at rx.Observable.subscribe(Observable.java:10423)
	at rx.Observable.subscribe(Observable.java:10390)
	at rx.observables.BlockingObservable.blockForSingle(BlockingObservable.java:443)
	at rx.observables.BlockingObservable.single(BlockingObservable.java:340)
	at com.microsoft.azure.management.datalake.analytics.implementation.JobsImpl.create(JobsImpl.java:403)
	at Main.SubmitJobByScript(Main.java:209)
	at Main.main(Main.java:94)
Caused by: java.lang.ClassNotFoundException: com.fasterxml.jackson.core.type.WritableTypeId
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 38 more

Spark job fails during CREATE file operation on Azure Data Lake Gen1

Hi,

I am facing an issue with Spark job, which is reading streaming data from Azure Event Hub and storing the data in ADL(Azure Data Lake) Gen1 file system.

Spark Version: 3.0.0

Please help and let me know

What is the root cause of the issue ?
How to fix it ? Is this something to do with the size of ADL Gen1 file system.
Also, one more observation is that - this is happening usually when the size of the input transactions is more (1 million). But, this issue is usually not seen when the size is less than 1M. Is this just a co-incidence ? Or is it something to do with the size of input load also ?

Brief Overview
Our Big Data Product runs in AKS Cluster deployed in Microsoft Azure.

All the jobs executed within the product are Apache Spark jobs. In addition to HDFS, even Azure Data Lake Gen1 is also one of the supported file systems.

Scenario
Source generates events and publishes them into Azure Event Hubs. Spark Streaming job is waiting for events on a particular EH(Event Hub) and it will keep on writing the data into Azure Data Lake Gen1 file system.

Huge number of transactions (1 to 5 million) records have been injected at the source side
Spark streaming job is continuously running for hours and writing the data into the ADL Gen1 file system

All of a sudden, it fails in the middle with the below error:
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:355)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:245)
Caused by: com.microsoft.azure.datalake.store.ADLException: Error creating file /landing_home/delta_log/.00000000000000001748.json.9d2edecf-973c-4d61-a178-4db46bd70f2c.tmp
Operation CREATE failed with exception java.net.SocketTimeoutException : Read timed out
Last encountered exception thrown after 1 tries. [java.net.SocketTimeoutException]
[ServerRequestId:null]
at com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1169)
at com.microsoft.azure.datalake.store.ADLStoreClient.createFile(ADLStoreClient.java:281)
at org.apache.hadoop.fs.adl.AdlFileSystem.create(AdlFileSystem.java:374)
at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1228)
at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:100)
at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:605)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:696)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:692)

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

append call to acquire lease in getappendstream is a append call with no offset

this is not optimal, so this need to be done after getfilestatus at the offset got from gfs

Data Lake Store Gen2

It is not specified anywhere, but I guess that this client implements only the Azure Data Lake Gen1 REST API. Am I right?

Is it in the roadmap to support Gen2 REST API?

ExponentialBackoffPolicy should add randomness to backoff interval

ExponentialBackoffPolicy.shouldRetry() uses a completely deterministic wait time. This can lead to continued failures in the face of multiple parallel processes launched more-or-less simultaneously, because each task will restart its attempt at the same time. While this case may seem contrived, it is exactly what happens when we attempt to launch 100s of tasks in Hadoop that perform parallel reads on a set of "part" files stored in ADL.

setMaxRetries, setExponentialRetryInterval, setExponentialFactor, and setEnableConditionalCreate need to return this to chain method calls.

The retry count of ExponentialBackoffPolicy created by ADLFileInputStream is not configurable

We occasionally encounter errors under heavy load where all 5 tries are exhausted:

com/microsoft/azure/datalake/store/ADLFileInputStream.read:com.microsoft.azure.datalake.store.ADLException: Error reading from file [filename]
Operation OPEN failed with HTTP429 : ThrottledException
Last encountered exception thrown after 5 tries. [HTTP429(ThrottledException),HTTP429(ThrottledException),HTTP429(ThrottledExceptio\
n),HTTP429(ThrottledException),HTTP429(ThrottledException)]
 [ServerRequestId: redacted] 
com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1179) 
com.microsoft.azure.datalake.store.ADLFileInputStream.readRemote(ADLFileInputStream.java:252)     com.microsoft.azure.datalake.store.ADLFileInputStream.readInternal(ADLFileInputStream.java:221)     com.microsoft.azure.datalake.store.ADLFileInputStream.readFromService(ADLFileInputStream.java:132) 
com.microsoft.azure.datalake.store.ADLFileInputStream.read(ADLFileInputStream.java:101)

The readRemote() method uses the default ctor for new ExponentialBackoffPolicy() and there doesn't seem to be any way to specify more retries or a steeper backoff. In our use case, we have tasks in Hadoop running in parallel, and they apparently overwhelm the default backoff strategy.

Proxy support

Hi,
I don't see any proxy support in your sdk. Can you please confirm ? Many corps use proxy and without Proxy support how come on prem systems behind a proxy can work with Azure?
Thanks

Why RefreshToken does not use tenant id based auth endpoint

https://github.com/Azure/azure-data-lake-store-java/blob/master/src/main/java/com/microsoft/azure/datalake/store/oauth2/AzureADAuthenticator.java#L75

Why RefreshToken does not use tenant id based auth endpoint like ClientCreds does?

HttpTransport swallowing detailed error message.

Hi,

I am currently implicitly using ADL store client via Hadoop ADLFileSystem.

While testing out in firewalled environment, I found out it's really hard to know the cause of failure due to very short error message. Turning on DEBUG does not help as well as response object has not been updated -- also, it's hard to expect DEBUG mode to be on in production.

Below is error message and link to the code.

https://github.com/Azure/azure-data-lake-store-java/blob/master/src/main/java/com/microsoft/azure/datalake/store/HttpTransport.java#L185

DEBUG AADToken: starting to fetch token using client creds for client ID
DEBUG HTTPRequest,Failed,cReqId:.0,lat:47,err:HTTP0(null),Reqlen:0,Resplen:0,token_ns:47224784,sReqId:null,path:/tmp/,qp:op=MKDIRS&permission=750&api-version=2016-11-01
WARN Caught exception. This may be retried.
com.microsoft.azure.datalake.store.ADLException: Error fetching access token
Last encountered exception thrown after 1 tries [HTTP0(null)]
at com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1018)
at com.microsoft.azure.datalake.store.ADLStoreClient.createDirectory(ADLStoreClient.java:471)
at org.apache.hadoop.fs.adl.AdlFileSystem.mkdirs(AdlFileSystem.java:560)
...

Thanks,
Jin

WildFly upgrade - security issues

WildFly 17.0.1 currently used by azure-data-lake-store-sdk is outdated and identified as containing multiple security vulnerabilities:

Please upgrade to the latest version (currently 21.0.1) to mitigate those.

Thanks

AdlFileSystem: Set Owner failed with failed with error 0x83090aa2

I am trying to create a directory in Azure Data Lake and then set the owner attribute. I am using the following code to achieve this:

    Configuration conf = new Configuration();
    conf.addResource(new Path("/path/to/core-site.xml"));

    Path path =
      new Path("adl://accountname.azuredatalakestore.net/path/to/directory");

    AdlFileSystem fs = (AdlFileSystem) FileSystem.get(path.toUri(), conf);
    fs.mkdirs(path);  // SUCCEEDS

    FileStatus status = fs.getFileStatus(path); // SUCCEEDS

    fs.setOwner(path, status.getOwner(), status.getGroup()); // FAILS!

The fs.setOwner(...) call above fails with the following exception even though the same owner and group parameters passed as received from the fs.getFileStatus(path) call.

Exception in thread "main" org.apache.hadoop.security.AccessControlException: Set Owner failed with  failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [715f4073-8615-4a6f-8bd0-bbcb2b3cd767][2017-08-29T03:13:51.5592332-07:00] [ServerRequestId:715f4073-8615-4a6f-8bd0-bbcb2b3cd767]
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at com.microsoft.azure.datalake.store.ADLStoreClient.getRemoteException(ADLStoreClient.java:1167)
	at com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1132)
	at com.microsoft.azure.datalake.store.ADLStoreClient.setOwner(ADLStoreClient.java:756)
	at org.apache.hadoop.fs.adl.AdlFileSystem.setOwner(AdlFileSystem.java:649)
       [...]

Can you please suggest what could be the issue. I need to set the owner field using ADL FileSystem JAVA APIs.

Allow the tenant id to be passed rather than common

Running into the issue where I cannot authenticate to the common tenant and need to authenticate against a particular tenant id. Is it possible for the oauth2 url below to be constructed with a tenant id and default to common?

azure-data-lake-store-java/src/main/java/com/microsoft/azure/datalake/store/oauth2/AzureADAuthenticator.java

Line 104 in 47d3eaa

String authEndpoint = "https://login.microsoftonline.com/Common/oauth2/token";

AclEntry class doesn't allow user id to be passed

AclEntry class in Java SDK doesn't allow user id to be passed and expects Name of the user as paramater.

AclEntry newAcl = new AclEntry(AclScope.ACCESS,AclType.USER,name,AclAction.ALL);

When a service principal is deleted and recreated with the same name in AAD then AclEntry class ended up adding old service principal even after explicitly removing the old service principal from the file or directory access. While the behavior is questioned in a different channel, checking here in the SDK to see if this class can be improved to accept ID as parameter. I am not sure since this wrapper is referring on WebHDFS REST API but i can see that PowerShell command https://docs.microsoft.com/en-us/powershell/module/azurerm.datalakestore/set-azurermdatalakestoreitemaclentry?view=azurermps-6.13.0 accepts ID, so looks like the WebHDFS REST API may accept ID after all.

So the request here is to see whether AcLEntry class can accept "user Id" instead of name to avoid the behavior of adding old principal when programmatically setting the ACL using Java SDK

Client Throwing fileAlreadyExists error and uploading empty file

We are running a high data volume ETL process to upload JSON files to ADLS using Java Spring. During a load test, the Azure ADLS client began throwing the following error:

com.azure.storage.file.datalake.models.DataLakeStorageException: Status code 409, "{"error":{"code":"PathAlreadyExists","message":"The specified path already exists.

After looking into this, I noticed that the library is creating an empty file and then throwing that error. I am fairly confident that this error is erroneous because we have two different file naming approaches. Some of our files are named by a unique ID for the given event we are processing. Other files are named with a Java generated GUID. Both naming standards generated this error.

I could not find any other information online to help troubleshoot. I am wondering if this is a known bug or if there is some other known issue that may relate to this. Please let me know what information I can include to help troubleshoot this further because I do recognize I have provided limited information in this report.

azure / azure-data-lake-store-java Goto Github PK

azure-data-lake-store-java's Introduction

Azure Datalake Store client for Java

azure-data-lake-store-java's People

Contributors

Stargazers

Watchers

Forkers

azure-data-lake-store-java's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs