codait / stocator Goto Github PK

Stocator is high performing connector to object storage for Apache Spark, achieving performance by leveraging object storage semantics.

License: Apache License 2.0

Java 99.42% Shell 0.58%

stocator's Introduction

Stocator - Storage Connector for Apache Spark

Apache Spark can work with multiple data sources that include various object stores like IBM Cloud Object Storage, OpenStack Swift and more. To access an object store, Apache Spark uses Hadoop modules that contain connectors to the various object stores.

Apache Spark needs only small set of object store functionalities. Specifically, Apache Spark requires object listing, objects creation, read objects, and getting data partitions. Hadoop connectors, however, must be compliant with the Hadoop ecosystem. This means they support many more operations, such as shell operations on directories, including move, copy, rename, etc. (these are not native object store operations). Moreover, Hadoop Map Reduce Client is designed to work with file systems and not object stores. The temp files and folders it uses for every write operation are renamed, copied, and deleted. This leads to dozens of useless requests targeted at the object store. It’s clear that Hadoop is designed to work with file systems and not object stores.

Stocator is implicitly designed for the object stores, it has very a different architecture from the existing Hadoop connector. It doesn’t depends on the Hadoop modules and interacts directly with object stores.

Stocator is a generic connector, that may contain various implementations for object stores. It shipped with OpenStack Swift and IBM Cloud Object Storage connectors. Stocator can be easily extended with more object store implementations.

Major features

Hadoop ecosystem compliant. Implements Hadoop FileSystem interface
No need to change or recompile Apache Spark
Stocator doesn’t create any temporary folders or files for write operations. Each Spark's task generates only one object in the object store. Preserves existing Hadoop fault tolerance model
Object's name may contain "/" and thus simulate directory structures
Containers / buckets are automatically created
Supports speculate mode
Stocator uses Apache httpcomponents.httpclient.version version 4.5.2 and up

Stocator build procedure

Checkout the master branch https://github.com/SparkTC/stocator.git
Change directory to stocator
Execute mvn install
If you want to build a jar with all the dependencies, please execute
```
  mvn clean package -Pall-in-one
```

Usage with Apache Spark

Stocator can be used easily with Apache Spark. There are different ways to use Stocator

Using spark-packages

Stocator deployed on spark-packages. This is the easiest form to integrate Stocator in Spark. Just follow stocator

Using Stocator without compiling Apache Spark

It is possible to execute Apache Spark with Stocator, without compiling Apache Spark. Just make sure that Stocator jars on the class path. Build Stocator with

	mvn clean package -Pall-in-one

Directory stocator/target contains standalone jar stocator-X.Y.Z-jar-with-dependencies.jar.

The best is to extend Spark's class path to include Stocator jar. Edit conf/spark-defaults.conf and add Stocator to the class path. For example

spark.driver.extraClassPath=/<PATH>/stocator/target/stocator-X-Y-Z-jar-with-dependencies.jar
spark.executor.extraClassPath=/<PATH>/stocator/target/stocator-X-Y-Z-jar-with-dependencies.jar

Another option is to run Apache Spark with

./bin/spark-shell --jars stocator-X.Y.Z-jar-with-dependencies.jar

However this is less adviced, as Spark will copy Stocator jar to the executors, which may consume time.

Using Stocator with Apache Spark compilation

Less recommended, only for advanced users who want to recompile Spark. Both main pom.xml and core/pom.xml of Spark should be modified. add to the <properties> of the main pom.xml

<stocator.version>X.Y.Z</stocator.version>

add stocator dependency to the main pom.xml

 <dependency>
      <groupId>com.ibm.stocator</groupId>
      <artifactId>stocator</artifactId>
      <version>${stocator.version}</version>
      <scope>${hadoop.deps.scope}</scope>
  </dependency>

modify core/pom.xml to include stocator

<dependency>
      <groupId>com.ibm.stocator</groupId>
      <artifactId>stocator</artifactId>
 </dependency>

Compile Apache Spark with Haddop support as described here

General requirements

Stocator verifies that

mapreduce.fileoutputcommitter.marksuccessfuljobs=true

If not modified, the default value of mapreduce.fileoutputcommitter.marksuccessfuljobs is true.

Configuration keys

Stocator uses configuration keys that can be configured via spark's core-site.xml or provided in run time without using core-site.xml. To provide keys in run time use SparkContext variable with

sc.hadoopConfiguration.set("KEY","VALUE")

For usage with core-site.xml, see the configuration template located under conf/core-site.xml.template.

Stocator and IBM Cloud Object Storage (IBM COS)

Stocator allows to access IBM Cloud Object Storage via cos:// schema. The general URI is the form

cos://<bucket>.<service>/object(s)

where bucket is object storage bucket and <service> identifies configuration group entry.

Using multiple service names

Each <service> may be any text, without special characters. Each service may use it's specific credentials and has different endpoint. By using multiple <service> allows to use different endpoints simultaneously.

For example, service=myObjectStore, then URI will be of the form

cos://<bucket>.myObjectStore/object(s)

and configuration keys will have prefix fs.cos.myObjectStore. If none provided, the default value is service, e.g.

cos://<bucket>.service/object(s)

and configuration keys will have prefix fs.cos.service.

Reference Stocator in the core-site.xml

Configure Stocator in conf/core-site.xml

<property>
	<name>fs.stocator.scheme.list</name>
	<value>cos</value>
</property>
<property>
	<name>fs.cos.impl</name>
	<value>com.ibm.stocator.fs.ObjectStoreFileSystem</value>
</property>
<property>
	<name>fs.stocator.cos.impl</name>
	<value>com.ibm.stocator.fs.cos.COSAPIClient</value>
</property>
<property>
	<name>fs.stocator.cos.scheme</name>
	<value>cos</value>
</property>

Configuration keys

Stocator COS connector expose "fs.cos." keys prefix. For backward compatibility Stocator also supports "fs.s3d" and "fs.s3a" prefix, where "fs.cos" has the highest priority and will overwrite other keys, if present.

COS Connector configuration with IAM

To work with IAM and provide api key please switch to the relevant ibm-sdk branch depends on the Stocator version you need. For example for Stocator 1.0.24 release, switch to 1.0.24-ibm-sdk, for Stocator master 1.0.25-SNAPSHOT, switch to 1.0.25-SNAPSHOT-IBM-SDK and so on.

You will need to build Stocator manually, for example using 1.0.24-ibm-sdk branch:

git clone https://github.com/SparkTC/stocator
cd stocator
git fetch
git checkout -b 1.0.24-ibm-sdk origin/1.0.24-ibm-sdk
mvn clean install -Dmaven.test.skip=true

You now need to include the target/stocator-1.0.24-SNAPSHOT-IBM-SDK.jar into class path of Spark. Follow section Using Stocator without compiling Apache Spark

Configure Stocator

The next step if to configure Stocator with your COS credentials. The COS credentials is of the form

{
  "apikey": "123",
  "endpoints": "https://cos-service.bluemix.net/endpoints",
  "iam_apikey_description": "Auto generated apikey during resource-key operation for Instance - abc",
  "iam_apikey_name": "auto-generated-apikey-123",
  "iam_role_crn": "role",
  "iam_serviceid_crn": "identity-123::serviceid:ServiceId-XYZ",
  "resource_instance_id": "abc"
}

The following is the list of the Stocator configuration keys. <service> can be any value, for example myCOS

Key	Info	Mandatory	value
fs.cos.`<service>`.iam.api.key	API key	mandatory	value of `apiKey`
fs.cos.`<service>`.iam.service.id	Service ID	mandatory	Value of `iam_serviceid_crn`. In certain cases you need only value after `:serviceid:`
fs.cos.`<service>`.endpoint	COS endpoint	mandatory	Open link from `endpoints` and choose relevant endpoint. This endpoint should go here

Example, configure <service> as myCOS:

<property>
	<name>fs.cos.myCos.iam.api.key</name>
	<value>123</value>
</property>
<property>
	<name>fs.cos.myCos.endpoint</name>
	<value>http://s3-api.us-geo.objectstorage.softlayer.net</value>
</property>
<property>
	<name>fs.cos.myCos.iam.service.id</name>
	<value>ServiceId-XYZ</value>
</property>

Now you can use URI

cos://mybucket.myCos/myobject(s)

An optional, it is possible to provide existing token instead of using API key. Instead of providing fs.cos.myCos.iam.api.key, Stocator supports fs.cos.myCos.iam.api.token that may contain value of the existing token. When token is expired, Stocator will throw 403 exception. It's the user responsibility to provide long activation token or re-create token outside of Stocator.

COS Connector configuration without IAM

The following is the list of the configuration keys. <service> can be any value, for example myCOS

Key	Info	Mandatory
fs.cos.`<service>`.access.key	Access key	mandatory
fs.cos.`<service>`.secret.key	Secret key	mandatory
fs.cos.`<service>`.session.token	Session token	optional
fs.cos.`<service>`.endpoint	COS endpoint	mandatory
fs.cos.`<service>`.v2.signer.type	Signer type	optional

Example, configure <service> as myCOS:

<property>
	<name>fs.cos.myCos.access.key</name>
	<value>ACCESS KEY</value>
</property>
<property>
	<name>fs.cos.myCos.endpoint</name>
	<value>http://s3-api.us-geo.objectstorage.softlayer.net</value>
</property>
<property>
	<name>fs.cos.myCos.secret.key</name>
	<value>SECRET KEY</value>
</property>
<property>
	<name>fs.cos.myCos.session.token</name>
	<value>SESSION TOKEN</value>
</property>
<property>
	<name>fs.cos.myCos.v2.signer.type</name>
	<value>false</value>
</property>

Now you can use URI

cos://mybucket.myCos/myobject(s)

COS Connector optional configuration tuning

Key	Default	Info
fs.cos.socket.send.buffer	8*1024	socket send buffer to be used in the client
fs.cos.socket.recv.buffer	8*1024	socket send buffer to be used in the client
fs.cos.paging.maximum	5000	number of records to get while paging through a directory listing
fs.cos.threads.max	10	the maximum number of threads to allow in the pool used by TransferManager
fs.cos.threads.keepalivetime	60	the time an idle thread waits before terminating
fs.cos.signing-algorithm		override signature algorithm used for signing requests
fs.cos.connection.maximum	10000	number of simultaneous connections to the object store
fs.cos.attempts.maximum	20	number of times we should retry errors
fs.cos.block.size	128	size of a block in MB
fs.cos.connection.timeout	800000	amount of time (in ms) until we give up on a connection to the object store
fs.cos.connection.establish.timeout	50000	amount of time (in ms) until we give up trying to establish a connection to the object store
fs.cos.client.execution.timeout	500000	amount of time (in ms) to allow a client to complete the execution of an API call
fs.cos.client.request.timeout	500000	amount of time to wait (in ms) for a request to complete before giving up and timing out
fs.cos.connection.ssl.enabled	true	Enables or disables SSL connections to COS
fs.cos.proxy.host		Hostname of the (optional) proxy server for COS connections
fs.cos.proxy.port		Proxy server port. If this property is not set but fs.cos.proxy.host is, port 80 or 443 is assumed (consistent with the value of fs.cos.connection.ssl.enabled)
fs.cos.proxy.username		Username for authenticating with proxy server
fs.cos.proxy.password		Password for authenticating with proxy server
fs.cos.proxy.domain		Domain for authenticating with proxy server
fs.cos.user.agent.prefix		User agent prefix
fs.cos.flat.list	false	In flat listing the result will include all objects under specific path prefix, for example bucket/a/b/data.txt, bucket/a/d.data. If listed bucket/a*, then result will include both objects. If flat list is set to flase, then it contains the same list behaviour as community s3a connector.
fs.stocator.cache.size	2000	The Guava cache size used by the COS connector
fs.cos.multipart.size	8388608	Size in bytes. Define multipart size
fs.cos.multipart.threshold	Max Integer	minimum size in bytes before we start a multipart uploads, default is max integer
fs.cos.fast.upload	false	enable or disable block upload
fs.stocator.glob.bracket.support	false	if true supports Hadoop string patterns of the form {ab,c{de, fh}}. Due to possible collision with object names, this mode prevents from create an object whose name contains {}
fs.cos.atomic.write	false	enable or disable atomic write to COS using conditional requests. When the flag is set to true and the operation is create with `overwrite == false` a conditional header will be used to handle race conditions for mutliple writers. If the path gets created between `fs.create` and `stream.close` by an external writer the close operation will fail and the object will not be written

Stocator and Object Storage based on OpenStack Swift API

Stocator allows to access OpenStack Swift API based object stores via unique schema swift2d://.

Uses streaming for object uploads, without knowing object size. This is unique to Swift connector and removes the need to store object locally prior uploading it.
Supports Swiftauth, Keystone V2, Keystone V3 Password Scope Authentication
Supports any object store that expose Swift API and supports different authentication models

Supports access to public containers. For example

  sc.textFile("swift2d://dal05.objectstorage.softlayer.net/v1/AUTH_ID/CONT/data.csv")

Configure Stocator in the core-site.xml

Add the dependence to Stocator in conf/core-site.xml

<property>
	<name>fs.stocator.scheme.list</name>
	<value>swift2d</value>
</property>

If Swift connector used concurrently with COS connector, then also configure

<property>
	<name>fs.stocator.scheme.list</name>
	<value>swift2d,cos</value>
</property>

Configure the rest of the keys

<property>
	<name>fs.swift2d.impl</name>
	<value>com.ibm.stocator.fs.ObjectStoreFileSystem</value>
</property>
<property>
	<name>fs.stocator.swift2d.impl</name>
	<value>com.ibm.stocator.fs.swift.SwiftAPIClient</value>
</property>
<property>
	<name>fs.stocator.swift2d.scheme</name>
	<value>swift2d</value>
</property>

Swift connector configuration

The following is the list of the configuration keys

Key	Info	Default value
fs.swift2d.service.SERVICE_NAME.auth.url	Mandatory
fs.swift2d.service.SERVICE_NAME.public	Optional. Values: true, false	false
fs.swift2d.service.SERVICE_NAME.tenant	Mandatory
fs.swift2d.service.SERVICE_NAME.password	Mandatory
fs.swift2d.service.SERVICE_NAME.username	Mandatory
fs.swift2d.service.SERVICE_NAME.block.size	Block size in MB	128MB
fs.swift2d.service.SERVICE_NAME.region	Mandatory for Keystone
fs.swift2d.service.SERVICE_NAME.auth.method	Optional. Values: keystone, swiftauth, keystoneV3	keystoneV3
fs.swift2d.service.SERVICE_NAME.nonstreaming.upload	Optional. If set to true then any object upload will be stored locally in the temp file and uploaded on close method. Disable stocator streaming mode	false

Example of core-site.xml keys

Keystone V2

<property>
   <name>fs.swift2d.service.SERVICE_NAME.auth.url</name>
	<value>http://IP:PORT/v2.0/tokens</value>
</property>
<property>
   <name>fs.swift2d.service.SERVICE_NAME.public</name>
	<value>true</value>
</property>
<property>
   <name>fs.swift2d.service.SERVICE_NAME.tenant</name>
	<value>TENANT</value>
</property>
<property>
   <name>fs.swift2d.service.SERVICE_NAME.password</name>
	<value>PASSWORD</value>
</property>
<property>
   <name>fs.swift2d.service.SERVICE_NAME.username</name>
	<value>USERNAME</value>
</property>
<property>
   <name>fs.swift2d.service.SERVICE_NAME.auth.method</name>
   <!-- swiftauth if needed -->
	<value>keystone</value>
</property>

Keystone V3 mapping to keys

Driver configuration key	Keystone V3 key
fs.swift2d.service.SERVICE_NAME.username	user id
fs.swift2d.service.SERVICE_NAME.tenant	project id

<property>
    <name>fs.swift2d.service.SERVICE_NAME.auth.url</name>
    <value>https://identity.open.softlayer.com/v3/auth/tokens</value>
</property>
<property>
    <name>fs.swift2d.service.SERVICE_NAME.public</name>
    <value>true</value>
</property>
<property>
    <name>fs.swift2d.service.SERVICE_NAME.tenant</name>
    <value>PROJECTID</value>
</property>
<property>
    <name>fs.swift2d.service.SERVICE_NAME.password</name>
    <value>PASSWORD</value>
</property>
<property>
    <name>fs.swift2d.service.SERVICE_NAME.username</name>
    <value>USERID</value>
</property>
<property>
    <name>fs.swift2d.service.SERVICE_NAME.auth.method</name>
    <value>keystoneV3</value>
</property>
<property>
    <name>fs.swift2d.service.SERVICE_NAME.region</name>
    <value>dallas</value>
</property>

Swift connector optional configuration

Below is the optional configuration that can be provided to Stocator and used internally to configure HttpClient.

Configuration key	Default	Info
fs.stocator.MaxPerRoute	25	maximal connections per IP route
fs.stocator.MaxTotal	50	maximal concurrent connections
fs.stocator.SoTimeout	1000	low level socket timeout in milliseconds
fs.stocator.executionCount	100	number of retries for certain HTTP issues
fs.stocator.ReqConnectTimeout	5000	Request level connect timeout. Determines the timeout in milliseconds until a connection is established
fs.stocator.ReqConnectionRequestTimeout	5000	Request level connection timeout. Returns the timeout in milliseconds used when requesting a connection from the connection manager
fs.stocator.ReqSocketTimeout	5000	Defines the socket timeout (SO_TIMEOUT) in milliseconds, which is the timeout for waiting for data or, put differently, a maximum period inactivity between two consecutive data packets).
fs.stocator.joss.synchronize.time	false	Will disable JOSS to synchronize time with the server. Setting this value to 'false' will badly impact on temp url. However this will reduce HEAD on account which might be problematic if the user doesn't has access rights to HEAD an account
fs.stocator.tls.version	false	if not provided, then system choosen is the default. In certain cases user may setup custom value, like TLSv1.2

Configure Stocator's schemas (optional)

By default Stocator exposes swift2d:// for Swift API and cos:// for IBM Cloud Object Storage. It possible to configure Stocator to expose different schema.

The following example, will configure Stocator to respond to swift:// in addition to swift2d://. This is useful, so users doesn't need to modify existing jobs that already uses hadoop-openstack connector. Below the example, how to configure Stocator to respond both to swift:// and swift2d://

<property>
	<name>fs.stocator.scheme.list</name>
	<value>swift2d,swift</value>
</property>
<!-- configure stocator as swift2d:// -->
<property>
	<name>fs.swift2d.impl</name>
	<value>com.ibm.stocator.fs.ObjectStoreFileSystem</value>
</property>
<property>
	<name>fs.stocator.swift2d.impl</name>
	<value>com.ibm.stocator.fs.swift.SwiftAPIClient</value>
</property>
<property>
	<name>fs.stocator.swift2d.scheme</name>
	<value>swift2d</value>
</property>
<!-- configure stocator as swift:// -->
<property>
	<name>fs.swift.impl</name>
	<value>com.ibm.stocator.fs.ObjectStoreFileSystem</value>
</property>
<property>
	<name>fs.stocator.swift.impl</name>
	<value>com.ibm.stocator.fs.swift.SwiftAPIClient</value>
</property>
<property>
	<name>fs.stocator.swift.scheme</name>
	<value>swift</value>
</property>

Examples

Persists results in IBM Cloud Object Storage

val data = Array(1, 2, 3, 4, 5, 6, 7, 8)
val distData = sc.parallelize(data)
distData.saveAsTextFile("cos://mybucket.service/one1.txt")

Listing bucket mydata directly with a REST client will display

one1.txt
one1.txt/_SUCCESS
one1.txt/part-00000-taskid
one1.txt/part-00001-taskid
one1.txt/part-00002-taskid
one1.txt/part-00003-taskid
one1.txt/part-00004-taskid
one1.txt/part-00005-taskid
one1.txt/part-00006-taskid
one1.txt/part-00007-taskid

Using dataframes

val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value","square")
squaresDF.write.format("parquet").save("cos://mybucket.service/data.parquet")

Running Terasort

https://github.com/ehiggs/spark-terasort

Setup Stocator with COS as previosuly explained. In the example we use bucket teradata and service =service

Step 1:

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

Step 2:

./bin/spark-submit --driver-memory 2g --class com.github.ehiggs.spark.terasort.TeraGen /spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar 1g cos://teradata.service/terasort_in

Step 3:

./bin/spark-submit --driver-memory 2g --class com.github.ehiggs.spark.terasort.TeraSort /target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar 1g cos://teradata.service/terasort_in cos://teradata.service/terasort_out

Step 4:

./bin/spark-submit --driver-memory 2g --class com.github.ehiggs.spark.terasort.TeraValidate /target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar cos://teradata.service/terasort_out cos://teradata.service/terasort_validate

Functional tests

Copy

src/test/resources/core-site.xml.tempate to src/test/resources/core-site.xml

Edit src/test/resources/core-site.xml and configure Swift access details. Functional tests will use container from fs.swift2d.test.uri. To use different container, change drivertest to different name. Container need not to be exists in advance and will be created automatically.

How to develop code

If you like to work on code, you can easily setup Eclipse project via

mvn eclipse:eclipse

and import it into Eclipse workspace.

To ease the debugging process, Please modify conf/log4j.properties to

log4j.logger.com.ibm.stocator=ALL

Before you sumit your pull request

We ask that you include a line similar to the following as part of your pull request comments:

“DCO 1.1 Signed-off-by: Random J Developer“. 
“DCO” stands for “Developer Certificate of Origin,”

and refers to the same text used in the Linux Kernel community. By adding this simple comment, you tell the community that you wrote the code you are contributing, or you have the right to pass on the code that you are contributing.

Need more information?

Stocator Mailing list

Join Stocator mailing list by sending email to [email protected]. Use [email protected] to post questions.

Additional resources

Please follow our wiki for more details. More information about Stocator can be find at

Stocator: A High Performance Object Store Connector for Spark - SoCC '17 Proceedings of the 2017 Symposium on Cloud Computing
MapReduce and object stores – How can we do it better?
Advantages and complexities of integrating Hadoop with object stores
Stocator on IBM Code
Analyze data faster using Spark and IBM Cloud Object Storage
Exabytes, Elephants, Objects and Apache Spark
Simulating E.T. or: how to insert individual files into object storage from within a map function in Apache Spark
Hadoop and object stores: Can we do it better? Strata Data Conference, 23-25 May 2017, London, UK
VERY LARGE DATA FILES, OBJECT STORES, AND DEEP LEARNING—LESSONS LEARNED WHILE LOOKING FOR SIGNS OF EXTRA-TERRESTRIAL LIFE SPARK SUMMIT 2017 DATA SCIENCE AND ENGINEERING AT SCALE, JUNE 5 – 7, 2017 SAN FRANCISCO
Using data connectors to work with IBM Cloud Object Storage in IBM Spectrum Conductor with Spark 2.2.1

This research was supported by IOStack, an H2020 project of the EU Commission

stocator's People

Contributors

Stargazers

Watchers

stocator's Issues

Support Amazon S3 object store

WIP #4

SwiftOutputStream doesn't throw exceptions

Assume user doesn't has a write access to the container and he uploads a small object.
In this case write thread activates on close() and then exception is not thrown to the caller class at all.
We need to modify exception handler in SwiftOutputStream so that exception will be thrown from the thread to the caller method.

Example of the code to reproduce

public class WritePermissionTest{
	
	private static Configuration mConf = new Configuration(true);
	
	public static void main(String[] args) {
        mConf.set("fs.swift2d.impl","com.ibm.stocator.fs.ObjectStoreFileSystem");
        mConf.set("fs.swift2d.service.srv.auth.url", "https://identity.open.softlayer.com/v3/auth/tokens");
        mConf.set("fs.swift2d.service.srv.public", "true");
        mConf.set("fs.swift2d.service.srv.tenant", “PROJECT ID”);
        mConf.set("fs.swift2d.service.srv.password", “<PASSWORD>“);
        mConf.set("fs.swift2d.service.srv.username", “<USER ID>“);
        mConf.set("fs.swift2d.service.srv.region", “<REGION>”);
			        
        FileSystem fs = null;
        String p = "swift2d://container.srv/a1";
        try {
        	fs = FileSystem.get(URI.create(p), mConf);
        	FSDataOutputStream out =  fs.create(new Path(p));
        	out.write("abcdefgh".getBytes());
        	out.close();
        } catch ( IOException e) {
        	e.printStackTrace();
        } catch ( RuntimeException e) {
        	e.printStackTrace();
        } catch (Exception e) {
                e.printStackTrace();
       }	         
}

Stocator throws error on count operation when reading .root file with 537 columns and 230 rows

val dfData1 = spark.
read.format("org.dianahep.sparkroot").
option("tree", "Events").
option("inferSchema", "true").
load("swift://testobjectstorage." + name + "/test.root")
dfData1.count()

The reference example
https://github.com/diana-hep/spark-root/blob/master/ipynb/publicCMSMuonia_exampleAnalysis_wROOT.ipynb

Printschema works fine but count operation doesn't work.

Note the number of columns are 537.
No error when reading from local file system.
Error for count when reading from Object storage
.
Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 1 in stage 6.0 failed 10 times, most recent failure: Lost task 1.9 in stage 6.0 (TID 35, yp-spark-dal09-env5-0021): com.ibm.stocator.fs.common.exception.ConfigurationParseException: Configuration parse exception: Missing mandatory configuration: .auth.url
at com.ibm.stocator.fs.common.Utils.updateProperty(Utils.java:183)
at com.ibm.stocator.fs.swift.ConfigurationHandler.initialize(ConfigurationHandler.java:80)
at com.ibm.stocator.fs.swift.SwiftAPIClient.initiate(SwiftAPIClient.java:179)
at com.ibm.stocator.fs.ObjectStoreVisitor.getStoreClient(ObjectStoreVisitor.java:124)
at com.ibm.stocator.fs.ObjectStoreFileSystem.initialize(ObjectStoreFileSystem.java:90)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.dianahep.root4j.RootFileReader.(RootFileReader.java:181)
at org.dianahep.sparkroot.package$RootTableScan$$anonfun$2.apply(sparkroot.scala:121)
at org.dianahep.sparkroot.package$RootTableScan$$anonfun$2.apply(sparkroot.scala:119)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)
Driver stacktrace:
StackTrace: at com.ibm.stocator.fs.common.Utils.updateProperty(Utils.java:183)
at com.ibm.stocator.fs.swift.ConfigurationHandler.initialize(ConfigurationHandler.java:80)
at com.ibm.stocator.fs.swift.SwiftAPIClient.initiate(SwiftAPIClient.java:179)
at com.ibm.stocator.fs.ObjectStoreVisitor.getStoreClient(ObjectStoreVisitor.java:124)
at com.ibm.stocator.fs.ObjectStoreFileSystem.initialize(ObjectStoreFileSystem.java:90)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.dianahep.root4j.RootFileReader.(RootFileReader.java:181)
at org.dianahep.sparkroot.package$RootTableScan$$anonfun$2.apply(sparkroot.scala:121)
at org.dianahep.sparkroot.package$RootTableScan$$anonfun$2.apply(sparkroot.scala:119)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1461)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1449)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1448)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1448)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:812)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:812)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:812)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1674)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1629)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at java.lang.Thread.getStackTrace(Thread.java:1117)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1887)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1900)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1913)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:932)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:378)
at org.apache.spark.rdd.RDD.collect(RDD.scala:931)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:290)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2193)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2199)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2227)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2226)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2226)
... 48 elided
Caused by: com.ibm.stocator.fs.common.exception.ConfigurationParseException: Configuration parse exception: Missing mandatory configuration: .auth.url
at com.ibm.stocator.fs.common.Utils.updateProperty(Utils.java:183)
at com.ibm.stocator.fs.swift.ConfigurationHandler.initialize(ConfigurationHandler.java:80)
at com.ibm.stocator.fs.swift.SwiftAPIClient.initiate(SwiftAPIClient.java:179)
at com.ibm.stocator.fs.ObjectStoreVisitor.getStoreClient(ObjectStoreVisitor.java:124)
at com.ibm.stocator.fs.ObjectStoreFileSystem.initialize(ObjectStoreFileSystem.java:90)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.dianahep.root4j.RootFileReader.(RootFileReader.java:181)
at org.dianahep.sparkroot.package$RootTableScan$$anonfun$2.apply(sparkroot.scala:121)
at org.dianahep.sparkroot.package$RootTableScan$$anonfun$2.apply(sparkroot.scala:119)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)

FileSystem.getChecksum should return the checksum and not null

The issues comes when using Stocator via FileSystem API directly.
For example, the following Java code upload an object

FileSystem fs = null; String p = "swift2d://container.bmtest1/some.data"; try { fs = FileSystem.get(URI.create(p), mConf); } catch ( IOException e) { e.printStackTrace(); } try { FSDataOutputStream out = fs.create(new Path(p)); out.write("abcdefgh".getBytes()); out.close(); } catch (Exception e) { e.printStackTrace(); }

User might want to verify that object created successfully and call FileSystem.getChecksum.
However current code will return null, which seems like a bug.
We need to return the checksum.

Improve directory / container listing

There are couple of issues related directories listing, using / patterns

Improve Stocator init method

Consider Spark with many workers. When Spark job is submitted, each worker will try to initialize Stocator. This can be easily seen. Just make Spark to access an object with many parts or access a container with multiple objects. Then see the logs of the tasks. You will notice that each task on the worker node will perform Stocator init.

Stocator init method is costly, since it contains token authentication and configuration parsing. This means that each Spark task on each worker will perform the same Stocator init which affects the performance.

We should improve Stocator init method and make it per worker JVM and not per task, Perhaps by using Singelton approach will do the trick, since it's per JVM. This way the init will happen per JVM and not every time per task. Token creation and configuration parsing should be in the Singlelton.

ObjectStoreGlobber.glob(path) is returning null. Seems that globStatus(path) is not handling the wildcard * properly

Create an object with the partition structure, something like
container/data/year=2015/month=11/data1.txt
container/data/year=2016/month=12/data2.txt
container/data/year=2015/month=11/data3.txt
container/data/year=2015/month=11/data4.txt

And now access all objects of the form
container/data/year=*/month=*

Stocator will return wrong result.

We also need to make sure that null is not returned, but empty collection. When ObjectStoreGlobber.glob(path) does not find any matches it should return an empty array instead of null

ObjectStoreGlobber is not consistent with Globber

We need to make sure that ObjectStoreGlobber return the same results as with Globber.
We noticed that sometimes hadoop fs -ls return different results when comparing between both.
ObjectStoreGlobber still need to preserve one GET request, however it response should match Globber

Testing various Hadoop versions

Currently code was tested with Hadoop 2.6.0 and 2.7.2
It would be nice to know how it behaves with other Hadoop versions.

container auto create - turn on and off

If Spark persists an objects in the container that is not exists, current code will automatically create the container. To make this happen, Stocator init method always checks if container exists and if not creates it automatically. All this happen in the init method of SwiftAPIClient.

In certain cases we would like to disable automatic container creation and skip the check if container exists during init method. This will save our cost of the HEAD Container operation during the init method.

There is need to introduce new property "fs.stocator.dataroot.autocreate" and make it true by default, when this property is not provided in the configuration. But if it's provided and the value is false, the init method should not check if container exists and skip automatic container creation.

Object names should be URL encoded and support spaces and other characters

Stocator should allow to write / read objects that has spaces in the names.
For example

val data = Array(1, 2, 3, 4)
val distData = sc.parallelize(data)
distData.saveAsTextFile("swift2d://container.service/a b c.txt")

And also support read of objects that has spaces in their names

The 403 Forbidden error is ignored at PUT flow

the configuration:
I have 2 users defined at keystone:
user1
user2

I have two containers, they allowed a read access to both users, but write access allowed only to one user per container:
user1 can write to container1 (but not to container2)
user2 can write to container1 (but not to container1)

if I try to write to the wrong container with wrong credentials I got 403 forbidden error

but when I try to upload a new object through stocator - stocarot ignore 403 forbidden error - and it looks like the upload succeed, but of course I don't have that object written to the container.

I think the problem is that stocator uses the 100-continue for chunked upload, but does not check the returned value

the problem can be reproduced by setting the system to defined configuration, and running the following code, when the user2 will be used for authentication

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import sys

sc = SparkContext()
sqlContext = SQLContext(sc)

myList = [[1,'a'],[2,'b'],[3,'c'],[4,'d'],[5,'e'],[6,'f']]
parallelList = sc.parallelize(myList).collect()
schema = StructType([StructField('column1', IntegerType(), False),
StructField('column2', StringType(), False)])
df = sqlContext.createDataFrame(parallelList, schema)
dfTarget = df.coalesce(1)
dfTarget.write.parquet("swift2d://container1.spark/myobject")
print "Done!"

Creating container from 2 threads at once fails

I have two threads trying to upload object to the same container (which doesn't exists).
once thread creates the container and the second one fails and throw exception:

Caused by: Command exception, HTTP Status code: 202 => ENTITY_ALREADY_EXISTS
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:86)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:58)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:542)
	at org.javaswift.joss.exception.HttpStatusToExceptionMapper.getException(HttpStatusToExceptionMapper.java:48)
	at org.javaswift.joss.exception.HttpStatusExceptionUtil.getException(HttpStatusExceptionUtil.java:16)
	at org.javaswift.joss.exception.HttpStatusExceptionUtil.throwException(HttpStatusExceptionUtil.java:10)
	at org.javaswift.joss.command.impl.core.httpstatus.HttpStatusChecker.isOk(HttpStatusChecker.java:30)
	at org.javaswift.joss.command.impl.core.httpstatus.HttpStatusChecker.verifyCode(HttpStatusChecker.java:41)
	at org.javaswift.joss.command.impl.core.AbstractCommand.call(AbstractCommand.java:50)
	at org.javaswift.joss.command.impl.core.AbstractSecureCommand.call(AbstractSecureCommand.java:31)
	at org.javaswift.joss.client.core.AbstractContainer.create(AbstractContainer.java:221)
	at com.ibm.stocator.fs.swift.SwiftAPIClient.initiate(SwiftAPIClient.java:272)
	at com.ibm.stocator.fs.ObjectStoreVisitor.getStoreClient(ObjectStoreVisitor.java:124)
	at com.ibm.stocator.fs.ObjectStoreFileSystem.initialize(ObjectStoreFileSystem.java:90)

important point:
fs.swift2d.impl.disable.cache is set to true which makes init run every time
might be related

Stocator should use FileSystem.Statistics to report metrics

How to use encryption with stocator?

When working with enterprise data, there is often a requirement to have encryption of data at rest.

How can Stocator support encryption at rest?

Swift driver: support large objects creation

Stocator should support creation of a single object that is more then 5GB size.
Work in progress in #21

Need unitest coverage

Current unitests are sort of functional tests, that use real Swift cluster.
We need regular unitests

improved error handing

When an error is thrown by stocator, it is very difficult to find the cause of the error. For example:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NullPointerException
    at org.javaswift.joss.client.impl.ClientImpl.createAccount(ClientImpl.java:100)
    at org.javaswift.joss.client.impl.ClientImpl.createAccount(ClientImpl.java:27)
        ...

Stocator is quite difficult to work with because the error messages generally don't help you identify the root cause of the problem.

Swift driver: Stocator should be able to access Swift public containers without authentication

In particular it should support referrer access URLs
http://docs.openstack.org/developer/swift/misc.html

Add fault tolerance coverage, including speculate mode support

In a process of merging code to cover fault tolerance and speculative mode.

Stocator support for temp url

Stocator need to support temp url.

For example, consider the following temp url
https://mil01.objectstorage.softlayer.net/v1/AUTH_8f731319-18c3-4ce8-909f-df78b885df0a/testurl1/data.csv?temp_url_sig=ID&temp_url_expires=1479816450

To access it from Spark:

val d = sc.textFile("swift2d://testurl1.mil01/v1/AUTH_8f731319-18c3-4ce8-909f-df78b885df0a/testurl1/data.csv?temp_url_sig=ID&temp_url_expires=1479816450")

However Stocator will fail during listStatus(..)

stocator hangs while trying to upload an object

I'm using stocator as a replacement to hadoop-openstack when using hadoop library (not in spark)
While uploading object the app got stuck on:

Java callstack:
  at java/lang/Object.wait(Native Method)
  at java/lang/Object.wait(Object.java:201(Compiled Code))
  at java/io/PipedInputStream.awaitSpace(PipedInputStream.java:286(Compiled Code))
  at java/io/PipedInputStream.receive(PipedInputStream.java:244(Compiled Code))
    (entered lock: java/io/PipedInputStream@0x00000000E1212D08, entry count: 1)
  at java/io/PipedOutputStream.write(PipedOutputStream.java:161(Compiled Code))
  at com/ibm/stocator/fs/swift/SwiftOutputStream.write(SwiftOutputStream.java:165(Com
  at org/apache/hadoop/fs/FSDataOutputStream$PositionCache.write(FSDataOutputStream.ja
  at java/io/DataOutputStream.write(DataOutputStream.java:119(Compiled Code))

This is probably due to network issues (usually it is not happening), but I would have expected to get an exception.

Authentication token expiration in the Swift driver

I never tested what happens if authentication token expires.
For example, Spark saves 1000 tasks, each task persist it's relevant data - part-XXXX.

What happens if token expired in the middle?
The same for read operations.

Some functions in the code doesn't contain "Retry authentication", for example createObject in SwiftAPIClient. There should be coverage for retry in case of authentication token expire.

Also, need to test how JOSS behaves, when token expire ( JOSS suppose to cover it )

Question: Is is possible to read data from swift with this library?

The example on the README is just for writing data to swift. Is reading supported?

Fixing object delete and object overwrite mode from Spark

Code is ready. In a process of merging,

Is JOSS support keystone v3? I can't get keystone-v3-0.9.10-0001.patch from ../stocator/additions/joss/patch

Expand unitest and functional tests coverage.

create object doesn't work in SSL enabled mode

Create method uses wrong HTTP package that doesn't natively supports SSL mode.

We should use org.apache.commons.httpclient.methods.PutMethod instead of the current implementation. The package org.apache.commons.httpclient.methods.PutMethod internally supports SSL mode. This will make SSL mode work as expected when creating new objects.

Current code will throw

bicluster#7|Traceback (most recent call last):
bicluster#7| File "/home/biadmin/test-1461009081582/exporttoswift.py", line 73, in
bicluster#7| counts.saveAsTextFile(swift_file_url)
bicluster#7| File "/usr/iop/4.1.0.0/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1508, in saveAsTextFile
bicluster#7| File "/usr/iop/4.1.0.0/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in call
bicluster#7| File "/usr/iop/4.1.0.0/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco
bicluster#7| File "/usr/iop/4.1.0.0/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
bicluster#7|py4j.protocol.Py4JJavaError: An error occurred while calling o86.saveAsTextFile.
bicluster#7|: java.lang.NullPointerException
bicluster#7| at org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:257)
bicluster#7| at org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:302)
bicluster#7| at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:524)
bicluster#7| at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:508)
bicluster#7| at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121)
bicluster#7| at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
bicluster#7| at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
bicluster#7| at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:126)
bicluster#7| at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1089)
bicluster#7| at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)
bicluster#7| at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)
bicluster#7| at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
bicluster#7| at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
bicluster#7| at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
bicluster#7| at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1065)
bicluster#7| at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:989)
bicluster#7| at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965)
bicluster#7| at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965)
bicluster#7| at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
bicluster#7| at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
bicluster#7| at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
bicluster#7| at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:965)
bicluster#7| at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:897)
bicluster#7| at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897)
bicluster#7| at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897)
bicluster#7| at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
bicluster#7| at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
bicluster#7| at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
bicluster#7| at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:896)
bicluster#7| at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1426)
bicluster#7| at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1405)
bicluster#7| at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1405)
bicluster#7| at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
bicluster#7| at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
bicluster#7| at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
bicluster#7| at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1405)
bicluster#7| at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:522)
bicluster#7| at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:47)
bicluster#7| at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
bicluster#7| at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
bicluster#7| at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
bicluster#7| at java.lang.reflect.Method.invoke(Method.java:497)
bicluster#7| at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
bicluster#7| at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
bicluster#7| at py4j.Gateway.invoke(Gateway.java:259)
bicluster#7| at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
bicluster#7| at py4j.commands.CallCommand.execute(CallCommand.java:79)
bicluster#7| at py4j.GatewayConnection.run(GatewayConnection.java:207)
bicluster#7| at java.lang.Thread.run(Thread.java:745)

Unitests should use createFile(..) instead of mkdirs()

Stocator doesn't implements mkdirs(). Some unitests calls mkdirs(), which do nothing.
We need to make sure that unitests will call createFile(..) and not mkdirs().
We can implement new createFile()..that will create few bytes files.

Running the same SQL couple of time may throw java.lang.IndexOutOfBoundsException

I used some SQL on Parquet file.
First run was ok. I then re-run the query again and got

16/05/26 08:45:04 DEBUG SwiftInputStream: negative seek
16/05/26 08:45:04 DEBUG SwiftInputStream: Reading 6619446 bytes starting at 6553910
16/05/26 08:45:04 WARN TaskSetManager: Lost task 5.0 in stage 9.0 (TID 472, localhost): java.lang.IndexOutOfBoundsException
at sun.security.ssl.AppInputStream.read(AppInputStream.java:90)
at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
at org.apache.http.impl.io.SocketInputBuffer.isDataAvailable(SocketInputBuffer.java:95)
at org.apache.http.impl.AbstractHttpClientConnection.isStale(AbstractHttpClientConnection.java:310)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.isStale(ManagedClientConnectionImpl.java:163)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:434)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.javaswift.joss.command.impl.core.AbstractCommand.call(AbstractCommand.java:49)
at org.javaswift.joss.command.impl.factory.AccountCommandFactoryImpl.authenticate(AccountCommandFactoryImpl.java:66)
at org.javaswift.joss.client.core.AbstractAccount.authenticate(AbstractAccount.java:171)
at com.ibm.stocator.fs.swift.SwiftInputStream.loadIntoBuffer(SwiftInputStream.java:259)
at com.ibm.stocator.fs.swift.SwiftInputStream.seek(SwiftInputStream.java:246)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:429)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:100)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
at org.apache.spark.sql.execution.datasources.parquet.DefaultSource$$anonfun$buildReader$1.apply(ParquetRelation.scala:349)
at org.apache.spark.sql.execution.datasources.parquet.DefaultSource$$anonfun$buildReader$1.apply(ParquetRelation.scala:328)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:114)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:359)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:359)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Delete swift files/containers with stocator?

I have saved rdds to swift. I now need to clean up some containers.

    sc = SparkContext()

    prefix = "fs.swift2d.service." + service_name

    sc._jsc.hadoopConfiguration().set("fs.swift2d.impl","com.ibm.stocator.fs.ObjectStoreFileSystem")

    sc._jsc.hadoopConfiguration().set(prefix + ".auth.url",     "https://identity.open.softlayer.com/v3/auth/tokens")
    sc._jsc.hadoopConfiguration().set(prefix + ".public",       "true")
    sc._jsc.hadoopConfiguration().set(prefix + ".tenant",       project_id)
    sc._jsc.hadoopConfiguration().set(prefix + ".password",     password)
    sc._jsc.hadoopConfiguration().set(prefix + ".username",     username)
    sc._jsc.hadoopConfiguration().set(prefix + ".auth.method", "keystoneV3")
    sc._jsc.hadoopConfiguration().set(prefix + ".region",      "dallas")

    sqlContext = SQLContext(sc)

    # read file from HDFS
    lines = sc.textFile(license_filename, 1)
    counts = lines.flatMap(lambda x: x.split(' ')) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add) \
                  .filter(lambda x: x[0].isalnum())

    # Create a sql dataframe from the counts dataframe
    hhdf = sqlContext.createDataFrame(counts,['letter', 'count'])

    # destination url
    swift_file_url = "swift2d://{0}.{1}/counts".format(container, service_name)

    # save to swift
    counts.saveAsTextFile(swift_file_url)

    # some code omitted for brevity

    ### how can I delete the container at swift_file_url?

Is it possible to delete swift files/containers with stocator? If so, please provide an example:

Support object rename

Work in progress #24

multiple partition object and create object with overwrite

Assume object store contains:

container/d
container/d/part-001
container/d/part-002
container/_SUCCESS

We now call create method to create the same object in the overwrite mode.
Assume that we now create object "d" that has smaller size, only 1 part.

Existing code will not delete previous object - but overwrite "d", "d/part-001". In this scenario old parts and even _SUCCESS method will exists from the previous run.

Create with overwrite mode should delete previous object, before creating another one. I assume calling delete method will be enough ( change should be in ObjectStoreFileSystem )

Support compression

Using the default swift connector, compression works out of the box. Compressing the data helps us to achieve significant performance gains, specially in our case where the files saved by one spark job will be processed by another job. We generate around 4-6 GB of data per hour. Compression helps us to reduce it to around 800 MB.

If we can support something like this, it would be great.

val data = Array(1, 2, 3, 4, 5, 6, 7, 8)
val distData = sc.parallelize(data)
distData.saveAsTextFile("swift2d://logs.swauth/one1.txt", classOf[GzipCodec])

Reference the new driver in the core-site.xml

The xml snippet shown references the class name for the new driver. However, it doesn't include the filesystem path where the class is located.

<property>
    <name>fs.swift2d.impl</name>
    <value>com.ibm.stocator.fs.ObjectStoreFileSystem</value>
</property>

Where should the driver location be specified (using Ambari)?

Thanks

Use HttpClient in SwiftOutputStream

We have to use HttpClient ( version 4 and up ) for the streaming upload.
Currently SwiftOutputStream uses HttpURLConnection to obtain ouputstream that is used for streaming. We should implement the same based on CloseableHttpClient, similar to what we do in SwiftAPIDirect.

i'm sorry i can't find the keystone v3 patch in the source code,can you give me one?and how to use it in joss?

How to enforce _SUCCESS file creation?

Hadoop contains configuration key "mapreduce.fileoutputcommitter.marksuccessfuljobs".
If value is set to "true" or value is not set, then each successful job will create _SUCCESS file.
Stocator rely on this file for certain flows and it need to verify that mapreduce.fileoutputcommitter.marksuccessfuljobs is not set to false ( the default is true, i think )

AttributeError: 'SparkContext' object has no attribute 'hadoopConfiguration'

I'm not sure if this is a bug or user error.

See the details of the ticket at: snowch/biginsight-examples#28

Testing Stocator with Spark Streaming

We need to test that Stocator works with Spark Streaming.
This https://docs.cloud.databricks.com/docs/latest/databricks_guide/08%20Spark%20Streaming/08%20Write%20Output%20To%20S3.html contains very good example.
We need to verify that it works with Stocator as well.
It will be also good to document here if there are any modification to the example above in order to make it with work with Stocator.

Create method should generate objects with correct content-type

create(...) method in ObjectStoreFileSystem uses wrong content-type for generated objects:

FSDataOutputStream outStream = storageClient.createObject(objNameModified, "binary/octet-stream", metadata, statistics);

This should be somehow resolved to correct content-type. It would be good to know how other Hadoop drivers resolves content-types. We need to decide wether we should have correct content-type or leave "binary/octet-stream" as is

Problems with dynamically setting the configuration values.

First of all thanks for this library. I was trying to test it out and encountered a few issues.

I tried with both Bluemix Keystone V3 and Legacy Softlayer Object Storage.
https://gist.github.com/anirupdutta/50434ef3ca87e850db3ded7ea877a744

Error trace
https://gist.github.com/anirupdutta/3de2e6f739889e019c1b2d80e272223e

Any idea what is causing this.

Swift tenant configuration on a per job basis

For my use case I would like to have the ability to specify the Swift tenant credentials on a per tenant basis. Currently this is configured through xml files. Is there any way to dynamically specify the tenant credentials through spark context or similar?

Setting timeout in JOSS

We need to have an ability to setup timeout for the requests when we use JOSS.
It seems JOSS uses keep-alive or too large timeout. We need to provide additional configuration parameter and use it to setup HTTP timeout for JOSS.

rename support for sub-directories

Current rename implementation works on the objects only.
We should extend it and support rename on the sub-directories, for example

container/a/b/c/data_1.txt
container/a/b/c/data_2.txt
container/a/b/c/data_3.txt
container/a/b/c/data_4.txt

And now we want to rename container/a/b/c to container/a/b/d. The implementation will need to handle

rename container/a/b/c/data_1.txt to container/a/b/d/data_1.txt
rename container/a/b/c/data_2.txt to container/a/b/d/data_2.txt
rename container/a/b/c/data_3.txt to container/a/b/d/data_3.txt
rename container/a/b/c/data_4.txt to container/a/b/d/data_4.txt

Issues with Hadoop dependency on HttpClient

Stocator depends on httpcomponents.httpcore.version 4.4.5 and httpcomponents.httpclient.version 4.5.2. However current Hadoop releases still depends on older versions
[INFO] +- org.apache.hadoop:hadoop-common:jar:2.7.2:provided
[INFO] | +- net.java.dev.jets3t:jets3t:jar:0.9.0:provided
[INFO] | | - (org.apache.httpcomponents:httpclient:jar:4.1.2:provided - omitted for conflict with 4.2.1)
[INFO] | - org.apache.hadoop:hadoop-auth:jar:2.7.2:provided
[INFO] | - (org.apache.httpcomponents:httpclient:jar:4.2.5:provided - omitted for conflict with 4.1.2)

The issues with HttpClient versions are not affecting Stocator only and they were addressed in [https://issues.apache.org/jira/browse/HADOOP-12767]. Unfortunately [https://issues.apache.org/jira/browse/HADOOP-12767] was not merged into 2.7.3 or 2.7.2 but only in Hadoop 2.8

Question: swift url for solftlayer?

When saving a file to softlayer, the example on the README is this:

sc.textFile("swift2d://dal05.objectstorage.softlayer.net/v1/AUTH_ID/CONT/data.csv")

Where do I get the AUTH_ID from?

testOverwrite() in TestSwiftFileSystemBasicOps fails

testOverwrite() in TestSwiftFileSystemBasicOps fails with wrong content length.
Not sure when this issue appears.

Container listing

When trying to retrieve a strict subset of contained objects by using a wildcard, for example, x*.csv, stocator returned all the objects of the specified container.

Improving cleanup process in the unitests

Many unitests create temp data.
The way it works today is that each @test creates it's data and then
@After public void tearDown()
responsible to clean the generated data.

This is highly inefficient since it happens for each test and not per test suite. We need to change the logic, so that temp data will be created only once in the setup() method and cleaned at the end of the class run:
@AfterClass public static void classTearDown() throws Exception { }
A good reference is TestSwiftFileSystemLsOperations. It suppose to create temp data only once and then clean it after all tests complete it's executions. The way it works today is that data is created and cleaned per function.

codait / stocator Goto Github PK

stocator's Introduction

Stocator - Storage Connector for Apache Spark

Major features

Stocator build procedure

Usage with Apache Spark

Using spark-packages

Using Stocator without compiling Apache Spark

Using Stocator with Apache Spark compilation

General requirements

Configuration keys

Stocator and IBM Cloud Object Storage (IBM COS)

Using multiple service names

Reference Stocator in the core-site.xml

Configuration keys

COS Connector configuration with IAM

Configure Stocator

COS Connector configuration without IAM

COS Connector optional configuration tuning

Stocator and Object Storage based on OpenStack Swift API

Configure Stocator in the core-site.xml

Swift connector configuration

Example of core-site.xml keys

Keystone V2

Keystone V3 mapping to keys

Swift connector optional configuration

Configure Stocator's schemas (optional)

Examples

Persists results in IBM Cloud Object Storage

Using dataframes

Running Terasort

Functional tests

How to develop code

Before you sumit your pull request

Need more information?

Stocator Mailing list

Additional resources

stocator's People

Contributors

Stargazers

Watchers

Forkers

stocator's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs