scaleunlimited / cascading.avro Goto Github PK

This project forked from clizzin/cascading.avro

Cascading Scheme for the Apache Avro data serialization format

License: Other

Java 100.00%

cascading.avro's Introduction

cascading.avro-scheme

Cascading scheme for reading and writing data serialized using Apache Avro. This project provides several schemes that work off an Avro record schema.

AvroScheme - sources and sinks tuples with fields named and ordered according to a given Avro schema or a list of Fields and Types. If no schema is specified in a source it will peek at the data and get the schema. A sink schema is required (for now).

Avro Maps will be read in and converted to Java Maps. Avro Arrays will be read in and converted to Java Lists. In order to use this feature you will need to provide Hadoop with a way to serialize Java Maps and Lists, such as cascading.kryo.

When writing to Avro an Avro Array can be made from either a Java List or a Cascading Tuple. The same applies for an Avro Map. In the case of a Map, the incoming Cascading Tuple will be taken two entries at a time, the first will be the key for the Avro Map and the second will be the value.

The current implementation supports all Avro types including nested records. A nested record will be written as a new Cascading TupleEntry inside the proper Tuple Field. To write a nested record to Avro you must provide a TupleEntry with proper field names.

The current version of cascading.avro is compatibile with Cascading 2.x. Please see the 1.0 branch for a Cascading 1.2.x version.

cascading.avro-maven-plugin

An Apache Maven plugin that generates classes with field name constants based on an Avro record schema. This plugin is similar to the standard Avro schema plugin used to generate specific objects for Avro records. The plugin creates names for generated classes by appending the word "Fields" to the record name. The generated class will have constant fields for all record fields, as well as a field named ALL that lists all fields in the expected order.

The advantage of using the plugin is that given an Avro record Foo with field bar, you can use FooFields.BAR rather than the string "bar" in your Flow. Also it adds FooFields.ALL which lists all fields in the record, which is helpful as the last step of an assembly to ensure you're producing all fields.

Acknowledgements

This project has components of the original cascading.avro project as well as some from the cascading-avro project.

License

Distributed under Apache License, Version 2.0

cascading.avro's People

Contributors

Stargazers

Watchers

cascading.avro's Issues

Make scheme compatible with Cascading 3.x

See pull request #44.

Do 2.2.0 release?

Hi all - seems like it's time for a new release.

Should we do a 2.2.0 release, or just 2.1.3?
Is there anything else that should go into this release?
Currently we don't use the Maven release plugin - the pom would need some mods, and I haven't used mvn release for 6+ years, so I'm not sure I'd get it right. Anyone else want to chime in here (or help)? Otherwise I'll do it the old-fashioned way.

Thanks,

-- Ken

add support for selective field reading

Currently all fields presented in a schema are read by the source. This means creating objects that may not be required if the pipe doesn't use them. It would be good to be able to specify which fields to read from the avro record.

Support for (Specific|Generic)Records inside Tuples

Hiya

Wondering if you'd consider supporting SpecificRecord or GenericRecords directly inside tuples, rather than requiring them to be converted to and from nested Tuple or TupleEntry?

This would make it easier to re-use existing code which works with avro records without needing to convert them to / from nested tuples at every level, only to have them converted back to GenericRecord again behind the scenes.

Looks like it shouldn't be too much work to add, so I'm happy to make a pull request if you're open to it, just checking if there were any pitfalls I haven't spotted. It seems like it'd be easy to support either records or nested tuples when sinking, although when sourcing there might need to be an option on the AvroScheme for whether to generate a GenericRecord, SpecificRecord or a nested Tuple.

Add Eclipse formatter file to define code style

Currently most code uses the following conventions (but the style is diverging)

four spaces for indentation
else on same line as closing brace
wrap lines to 120 characters

Plus a few more...I should capture these in an Eclipse formatting file.

Add tests to validate scheme works in Cascading solo mode

Cascading workflows can run without Hadoop, if you use the LocalFlowConnector. We should test to verify that AvroScheme works in this environment.

Changes to allow the use of cascading.avro in Cascalog

I made a number of changes (most notably was the overloaded constructor that added support for providing avro field names which may differ from the tuple field names). Cascalog uses prefixes in the tuples like ? and ! which are not allowed as avro fields. For example, someone can name the tuple "?name" and the avro field "name".

The README has details about other changes.

In short, this pull request provides:

new constructor to support tuple/avro field name mapping
updated avro to 1.6.3 (and made some adjustments as needed)
added change that ensured avro.codec is written out in the file
added automatic conversion of date
changes to POM to indicate new version/fork

I really just tried to stick with the coding style as much as possible, but feel this whole thing can be cleaned up a bit.

Pull as you please.

Fix some bugs in 2.1, split PackedAvroScheme to its own file

I updated the 2.1 branch with some bugfixes and changed the unpacked api to use a different scheme for clarity. I think this should be version 2.1.1 and we should put a new jar up on conjars.

Apologies for the diffs, new IDE and some cut/paste seems to have moved everything around.

Issue in handling ["null","boolean|Double|Integer"] types in avro schema

https://groups.google.com/forum/#!topic/cascading-user/r6UlLgw7nFk

Faced an issue in writing data to Avro Sink using Avro schema as well as passing Cascading fields and Type. Possible issue in CascadingToAvro methodtoAvro(Object obj, Schema schema). Currently fixed it in my code project to get the AvroScheme working.

Using Avro input and then grouping results in a runtime error.

From the cascading group:
https://groups.google.com/d/msg/cascading-user/cxnbmn0KFfM/GZuBXXRZNhMJ

NoSuchMethodException when using AvroScheme sink

Hey,

I get the following
2014-12-18 00:31:19,273 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
at org.apache.avro.mapred.AvroOutputFormat.getRecordWriter(AvroOutputFormat.java:156)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.(ReduceTask.java:484)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

when I try to write an avro file as output, note I can read without issues.

Seems to be a version mismatch, I've tried the following:
hadoop 2.6
cascading 2.6.1 (and 2.5.5)

hadoop 2.3
cascading 2.6.1 (and 2.5.5)

any ideas what Im doing wrong?

avro-scheme generates untyped Cascading Fields objects

The Fields class in Cascading 2.2 and later supports optional field type declarations. However, when I tried to use avro-scheme with an application that makes use of this feature, it failed. The reason is a bit hard to explain, but this is the general idea:

When Cascading encounters a mix of typed and untyped Fields in a flow it will generally drop the field types.
The AvroScheme generates an untyped Fields object.
Therefore, the tuples sourced from the scheme cause pipes to fail downstream if they rely on the field types facility for type coercion.

For more background, see these links:

I have started a branch of avro-scheme that generates typed fields:

https://github.com/ldcasillas-progreso/cascading.avro/tree/field_types-2.6

I don't feel that this is ready for a merge just yet, because:

I started my work on the version-2.5 branch and I haven't yet digested the type conversion logic going on in version-2.6 and how this might affect what I'm doing.
I haven't yet thought out how to unit test these field typing changes

Yet I think it's a good time already to start the conversation on how to move forward from here on getting this feature merged upstream.

See also the discussion under issue #27 for other potential expansions of this feature.

2.6-SNAPSHOT map/array behavior does not correspond with README

The README.md currently in the version-2.6 branch states the following:

Avro Maps will be read in and converted to Java Maps. Avro Arrays will be read in and converted to Java Lists.

But looking at the AvroToCascading.java version currently in the branch (from commit 70d9562), this is not true. Both Avro maps and arrays will be read in and converted to Cascading Tuples. For example, an Avro map will be converted to a Tuple with keys in even indexes and values in odd indexes.

The corresponding code in the version-2.5 branch does conform with the README.md. So which is right, the README or the code?

Request for documentation

A getting started guide, wiki or annotated set of examples would be very helpful.

Currently, I'm reading through the tests to muddle my way through how to use this library. I see examples on how to use "Generic" Avro, but not "Specific" reader/writers. In particular, you have a "AvroSpecificRecordSerialization" class that seems to be used nowhere in the codebase.

This request for better documentation also goes for the maven plugin you created. It would be much easier to understand what it provides if a few documented examples were included.

What should we do with the Maven plugin?

As per Michael Peterson's question on the list:

From the (extremely minimal) description of cascading.avro-maven-plugin on the cascading.avro web page, it sounds like a replacement for the Avro avro-maven-plugin. Is that right?

Why would I use this plugin rather than the one from org.apache.avro? What does it provide that is better or more suited to using with Cascading?

And Chris Severs responded...

I don't think the cascading.avro maven plugin author is active in the project any more and to be honest I've never looked at it since I didn't have a need for it. Maybe Ken Krugler knows and can answer if he's reading this?

So we should probably add a disclaimer/warning, or we could remove it - thoughts?

Exception due to missing hashcode and equals

https://github.com/ScaleUnlimited/cascading.avro/blob/master/scheme/src/main/java/cascading/avro/AvroScheme.java#L55

https://groups.google.com/forum/#!topic/cascading-user/puchYenHQT4

Please refer to the thread above it's the same exact problem we have with Avro

Switch Cascading dependencies in pom.xml to provided scope

Chris Severs had said:

More generally, I was thinking it might be nice to not force the include of cascading in cascading.avro. Does that sound reasonable to you (and anyone else)? Right now I think cascading.avro has a dependency on some version of cascading but it could be changed to be a provided dependency so it doesn't step on your chosen version of cascading if you use maven. It's also possible of course to exclude the cascading from cascading.avro in the pom file but it might be nice to do so by default.

And we only need to depend on cascading-hadoop (and cascading-local?), not cascading-core

TrevniScheme, and test case added.

TrevniScheme added.

Add support for custom conversion

I have a scheme that has many fields as avro "bytes". These are converted to BytesWritable (BTW, why not Byte[] ?). But I want to use them as String. So I need an extra step to convert. And the BytesWritable object is created unnecessarily. Would be nice to be able to specify custom conversions for fields to avoid the generic ones.

Request remove limitation "Unions may only consist of a concrete type and null in cascading.avro"

Hi,

At the moment the only avro union's support are those of type [:null :string] or [:null :int] etc
I have an avro shema with a union that references different Record schemas
["Cat" "Mouse" "Dog"]

How much work would removing this limitation require?

Thanks

Need support for multiple schemas within an input directory

It is possible to have multiple Avro Schemas present within the same input directory. Usually this occurs when the data format changes over time. Need to have a utility method to create a pipe which can read multiple schemas (probably as a multi-tap). Also, need to return fields and data types as a normalized union of the two schemas, using the following rules:

If data types of the fields do not conflict, simply take the union of the fields and data types of all input Schemas.
If the data types conflict, but can be promoted, then promote them (i.e. int -> long, float -> double, int -> double).
Otherwise, convert conflicting data types to String (when converting BytesWritable to String, use Hex)

Bug PackedAvroScheme giving incorrect result for nested record avro schema

Take for example this schema:

{:name "innerInnerRecord" :type :record :fields [{:name "c" :type :string}]}

   {:name "innerRecord"
    :type :record
    :fields [{:name "b" :type ["innerInnerRecord"]}]}

   {:name "mainRecord"
    :type :record
    :fields [{:name "a" :type ["innerRecord"]}
             {:name "d" :type :string}]}

I'm storing in an avro file the following data:

{:a {:b {:c "foo"}}
              :d "bar"}

Using the PackedAvroScheme as tap cascalog outputs the following:

{"a": {"b": {"c": "foo"}}, "d": {"b": {"c": "foo"}}}

Notice how it just repeated the value it got for key a for key d as well?

You can reproduce these results by cloning the following repo:
(deleted repo)

Just follow the Installation instructions in the github readme.

Notes:
-cascading.avro source is in src/java/cascading/avro
-the avro schema is is src/cabug/core.clj
-the -main method of core.clj write's out the test data to an avro file in the workdir
and launches a local cascalog job that tries to read the data back from that avro file using the PackedAvroScheme.

Add unit test to verify support for arrays in AvroScheme

From Sowmitra on May 19, 2015:

Finally zeroed in on the problem. For one of the datasets, I was using AvroScheme instead of PackedAvroScheme, and the data I was reading had a schema like below. Looks like AvroScheme does not handle the ser/de of the array field well. Are you aware of any issues around that?

The schema he was using looks like:

{
    "type" : "record",
    "name" : "Object",
    "fields" : [ {
        "name" : "field1",
        "type" : "int"
    }, {
        "name" : "field2",
        "type" : "string"
    }, {
        "name" : "field3",
        "type" : "boolean"
    }, {
        "name" : "field4",
        "type" : {
            "type" : "array",
            "items" : {
                "type" : "record",
                "name" : "field4_type",
                "fields" : [ {
                    "name" : "subField1",
                    "type" : "string"
                }, {
                    "name" : "subField2",
                    "type" : "string"
                } ]
            }
        }
    } ]
}

Inferring schema from a source with other files or paths fails

Trying to infer the schema from a path which has other files, such as _SUCCESS, will likely fail.

There is a good example of how to do this properly here:
https://github.com/josephadler/fast-avro-storage/blob/master/src/main/java/com/linkedin/pig/FastAvroStorage.java#L142

We should do something similar.

2.2 API change

I made a slight breaking change for 2.2. I removed the packUnpack boolean from AvroScheme and just made a new subclass called PackedAvroScheme for the purpose of reading an Avro record of type T and doing nothing with it.

I also changed from using Tuple on the read to a new subclass of Tuple called AvroTuple. This is just a thin wrapper with adapters on the Avro Record so it should incur less I/O. All the tests pass but it would be nice if people beat on it a little bit.

Tell Avro to use String vs. utf8

As per Doug Cutting on the Avro list:

The property is "avro.java.string". If the value is "String" then a
java.lang.String is returned. In other words, if the reading schema
is {"type":"string", "avro.java.string":"String"} then String will be
returned instead of Utf8.

This should improve our efficiency. I think we would want to do this when synthesizing an Avro schema (Maven plugin, or from the old API with Fields & types), and probably when reading as we ultimately need to have String objects in the Cascading Tuple.

Support for Union types with more than one concrete types(excluding nulls).

As of now, in case of union types , there is a restriction for one concrete type and null.

Dependency on cascading-hadoop rather than cascading-hadoop2

SHouldnt this be on cascading-hadoop2-mr1 or at least a version of cascading.avro that uses the latest hadoop2 cascading jars?

Downgrade Avro dependency to 1.7.4?

As mentioned in issue #33, The 2.5.x branch of avro-scheme depends on a newer Avro version (1.7.7) than the one that ships with Hadoop 2.6.x (Avro 1.7.4). In addition to that, avro-scheme does call a methods that exist in Avro 1.7.7 but not in 1.7.4. Which means that an application that uses avro-scheme must set the Hadoop mapreduce.job.user.classpath.first configuration option to true to work reliably, otherwise it can get NoSuchMethodErrors as detailed in Issue #33.

However, setting mapreduce.job.user.classpath.first=true can cause problems with other components or libraries that the application uses. For example, after modifying my application to use that setting, I had to downgrade the Guava library to 14.0.1, because the setting causes Hadoop to put my application's newer Guava ahead of its own, and current Hadoop versions contain code that depends on methods that were removed in recent Guava versions.

Therefore, it seems that it would be prudent for avro-scheme to be more conservative about which Avro version is required, and try not to use a version newer than whatever Hadoop ships with.

Note that this isn't just a matter of downgrading the Avro dependency, since avro-scheme currently does use methods that were introduced in Avro 1.7.5 and later (thus the NoSuchMethodError detailed in issue #33).

Get a clojure map output from cascading.avro tap

Hi,

I'm trying to get a clojure data structure out from my cascading.avro tap.
I did succeed in modding cascading.avro to return a java HashMap but converting that to a clojure map is not optimal.

I'm looking to get this kind of output out from my cascading.avro tap: {:a {:b 1 :properties [{:c {:d 2 :e 4}]}}

Is this something that the AvroSpecificRecordSerialization class from cascading.avro can do? or perhaps Kryo?

Thanks

Inferring Avro schema from subquery field types

Would it be possible to infer Avro schema directly from a query field types?

AvroScheme fails with empty input directory

[Issue ported from https://github.com/bixolabs/cascading.avro/issues/4]

We have a situation with complex CoGroup joins and GroupBy where sometimes the input directory is empty for one or more sources. Previously before converting to Avro we could point Cascading to an empty directory and simply get 0 tuples back but no error. Now we get the error:
java.lang.IllegalStateException: scheme cannot be generated as no input file present!!
at com.icrossing.collection.cascading.avro.AvroTest.getAvroScheme(AvroTest.java:140)
at com.icrossing.collection.cascading.avro.AvroTest.testForSortOnString(AvroTest.java:92)

Cascading.Avro has all the information necessary to generate a Scheme since we are passing the fields and data types in upon construction. Also, this is not backward compatible with Cascading behavior. I see two immediate alternatives to resolve the: either always create an output file with metadata whenever a flow runs, or create a virtual Scheme in the case no input file exists.

Test case:
package com.icrossing.collection.cascading.avro;

import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

import org.apache.avro.Schema;
import org.apache.avro.Schema.Field;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.mapred.FsInput;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
import org.apache.log4j.Logger;
import org.junit.After;
import org.junit.Test;

import com.bixolabs.cascading.avro.AvroScheme;

import cascading.CascadingTestCase;
import cascading.flow.Flow;
import cascading.flow.FlowConnector;
import cascading.pipe.CoGroup;
import cascading.pipe.Pipe;
import cascading.pipe.cogroup.OuterJoin;
import cascading.scheme.Scheme;
import cascading.scheme.TextDelimited;
import cascading.tap.Lfs;
import cascading.tap.Tap;
import cascading.tuple.Fields;
import cascading.tuple.Tuple;
import cascading.tuple.TupleEntry;
import cascading.tuple.TupleEntryCollector;
import cascading.tuple.TupleEntryIterator;

public class AvroTest
extends CascadingTestCase
{
private static final String DIR_NAME = "./testdata/";
private static final String SOURCE_FILE_EMP = DIR_NAME + "emp/part-00000.avro";
private static final String OUTPUT_DIR = DIR_NAME + "output";
private static final String SOURCE_FILE_MGMT = DIR_NAME + "mgmt/part-00000.avro";
private static final Logger LOG = Logger.getLogger(AvroTest.class);
private static final Fields EMP_FIELDS = new Fields("e_id", "e_salary");
private static final Fields MGMT_FIELDS = new Fields("m_id", "m_designation");

public void setUpData()
throws Exception
{
createDirectory(DIR_NAME);
createDirectory(DIR_NAME + "emp/");
createDirectory(DIR_NAME + "mgmt/");
Class[] empTypes = new Class[] { Integer.class, Long.class };
Tap empSink = new Lfs(new AvroScheme(EMP_FIELDS, empTypes), DIR_NAME + "emp/");
TupleEntryCollector out = empSink.openForWrite(new JobConf());
TupleEntry t = new TupleEntry(EMP_FIELDS, new Tuple(new Object[empTypes.length]));
t.set("e_id", 1);
t.set("e_salary", 320000L);
out.add(t);
out.close();

}

@after
public void tearDown()
throws Exception
{
deleteFile(SOURCE_FILE_EMP);
deleteDirectoryRecursively(DIR_NAME);
}

@test
public void testForSortOnString()
throws Exception
{
setUpData();
/**
* get AVRO scheme from created avro file...
*/
Scheme empScheme =
getAvroScheme(new Path(SOURCE_FILE_EMP), new Configuration(), EMP_FIELDS);

/**
 * avro file does not exist
 */
Scheme mgmtScheme =
    getAvroScheme(new Path(SOURCE_FILE_MGMT), new Configuration(), MGMT_FIELDS);

Tap emp = new Lfs(empScheme, SOURCE_FILE_EMP);
Tap mgmt = new Lfs(mgmtScheme, SOURCE_FILE_MGMT);

Map<String, Tap> source = new HashMap<String, Tap>();
source.put("emp", emp);
source.put("mgmt", mgmt);
Tap sink = new Lfs(new TextDelimited(Fields.ALL, "\t"), OUTPUT_DIR, true);

Pipe empPipe = new Pipe("emp");
Pipe mgmtPipe = new Pipe("mgmt");
Pipe assembly =
    new CoGroup(empPipe, new Fields("e_id"), mgmtPipe, new Fields("m_id"), new OuterJoin());

Flow flow = new FlowConnector().connect(source, sink, assembly);
flow.complete();
TupleEntryIterator tupleEntryIteratorInvalid = sink.openForRead(flow.getJobConf());
LOG.info("Result : ");
while (tupleEntryIteratorInvalid.hasNext())
{
    LOG.info(tupleEntryIteratorInvalid.next().getTuple());
}

}

/**

THis method will get the avro schme from the .avro file...
@param filePath
@param conf
@param selectorFieldList
@return
*/
private static Scheme getAvroScheme(Path filePath, Configuration conf, Fields selectorFieldList)
{
Scheme outputScheme = null;
boolean isFilePresent = false;;
try
{
FileSystem dfs = FileSystem.get(conf);
if (dfs.isFile(filePath))
{
isFilePresent = true;
outputScheme = readAvroFileToGetScheme(filePath, conf, selectorFieldList);
}
if (!isFilePresent)
{
LOG.error("scheme cannot be generated as no input file present!!");
throw new IllegalStateException(
"scheme cannot be generated as no input file present!!");
}
}
catch (IOException e)
{
LOG.warn("Error retrieving avro part file", e);
}
return outputScheme;
}

/**

@param avroFilePath
@param conf
@param selectorFieldList
@return
@throws IOException
*/
@SuppressWarnings("unchecked")
private static Scheme readAvroFileToGetScheme(Path avroFilePath, Configuration conf,
Fields selectorFieldList)
throws IOException
{
DataFileReader dataFileReader = null;
try
{
dataFileReader =
new DataFileReader(new FsInput(avroFilePath, conf),
new GenericDatumReader());
Schema schema = dataFileReader.getSchema();
Comparable fieldNames[] = new Comparable[selectorFieldList.size()];
Class types[] = new Class[selectorFieldList.size()];
for (int fieldCount = 0; fieldCount < schema.getFields().size(); fieldCount++)
{
Field field = schema.getFields().get(fieldCount);
for (int reqFieldCount = 0; reqFieldCount < selectorFieldList.size(); reqFieldCount++)
{
if (StringUtils.equals(selectorFieldList.get(reqFieldCount).toString(), field
.name()))
{
fieldNames[reqFieldCount] = field.name();
types[reqFieldCount] = getClass(field.schema().getTypes().get(1).getName());
```
        }
    }
}
Fields fields = new Fields(fieldNames);
return new AvroScheme(fields, types);
```
}
finally
{
dataFileReader.close();
}
}

/**

@param dataType
@return
*/
private static Class getClass(String dataType) { Class type = null;
if (StringUtils.equals("int", dataType))
{
type = Integer.class;
}
else if (StringUtils.equals("long", dataType))
{
type = Long.class;
}
return type;
}

private boolean createDirectory(String dirName)
throws IOException
{
boolean isSuccessful = false;
File directory = new File(dirName);
isSuccessful = directory.mkdirs();
return (isSuccessful);
}

private boolean deleteFile(String fileName)
throws IOException
{
boolean isSuccessful = false;
File file = new File(fileName);
if (file.exists()) isSuccessful = file.delete();
return (isSuccessful);
}

private boolean deleteDirectoryRecursively(String dirName)
throws IOException
{
if (dirName == null)
{
return false;
}
File directory = new File(dirName);
if (directory.isDirectory())
{
String[] files = directory.list();
for (int i = 0; i < files.length; i++)
{
File child = new File(dirName + File.separator + files[i]);
if (child.isDirectory())
{
deleteDirectoryRecursively(dirName + File.separator + files[i]);
}
else
{
boolean success = deleteFile(dirName + File.separator + files[i]);
if (!success) return false;
}
}
}
return (directory.delete());
}

}

minor doc issue

Hi,

I'd like to report a minor issue about the doc of AvroScheme's default constructor.

/**
* Constructor to read from an Avro source without specifying the schema. If this is used as a source Scheme
* a runtime error will be thrown.
*/
public AvroScheme() {
this(null, true);
}

Actually, it must be sink Scheme. It firstly confused me when I tried to use this constructor to read data with scheme which was successful, however, the write was failed.

Later, I checked the implementation:
public void sinkConfInit(
FlowProcess flowProcess,
Tap<JobConf, RecordReader, OutputCollector> tap,
JobConf conf) {
if (schema == null ) {
throw new RuntimeException("Must provide sink schema");
}
AvroJob.setOutputSchema(conf, schema);
}

I believe this must be a doc issue:)

CSV to Avro - java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long

Versions used

Cascading 2.2.0
Cascading.avro 2.1.2
Hadoop 2.0.0-mr1-cdh4.4.0

i am trying to convert a CSV file to Avro using Cascading.avro package - , modifying ccsevers example https://gist.github.com/ccsevers/3975481

when i use, Long or Double types in output avro schema - conversion fails with .ClassCastException (e.g java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long) ,

assuming this as straightforward case am i using the library incorrectly ?

details are below

i have updated all files and stacktrace as a gist - https://gist.github.com/yashk/8580517

Java version requirement inconsistency in version-2.6 branch

The pom.xml files in the version-2.6 branch specify 1.6 as the source and target Java versions, but AvroSchemeTest imports and uses java.nio.file.Paths, which was introduced in 1.7.

Either the required Java version number must be bumped up to 1.7, or the test must be modified to work under 1.6.

scaleunlimited / cascading.avro Goto Github PK

cascading.avro's Introduction

cascading.avro-scheme

cascading.avro-maven-plugin

Acknowledgements

License

cascading.avro's People

Contributors

Stargazers

Watchers

Forkers

cascading.avro's Issues

/**

/**

Recommend Projects

Recommend Topics

Recommend Org

Jobs