jdatta / tarfilesystem Goto Github PK

View Code? Open in Web Editor NEW

8.0 2.0 4.0 460 KB

The Tar FileSystem for Hadoop lives here

License: Apache License 2.0

Java 100.00%

tar-archive java hadoop tar-uri tar-filesystem

tarfilesystem's Introduction

Tar FileSystem for Hadoop

Version: 2.0_beta

TAR is a widely used format for storing backup images, distributing large datasets etc. Many of those files could be used as an input to analytic jobs.

Apache Hadoop, as of now, is not TAR aware. That is, it can not directly read a file inside a TAR. Neither it can run map-reduce on those files. To run analytic jobs on a TAR, one needs to first copy it to local disk, un-TAR it, then copy back to Hadoop file system. Or convert it to sequence file/other Hadoop aware format using custom (java) program. This procedure is time consuming and the user ends up having two copies of data.

By using the TarFileSystem for Hadoop, Hadoop can directly read files inside a TAR and run analytic jobs on those file. This way, no conversion/extraction is required.

Building

Run "mvn package" inside the project directory. The TarFileSystem distribution is created as a jar file at ./target/hadoop-tarfs-2.0_beta.jar

Distribution and Configuration

TAR File System binary for Hadoop is distributed as a JAR library (hadoop-tarfs-*.jar). This JAR contains all the required classes to support TarFileSystem. Copy this JAR to the HADOOP_HOME/lib directory (HDFS_HOME/lib for Hadoop 2.0) or add the jar to HADOOP_CLASSPATH environment variable.

Next expose tar:// uri schema to Hadoop by adding the following property in HADOOP_CONF_DIR/core-site.xml file.

<property>
  <name>fs.tar.impl</name>
  <value>org.apache.hadoop.fs.tar.TarFileSystem</value>
</property>

Optional Configuration:

By default, TarFileSystem creates an .index file in the same directory where the tar file resides. Index writing may fail if you do not have sufficient permission in that directory. In that case you may specify a temporary directory where you have write permission and tell TarFileSystem to use that directory instead. You may specify the following property in core-site.xml for this:

<property>
  <name>tarfs.tmp.dir</name>
  <value>/a/directory/with/write/permission</value>
</property>

Note that, TarFileSystem will still prefer the same directory where the tar file exists for writing the .index file. Only if writing to the same directory fails it will use the tarfs.tmp.dir. In that case, if tarfs.tmp.dir is not specified or writing to that directory also fail, it will skip writing the .index file with a warning message.

Using TAR File System

Hadoop can access a TAR archive using TAR URI SCHEMA (URI starting with tar://). The following examples shows this:

Following is a TAR inside Hadoop file System

[jd@node1 ~]$ bin/hadoop fs -ls /tardemo/archive.tar ↲
Found 1 items
-rw-r--r--   1 jd supergroup    1751040 2013-07-15 20:30 /tardemo/archive.tar

To access files inside this tar, simply prepone this with tar:// to make it a TAR File System URI

[jd@node1 ~]$ bin/hadoop fs -ls tar:///tardemo/archive.tar ↲
13/07/15 20:33:04 INFO tar.TarFileSystem: *** Using Tar file system ***
Found 3 items
-rw-rw-r--   1 jd jd     502760 2013-07-15 20:27 /tardemo/archive.tar+/data+file2.txt
-rw-rw-r--   1 jd jd     594933 2013-07-15 20:26 /tardemo/archive.tar+/data+file1.txt
-rw-rw-r--   1 jd jd     641720 2013-07-15 20:27 /tardemo/archive.tar+/data+file3.txt

To access a file inside a TAR archive, append the name of the file after the TAR URI using a ‘+’ sign. All sub-directory paths within a TAR archive are also defined using ‘+’ sign. For example, if the file is in path dir1/dir2/file1.txt within tar archive, use the following path to read it.

[jd@node1 ~]$ bin/hadoop fs -cat tar://hdfs-localhost:54310/tardemo/archive.tar/+dir1+dir2+file1.txt ↲
13/07/15 20:38:35 INFO tar.TarFileSystem: *** Using Tar file system ***
This is the file content.
[...]

In TAR File System, the TAR archive is modeled like a directory and all the files inside a TAR are modeled like files within a directory. One can run mapreduce jobs on files within a TAR archive just like they do it on normal files.

[jd@node1 ~]$ bin/hadoop jar hadoop*examples*.jar wordcount tar:///tardemo/archive.tar wc_out ↲ 
13/07/15 20:43:05 INFO tar.TarFileSystem: *** Using Tar file system ***
13/07/15 20:43:05 INFO input.FileInputFormat: Total input paths to process : 3
13/07/15 20:43:05 INFO mapred.JobClient: Running job: job_201307151954_0001
13/07/15 20:43:06 INFO mapred.JobClient:  map 0% reduce 0%
 [...]

TO DO

Implement efficient seek in SeekableTarInputStream
Support compressed TAR archives

tarfilesystem's People

Contributors

Stargazers

Watchers

Forkers

bowang caichangqi munterkalmsteiner ekohlwey

tarfilesystem's Issues

Long file names in tar are getting cut in index

Hi,
Great effort!
We are trying to use your code and it seems the index file is not created correctly for long files names inside a tar. Changing the Tar index creation part to use TarArchiveInputStream
with:
while (null != (tarArchiveEntry = tarArchiveInputStream.getNextTarEntry())) ...
instead of using the byte array seems to fix the problem.

"Not a valid DFS filename" error on Hadoop 2.6.0

After installing TarFileSystem as described in the readme, I try:

hadoop dfs -ls tar:///user/cloudera/file.tar

which produces the error:

INFO tar.TarFileSystem *** Using Tar file system ***
-ls: Pathname /tar:/user/cloudera/file.tar from /tar:/user/cloudera/file.tar is not a 
    valid DFS filename

This is on Hadoop 2.6.0-cdh5.9.0 (Cloudera Distribution of Hadoop).

It looks as if the tar:// URL is getting incorrectly turned into a pathname?

ReadIndexFile does not work if file name contains spaces

The index of TarFileSystem is stored as space separated values in parent file system for later reuse. When we try to read the index back it fails if file name contains spaces.

17/04/22 00:20:32 INFO tar.TarFileSystem: *** Using Tar file system ***
17/04/22 00:20:33 ERROR tar.TarIndex: Invalid Index File: /jd/systems.tar.index

We should either use properly quoted CSV file. Or use some structured format like JSON.

If we support hierarchical index in future to support directories, structured format like JSON seems to be a better choice.

Write index to disk asynchronously

Once index is prepared in-memory, the task of writing it to permanent storage for later use can be done asynchronously. This way latency can be reduced.

Support hierarchical directory structure in TarFileSystem

Summary

Currently, TarFileSystem does not support hierarchical directory structure. It ignores all the directory entries inside the TAR and presents all the files in a flat hierarchy.

This issue tracks the efforts required to support hierarchical directory structure in TAR. This is the umbrella issue; sub-issues may be created later to track specific tasks required.

Sub-tasks

To support hierarchical directory structure or nested directories, we need to make following changes in TarFileSystem code.

Make the index data-structure hierarchical

Currently the index is maintained as a flat hashmap. To capture the nested directories we need some kind of tree like structure. Currently I am thinking to have a simple n-way tree (Each IndexEntry of type directory would contain a list of child index entries) along with a map to fast find a specific node in the tree (Map<String, IndexEntry>)

Change serialization of index

Index is serialized to facilitate later reuse. The current format is
Space separated values. In addition to existing issues (does not work if file name contains spaces) this format would not easily work for serializing hierarchical index. Currently I am thinking to use JSON. Need to check if the JSON would work if TAR contains huge number of entries; say, 1 billion.

Re-evaluate `IndexEntry` data structure

Currently IndexEntry only holds length and offset of the tar header. When the archived file is accessed, we seek to the offset and header is read again. Need to evaluate if we can store more information in index itself so that this disk access is not required. Note that at one extreme there is to store just length and offset. At the other extreme we can keep the complete tar header entry in index. We need to make sure this change does not cause memory pressure.

Note: The actual requirement is to ensure sequential read when we do a ls and ls -R. If all the childs of a directory entry are stored sequentially in tar; i.e. if the entries in tar are sorted by path string then the performance would be no worse than what is current even if we just store length and offset. For recursive walk, we need to ensure the calling code processes the directory hierarchy in the same order as the entries are stored inside tar^{^}. This restriction may turn out to be difficult to impose

Issue in hadoop `dfs -ls -R` command

To recursively list all paths, -ls -R appends the name component of child path after the parent path. Here the separator is hardcoded / (ouch!)

Here is the corresponding code from FsShell module

// check getPath() so scheme slashes aren't considered part of the path
    String basename = childPath.getName();
    String separator = uri.getPath().endsWith(Path.SEPARATOR)
        ? "" : Path.SEPARATOR;
    return uriToString(uri, inferredSchemeFromPath) + separator + basename;

Obviously this would not work for the current path format of TarFileSystem. For TarFileSystem, the separator for the archived files are +. Also the getName call would get the complete path of archived file. Resulting in the following bug:

Parent abs path: tar:///file.tar/+dir1+dir2+
Child abs path: tar:///file.tar/+dir1+dir2+f1.txt
FsShell would form path as:
tar:///file.tar/+dir1+dir2+/+dir1+dir2+f1.txt
While looking up the file from inside Tar, the TarFileSystem would translate the inArchivePath as /dir1/dir2///dir1/dir2/f1.txt resulting in FileNotFoundException

Footnotes

^{^} For example, say the TAR order is as follows:

a/b/
a/b/f1.txt
a/b/c/
a/b/c/f2.txt
a/f3.txt
a/f4.txt

Then the calling code that does the recursive walk needs to do a DFS. As we can see a BFS would cause non sequential read.

Stackoverflowerror with a tar containing many files

I've a tar with ~160000 files which produces a Stackoverflowerror (see below). Not sure where the problem is. Running the job on 160000 files (not tarred) works fine.

Exception in thread "main" java.lang.StackOverflowError at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:721) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:656) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.fs.tar.TarFileSystem.readHeaderBuffer(TarFileSystem.java:191) at org.apache.hadoop.fs.tar.TarFileSystem.readHeaderEntry(TarFileSystem.java:179) at org.apache.hadoop.fs.tar.TarFileSystem.getFileStatus(TarFileSystem.java:288) at org.apache.hadoop.fs.tar.TarFileSystem.listStatus(TarFileSystem.java:208) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557) at org.apache.hadoop.fs.FileSystem$4.<init>(FileSystem.java:1714) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1713) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1696) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPathRecursively(FileInputFormat.java:343) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPathRecursively(FileInputFormat.java:348) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPathRecursively(FileInputFormat.java:348) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPathRecursively(FileInputFormat.java:348) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPathRecursively(FileInputFormat.java:348)

Add unit tests for hierarchical directory support

Depends on #10