GithubHelp home page GithubHelp logo

tamargoadam / huffman-coding Goto Github PK

View Code? Open in Web Editor NEW
1.0 0.0 0.0 312 KB

Huffman Coding lossless compression and decompression for large text files

Java 98.29% Makefile 1.71%
java huffman-coding binary-heap 4-way-cached-heap pairing-heap

huffman-coding's Introduction

Huffman Coding

Encoding and decoding of large text files via Huffman coding lossless compression algorithm

Overview

How to run

  • From the src directory, run make to build the encoder and decoder programs
  • Command to run encoder: java encoder <input_file_name>
  • Command to run decoder: java decoder <encoded_file_name> <code_table_file_name>

Encoding

  • Reads an input file that is to be compressed and generates two output files
    • A compressed .bin file version of the input
    • A code table that maps characters in the origional file to their binary representation
  • Can use one of three types of priority queue to facilitate building code table
    • Binary heap
    • 4-way cache optimized heap
    • Pairing heap

Decoding

  • Builds a decode tree using the code table file as input
  • Generates a decoded text file from the input encoded.bin file using the decode tree

Program Structure

Encoding

  • huffEncodeTable(HuffmanTree huff)
    • Returns a hash table containing all characters encoded with the Huffman tree and their corresponding binary code Strings
  • characterCode(HuffBaseNode bNode, char c, String code)
    • Returns a String of 1’s and 0’s that corresponds to the character argument
    • Traverses tree recursively and adds 1 or 0 based on branch traversed
  • fileToHash(File inputFile)
    • Reads each line of the input file and increments the value corresponding to the read characters' keys in an instance -> frequency hash table.
    • If no key with the instance is present, a new entry with key = instance and value = 1 is created
  • buildTree(File inputFile)
    • Returns a Huffman tree corresponding to the input file
    • Creates a root node for each value and frequency read in and adds them to a priority queue
    • The two trees with the smallest frequency are removed from the heap and combined into a single Huffman tree whose new root has a weight of the sum of its two children
    • This repeats until only one tree remains
  • generateEncodedFile(File inputFile, HuffmanTree huffTree)
    • Reads each line of the input file and, utilizing the value to code hash table, translates the values corresponding code to binary that is written to the encoded.bin file
  • generateCodeTable(HuffmanTree huffTree)
    • Writes the key value pairs generated by huffEncodeTree() to the code_table.txt file

Binary Heap

  • Array implementation of binary heap
    • Child indexes for parent at index x are:
    • 2x
    • 2x+1
  • On removeMin(), check each child’s weight is less than the current node, if not, swap the positions of the parent node and the least child node

Four-way Heap

  • Array implementation of four-way heap
    • Root is held at index 3
    • Child indexes of parent at index x are:
    • 4(x-3)+4
    • 4(x-3)+5
    • 4(x-3)+6
    • 4(x-3)+7
  • Same checking and adjustments as binary heap necessary on removeMin() call, but must check all 4 children

Pairing Heap

  • Nodes hold child, left sibling, and right sibling pointers and a Huffman tree data item
  • Multi-pass scheme is used when melding after removeMin() call
    • All children of removed root node are added to a queue
    • First two heaps are removed from queue and melded, and result is added to queue
    • Repeats until one heap exists

Decoding

  • createCodeTable(File codes)
    • Creates a code -> value hash table based on the input file provided
  • decodeBinFile(File encoded)
    • Reads each byte of the binary input file and converts binary to string, checking for matches against the code -> value hash table each bit

Queue Performance Analysis

Average times to build code table with each variation of priorety queue for the same large textfile recorded over multiple runs and reported below.

  • Binary heap: avg. 7360 ms
    • With a delete min operation with complexity O(log2(n)), the binary heap is a decent structure to use in the creation of the Huffman tree, but its lack of optimizations elsewhere prevent it from being the best option.
  • Four-way heap: avg. 6603 ms
    • Although the time complexity of the four-way heap’s removeMin operation is worse than that of the binary heap (4 log4(n) vs. log2(n)), the cache optimization that this approach utilizes makes it the fastest approach for large inputs.
  • Pairing heap: avg. 7832 ms
    • While pairing heaps can be faster than binary or k-way heaps in many applications, here that is not the case. The pairing heap’s largest advantage comes from its constant time merge operation (compared to the alternative heaps linear time merge), however, the merge is not utilized outside of insert and removeMin operations and it performs worse due to it having a greater constant time overhead per operation.

huffman-coding's People

Contributors

tamargoadam avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.