GithubHelp home page GithubHelp logo

externalsort's Introduction

External Sort

This program implements an external merge sort algorithm, which is used to sort data that is larger than the amount of memory available for sorting.

The algorithm works as follows, given a buffer pool size of B pages and a page size of P in bytes:

Pass 0: Split the input files into runs of size P. A total of R runs are produced, are sorted in memory by Quicksort and are written out to disk individually.

Pass 1 - Pass logk(R): For each of the remaining passes, we merge k (where k = B - 1) runs at time and write the merged run to disk. We use k buffers for input and use the k+1th buffer as our output buffer. The final pass produces one run, which is fully sorted.

Optimizations

Reduce the number of initial runs

One way to create the initial runs would be to split up the file by its natural sort order. For example "5, 8, 9, 1, 2", would be split into into two runs: R1 = "5, 8, 9" and R2 = "1, 2".

However, by producing runs of a fixed size, P, we reduce the number of initial runs, and in turn, reduce the number of merge passes.

Perform k-way merges

Rather than merging runs in pairs during each pass, we could use as many buffers as possible, and thus merge k runs at a time. This will also reduce the total number of passes of the algorithm.

Do not write the final run to disk (not implemented)

A further optimization would be to not write the final run out to disk. Instead, during the last pass, we could simply perform the k-way merge and stream the final run directly to STDOUT, which would save some IOs.

Usage

Usage: java -jar externalSort.jar [--help] [-n=numBuffers] [-p=pageSize] FILE
  FILE                    The file to sort
      --help                  display this help message
  -n, --numBuffers=numBuffers The number of buffers in the buffer pool
  -p, --pageSize=pageSize     The page size in bytes

externalsort's People

Contributors

daves-hubdoc avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.