GithubHelp home page GithubHelp logo

dataspora / big-data-tools Goto Github PK

View Code? Open in Web Editor NEW
17.0 4.0 3.0 95 KB

Miscellaneous tools in bash, Python, and Perl for munging Big Data.

Home Page: http://data-analytics-tools.blogspot.com

Perl 8.33% Python 9.42% R 34.61% Shell 47.64%

big-data-tools's Introduction

This directory contains several files relevant to operating on very
large data sets.

== Map Function in Bash ==

- map.sh - a map function implemented in Bash

When multi-core processors are the norm, it is only reasonable that we
ought to be able to parallelize even shell scripts.  This script
provides a means for operating in parallel on sets of files contained
in directories.

== Reservoir Sampling ==

- samplen.py - a reservoir sampler implemented in Python
- samplen.pl - a reservoir sampler implemented in Perl

Algorithms that perform calculations on evolving data streams, but in
fixed memory, have increasing relevance in the Age of Big Data.

The reservoir sampling algorithm outputs a sample of N lines from a
file of undetermined size. It does so in a single pass, using memory
proportional to N.

These two features -- (i) a constant memory footprint and (ii) a
capacity to operate on files of indeterminate size -- make it ideal
for working with very large data sets common to event processing.

While it has likely been multiply discovered and implemented, like
many algorithms, it was codified by Knuth's The Art of Computer
Programming.

The trick of this algorithm is to first fill up the sample buffer, and
afterwards, to probabilistically replace it with additional lines of
input.

big-data-tools's People

Contributors

dataspora avatar

Stargazers

Travis Taylor avatar Angus H. avatar Ilias Ktn avatar Nazeeruddin Ikram avatar Stuart Andrews avatar Richard Nieh avatar Byron Gibson avatar  avatar Antonio Graeff avatar Jake Hofman avatar Ryan Rosario avatar Tim Bart avatar Eoin Brazil avatar mat kelcey avatar  avatar Ilya Grigorik avatar Philip (flip) Kromer avatar

Watchers

 avatar James Cloos avatar Khurram Waqas Malik avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.