GithubHelp home page GithubHelp logo

arctanx999 / ngrampy Goto Github PK

View Code? Open in Web Editor NEW

This project forked from piantado/ngrampy

0.0 1.0 0.0 132.5 MB

Tools in python for dealing with Google Books Ngram files and other similar data sets.

Python 94.36% R 4.21% Shell 1.43%

ngrampy's Introduction

ngrampy is a python class for manipulating google (or similarly formatted) n-gram data. It provides a python class for very basic table manipulations such that operations on tables are mimiced by operations on the hard drive, with huge n-gram files that cannot be read into RAM. This takes a lot of hard drive time, but can handle arbitrary file sizes (5~20gb is typical). This is *not* optimized for speed, since these things take a long time anyways and are typically run once. 

Usually, it makes more sense to process the google files once, concatinging and collapsing by some dates into a large file with all the ngrams (since this may take a few days). For this, the process-google.py script is fastest (much faster than LineFile). In collapsing dates, it makes a much smaller file (~10GB for eng-us 2grams)
	gzip -dc /home/piantado/Desktop/GoogleBooks/eng-us-all/2/* | python process-google.py /tmp/G-eng-us-all
	Or, unpigz is about 2x as fast as gzip on my computer (it multithreads fetching, file io, etc.)

This perl script does not do any fancy filtering of the ngrams.
	

To download data from google, you can use download.py

NOTE: In general, you should use this library with 
 
   export PYTHONIOENCODING=utf-8
 
so that you can handle utf-8 characters from google. 

NOTE: This splits columns in the text files by whitespace; if you want something else, you should merge with underscores or something

NOTE: The pypy tends to run much faster than python for this!

========================================================
== LICENSE
========================================================

ngrampy is licensed under GPL 3.0

========================================================
== INSTALLATION:
========================================================

Put this library somewhere--mine lives in /home/piantado/mit/Libraries/ngrampy/
	
Set the PYTHONPATH environment variable to point to ngrampy/:
	
	export PYTHONPATH=$PYTHONPATH:/home/piantado/Desktop/mit/Libraries/ngrampy
	
You can put this into your .bashrc file to make it loaded automatically when you open a terminal. On ubuntu and most linux, this is:
	
	echo 'export PYTHONPATH=$PYTHONPATH:/home/piantado/Desktop/mit/Libraries/ngrampy' >> ~/.bashrc

You can also do
  
        echo 'export PYTHONIOENCODING=utf-8' >> ~/.bashrc
        
although this will change your default python encoding. 
	
And you should be ready to use the library

ngrampy's People

Contributors

futrell avatar piantado avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.