GithubHelp home page GithubHelp logo

hhy5277 / google-group-crawler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from icy/google-group-crawler

0.0 1.0 0.0 116 KB

Get (almost) original messages from google group archives. Your data is yours.

Shell 100.00%

google-group-crawler's Introduction

Build Status

Download all messages from Google Group archive

google-group-crawler is a Bash-4 script to download all (original) messages from a Google group archive. Private groups require some cookies file in Netscape format. Groups with adult contents haven't been supported yet.

Installation

The script requires bash-4, sort, wget, sed, awk.

Make the script executable with chmod 755 and put them in your path (e.g, /usr/local/bin/.)

The script may not work on Windows environment as reported in icy#26.

Usage

The first run

For private group, please prepare your cookies file.

# export _WGET_OPTIONS="-v"       # use wget options to provide e.g, cookies
# export _HOOK_FILE="/some/path"  # provide a hook file, see in #the-hook

# export _ORG="your.company"      # required, if you are using Gsuite
export _GROUP="mygroup"           # specify your group
./crawler.sh -sh                  # first run for testing
./crawler.sh -sh > wget.sh        # save your script
bash wget.sh                      # downloading mbox files

You can execute wget.sh script multiple times, as wget will skip quickly any fully downloaded files.

Update your local archive thanks to RSS feed

After you have an archive from the first run you only need to add the latest messages as shown in the feed. You can do that with -rss option and the additional _RSS_NUM environment variable:

export _RSS_NUM=50                # (optional. See Tips & Tricks.)
./crawler.sh -rss > update.sh     # using rss feed for updating

It's useful to follow this way frequently to update your local archive.

Private group or Group hosted by an organization

To download messages from private group or group hosted by your organization, you need to provide cookies in legacy format.

  1. Export cookies for google domains from your browser and save them as file. Please use a Netscape format, and you may want to edit the file to meet a few conditions:

    1. The first line should be # Netscape HTTP Cookie File
    2. The file must use tab instead of space.
    3. The first field of every line in the file must be groups.google.com.

    A simple script to process this file is as below

     $ cat original_cookies.txt \
       | tail -n +3 \
       | awk  -v OFS='\t' \
         'BEGIN {printf("# Netscape HTTP Cookie File\n\n")}
          {$1 = "groups.google.com"; printf("%s\n", $0)}'
    

    See the sample files in the tests/ directory

    1. The original file: tests/sample-original-cookies.txt
    2. The fixed file: tests/sample-fixed-cookies.txt
  2. Specify your cookie file by _WGET_OPTIONS:

     export _WGET_OPTIONS="--load-cookies /your/path/fixed_cookies.txt --keep-session-cookies"
    

    Now every hidden group can be downloaded :)

The hook

If you want to execute a hook command after a mbox file is downloaded, you can do as below.

  1. Prepare a Bash script file that contains a definition of __wget_hook command. The first argument is to specify an output filename, and the second argument is to specify an URL. For example, here is simple hook

     # $1: output file
     # $2: url (https://groups.google.com/forum/message/raw?msg=foobar/topicID/msgID)
     __wget_hook() {
       if [[ "$(stat -c %b "$1")" == 0 ]]; then
         echo >&2 ":: Warning: empty output '$1'"
       fi
     }
    

    In this example, the hook will check if the output file is empty, and send a warning to the standard error device.

  2. Set your environment variable _HOOK_FILE which should be the path to your file. For example,

     export _GROUP=archlinuxvn
     export _HOOK_FILE=$HOME/bin/wget.hook.sh
    

    Now the hook file will be loaded in your future output of commands crawler.sh -sh or crawler.sh -rss.

What to do with your local archive

The downloaded messages are found under $_GROUP/mbox/*.

They are in RFC 822 format (possibly with obfuscated email addresses) and they can be converted to mbox format easily before being imported to your email clients (Thunderbird, claws-mail, etc.)

You can also use mhonarc ultility to convert the downloaded to HTML files.

See also icy#15.

Rescan the whole local archive

Sometimes you may need to rescan / redownload all messages. This can be done by removing all temporary files

rm -fv $_GROUP/threads/t.*    # this is a must
rm -fv $_GROUP/msgs/m.*       # see also Tips & Tricks

or you can use _FORCE option:

_FORCE="true" ./crawler.sh -sh

Another option is to delete all files under $_GROUP/ directory. As usual, remember to backup before you delete some thing.

Known problems

  1. Fails on group with adult contents (icy#14)
  2. This script may not recover emails from public groups. When you use valid cookies, you may see the original emails if you are a manager of the group. See also icy#16.
  3. When cookies are used, the original emails may be recovered and you must filter them before making your archive public.

For script hackers

Please skip this section unless your really know to work with Bash and shells.

  1. If you clean your files (as below), you may notice that it will be very slow when re-downloading all files. You may consider to use the -rss option instead. This option will fetch data from a rss link.

    It's recommmeded to use the -rss option for daily update. By default, the number of items is 50. You can change it by the _RSS_NUM variable. However, don't use a very big number, because Google will ignore that.

  2. Because Topics is a FIFO list, you only need to remove the last file. The script will re-download the last item, and if there is a new page, that page will be fetched.

     ls $_GROUP/msgs/m.* \
     | sed -e 's#\.[0-9]\+$##g' \
     | sort -u \
     | while read f; do
         last_item="$f.$( \
           ls $f.* \
           | sed -e 's#^.*\.\([0-9]\+\)#\1#g' \
           | sort -n \
           | tail -1 \
         )";
         echo $last_item;
       done
    
  3. The list of threads is a LIFO list. If you want to rescan your list, you will need to delete all files under $_D_OUTPUT/threads/

  4. You can set the time for mbox output files, as below

     ls $_GROUP/mbox/m.* \
     | while read FILE; do \
         date="$( \
           grep ^Date: $FILE\
           | head -1\
           | sed -e 's#^Date: ##g' \
         )";
         touch -d "$date" $FILE;
       done
    

    This will be very useful, for example, when you want to use the mbox files with mhonarc.

Similar projects

License

This work is released under the terms of a MIT license.

Author

This script is written by Anh K. Huynh.

He wrote this script because he couldn't resolve the problem by using nodejs, phantomjs, Watir.

New web technology just makes life harder, doesn't it?

google-group-crawler's People

Contributors

cmpitg avatar icy avatar leekevin avatar luk4hn avatar patrakov avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.