GithubHelp home page GithubHelp logo

seanbreckenridge / google_takeout_parser Goto Github PK

View Code? Open in Web Editor NEW
67.0 6.0 11.0 178 KB

A library/CLI tool to parse data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

Home Page: https://pypi.org/project/google-takeout-parser/

License: MIT License

Python 95.73% Go 4.27%
data backup google-takeout google export google-location-history

google_takeout_parser's Introduction

google_takeout_parser

Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...). This:

  • parses both the Historical HTML and new JSON format for Google Takeouts
  • caches individual takeout results behind cachew
  • merge multiple takeouts into unique events

This doesn't handle all cases, but I have yet to find a parser that does, so here is my attempt at parsing what I see as the most useful info from it. The Google Takeout is pretty particular, and the contents of the directory depend on what you select while exporting. Unhandled files will warn, though feel free to PR a parser or create an issue if this doesn't parse some part you want.

This can take a few minutes to parse depending on what you have in your Takeout (especially while using the old HTML format), so this uses cachew to cache the function result for each Takeout you may have. That means this'll take a few minutes the first time parsing a takeout, but then only a few seconds every subsequent time.

Since the Takeout slowly removes old events over time, I would recommend periodically (personally I do it once every few months) backing up your data, to not lose any old events and get data from new ones. To use, go to takeout.google.com; For Reference, once on that page, I hit Deselect All, then select:

  • Chrome
  • Google Play Store
  • Location History
    • Select JSON as format
  • My Activity
    • Select JSON as format
  • Youtube and Youtube Music
    • Select JSON as format
    • In options, deselect music-library-songs, music-uploads and videos

Be sure to select JSON instead of HTML whenever possible. Code to parse the HTML format is included here, but it is treated as legacy code and comes with worse performance and a myriad of other issues. See legacy html parsing

The process for getting these isn't that great -- you have to manually go to takeout.google.com every few months, select what you want to export info for, and then it puts the zipped file into your google drive. You can tell it to run it at specific intervals, but I personally haven't found that to be that reliable.

This currently parses:

  • Activity (from dozens of Google Services) - My Activity/*.html|*.json)
  • Chrome History - Chrome/BrowserHistory.json
  • Google Play Installs - Google Play Store/Installs.json
  • Location History:
    • Semantic Location HistoryLocation History/Semantic Location History/*
    • Location History Location History/Location History.json, Location History/Records.json
  • Youtube:
    • History - YouTube and YouTube Music/history/*.html|*.json
    • Comments:
      • Legacy HTML Comment/Live chats: YouTube and YouTube Music/my-comments/*.html and YouTube and YouTube Music/my-live-chat-messages/*.html
      • CSV/JSON (comment text is stored as a JSON blob, see below
        • Youtube/comments/comments.csv
        • Youtube/live chats/live chats.csv
    • Likes: YouTube and YouTube Music/playlists/likes.json

This was extracted out of my HPI modules, which was in turn modified from the google files in karlicoss/HPI

Installation

Requires python3.8+

To install with pip, run:

pip install google-takeout-parser

Usage

The directory structure of the google takeout changes depending on your Google accounts main language. If this doesn't support your language, see contributing. This currently supports:

  • EN: English
  • DE: German (thanks to @parthux1)

CLI Usage

Can be accessed by either google_takeout_parser or python -m google_takeout_parser. Offers a basic interface to list/clear the cache directory, and/or parse/merge a takeout and interact with it in a REPL:

Usage: google_takeout_parser parse [OPTIONS] TAKEOUT_DIR

  Parse a takeout directory takeout

Options:
  -f, --filter [Activity|LikedYoutubeVideo|PlayStoreAppInstall|Location|ChromeHistory|YoutubeComment|PlaceVisit]
                                  Filter to only show events of this type
  -l, --locale [EN|DE]            Locale to use for matching filenames [default: EN]  [env var:
                                  GOOGLE_TAKEOUT_PARSER_LOCALE]
  -a, --action [repl|summary|json]
                                  What to do with the parsed result  [default: repl]
  --cache / --no-cache            [default: no-cache]
  -h, --help                      Show this message and exit.

If you use a language this doesn't support, see contributing.

To clear the cachew cache: google_takeout_parser cache_dir clear

A few examples of parsing takeouts:

$ google_takeout_parser --quiet parse ~/data/Unpacked_Takout --cache
Interact with the export using res

In [1]: res[-2]
Out[1]: PlayStoreAppInstall(title='Hangouts', device_name='motorola moto g(7) play', dt=datetime.datetime(2020, 8, 2, 15, 51, 50, 180000, tzinfo=datetime.timezone.utc))

In [2]: len(res)
Out[2]: 236654

$ google_takeout_parser --quiet merge ./Takeout-Old ./Takeout-New --action summary --no-cache

Counter({'Activity': 366292,
         'Location': 147581,
         'YoutubeComment': 131,
         'PlayStoreAppInstall': 122,
         'LikedYoutubeVideo': 100,
         'ChromeHistory': 4})

Can also dump the info to JSON; e.g. to filter YouTube-related stuff from your Activity using jq:

google_takeout_parser --quiet parse -a json -f Activity --no-cache ./Takeout-New |
  # select stuff like Youtube, m.youtube.com, youtube.com using jq
  jq '.[] | select(.header | ascii_downcase | test("youtube"))' |
  # grab the titleUrl, ignoring nulls
  jq 'select(.titleUrl) | .titleUrl' -r

Also contains a small utility command to help move/extract the google takeout:

$ google_takeout_parser move --from ~/Downloads/takeout*.zip --to-dir ~/data/google_takeout --extract
Extracting /home/sean/Downloads/takeout-20211023T070558Z-001.zip to /tmp/tmp07ua_0id
Moving /tmp/tmp07ua_0id/Takeout to /home/sean/data/google_takeout/Takeout-1634993897
$ ls -1 ~/data/google_takeout/Takeout-1634993897
archive_browser.html
Chrome
'Google Play Store'
'Location History'
'My Activity'
'YouTube and YouTube Music'

Library Usage

Assuming you maintain an unpacked view, e.g. like:

 $ tree -L 1 ./Takeout-1599315526
./Takeout-1599315526
├── Google Play Store
├── Location History
├── My Activity
└── YouTube and YouTube Music

To parse one takeout:

from google_takeout.path_dispatch import TakeoutParser
tp = TakeoutParser("/full/path/to/Takeout-1599315526")
# to check if files are all handled
tp.dispatch_map()
# to parse without caching the results in ~/.cache/google_takeout_parser
uncached = list(tp.parse())
# to parse with cachew cache https://github.com/karlicoss/cachew
cached = list(tp.parse(cache=True))

To parse a locale this doesn't support yet, you can create a dictionary which maps the names of the files to functions, see locales/en.py for an example. That can be passed as handlers to TakeoutParser

To cache and merge takeouts (maintains a single dependency on the paths you pass -- so if you change the input paths, it does a full recompute)

from google_takeout.merge import cached_merge_takeouts
results = list(cached_merge_takeouts(["/full/path/to/Takeout-1599315526", "/full/path/to/Takeout-1634971143"]))

If you don't want to cache the results but want to merge results from multiple takeouts, can do something custom by directly using the merge_events function:

from google_takeout_parser.merge import merge_events, TakeoutParser
itrs = []  # list of iterators of google events
for path in ['path/to/Takeout-1599315526' 'path/to/Takeout-1616796262']:
    # ignore errors, error_policy can be 'yield', 'raise' or 'drop'
    tk = TakeoutParser(path, error_policy="drop")
    itrs.append(tk.parse(cache=False))
res = list(merge_events(*itrs))

The events this returns is a combination of all types in the models.py, to filter to a particular type you can provide that to skip parsing other files:

from google_takeout_parser.models import Location
from google_takeout_parser.path_dispatch import TakeoutParser
# filter_type can be a list to filter multiple types
locations = list(TakeoutParser("path/to/Takeout").parse(filter_type=Location))
len(locations)
99913

I personally exclusively use this through the HPI google takeout file, as a configuration layer to locate where my takeouts are on disk, and since that 'automatically' unzips the takeouts (I store them as the zips), i.e., doesn't require me to maintain an unpacked view

Youtube Comment JSON

The CSV youtube comment files' content are stored as a JSON blob, which look like:

{
  "takeoutSegments": [
    {
      "text": "I symlink /bin/sh to dash, its a very minimal/posix compliant implementation. see "
    },
    {
      "text": "https://wiki.archlinux.org/index.php/Dash",
      "link": {
        "linkUrl": "https://wiki.archlinux.org/index.php/Dash"
      }
    }
  ]
}

This exposes some functions to help parse those, into text, markdown, or just extract the links:

from google_takeout_parser.path_dispatch import TakeoutParser
from google_takeout_parser.models import CSVYoutubeComment
from google_takeout_parser.parse_csv import extract_comment_links, reconstruct_comment_content


path = "./Takeout-1599315526"
comments = list(
    TakeoutParser(path, error_policy="raise").parse(
        cache=False, filter_type=CSVYoutubeComment
    )
)
for e in comments:
    print(extract_comment_links(e.contentJSON))
    print(reconstruct_comment_content(e.contentJSON, "text"))
    print(reconstruct_comment_content(e.contentJSON, "markdown"))

Legacy HTML Parsing

I would heavily recommend against using the HTML format for My Activity. It is not always possible to properly parse the metadata, is more prone to errors parsing dates due to local timezones, and takes much longer to parse than the JSON format.

On certain machines, the giant HTML files may even take so much memory that the process is eventually killed for using too much memory. For a workaround, see split_html.

Contributing

Just to give a brief overview, to add new functionality (parsing some new folder that this doesn't currently support), you'd need to:

  • Add a model for it in models.py subclassing BaseEvent and adding it to the Union at the bottom of the file. That should have a key property function which describes each event uniquely (used to merge takeout events)
  • Write a function which takes the Path to the file you're trying to parse and converts it to the model you created (See examples in parse_json.py). Ideally extract a single raw item from the takeout file add a test for it so its obvious when/if the format changes.
  • Add a regex match for the file path to the handler map in google_takeout_parser/locales/en.py.

Dont feel required to add support for all locales, its somewhat annoying to swap languages on google, request a takeout, wait for it to process and then swap back.

Though, if your takeout is in some language this doesn't support, you can create an issue with the file structure (run find Takeout and/or tree Takeout), or contribute a locale file by creating a path -> function mapping, and adding it to the global LOCALES variables in locales/all.py and locales/main.py

This is a pretty difficult to maintain, as it requires a lot of manual testing from people who have access to these takeouts, and who actively use the language that the takeout is in. My google accounts main language is English, so I upkeep that locale whenever I notice changes, but its not trivial to port those changes to other locales without swapping my language, making an export, waiting, and then switching back. I keep track of mismatched changes in this board

Ideally, you would select everything when doing a takeout (not just the My Activity/Chrome/Location History like I suggested above), so paths that are not parsed can be ignored properly.

Testing

git clone 'https://github.com/seanbreckenridge/google_takeout_parser'
cd ./google_takeout_parser
pip install '.[testing]'
mypy ./google_takeout_parser
flake8 ./google_takeout_parser
pytest

google_takeout_parser's People

Contributors

karlicoss avatar parthux1 avatar ryanbateman avatar seanbreckenridge avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

google_takeout_parser's Issues

add handler for Google Fit data

Fit/Daily Aggregations csv files -- started appearing in 2017

Fit/Activities/*.tcx and Fit/Activities/Low Accuracy/*.tcx files -- perhaps worth just having a function to get them, something else should actually handle tcx files
also a bunch of them seems to have disappeared in 2020 (comparing with 2018) -- not sure if it's some sort of retention

use streaming html parser

loading the whole html document into memory is pretty expensive memory wise, could either use a streaming html parser, or maybe split the file before loading it?

add parser for Hangouts/Hangouts.json

also oddly, my takeout for 2014 has Hangouts.json and Hangouts2.json (with similar size), the diff shows like 10% of lines, but hard to tell what's actually different, seems very random
all later takeouts only have single Hangouts.json

some semantic locations are missing placeId

Even in the latest export getting quite a few errors originating from

otherCandidateLocations=[
CandidateLocation.from_dict(pv)
for pv in placeVisit.get("otherCandidateLocations", [])
],

The reason is sometimes they are missing placeId which is required:

placeId=data["placeId"],

Most of them look like this, with either TYPE_HOME or TYPE_WORK:

{'latitudeE7': ..., 'longitudeE7': ..., 'semanticType': 'TYPE_HOME', 'locationConfidence': 26.890783}

(sometimes it's TYPE_HOME/TYPE_WORK and placeId is present as well, but not always)

There are a few remaining ones with different type:

{'latitudeE7': ...,
'locationConfidence': 14.805286,
'longitudeE7': ...,
'semanticType': 'TYPE_ALIASED_LOCATION'}

, however not sure what that one means.

Not sure what's the right way to handle this, perhaps making placeId optional and perhaps also adding semanticType field?

split cached databases by type

I believe this would make the size smaller since individual rows for the cachew union type would be smaller, so the cache doesnt grow to unreasonable sizes.

Would probably leave the one in HPI google_takeout as its theres just one of those, and not multiple that grow exponentially with no. of exports

As it stands, Im comfortable with the tradeoff here -- trading ease for disk space, but definitely could be improved

parse/check new youtube files

[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel URL configs.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel feature data.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel page settings.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/comments/comments.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/live chats/live chats.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/video recordings.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/video texts.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/videos.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel URL configs.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel feature data.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel page settings.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/comments/comments.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/live chats/live chats.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/video recordings.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/video texts.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/videos.csv
[W 231203 09:48:47 path_dispatch:331] No function to handle parsing Maps (your places)/Saved Places.json

add parser for saved places on google maps

Seem to be scattered across different formats 💩

"Saved" list is in "Maps (your places)/Saved Places.json" -- present since 2015

{
  "type" : "FeatureCollection",
  "features" : [ {
    "geometry" : {
      "coordinates" : [ -0.1202100, 51.5979200 ],
      "type" : "Point"
    },
    "properties" : {
      "Google Maps URL" : "http://maps.google.com/?cid=17295021474934382781",
      "Location" : {
        "Address" : "United Kingdom",
        "Business Name" : "Alexandra Palace",
        "Country Code" : "GB",
        "Geo Coordinates" : {
          "Latitude" : "51.5979200",
          "Longitude" : "-0.1202100"
        }
      },
      "Published" : "2017-09-27T09:56:06Z",
      "Title" : "Alexandra Palace",
      "Updated" : "2017-09-27T09:56:06Z"
    },
    "type" : "Feature"
  }, {
    "geometry" : {
      "coordinates" : [ -0.1307733, 51.5941783 ],
      "type" : "Point"
    },
...
]}

Whereas other lists are in CSV files (since 2018), in "Saved" directory, one for each list in google maps
e.g. Saved/Paris.csv

Title,Note,URL
Urfa Durum,,"https://www.google.com/search?q=Urfa+Durum&ludocid=15623525448940569321&ibp=gwp;0,7"

doesn't seem like this data is preset anywhere else in takeouts

remove placevisit JSON property

previously had used json.dumps on for other candidates and had a property that decoded it to comply with cachew not allowing JSON objects as values. Can now do that since it just uses a json encoder internally, so should be able to type the list of dictionaries properly

Recreate cache on version upgrades

Unless a model changes, the hash for cachew doesnt update, but code may have changed and we still have old results. So, unless you clear the directory you could have results generated from old functionality

the clear command does fix that, but would be nice for this to invalidate old results automatically, by inspecting package installation to see what version this is and put a 'version' file in the cache directory (or maybe in the cachew hash db talbe?)

Could add an environment variable/flag that lets you use mismatched hashes during development

Could also maybe just add the version at the front of the _cachew_depends_on, since that gets stored as part of the hash

parse youtube video metadata

includes information about the video release date, the description, duration etc.
video recordings.csv also includes a lat/lon, so that could be nice to expose

support Windows separators in path_dispatch

While setting up Windows CI for promnesia, the takeout tests failed and had these in logs:

2022-05-09T21:03:56.5877916Z [INFO    2022-05-09 20:58:14 promnesia extract.py:49] extracting via promnesia.sources.takeout:index ...
�[33m[W 220509 20:58:14 path_dispatch:270]�[39m No function to handle parsing My Activity\Chrome\MyActivity.html
�[33m[W 220509 20:58:14 path_dispatch:270]�[39m No function to handle parsing My Activity\Chrome\README

I guess it's because in path_dispatch forward slashes are hardcoded. Perhaps the quickest fix would be to do something like .replace(os.sep, '/') here -- paths in takeout shouldn't have either forward or backwards slashes anyway https://github.com/seanbreckenridge/google_takeout_parser/blob/master/google_takeout_parser/path_dispatch.py#L94

add parser for Google Keep data

Seems to be in "Keep/" directory. Mostly in HTML

pretty messy filenames:

  • in 2015
2015-05-18T18_43_03.920Z.html
5.html
  • in 2017
2017-01-29T19_43_26.664Z
2017-01-29T19_43_29.485Z
  • 2021 has both html and json, but jsons are mostly empty, almost no data
2018-05-09T09_29_49.983+01_00.html
2018-05-09T09_29_49.983+01_00.json

example HTML:

...
<body><div class="note DEFAULT"><div class="heading"><div class="meta-icons">
<span class="archived" title="Note archived"></span>
</div>
Apr 7, 2019, 1:11:02 PM</div>

<div class="content">HTML content</div>


</div></body></html>

Do something about http:// youtube links

It might make sense to replace http:// with https:// for some links, e.g. to youtube videos.

For instance, in Takeout/My Activity/Video Search/MyActivity.{json,html} might contain http:// links for some old entries

{'header': 'youtube.com', 'title': 'Watched Octobass @ the Musical Instrument Museum - YouTube', 'titleUrl': 'http://www.youtube.com/watch?v=FP1QqtGe8ts', 'time': '2015-06-10T12:24:03.796Z', 'products': ['Video Search']}

In case of youtube, switching to https doesn't really hurt (the http/https are equivalent and both are availabe), and it might make it easier to consume downstream, e.g. might prevent duplicates.

zulip discussion: https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/google_takeout_parser/near/279605540

push to pypi

Already have a release just to have the name registered, but leaving install method as git+ for now, esp. because might be more changes (i.e. #2) and is a relatively new project right now

add some helper methods for the CSV comment/live chats on youtube

currently its a list of json blobs stored as a string

its not clear what the user wants to access pattern to be, so creating one format that works for everything is difficult

but would at least be nice to be able to create an helper on the model which yields items out of it in a nicer format, and perhaps a .text method that just converts it to a block of text

could also maybe have a markdown() function, as it would codify all the complexity relatively well

Takeout folders are localized according to accounts main language

problem

If an account doesn't have english as its main language, folders and some files are named differently (localized).
This results into no parsed folders due to _match_handler misses.

possible solution

I think it would be beneficial to add default handler maps like defined here (DEFAULT_HANDLER_MAP) for other languages.
An user could select a HandlerMap via command line argument.

If you approve this idea I could work on a pull request for adding a german localization as well as a command line argument for selecting a handler map.

If there's already an option to achieve this please fill me in :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.