seanbreckenridge / google_takeout_parser Goto Github PK

View Code? Open in Web Editor NEW

71.0 6.0 13.0 185 KB

A library/CLI tool to parse data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

Home Page: https://pypi.org/project/google-takeout-parser/

License: MIT License

Python 95.73% Go 4.27%

data backup google-takeout google export google-location-history

google_takeout_parser's Issues

add parser for Location History/Semantic Location History

some semantic locations are missing placeId

Even in the latest export getting quite a few errors originating from

google_takeout_parser/google_takeout_parser/parse_json.py

Lines 193 to 196 in 6561a6a

 otherCandidateLocations=[ 

 CandidateLocation.from_dict(pv) 

 for pv in placeVisit.get("otherCandidateLocations", []) 

 ],

The reason is sometimes they are missing placeId which is required:

google_takeout_parser/google_takeout_parser/models.py

Line 147 in 6561a6a

placeId=data["placeId"],

Most of them look like this, with either TYPE_HOME or TYPE_WORK:

{'latitudeE7': ..., 'longitudeE7': ..., 'semanticType': 'TYPE_HOME', 'locationConfidence': 26.890783}

(sometimes it's TYPE_HOME/TYPE_WORK and placeId is present as well, but not always)

There are a few remaining ones with different type:

{'latitudeE7': ...,
'locationConfidence': 14.805286,
'longitudeE7': ...,
'semanticType': 'TYPE_ALIASED_LOCATION'}

, however not sure what that one means.

Not sure what's the right way to handle this, perhaps making placeId optional and perhaps also adding semanticType field?

add warnings to missing top level keys in parse_json

push to pypi

Already have a release just to have the name registered, but leaving install method as git+ for now, esp. because might be more changes (i.e. #2) and is a relatively new project right now

add some helper methods for the CSV comment/live chats on youtube

currently its a list of json blobs stored as a string

its not clear what the user wants to access pattern to be, so creating one format that works for everything is difficult

but would at least be nice to be able to create an helper on the model which yields items out of it in a nicer format, and perhaps a .text method that just converts it to a block of text

could also maybe have a markdown() function, as it would codify all the complexity relatively well

use detected locale to check if one of the common directories is present

use streaming html parser

loading the whole html document into memory is pretty expensive memory wise, could either use a streaming html parser, or maybe split the file before loading it?

remove placevisit JSON property

previously had used json.dumps on for other candidates and had a property that decoded it to comply with cachew not allowing JSON objects as values. Can now do that since it just uses a json encoder internally, so should be able to type the list of dictionaries properly

Takeout folders are localized according to accounts main language

problem

If an account doesn't have english as its main language, folders and some files are named differently (localized).
This results into no parsed folders due to _match_handler misses.

possible solution

I think it would be beneficial to add default handler maps like defined here (DEFAULT_HANDLER_MAP) for other languages.
An user could select a HandlerMap via command line argument.

If you approve this idea I could work on a pull request for adding a german localization as well as a command line argument for selecting a handler map.

If there's already an option to achieve this please fill me in :)

use dispatch_map to detect locale if not provided

try/catch all parsing functions and yield errors

To prevent the entire parse from failing if a new/old format is different than whats expected

add parser for Google Q & A ("Baraza")

https://wiki.archiveteam.org/index.php/Google_Baraza
last seen in 2015 takeouts

add summary flag to parse CLI

support Windows separators in path_dispatch

While setting up Windows CI for promnesia, the takeout tests failed and had these in logs:

2022-05-09T21:03:56.5877916Z [INFO    2022-05-09 20:58:14 promnesia extract.py:49] extracting via promnesia.sources.takeout:index ...
�[33m[W 220509 20:58:14 path_dispatch:270]�[39m No function to handle parsing My Activity\Chrome\MyActivity.html
�[33m[W 220509 20:58:14 path_dispatch:270]�[39m No function to handle parsing My Activity\Chrome\README

I guess it's because in path_dispatch forward slashes are hardcoded. Perhaps the quickest fix would be to do something like .replace(os.sep, '/') here -- paths in takeout shouldn't have either forward or backwards slashes anyway https://github.com/seanbreckenridge/google_takeout_parser/blob/master/google_takeout_parser/path_dispatch.py#L94

Recreate cache on version upgrades

Unless a model changes, the hash for cachew doesnt update, but code may have changed and we still have old results. So, unless you clear the directory you could have results generated from old functionality

the clear command does fix that, but would be nice for this to invalidate old results automatically, by inspecting package installation to see what version this is and put a 'version' file in the cache directory (or maybe in the cachew hash db talbe?)

Could add an environment variable/flag that lets you use mismatched hashes during development

Could also maybe just add the version at the front of the _cachew_depends_on, since that gets stored as part of the hash

use error_policy kwarg instead of yield/drop/raise

should replace these with an error_policy argument which is either yield/warn or drop, using a Literal, to make it more obvious that these are related to how to handle errors

type internal dicts inside lists

https://github.com/seanbreckenridge/google_takeout_parser/blob/368017199f822196f78a3e70f7b27584066b8796/google_takeout_parser/models.py

can be typed with a metadata dict now, since karlicoss/cachew#28 was fixed

add parser for saved places on google maps

Seem to be scattered across different formats 💩

"Saved" list is in "Maps (your places)/Saved Places.json" -- present since 2015

{
  "type" : "FeatureCollection",
  "features" : [ {
    "geometry" : {
      "coordinates" : [ -0.1202100, 51.5979200 ],
      "type" : "Point"
    },
    "properties" : {
      "Google Maps URL" : "http://maps.google.com/?cid=17295021474934382781",
      "Location" : {
        "Address" : "United Kingdom",
        "Business Name" : "Alexandra Palace",
        "Country Code" : "GB",
        "Geo Coordinates" : {
          "Latitude" : "51.5979200",
          "Longitude" : "-0.1202100"
        }
      },
      "Published" : "2017-09-27T09:56:06Z",
      "Title" : "Alexandra Palace",
      "Updated" : "2017-09-27T09:56:06Z"
    },
    "type" : "Feature"
  }, {
    "geometry" : {
      "coordinates" : [ -0.1307733, 51.5941783 ],
      "type" : "Point"
    },
...
]}

Whereas other lists are in CSV files (since 2018), in "Saved" directory, one for each list in google maps
e.g. Saved/Paris.csv

Title,Note,URL
Urfa Durum,,"https://www.google.com/search?q=Urfa+Durum&ludocid=15623525448940569321&ibp=gwp;0,7"

doesn't seem like this data is preset anywhere else in takeouts

add tests for current JSON parsers

add filter flag to merge/parse which filters to one of the models

use streaming JSON parser (ijson)

I guess not a super big deal since we use caching, but it does give significant (almost 2x speedups)

Had good success using it for a couple of DALs https://github.com/karlicoss/exporthelpers/blob/804b8afa070d8017ad15710a2a179e71ea60316f/dal_helper.py#L140-L171 (made it an optional dependency for backwards compatibility since ijson involves some binaries which might be unavailable for some platforms)

related: #40

add test for path_dispatch

to make sure all the paths get matched properly

check invalid regex escape sequences

https://github.com/seanbreckenridge/google_takeout_parser/runs/5131686258?check_suite_focus=true#step:6:21

dedupe HTML/CSV comments when merging across different takeouts

Not sure how feasible this is, will need to test

parse activitysegments from location history

document which locales this supports

bump required cachew version

add parser for Hangouts/Hangouts.json

also oddly, my takeout for 2014 has Hangouts.json and Hangouts2.json (with similar size), the diff shows like 10% of lines, but hard to tell what's actually different, seems very random
all later takeouts only have single Hangouts.json

parse youtube video metadata

includes information about the video release date, the description, duration etc.
video recordings.csv also includes a lat/lon, so that could be nice to expose

add small utility command to help move/rename and extract takeouts

add handler for Google Fit data

Fit/Daily Aggregations csv files -- started appearing in 2017

Fit/Activities/*.tcx and Fit/Activities/Low Accuracy/*.tcx files -- perhaps worth just having a function to get them, something else should actually handle tcx files
also a bunch of them seems to have disappeared in 2020 (comparing with 2018) -- not sure if it's some sort of retention

allow passing multiple filters

Check watch-history title in newer google takeout exports

from:

https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/google_takeout_parser/near/302874482

should take a look at parse_json _parse_json_activity and see if title which is currently just a dict access and not a get is affected with a new takeout

split cached databases by type

I believe this would make the size smaller since individual rows for the cachew union type would be smaller, so the cache doesnt grow to unreasonable sizes.

Would probably leave the one in HPI google_takeout as its theres just one of those, and not multiple that grow exponentially with no. of exports

As it stands, Im comfortable with the tradeoff here -- trading ease for disk space, but definitely could be improved

{'header': 'youtube.com', 'title': 'Watched Octobass @ the Musical Instrument Museum - YouTube', 'titleUrl': 'http://www.youtube.com/watch?v=FP1QqtGe8ts', 'time': '2015-06-10T12:24:03.796Z', 'products': ['Video Search']}

In case of youtube, switching to https doesn't really hurt (the http/https are equivalent and both are availabe), and it might make it easier to consume downstream, e.g. might prevent duplicates.

zulip discussion: https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/google_takeout_parser/near/279605540

add parser for subscriptions.json/likes.csv files

parse/check new youtube files

[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel URL configs.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel feature data.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel page settings.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/comments/comments.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/live chats/live chats.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/video recordings.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/video texts.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/videos.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel URL configs.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel feature data.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel page settings.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/comments/comments.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/live chats/live chats.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/video recordings.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/video texts.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/videos.csv
[W 231203 09:48:47 path_dispatch:331] No function to handle parsing Maps (your places)/Saved Places.json

add parser for Google Keep data

Seems to be in "Keep/" directory. Mostly in HTML

pretty messy filenames:

in 2015

2015-05-18T18_43_03.920Z.html
5.html

in 2017

2017-01-29T19_43_26.664Z
2017-01-29T19_43_29.485Z

2021 has both html and json, but jsons are mostly empty, almost no data

2018-05-09T09_29_49.983+01_00.html
2018-05-09T09_29_49.983+01_00.json

example HTML:

...
<body><div class="note DEFAULT"><div class="heading"><div class="meta-icons">
<span class="archived" title="Note archived"></span>
</div>
Apr 7, 2019, 1:11:02 PM</div>

<div class="content">HTML content</div>


</div></body></html>

	otherCandidateLocations=[
	CandidateLocation.from_dict(pv)
	for pv in placeVisit.get("otherCandidateLocations", [])
	],

seanbreckenridge / google_takeout_parser Goto Github PK

google_takeout_parser's Issues

problem

possible solution

Recommend Projects

Recommend Topics

Recommend Org

Jobs