GithubHelp home page GithubHelp logo

seanbreckenridge / google_takeout_parser Goto Github PK

View Code? Open in Web Editor NEW
71.0 6.0 13.0 185 KB

A library/CLI tool to parse data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

Home Page: https://pypi.org/project/google-takeout-parser/

License: MIT License

Python 95.73% Go 4.27%
data backup google-takeout google export google-location-history

google_takeout_parser's Issues

some semantic locations are missing placeId

Even in the latest export getting quite a few errors originating from

otherCandidateLocations=[
CandidateLocation.from_dict(pv)
for pv in placeVisit.get("otherCandidateLocations", [])
],

The reason is sometimes they are missing placeId which is required:

placeId=data["placeId"],

Most of them look like this, with either TYPE_HOME or TYPE_WORK:

{'latitudeE7': ..., 'longitudeE7': ..., 'semanticType': 'TYPE_HOME', 'locationConfidence': 26.890783}

(sometimes it's TYPE_HOME/TYPE_WORK and placeId is present as well, but not always)

There are a few remaining ones with different type:

{'latitudeE7': ...,
'locationConfidence': 14.805286,
'longitudeE7': ...,
'semanticType': 'TYPE_ALIASED_LOCATION'}

, however not sure what that one means.

Not sure what's the right way to handle this, perhaps making placeId optional and perhaps also adding semanticType field?

push to pypi

Already have a release just to have the name registered, but leaving install method as git+ for now, esp. because might be more changes (i.e. #2) and is a relatively new project right now

add some helper methods for the CSV comment/live chats on youtube

currently its a list of json blobs stored as a string

its not clear what the user wants to access pattern to be, so creating one format that works for everything is difficult

but would at least be nice to be able to create an helper on the model which yields items out of it in a nicer format, and perhaps a .text method that just converts it to a block of text

could also maybe have a markdown() function, as it would codify all the complexity relatively well

use streaming html parser

loading the whole html document into memory is pretty expensive memory wise, could either use a streaming html parser, or maybe split the file before loading it?

remove placevisit JSON property

previously had used json.dumps on for other candidates and had a property that decoded it to comply with cachew not allowing JSON objects as values. Can now do that since it just uses a json encoder internally, so should be able to type the list of dictionaries properly

Takeout folders are localized according to accounts main language

problem

If an account doesn't have english as its main language, folders and some files are named differently (localized).
This results into no parsed folders due to _match_handler misses.

possible solution

I think it would be beneficial to add default handler maps like defined here (DEFAULT_HANDLER_MAP) for other languages.
An user could select a HandlerMap via command line argument.

If you approve this idea I could work on a pull request for adding a german localization as well as a command line argument for selecting a handler map.

If there's already an option to achieve this please fill me in :)

support Windows separators in path_dispatch

While setting up Windows CI for promnesia, the takeout tests failed and had these in logs:

2022-05-09T21:03:56.5877916Z [INFO    2022-05-09 20:58:14 promnesia extract.py:49] extracting via promnesia.sources.takeout:index ...
�[33m[W 220509 20:58:14 path_dispatch:270]�[39m No function to handle parsing My Activity\Chrome\MyActivity.html
�[33m[W 220509 20:58:14 path_dispatch:270]�[39m No function to handle parsing My Activity\Chrome\README

I guess it's because in path_dispatch forward slashes are hardcoded. Perhaps the quickest fix would be to do something like .replace(os.sep, '/') here -- paths in takeout shouldn't have either forward or backwards slashes anyway https://github.com/seanbreckenridge/google_takeout_parser/blob/master/google_takeout_parser/path_dispatch.py#L94

Recreate cache on version upgrades

Unless a model changes, the hash for cachew doesnt update, but code may have changed and we still have old results. So, unless you clear the directory you could have results generated from old functionality

the clear command does fix that, but would be nice for this to invalidate old results automatically, by inspecting package installation to see what version this is and put a 'version' file in the cache directory (or maybe in the cachew hash db talbe?)

Could add an environment variable/flag that lets you use mismatched hashes during development

Could also maybe just add the version at the front of the _cachew_depends_on, since that gets stored as part of the hash

add parser for saved places on google maps

Seem to be scattered across different formats 💩

"Saved" list is in "Maps (your places)/Saved Places.json" -- present since 2015

{
  "type" : "FeatureCollection",
  "features" : [ {
    "geometry" : {
      "coordinates" : [ -0.1202100, 51.5979200 ],
      "type" : "Point"
    },
    "properties" : {
      "Google Maps URL" : "http://maps.google.com/?cid=17295021474934382781",
      "Location" : {
        "Address" : "United Kingdom",
        "Business Name" : "Alexandra Palace",
        "Country Code" : "GB",
        "Geo Coordinates" : {
          "Latitude" : "51.5979200",
          "Longitude" : "-0.1202100"
        }
      },
      "Published" : "2017-09-27T09:56:06Z",
      "Title" : "Alexandra Palace",
      "Updated" : "2017-09-27T09:56:06Z"
    },
    "type" : "Feature"
  }, {
    "geometry" : {
      "coordinates" : [ -0.1307733, 51.5941783 ],
      "type" : "Point"
    },
...
]}

Whereas other lists are in CSV files (since 2018), in "Saved" directory, one for each list in google maps
e.g. Saved/Paris.csv

Title,Note,URL
Urfa Durum,,"https://www.google.com/search?q=Urfa+Durum&ludocid=15623525448940569321&ibp=gwp;0,7"

doesn't seem like this data is preset anywhere else in takeouts

add parser for Hangouts/Hangouts.json

also oddly, my takeout for 2014 has Hangouts.json and Hangouts2.json (with similar size), the diff shows like 10% of lines, but hard to tell what's actually different, seems very random
all later takeouts only have single Hangouts.json

parse youtube video metadata

includes information about the video release date, the description, duration etc.
video recordings.csv also includes a lat/lon, so that could be nice to expose

add handler for Google Fit data

Fit/Daily Aggregations csv files -- started appearing in 2017

Fit/Activities/*.tcx and Fit/Activities/Low Accuracy/*.tcx files -- perhaps worth just having a function to get them, something else should actually handle tcx files
also a bunch of them seems to have disappeared in 2020 (comparing with 2018) -- not sure if it's some sort of retention

split cached databases by type

I believe this would make the size smaller since individual rows for the cachew union type would be smaller, so the cache doesnt grow to unreasonable sizes.

Would probably leave the one in HPI google_takeout as its theres just one of those, and not multiple that grow exponentially with no. of exports

As it stands, Im comfortable with the tradeoff here -- trading ease for disk space, but definitely could be improved

Do something about http:// youtube links

It might make sense to replace http:// with https:// for some links, e.g. to youtube videos.

For instance, in Takeout/My Activity/Video Search/MyActivity.{json,html} might contain http:// links for some old entries

{'header': 'youtube.com', 'title': 'Watched Octobass @ the Musical Instrument Museum - YouTube', 'titleUrl': 'http://www.youtube.com/watch?v=FP1QqtGe8ts', 'time': '2015-06-10T12:24:03.796Z', 'products': ['Video Search']}

In case of youtube, switching to https doesn't really hurt (the http/https are equivalent and both are availabe), and it might make it easier to consume downstream, e.g. might prevent duplicates.

zulip discussion: https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/google_takeout_parser/near/279605540

parse/check new youtube files

[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel URL configs.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel feature data.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel page settings.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/comments/comments.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/live chats/live chats.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/video recordings.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/video texts.csv
[W 231203 09:46:54 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/videos.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel URL configs.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel feature data.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel page settings.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/channels/channel.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/comments/comments.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/live chats/live chats.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/video recordings.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/video texts.csv
[W 231203 09:46:56 path_dispatch:331] No function to handle parsing YouTube and YouTube Music/video metadata/videos.csv
[W 231203 09:48:47 path_dispatch:331] No function to handle parsing Maps (your places)/Saved Places.json

add parser for Google Keep data

Seems to be in "Keep/" directory. Mostly in HTML

pretty messy filenames:

  • in 2015
2015-05-18T18_43_03.920Z.html
5.html
  • in 2017
2017-01-29T19_43_26.664Z
2017-01-29T19_43_29.485Z
  • 2021 has both html and json, but jsons are mostly empty, almost no data
2018-05-09T09_29_49.983+01_00.html
2018-05-09T09_29_49.983+01_00.json

example HTML:

...
<body><div class="note DEFAULT"><div class="heading"><div class="meta-icons">
<span class="archived" title="Note archived"></span>
</div>
Apr 7, 2019, 1:11:02 PM</div>

<div class="content">HTML content</div>


</div></body></html>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.