sausy-lab / itinerum-trip-breaker Goto Github PK

Trip/activity delimiting tool for Itinerum travel survey app

Python 97.48% R 2.36% Shell 0.15%

trips gis travel-survey itinerum gps-tracking gps

itinerum-trip-breaker's Introduction

Introduction

This is an application for parsing Itinerum travel survey data into a more standard travel survey format, showing trips, activity locations, modes of travel and times of arrival/departure, etc.

How to run the scripts

Clone or download the repo.
Place your Itinerum output csv files in the imput directory

The scripts require 2 input files: coordinates.csv, survey_reponses.csv (and eventually prompt_responses.csv)
If your output/input file names do not match these then change them, or modify the corresponding names in config.py

If you have ground truth data for generating quality metrics, place them in the outputs directory, titled episodes_ground_truth.csv and locations_ground_truth.csv respectively.

You can also change the corresponding names in config.py for ground truth data.

Ensure you have all the dependencies, which are identified below.
In your shell of choice, run python3 main.py to generate episode and locations files in the output directory.
If you have ground truth data, you can run python3 compare.py to generate quality metrics.
Other settings can be modified in config.py at your discretion.

Testing data

3 people (users A, B, and C) have generously contributed a week each of itinerum survey and classified ground truth data for testing. User D is simple a test case for error detection and not a real user.

Algorithm

For each user:

Data cleaning:

The basic idea in the cleaning phase to remove points not based on decent GPS data. Many points may be derived from cell-towers, wifi networks, etc. The phone is a black box in this regard. Such points often repeat a location precisely, which is unlikely with a GPS reading or they may suddenly appear very far away from their temporal neighbors.

Remove points with high known error (h_accuracy > x meters)
Remove points at same location as temporal neighbors
Any major jump away and back again, especially if the h_error doesn't justify the distance may indicate a non-GPS signal or a bad error estimate. Away-and-back-again points are identified by the minimum distance from neighboring points and the angle formed between the three.
Repeat step 2
More to come...

Data segmentation

Data is segmented into lists of points where we feel confident that we haven't lost track of the user, e.g. through a powered-off phone. We refer to these as known segments and let the surrounding time be considered unknown. For the moment, this consists only of:

Consider as unknown any time where the user moves more than 1 kilometer and two hours without reporting a location. Unless the start and end of that segment is within 200m of a subway tunnel.
More to come...

Location detection

This phase largely follows the method described by Thierry Chaix and Kestens. We essentially do a time-weighted KDE on the user's points, spatio-temporally linear-interpolated where necessary. The kernel density estimation is done in R (for lack of a decent Python package) and then brought back into Python.

Calculate a KDE based on time weighted GPS points, interpolated as appropriate. GPS points are also used as points where the PDF is estimated. This saves the cost of estimating a grid of points over sparse GPS data.
Estimate an "activity threshold" for that surface. i.e., how high would a peak in the KDE be if someone spent X seconds at a given location, given the known parameters of
- total time under the surface (sum of time weights)
- average GPS error for the user
- kernel bandwidth
Points with a PDF estimate above the threshold are clustered into contiguous groups.
The maximum PDF estimate from each cluster (the peak) will be taken as a potential activity location. (It's possible that it may make sense to use a polygonized version of the cluster as a definition of the activity location rather than a point from the peak. This is not implemented yet.)

Activity/Trip sequencing

The ultimate goal of this program is to create a travel-diary-like sequence of activities; this step is where that happens. Conceptually, time may be spent in only one of two (three) ways in this model: 1) travelling, 2) at an 'activity' (or 3 it may be classified as unknown). Framing the problem thus leads to a hidden Markov model approach to discrete activity sequence classification. Each observed GPS point can be assigned an emission probability associated with each location, based on distance between the two. Transition probabilities give strong preference to state continuity and disallow teleporting (activities without intermediating travel).

The Viterbi algorithm provides the most likely underlying classification of points, and these are used to map time to travel and activities.

Any stationary activities shorter than the configured time threshold are merged back into surrunding acivities.

Output Data

Data is output in several files, some of which contain somewhat overlapping information.

Episodes File

The episodes file is the most esential output. It's in a sort of this happenend and then that happened format.

Locations File

This file contains a list of potential activity locations, many of which will correspond to an entry in the episodes file. Locations not used will be marked as such.

Person Days File

This file summarizes the episodes file per calendar day. Since many people may stay up a little past midnight, but few go all the way around the clock, a calndar date is considered to go from 3am to 3am. Each episode type is counted and summed. E.g. How much time was spent at home today and how many trips were made? Eventually this will be made timezone aware; for now it s configured for a single configured time zone.

Dependencies

Python 3
- Rpy2
- pyproj
- scipy
- editdistance
- osmium
R 3.3+
- ks package ks Needs to be installed through R:

install.packages('ks')

itinerum-trip-breaker's People

Contributors

Stargazers

Watchers

Forkers

kafitz wyzdevin jamesssss00000

itinerum-trip-breaker's Issues

Limit the number of points sent to KDE

We are using a KDE to estimate locations where more than 10 minutes has been spent in one spot. To do this, we feed in all the interpolated points. But we should know that we can't/shouldn't have any locations found where there are only interpolated points. The only reason we include them all is that it's an easier way to spread out time weights where they shouldn't be too concentrated on the original (sparse) input points.

Better/smarter weight assignments to the original points should reduce the time spent in the KDE step.

Refactor compare.py/compare_locations

Should be structured more like compare_episodes

Also location comparison should exclude computed locations that aren't referenced in the episodes file.

Many long, obscure functions everywhere should be shortened

In particular:
init file reading should be handled elsewhere (part of issue #4 )
flush likewise part of issue #4
get_days inner loop can be abstracted

refactor for readability:
get_known_subsets
get_activity_locations
break_trips
find_peaks
observe_neighbours

Study location needs to be made optional

The script fails when users are present with no study location (and probably also when there is another location like work missing).

nate@nate-worklap:~/itinerum/itinerum-trip-breaker$ python3 main.py
Traceback (most recent call last):
  File "main.py", line 46, in <module>
    school = Location(row['location_study_lon'], row['location_study_lat'])
  File "/home/nate/itinerum/itinerum-trip-breaker/location.py", line 6, in __init__
    self.latitude = float(latitude)
ValueError: could not convert string to float:

The string in question is empty: ''

Some temporal interpolation needed.

In the new algorithm, activity time attribution will be done on the set of interpolated points. The spatial interpolation is fine, but some additional temporal interpolation will be needed where there are large temporal gaps. It doesn't need to be even across the whole gap, unless that is much easier.

It may be more efficient to do this after the KDE.

Reading headers is currently hardcoded

Should be generalized for any order of headers.

Also double check output file headers for consistency with compare.py

Identify home/work/school locations from user survey data

Right now we have a naive enough algorithm that we are basically not even trying to do this.

Different platforms giving different clustering results

On the test data Nate's computer yields 16 activity locations with default parameters. Felipe's has many more points above the threshold and yields only 11 14 locations.

Possibly an issue of differing precision, but the magnitude of the difference seems to indicate otherwise.

Classify Mischa's survey data (with unknown gaps)

Mischa's survey data (already uploaded) should be a good test of our classification of unknown time. Will need detailed manual reclassification.

Peak finding algorithm seems broken after R ks update

There are 'locations' being found that are unacceptably close to one another, < the clustering distance.

Improve data input efficiency

Computing time spent at each location

Run scripts on full datasets to identify possible bugs

Refactor compare.py/compare_user

Note identical points in input data, somehow flag as more likely being error

Android phones seem to produce many points at the same locations, perhaps when they are acquiring a signal or when the signal is weak. Most of these are basically crap though and should be discarded if they aren't already. This information should be used in the data cleaning stage to remove points that are repeated, with limited exceptions where points repeat just by chance due to rounding.

Be smarter about identifying gaps due to subway travel

How should ground truth data be classified?

The purpose of the ground truth data is to test the performance of the algorithm on a known dataset. It seems to me that there are two broad potential approaches to this:

We can classify the data according to what actually happened on the ground, just translated into the required language of discrete trips and activities.
Or we can classify according to what we see in the GPS points, informed by what actually happened on the ground.

The ground truth data we currently have (my own) is a sloppy mix of these.

To give an example, should we include activity locations that we actually visited but that don't look like activities in coordinates.csv, perhaps because of missing or inaccurate data?

The benefit of producing a properly true ground truth is that we can measure how far our algorithm (considered as encompassing the app, the phone, etc.) is from actual reality as interpreted by the one who lived it, or at least from a more traditional activity survey.

The benefit of ground truth as manual classification of input data is that it tells us how far we are from the best possible results we can get from the data we have available.

My Reality > Phone's Reality > Our interpretation of Phone's Reality

Limit bottlenecks on large datasets

Larger datasets in the tower survey sample can take more than 5 minutes to run. E.g. user 0c7714d8-ce0b-421a-a698-5d77152e165a. This user happens to have too many days in the dataset, but we can use them as a test case for speed improvements.

Some suggestions:

Thin out the interpolation if it results in too many points.
Thin out just the points at which the KDE is estimated through sampling
Run Viterbi on only original points? This may actually be quite fast already - that needs to be tested.

Too few data points entering KDE function

I wonder if interpolation is going awry? There are fewer points in the interpolated input tot he KDE than in the original (cleaned) dataset.

Document time_at_loc in trace.py

This is a very cryptic function. Documentation needed along with better variable names.

KDE crashes on CFS data

Various bugs come up: zero division error, some runtime errors in clustering, a segmentation fault in larger traces, errors when length of trace is too short

Check for repeating records in the coordinates table

Not sure if this would actually cause us any trouble, but there are some (many?) records in the input coordinate data that are repeated verbatim. Same timestamp, etc.
Would be good to check for this in the cleaning step just in case though!

ground truth user aborting after/during KDE

nate@nate-worklap:~/itinerum/itinerum-trip-breaker$ python3 main.py
1 user(s) to clean
User: 1 2128 points at start for 44b9444a-ccb8-426b-878b-8b09318348e5
	 123 points removed as high stated error
	 28 points removed as duplicate
	 45 points removed by positional cleaning
	 17 points removed as duplicate
	 1 gap(s) found in data
	Running KDE on 4672 points
44b9444a-ccb8-426b-878b-8b09318348e5 aborted

Output points geometry/diagnostics/classification file

We need to output information on each point used in the algorithm:

Location, obviously
Viterbi state classification (trip? activity location?) which can be used to construct geometry of trips
Whether it was original input point or interpolated
Whether it was removed during cleaning (used to debug/verify cleaning process)
Temporal weight assigned during KDE step
Estimated PDF probability at that location
etc
point_diagnostics.csv (the header is being written already) should replace a couple of intermediate diagnostic files

This data should already be stored with each point object, so we just need to write the output.

Output trip geometry file

Coded as an additional column on the coordinates file, labeling each point to what trip/activity it corresponds to, if any.

Also, see if any standards exists for representing trip geometry

Ground truth I-phone survey data

Check with Mischa/Kyle to be sure the app is working again and how to know if it isn't.

Error when running main.py

nate@nate-worklap:~/itinerum/itinerum-trip-breaker$ python3 main.py
3 user(s) to clean
Traceback (most recent call last):
  File "main.py", line 54, in <module>
    user = Trace(user_id, user_ids[user_id], survey_responses[user_id])
  File "/home/nate/itinerum/itinerum-trip-breaker/trace.py", line 29, in __init__
    h = read_headers()
TypeError: read_headers() missing 1 required positional argument: 'fname'

Run Viterbi algorithm on interpolated subsets

Currently running this on only the original point sets. Improved spatial resolution should improve performance, hopefully. Better interpolation and subsetting algorithms will eventually lead to better results.

Improve and/or refactor misc_funcs.py/kde

We've discussed changing the kde function to improve location detection accuracy, but short of that it should be refactored for readability.

Generate output quality/testing metrics comparing to ground-truth data

We need ground truth data

Implement smarter point clustering algorithm

Cluster detection is currently the most computationally expensive process in the script. We end up calculating an entire NxN distancce matrix where N can be as large as 15,000 and separating clusters based on simple distance. There should be a faster and more memory efficient way to partition that many points into clusters. This is almost certainly an out of the box function from an existing library - we just need to pick the ideal algorithm based on the nature of the problem and choose a library.

KDE grid is inefficient sample of sparse points

The KDE calculation takes a crazy amount of time with a sufficiently fine resolution. Tens of thousands of points are evaluated that are nowhere near points in the input dataset.

Changing from a grid to a set of provided eval.points will effect the peak-finding algorithm, but could save a huge amount of time.

Output compare.py diagnostics to CSV

The output from this script is pretty much taking up a full screen to display properly on my computer. Once we add a few more diagnostics, printing to the console will be untenable.

Or maybe the printout can be reformatted?

Refactor trace.py/break_trips

KDE bandwidth not consistent between users?

It appears that the distance decay function (kernel bandwidth) for user F95144C6-73FC-4310-9B71-7CCD217F13E4 is much too high, while for user 44b9444a-ccb8-426b-878b-8b09318348e5 it seems about right. (also for 1FEBD82C-8ED9-46AB-B3AF-2111BA007FAE)

The KDE values for the first just appear overly smooth.

Why would this be? No parameters should be changing between them.