How Google knows where you live pt2
In my previous post I showed how to access the location history Google stores for your phone (if you have location history enabled). With two easy assumptions I showed that it is easy to extract the location of my home and work place from this data.
This post I am digging deeper. I discuss the peculiarities of the data and clustering techniques to extract meaningful locations from the data. This time I will include the python code I used to do this.
For those who are interested in academic work on this topic, the paper "Discovering Personally Meaningful Places: An Interactive Clustering Approach" by C. Zhou et al, published in the ACM Transactions on Information Systems, volume 25, number 3 in July 2007 seem like a good start.
Properties of the data
There are a few quirks in the data as downloaded in the KML file from Google. The three I mentioned in the last post are: There are more points in the file than in the web interface, the error margin is not included in the file and the timezone is off. In addition to this I would like to add that the Latitude and Longitude are reversed in the KML file. For example:
<when>2014-03-31T06:46:06.075-07:00</when> <gx:coord>6.5359769 53.2402912 0</gx:coord>
An then there is the fact that the earth is round. Because of this a vector is Lat, Lng space has to be converted to know the actual distance. In this blog post I'm only going to look at data recorded in Groningen, The Netherlands, where the error between Lat, Lng space and actual distances should be the roughly equal for all distances in the city. Because of this and time/motivation constrains I'm not going deeper into this right now.
The objective in my last post was to find the location of my home and work using the location data from my Android phone. This post my objective is to find more locations where I went. My assumption this time is that I stayed for a while at locations that are meaningful. If I stayed somewhere for a while there should be a concentration of data points there. By finding concentrations of data points I should be able to find meaningful locations.
This could be wrong, in my last post I showed the time between data points fluctuates. It could be that my phone sends less location data when I'm indoor, lowering the concentration of points in meaningful locations compared to the rest.
There are techniques to find concentrations of data points, these are called clustering techniques. For an excellent comparison between clustering techniques, see the docs of the Python library Scikit Learn: http://scikit-learn.org/stable/modules/clustering.html
One of the most common clustering techniques is k-means. It is easy to implement and fast, even on large datasets. There is, however, a good reason not to use it in this case: the parameter of the technique is the amount of clusters you expect to find in the data. For a given day I could not give you the amount of places I go to, let alone the amount easily discernible in the data.
A clustering technique related to k-means is Mean-Shift. It is better suited for this data, since its parameter is "bandwidth", basically the diagonal of the area you consider to be meaningful place. In our case this parameter is related to the noise that is present in the data. A value that corresponds to a diagonal of about 50 meters seems to work well for my data.
Hierarchical clustering has a parameter that is related to "bandwidth", the cut off parameter. When using complete linkage as criterium the results should be similar to the Mean-Shift technique. The main reason to go for Mean-Shift is that it is new to me (exiting!), and it outputs cluster centers directly (HC does not output cluster centers).
Applying this to a real day
Let me tell you how the 31st of March went for me. I woke up at my home, went to the corner store to buy breakfast. After breakfast I rode my bicycle to the university and worked there until the end of the afternoon. Then I ate with my girlfriend at her place, we decided to walk into town, sat down for a beer and after an hour or so we went back to her place again.
The next figure shows how this data looks:
There are 2 errors among the location points. The first is easiest to spot, the huge jump location from the points in the path from the lower left corner to the top left corner. The second is harder to spot, it is a jump in the path that runs in the lower right corner.
In the figure it is hard to spot the areas with a large concentration of points. Luckily we have a clustering technique to extract those. The results are below. I have redacted some of the locations, I'm not giving away the coordinates to my home ;) The two locations I have included are the university building I study at, and the cafe I drank a beer at that night.
Cluster 0, # of points: 190, loc= Cluster 1, # of points: 182, loc=(53.240288, 6.536754) Cluster 2, # of points: 94, loc= Cluster 3, # of points: 66, loc=(53.218403, 6.570528) Cluster 4, # of points: 5, loc= Cluster 5, # of points: 7, loc= Cluster 6, # of points: 3, loc= Cluster 7, # of points: 2, loc=
In total there where 55 clusters, most of them with only a single point. Four clusters contain almost all of the 608 location points that are recorded on this day. These are my home, my girlfriends home, the university building I study at and the cafe we drank a beer at. Cluster 4 points to the location of my corner store, the other small clusters contain noise point at the university campus and the routes between the actual locations.
If you use Google Maps on the (53.218403, 6.570528) coordinates you see the following location. The cafe we drank beer at.
In this post I applied the clustering technique to one day. In a next post I will extend this to longer periods of time. What makes this more difficult is that fact that routes I cycle every day get a higher density of points, while places I only go to once in a while do not necessarily get more points
The code can be found here.