A Travel Mode Detection approach based on hierarchical and non-hierarchical clustering

Patterns and Trends in Environmental Data

Author

Lucia Scheele and Laura Vetter

preprocessing

#loading all the data from the preprocessing files
load("my_environment.rdata")

library(tmap)
library(sf)
library(maptiles)

tmap_mode("plot") # bc mode "view" is too work intensive
#define background map
bbox <- st_bbox(selected_tracks_na_omit)
# Fetch OSM tiles
osm_bg <- get_tiles(bbox, provider = "OpenStreetMap", zoom = 16)

Abstract

This project aims to analyze travel behavior and clustering of travel segments by different clustering methods and visual analysis. Parameters like speed, acceleration, elevation and sinuosity and derived parameters were used for clustering with kmeans and hclust. Additionally, a segmentation by threshold (using the parameter speed) and a manual segmentation using GIS were carried out to compare their results to the clustering. Verification took place using the Chi-square test. It was found that for the partinoning (choosing the amount of k), only one of the statistical tests coincided with the actual number of movement types (k=5). Furthermore, the results suggest that hclust provides a clustering that matches the ground truth better than k-means. Specifically, the ward method in hclust performs better than the single and complete method. It was not proven, that clustering shows a better (or equally good) result than manual classification in GIS. This is mainly due to a lack of differentiation between bike and tram movement, as well as a mis-classification of movement in turns (e.g. bike was classified as walking)

Introduction

In order to understand travel behaviour, it is critical to analyze movement data and be able to classify different types of movement correctly (Sadeghian et al. (2022)). However, the exact detection of the corresponding travel modes is complex and can be time consuming.
Generic methodologies to detect travel modes have been described using unsupervised learning algorithms, GIS multi-criteria process (eg. by Sadeghian et al. (2022)), and supervised learning algorithms or decision trees and confusion matrix(Shamoun-Baranes et al. (2012)). This project aims to answer the reasearch question of which travel modes can be differentiated in trajectories with different travel modes within the trajectories using hierarchical and non-hierarical clustering? Therefore, the project focuses on a comparison of these two clustering methods, and a comparison with a manual segmentation using ArcGIS and a threshold segmentation using the parameter speed.

Material and Methods

Data

The data sample consists of movement data collected with the Tracking App Strava, by two individuals over a timespan of April - May 2024. Overall, 31 trips of different travel modes were recorded with the movement types walking, running, cycling, bus, tram, train and car. For simplification of the design and to be able to focus on the application of the clustering methods, a sample of 2 trips with different movement types within the trajectory were selected to be analyzed. The tracks were selected so that all movement types standing, walking, running, biking and tram were included. Data points were recorded every second, resulting in a relatively high resolution of data. However, data frequency is not consistent in all cases due to lack of signal (e.g. in a train). At first, including contextual data such as traffic infrastructure data by the city of Zurich ( https://www.stadt-zuerich.ch/geodaten/) was considered. Out of two reasons, it was decided not to include this data: 1) categorical data is not suitable for the evaluation with the k-means clustering algorithm and 2) the infrastructure data dit not overlap with the trajectories in many cases and was therefore not especially meaningful. Therefore, the focus was set on the analysis of the numerical data.

Calulation of Parameters

The analysis was made with data of the individual fixes. instead of using segments for the analysis of e.g. acceleration, the data of moving windows ( X steps before and after the fix) was assigned to the individual fix. This was done to have more information for each fix (data point).

5 different travel modes could be identified in the samples (standing, walking, running, tram, biking), therefore the optimal number of clusters expected is 5, which is also the number of classes assigned in GIS and with the speed parameter. Also, the trajectory data will be analysed with partitioning methods to determine the number of clusters needed, and results will be compared. Space and time of trajectories were treated to be relative and are filtered out for the analysis. The parameters for clustering will be calculated from the given geographical data and include:

distance point1 - point2
speed p1 - p2
acceleration p1 - p2
mean speed in moving window
max speed in moving window
mean acceleration in moving window
max acceleration in moving window
elevation change in moving window
sinuosity in moving window

Code

    distance = distance_by_element(geometry, lead(geometry, 1)), 
    time_diff = as.numeric(timestamp - lag(timestamp)),
    speed = distance / time_diff,
    speed_kmh = speed * 3.6, 
    acceleration = (speed - lag(speed)) / time_diff,
    avg_speed_10s = slide_dbl(speed_kmh, mean, .before = 5, .after = 5, .complete = TRUE),
    avg_speed_60s = slide_dbl(speed_kmh, mean, .before = 30, .after = 30, .complete = TRUE), 
    max_speed_10s = slide_dbl(speed_kmh, max, .before = 5, .after = 5, .complete = TRUE),
    avg_acc_10s = slide_dbl(acceleration, mean, .before = 5, .after = 5, .complete = TRUE) ,
    avg_acc_60s = slide_dbl(acceleration, mean, .before = 30, .after = 30, .complete = TRUE),
    max_acc_10s = slide_dbl(acceleration, max, .before = 5, .after = 5, .complete = TRUE),
    el_change = (elevation -lag(elevation,  10)),
    d_direct10 = distance_by_element(lag(geometry,4), lead(geometry,5)),
    d_sinu10 = rollsum(distance, 10,align = "center", fill = NA), 
    d_direct10 = case_when(d_direct10 == 0 ~ d_sinu10,
                           TRUE ~ d_direct10),
    sinuosity = d_sinu10/d_direct10

The calculation of sinuosity, some adaptions needed to be made as some values had the value “Inf”. This occurs when the direct distance = 0, meaning that start and end point of the moving window are the same. The actual moving distance is >1, as within the moving window movement occured.

actual moving distance /direct distance = x/0=Inf

There are several ways to deal with this problem:

Set inf values to the maximum value of the sinuosity
Filter out values completely from the table
Insert the actual distance traveled per 10 points, actual/direct=x/x=1

The third method was chosen, resulting in the lowest sinuosity value possible (=1).

Speed Parameter Classification

In order to verify how well speed alone can already predict the different movement types, a classification was made based on speed as the only criterion. The different travel modes were assigned to different speeds where:

Code

#speed_cluster = case_when(
   # speed_kmh == 0 ~ "1", #standing
   # speed_kmh < 5 ~ "2", #walking
   # speed_kmh >= 5 & speed_kmh < 18 ~ "3",  #running
   # speed_kmh >= 18 & speed_kmh < 30 ~ "4", # biking 
   # speed_kmh > 30 ~ "5" # tram 
   # )

Possible unhomogenity of some fixes could be a problem, e.g. when the threshold is set at a point where the speed varies constantly (e.g. speed>5 = running, speed<5 = walking). This issue will be resolved with manual classification in GIS.

Manual Classification using GIS

In order to test the results of our cluster analysis, it was aimed to assess in how far the cluster categories are consistent with the actual movement types conducted during the trajectory recordings. This required two independent enrichments of the fix point data. In one step, the information of the actual movement type had to be added to the single trajectory fixes by using the recorder’s knowledge of the track. This step was conducted within ArcGis (Tools: Select and Calculate Field). To help the recognition of movement type change, the fix points were colored using the speed (km/h) variable.
In another step, the individual cluster groups were inspected visually and statistically to find out which movement they are likely to represent.

Clustering with kmeans

K means is one of the most commonly used partition based algorithms (Yuan et al. (2017)), where k defines the number of clusters and is identified before processing. Each partition is one cluster and contains at least one data point, where each data point is assigned to only one cluster. First, partitioning (determining the appropriate number of k) is carried out using the elbow method from the package factoextra (“Determining-the-Optimal-Number-of-Clusters-3-Must-Know-Methods”) and the cascade method from the package vegan.
All parameters included in the clustering analysis are numerical:

Code

head(km_no_geom, 5)

# A tibble: 5 × 11
  distance speed acceleration avg_speed_10s avg_speed_60s avg_acc_10s
     <dbl> <dbl>        <dbl>         <dbl>         <dbl>       <dbl>
1     4.21  4.21        2.59           8.34          9.65     -0.0108
2     2.02  2.02       -2.19           8.11          9.87     -0.0652
3     3.01  3.01        0.994          9.32         10.1       0.338 
4     3.34  3.34        0.331          9.74         10.3       0.115 
5     1.40  1.40       -1.94           9.64         10.4      -0.0284
# ℹ 5 more variables: avg_acc_60s <dbl>, el_change <dbl>, d_direct10 <dbl>,
#   d_sinu10 <dbl>, sinuosity <dbl>

Clustering with hclust

When using k-means, the iterative process results in a cluster partition that differs with the number of requested cluster groups. In contrast, hierarchical cluster algorithms also allow to explore how the various clusters are related to each other with different clustering levels. The agglomerative hierarchical clustering does this, by first assigning each object its own cluster and iteratively joining the two most similar clusters, until there is just one single cluster. The distance between clusters is calculated according to a distance measure of choice (“Hclust Function - RDocumentation — Rdocumentation.org”). Here, three typical methods for this type of hierarchical clustering are explored: Ward’s Minimum Variance Method, the complete linkage method and single linkage method.
Here, the agglomerative clustering methods were applied using hclust function from the stats package. The required input data has to be supplied as a dissimilarity structure. The latter was calculated from the scaled parameter values of each observation using the dist function (i.e. chord-distance). This prevents, that ensures, that parameters with different value scales have a equal influence on the results.

Results

Two trajectories were chosen which include several travel modes. The first one (light blue) is a mixture of walking, running, tram and standing (at the tram stop). The second one (green) consists of biking, walking and standing.

Code

trackID_map <- tm_shape(osm_bg) +
  tm_rgb(alpha = 0.4)+
  tm_shape(selected_tracks_na_omit) + 
  tm_dots(col = "trajID", palette = "Paired", alpha = 1)

trackID_map

Map with the two Trajectories to be analyzed

Speed Classification

The parameters distance, speed, acceleration, elevation and sinuosity were calculated as mentioned in the methods. The visual evaluation showed that speed might be the most appropriate as a reference predictor for classifying the movement types. Therefore, it was taken and a threshold analysis was made.

Code

speed_map <-  tm_shape(selected_tracks_na_omit)+
  tm_dots(col = "speed_kmh", palette = "-RdYlGn") 


cluster_speed_map <- tm_shape(selected_tracks_na_omit)+ 
  tm_dots(col = "speed_cluster_name", palette = "Paired")


tmap_arrange(speed_map, cluster_speed_map )

Map 1: Speed in km/h , Map 2: Classification by Speed

The first map shows the actual speed per datafix, while the second map shows the classification via the speed thresholds. Here, most of the biking part and the tram are classified correctly, however the differentiation between walking and running is not perfect and also incorrect when the tram slows down.

GIS Classification

To be able to differentiate between the different groups, a manual assignation of the groups was made in ArcGIS. It works as the “reference classification” or ground truth.

Code

cluster_GIS_map <- tm_shape(selected_tracks_na_omit)+ 
  tm_dots(col = "GIS_name", palette = "Paired")

cluster_GIS_map

Map of manually assigned groups in ArcGIS

kmeans Cluster

Partitioning

In order to define the appropriate number of clusters, the elbow statistics is used. The bend (like a bent elbow) in the line indicates the number of appropriate k.

Code

plot_k_elbow

This method indicates that the best amount of clusters seems to be 2, as the curve starts to bend there. The method was tried with several configurations (different n-starts, leaving out parameters like elevation or sinuosity, taking one track only), but the indicated number for k stayed the same. As this does not correspond to the actual amount of movement types, a different method, the cascade, was tried out. Here, the suitability was tested for 2-8 cluster groups.

Code

cascade_results

        2 groups     3 groups     4 groups     5 groups     6 groups
SSE 40794.518994 3.420099e+04 3.076598e+04 2.749286e+04 2.535376e+04
ssi     0.600691 6.244613e-01 7.005974e-01 1.441969e-01 5.654394e-01
        7 groups     8 groups
SSE 2.328762e+04 2.199303e+04
ssi 9.243113e-01 4.765403e-01

The SSE (sum of squared errors) is a measure of how well the data points are summarized within the clusters. The lower, the better the clustering. In the ssi (silhouette index) the higher the value (from -1 to 1) the better the clustering. The ssi is highest for 5 groups, while the SSE decreases constantly with the rising number of groups. This indicates that a well chosen amount of k is 5.

Applying kmeans

The importance of the nstart criterium in the kmeans analysis is not to underestimate. It is recommended to run k-means multiple times with different random n-starts and choosing the best result, as this helps avoid getting stuck in a poor local optimum. In the following graphic, we can see the different results when using no predetermined n, n=20 and n=100.

Code

plot_cluster_5

Code

plot_cluster_5_20

Code

plot_cluster_5_100

The classification of the sample points in the point cloud, gives insight into how many data points belong to which cluster. From this, it is not yet possible to determine which clustering method might correspond best to the original movement type and which amount of n suits our classification best.

Code

cluster_k5_map <-   tm_shape(selected_tracks_na_omit)+ 
  tm_dots(col = "kmeans5", palette = "Paired")

cluster_k5_20_map <- tm_shape(selected_tracks_na_omit)+ 
  tm_dots(col = "kmeans5_20", palette = "Paired")

cluster_k5_100_map <- tm_shape(selected_tracks_na_omit)+ 
  tm_dots(col = "kmeans5_100", palette = "Paired")

tmap_arrange(cluster_k5_map, cluster_k5_20_map, cluster_k5_100_map, nrow = 3 )

Looking at the clusters with k=5, k=5&n=20 and k=5&n=100, we can see that the distribution of clusters is very similar, meaning that (sadly, this is much better visible in the interactive map, but it is too heavy to load in the html document) This means, n does not make a great difference in the assignation of the fixes to the clusters, different to what the plots before indicated. However, the problem of clustering difficulties for running/walking and differentiation of tram & bike are encountered again.

Code

cluster_k4_map <- tm_shape(selected_tracks_na_omit)+ 
  tm_dots(col = "kmeans4", palette = "Paired")
cluster_k4_map

Changing to k=4, running, walking and a third category can be relatively clearly distinguished, but there is neither a distinction between tram and bike visible. Also curves with the bike (which imply slowing down) or tram stops are not classified correctly. For further analysis we will therefore leave out the k=4.

hclust Cluster

Hierarchical clustering (hclust)

The result of the h-clustering process is a tree of clusters which shows how the different clusters are related to eachother.

Code

par(mfrow = c(1, 3))
plot(hc_single, main = "Single", labels = NULL, sub = NULL)
plot(hc_complete, main = "Complete")
plot(hc_ward, main = "Ward")

Code

par(mfrow = c(1,1))

Depending on the input data and the question asked, the cluster can be split at different cluster levels. In our case, the cluster level of 5 is a sensible choice when considering the amount of different movement types between which we would like to distinguish. Looking at the resulting summary of the fix point distribution among the five clusters, it became apparent, that the Ward method is the only method which was able to create spherical clusters of reasonable sizes.

Code

summary_table_hclust

            1   2   3    4    5
single   5415   1   2    2    2
complete 5364  20  25   11    2
ward     1183 434 528 1944 1333

Since in the given tracks, the movement types were easily practiced for more than a minute, we would expect there to be more than 60 fixes per movement type cluster. As the table above shows, the clusters from the single and complete method cannot account for this distribution. This is most likely due to the effect of chaining, where single observations are considered different from all other observations. This can also be observed in the given dendrograms. In the following, we will thus only consider the results from the ward cluster.

Code

cluster_h5_ward <- tm_shape(selected_tracks_na_omit)+ 
  tm_dots(col = "c_ward_5", palette = "Paired")

tmap_arrange(cluster_h5_ward)

The visual inspection of this track is already giving a good impression of how the fix point data is split into segments. Though there is a lack of continuity, as longer stretches of the same color are interrupted by single fixes of a different color. Generally the clustering works well with differentiation between running and walking, but bike and tram appear to be within the same cluster. This is likely due to similar parameter characteristics from the parameters used for the clustering. It would thus be necessary, to find a parameter, which would allow the clustering method to differentiate between a bike and tram ride.

Verification with chi square test

Clusterings with different k values are not directly comparable, as they have fundamentally different structures. Therefore, we will focus on the comparison of k=5 only. The n criteria in kmeans does not play such a big role in cluster assignation after all, therefore, only k=5&n=100 is taken into account for verification. The best method to analyse several categorical variables (cluster groups) with no supposed causality between them is the chi square or fishers test. However, fishers test is for smaller data samples (sample size less than 1000), therefore the chi square is the appropriate test (“Fisher’s_exact_test”).

We will compare:

GIS - Speed
GIS - kmeans_100
GIS - hclust_ward
speed - kmeans_100
speed - hclust_ward
hclust_ward - kmeans_100

X-squared: This is the value of the chi-squared statistic. The larger χ2, the more likely that the variables are related.

df: This is the number of degrees of freedom. It is typically calculated as (number of rows-1)×(number of columns-1) in the contingency table.

p-value: This is the p-value of the test. A very small p-value (in this case extremely small) indicates that the probability of obtaining the observed data when the null hypothesis is true is very low.

Code

results_df

              X_squared.X-squared df.df p_value
GIS_speed                5237.772    16       0
GIS_kmeans               6531.559    16       0
GIS_hmeans               9211.967    16       0
speed_hmeans             6084.513    16       0
speed_kmeans             7081.301    16       0
hmeans_kmeans           12111.207    16       0

The X-squared value is highest comparing the hclust and kmeans values, but this is not of relevance. Taking GIS as the reference, from all classifications, the hclust cluster has the highest X-squared value and thus seems to be closest to the GIS classification. Interestingly, kmeans is closer to the speed classification.

Lets check these results visually:

Code

tmap_arrange(cluster_k5names_map, cluster_h5_names_map, cluster_GIS_map, nrow = 3)

Code

cluster_GIS_map <- tm_shape(osm_bg) +
  tm_rgb(alpha = 0.4)+
  tm_shape(selected_tracks_na_omit)+ 
  tm_dots(col = "GIS_name", palette = "Paired")

cluster_k5names_map <- tm_shape(osm_bg) +
  tm_rgb(alpha = 0.4)+
  tm_shape(selected_tracks_na_omit)+ 
  tm_dots(col = "k5_names", palette = "Paired")

cluster_h5_names_map <-tm_shape(osm_bg) +
  tm_rgb(alpha = 0.4)+
  tm_shape(selected_tracks_na_omit)+ 
  tm_dots(col = "hward_names", palette = "Paired")

tmap_arrange(cluster_k5names_map, cluster_h5_names_map, cluster_GIS_map)
  
cluster_k5names_map
cluster_h5_names_map
cluster_GIS_map

Category Standing was not Identified with either of the clustering methods.
the “undefined” category is the same in both clustering methods, and seems to consist of points with high sinuosity or points where data nearby is missing (NA). No time for further investigation.
Differentiation between bike&tram was not possible.

Discussion

The Chi Square results suggest that the hclust method is more appropriate to predict transport modes than the kmeans clustering. However, when analyzing visually, both methods fail to differentiate between tram and biking. The parameter sinuosity, which was initially thought to help distinguish between the two classes, did not seem to have an effect. In an earlier stage of this project, it was observed, that hclust generated a more accurate differentiation when only looking at the data of one track (results of this are not included/got lost on the way). This leads to the assumption, that the movement type information gets more obscure with the information of several tracks, causing a greater heterogeneity within one movement type and potentially more overlap among different movement types. This raises the question, if this method is suitable for bigger sets of data or sets with different movement patterns. Considering the included data, it should also be mentioned, that the stationary phases were not removed from the movement data. It would be interesting to see, if removing them could help with the accuracy of the clustering methods. Furthermore, it is to mention, that the validation with GIS used the speed information from the parameters that were also used within the clustering. It is not clear, in how far this might cause a dependency between the ground-truth from GIS and the cluster results. Lastly, it is not clear how the dependency of spatial data influences the clustering procedure. This could be further investigated with a more thorough literature analysis on the method, but this was not in the scope of our project.

Further Research

Apart from the suggested investigation that are mentioned in the discussion above, the following points could be of interest:

The turning angle was considered as an alternative parameter to mitigate the inaccuracies that occured when the bike slowed down in curves, however there was no time left to include this in the analysis and will therefore be left for another project. Another interesting investigation would have been the possibility to extract the parameters by which the clusters were formed (how big of a weight had, e.g. the speed in comparison to the average acceleration). If this were possible, the weights could be adjusted to balance certain inaccuracies as well. Further clustering, such as kmodes, which also allows to consider ordinal/categorical variables, would have been of interest, as infrastructural data could have been included. Finally, it could be tested, in how far smoothing out the inconsistencies within segments of one movement type would improve the validation method.

Appendix

Wordcount

Code

#for word count:
#install.packages("pacman") 
library("pacman")
p_install_gh("benmarwick/wordcountaddin")
wordcountaddin::word_count("index.qmd")

[1] 2946

Code

#knitr::spin("preprocessing.R") --> macht "run" von preprocessing file und produziert html.

References

“Determining-the-Optimal-Number-of-Clusters-3-Must-Know-Methods.” https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/#elbow-method.

“Fisher’s_exact_test.” https://stats.libretexts.org/Bookshelves/Applied_Statistics/Biological_Statistics_(McDonald)/02%3A_Tests_for_Nominal_Variables/2.07%3A_Fisher's_Exact_Test.

“Hclust Function - RDocumentation — Rdocumentation.org.” https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/hclust.

Sadeghian, Paria, Xiaoyun Zhao, Arman Golshan, and Johan Håkansson. 2022. “A Stepwise Methodology for Transport Mode Detection in GPS Tracking Data.” Travel Behaviour and Society 26: 159–67.

Shamoun-Baranes, Judy, Roeland Bom, E Emiel van Loon, Bruno J Ens, Kees Oosterbeek, and Willem Bouten. 2012. “From Sensor Data to Animal Behaviour: An Oystercatcher Example.” PloS One 7 (5): e37997.

Yuan, Guan, Penghui Sun, Jie Zhao, Daxing Li, and Canwei Wang. 2017. “A Review of Moving Object Trajectory Clustering Algorithms.” Artificial Intelligence Review 47: 123–44.