A comparison of unsupervised curve classification methods for sport training data
Résumé
Achieving peak performance at a specified time is the primary goal of athletes’ training programs. To optimize performance and reduce the risk of injury, a comprehensive list of training program parameters (e.g. intensity, volume, frequency, distribution, duration and type) requires careful management. This work focuses on clustering of time evolution curves of training measurements.
Training data are recorded densely over time. However, duration of follow-up and duration of the seasons vary among subjects. Also, subject-specific variation can induce substantial error. Functional data analysis (FDA) and longitudinal data analysis (LDA) are the main approaches to analyze repeated measures data (in which multiple measurements are made on the same subject across time). Typically, FDA is applied when the data are dense, assumed to be observed in the continuum, and a function of time. LDA is usually applied when data are sparse, possibly with different number of measurements across individuals, and subject to error. We compared several FDA and LDA methods implemented through publicly available R code: k-means based on the standard Euclidian distance, a discrete Fréchet distance [2], and a functional distance [1]; Gaussian mixture model–based clustering for standard [4], longitudinal [5] and functional [3] data; and latent class mixed models [6]. We discuss advantages and limitations including computational and practical aspects.
References
[1] Febrero-Bande, M. and Oviedo de la Fuente, M. (2012). Statistical computing in functional data analysis: the R package fda.usc. Journal of Statistical Software, 51, 1–28.
[2] Genolini, C. and Falissard, B. (2011). Kml : A package to cluster longitudinal data. Computer Methods and Programs in Biomedicine.
[3] Jacques, J. and Preda, C. (2013). Funclust: A curves clustering method using functional random variables density approximation. Neurocomputing, 112, 164–171.
[4] Lebret, R., Iovleff, S., Langrognet, F., Biernacki, C., Celeux, G., and Govaert, G. (2014). Rmixmod: The R package of the model–based unsupervised, supervised and semi–supervised classification mixmod library. Journal of Statistical Software.
[5] McNicholas, P. D. and Murphy, T. B. (2010). Model–based clustering of longitudinal data. Canadian Journal of Statistics, 38, 153–168.
[6] Proust-Lima, C., Philipps, V., and Liquet, B. (2015). Estimation of extended mixed models using latent classes and latent processes: the R package lcmm. Technical report, University of Bordeaux. arXiv:1503.00890v2.