how_stickleback_works.Rmd
The accelerometers, magnetometers, and other sensors used in modern
bio-loggers allow ecologists to remotely observe animal behavior at ever
finer scales (Wilmers et al. 2015).
However, new computational techniques are needed for processing,
visualizing, and analyzing the large amount of data generated by these
sensors (Nathan et al. 2022; Williams et al.
2017; Cade et al. 2021). For example, detecting behavioral events
in bio-logging sensor data, such as feeding or social interactions,
requires sifting through hours of high-resolution data - a laborious and
potentially error-prone process. Existing methods for automating
behavioral event detection typically rely on signal processing (Sweeney et al. 2019), machine/deep learning
(Ngô et al. 2021; Bidder et al. 2020), or
a combination of the two (Chakravarty et al.
2020). However, bio-logging data are time series, which are
difficult to classify using traditional methods (Keogh and Kasetty 2003). Fortunately, the data
mining research community developed new algorithms specifically designed
for time series (Bagnall et al. 2017; Ruiz et al.
2021), which they published in a standardized Python package,
sktime
(Löning et al.
2019).
stickleback
, named for the classical animal behavior
model organism, is a machine learning pipeline for automating behavioral
event detection in bio-logging data. It interfaces with
sktime
to provide bio-logging scientists access to the
latest developments in time series learning. The user interface was
designed to solve many of the computational challenges facing
bio-logging scientists. For example, interactive visualizations
facilitate inspection of high-resolution, multi-variate bio-logging
data, and users can define a temporal tolerance for “close enough”
predictions. This package, rstickleback
, solves another
critical problem for bio-logging scientists. Ecology as a field
preferentially uses R (Lai et al. 2019),
but machine learning tools are most often developed in Python. This
package, rstickleback
, solves the language-domain mismatch
by providing an R interface to Python-based tools.
stickleback
is a supervised learning pipeline that
operates in three steps. The local step trains a machine learning
classifier on a subset of the data to differentiate events from
non-events. The global step uses a sliding window and cross validation
to identify a prediction confidence threshold for events. Finally, the
boosting step uses prediction errors identified during the global step
to augment the training data for the local step.
stickleback
requires two types of data: bio-logging
sensor data, \(S\), and labeled
behavioral events, \(E\). \(S\) can be raw data, such as tri-axial
acceleration, or derived variables such as pitch or overall dynamic body
acceleration (Gleiss, Wilson, and Shepard 2010;
Wilson et al. 2006). \(E\) must
be points in time. Contrast events with segmentation, where
behaviors are periods of time, which is usually accomplished
through unsupervised methods like hidden Markov models (Langrock et al. 2012). Both \(S\) and \(E\) must be associated with bio-logger
deployments, \(d\).
The goal of the local step is to train a time series classifier to differentiate behavioral events from the background (non-events). From the user’s perspective, the critical inputs are (1) a time series classification model \(M\) and (2) a window size \(w\). \(w\) determines the length of the time series extracted for training \(M\).
stickleback
extracts training data, \(D_L\) for \(M\) composed of \(2n\) windows from \(S\), where \(n\) is the number of events in \(E\). The training data includes (1) the
windows in \(S\) centered on all \(n\) events in \(E\) (class events
) and (2) a
non-overlapping random sample of windows in \(S\) (class non-events
).
Using a subset of \(S\) for \(D_L\) addresses the imbalanced class issue in behavioral event detection. For high resolution bio-logging data, behavioral events can be outnumbered by non-events by a factor on the order of 100-1000x or more. Therefore, a random sample of \(n\) non-events undersamples the majority class, improving performance on the minority class (Haibo He and Garcia 2009). This can lead to increased false positive rates, however, which is addressed later by the boosting step.
\(M\) must be a time series
classification model from the sktime
package, which the
local step fits to the local training data, \(D_L\).
The bio-logging sensor data, \(S\), is longitudinal, but the time series classification model, \(M\), is trained on windows of length \(w\), so the global step connects the two time scales. The critical inputs are the temporal tolerance, \(\epsilon\), and the number of folds for cross validation, \(f\).
The global step as described uses the same data to train \(M\) and select \(\hat{r}\), which will probably bias \(\hat{r}\) too high. This is because the \(p_l\) output of \(M\) for out-of-sample data will likely be lower than for in-sample data. Therefore, the global step actually partitions \(S\) and \(E\) into \(f\) folds and uses cross validation to choose \(\hat{r}\). For each fold, a copy of \(M\), \(M'\), is trained on the other \(f-1\) folds of data. Step 1 uses \(M'\) to generate \(p_l\) for the held out fold. The \(p_l\) series for each fold are then merged, and steps 2 and 3 use the combined \(p_l\) for selecting \(\hat{r}\).
Undersampling the majority class (non-event
) can lead to
increased false positives. These false positives are “near misses”,
where the animal’s movement was similar enough to the behavior of
interest to fool the time series classifier, \(M\). These windows of time contain
important information for differentiating between event
and
non-event
windows, and are valuable for training \(M\), but stickleback
cannot
know when they are a priori. Therefore, in the boosting step,
all windows centered on the false positive predictions are added to
\(D_L\), the training data set for
\(M\). Then the local and global steps
are repeated.
Use Stickleback()
to define the model. The argument
tsc
(Time Series Classifier) corresponds to \(M\). Use compose_tsc()
or
create_tsc()
to define tsc
. Arguments
win_size
, tol
, and n_folds
correspond to \(w\), \(\epsilon\), and \(f\), respectively. nth
modifies how the global step generates \(p_l\). If nth = 2
, for
example, then \(p_l\) is evaluated for
every other window. Gaps are filled with cubic spline interpolation.
sb_fit()
runs all three steps in the method: local,
global, and boosting.