preparing-data.Rmd
Examples in the rstickleback
documentation rely on the
load_lunges()
helper function to provide data in a
processed format. Though convenient, load_lunges()
doesn’t
help users understand how to prepare their own data. This vignette
demonstrates the steps necessary to convert files in a directory to
rstickleback
-friendly objects. Your data won’t be in the
same directory structure or follow the same naming conventions as these
sample data, but hopefully this vignette gets you started in the right
direction.
# Start by loading a few packages, including rstickleback
library(fs)
library(lubridate)
library(rstickleback)
library(tidyverse)
#> Warning: package 'readr' was built under R version 4.0.5
The bluewhaledata
repository on GitHub has a zipped archive with the same sample data as
this package, only in CSV form. The archive contains two directories,
sensors and events, with CSV files containing bio-logging sensor data
and behavioral event timestamps, respectively. For more details about
how the data were collected and what they represent, see
load_lunges()
.
You can download the data directly in R.
# Create a temporary directory
temp_dir <- tempdir()
# Download file and unzip data
url <- "https://github.com/FlukeAndFeather/bluewhaledata/raw/master/bluewhaledata.zip"
download.file(url, file.path(temp_dir, "bluewhaledata.zip"))
unzip(file.path(temp_dir, "bluewhaledata.zip"), exdir = temp_dir)
Each CSV file is named according to the deployment identifier and data type (event or sensor).
data_dir <- file.path(temp_dir, "data")
# Note: your root folder won't match exactly. It's a random path, generated by
# tempdir().
dir_tree(data_dir)
#> /var/folders/x2/pd2b41ls033f6_p4h4h37k880000gn/T//Rtmpc8BTUv/data
#> ├── events
#> │ ├── bw180904-44_events.csv
#> │ ├── bw180904-48_events.csv
#> │ ├── bw180904-52_events.csv
#> │ ├── bw180905-42_events.csv
#> │ ├── bw180905-49_events.csv
#> │ └── bw180905-53_events.csv
#> └── sensors
#> ├── bw180904-44_sensors.csv
#> ├── bw180904-48_sensors.csv
#> ├── bw180904-52_sensors.csv
#> ├── bw180905-42_sensors.csv
#> ├── bw180905-49_sensors.csv
#> └── bw180905-53_sensors.csv
Inspecting the event and sensor data for the first deployment
(identifier bw180904-44), we see that event CSV files have one datetime
column, lunge
, and sensor CSV files have five columns. In
the sensor data, the timestamp_utc
datetime column is the
time and the other four columns (depth
, pitch
,
roll
, and speed
) describe the animal’s
movement.
sample_events <- read_csv(file.path(data_dir, "events", "bw180904-44_events.csv"))
#> Rows: 40 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dttm (1): lunge
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sample_events)
#> Rows: 40
#> Columns: 1
#> $ lunge <dttm> 2018-09-04 10:52:19, 2018-09-04 10:54:15, 2018-09-04 10:56:08, …
sample_sensors <- read_csv(file.path(data_dir, "sensors", "bw180904-44_sensors.csv"))
#> Rows: 72000 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): depth, pitch, roll, speed
#> dttm (1): timestamp_utc
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sample_sensors)
#> Rows: 72,000
#> Columns: 5
#> $ timestamp_utc <dttm> 2018-09-04 10:47:16, 2018-09-04 10:47:16, 2018-09-04 10…
#> $ depth <dbl> 0.3480302, 0.3534735, 0.3587432, 0.3638423, 0.3683940, 0…
#> $ pitch <dbl> -0.276757371, -0.253706385, -0.232672562, -0.212183548, …
#> $ roll <dbl> 0.34689666, 0.29694713, 0.24615438, 0.19446428, 0.141366…
#> $ speed <dbl> 6.297395, 6.110192, 5.890096, 5.673690, 5.456957, 5.2420…
To prepare these data for use in rstickleback
, we have
to combine all the CSVs into two data frames; one each for events and
sensors. Here is one way of doing that using a for-loop, which is likely
familiar to most readers. I demonstrate a more elegant solution
later.
# A list of the deployment ids
deployids = c("bw180904-44",
"bw180904-48",
"bw180904-52",
"bw180905-42",
"bw180905-49",
"bw180905-53")
# Create empty data frames to hold the results
bw_events_df <- data.frame()
bw_sensors_df <- data.frame()
# Iterate through the deployment ids, read the data, and append to the results
for (id in deployids) {
# Event data first...
event_path <- dir(file.path(data_dir, "events"),
pattern = id,
full.names = TRUE)
events <- read_csv(event_path, col_types = "T") %>%
# I happen to know dates are in UTC. Metadata are important!
mutate(lunge = force_tz(lunge, "UTC"),
deployid = id)
bw_events_df <- rbind(bw_events_df, events)
# ...then sensor data
sensor_path <- dir(file.path(data_dir, "sensors"),
pattern = id,
full.names = TRUE)
# Specifying the column types with the `col_types` parameter suppresses a lot
# of extraneous messages about column parsing.
sensors <- read_csv(sensor_path, col_types = "Tdddd") %>%
mutate(timestamp_utc = force_tz(timestamp_utc, "UTC"),
deployid = id)
bw_sensors_df <- rbind(bw_sensors_df, sensors)
}
# The results
glimpse(bw_events_df)
#> Rows: 218
#> Columns: 2
#> $ lunge <dttm> 2018-09-04 10:52:19, 2018-09-04 10:54:15, 2018-09-04 10:56:0…
#> $ deployid <chr> "bw180904-44", "bw180904-44", "bw180904-44", "bw180904-44", "…
glimpse(bw_sensors_df)
#> Rows: 416,046
#> Columns: 6
#> $ timestamp_utc <dttm> 2018-09-04 10:47:16, 2018-09-04 10:47:16, 2018-09-04 10…
#> $ depth <dbl> 0.3480302, 0.3534735, 0.3587432, 0.3638423, 0.3683940, 0…
#> $ pitch <dbl> -0.276757371, -0.253706385, -0.232672562, -0.212183548, …
#> $ roll <dbl> 0.34689666, 0.29694713, 0.24615438, 0.19446428, 0.141366…
#> $ speed <dbl> 6.297395, 6.110192, 5.890096, 5.673690, 5.456957, 5.2420…
#> $ deployid <chr> "bw180904-44", "bw180904-44", "bw180904-44", "bw180904-4…
rstickleback
-friendly objects
We now have two data frames containing all the event and sensor data
for these six deployments, which we can use to create Event
and Sensor
objects for use with
rstickleback
.
bw_events <- Events(bw_events_df,
deployid_col = "deployid",
datetime_col = "lunge")
bw_sensors <- Sensors(bw_sensors_df,
deployid_col = "deployid",
datetime_col = "timestamp_utc",
sensor_cols = c("depth", "pitch", "roll", "speed"))
bw_events
#> Events
#> 6 deployments.
#> With 218 events.
bw_sensors
#> Sensors
#> 6 deployments.
#> With column(s): depth, pitch, roll, speed
These data are now ready to use with rstickleback
functions. For example, you can visualize the event and sensor data with
sb_plot_data()
and train a model using
sb_fit()
.
As a final note, here’s the more elegant way to read an entire
directory of CSVs using regular expressions to extract the deployment
identifier from the file name. Note: this only works if your version of
readr
is at least 2.0.0.
event_csvs <- dir(file.path(data_dir, "events"), full.names = TRUE)
bw_events_df2 <- read_csv(event_csvs, col_types = "T", id = "file_name") %>%
mutate(deployid = str_extract(file_name, "bw[0-9]{6}-[0-9]{2}"))
sensor_csvs <- dir(file.path(data_dir, "sensors"), full.names = TRUE)
bw_sensors_df2 <- read_csv(sensor_csvs, col_types = "T", id = "file_name") %>%
mutate(deployid = str_extract(file_name, "bw[0-9]{6}-[0-9]{2}"))
These data frames are functionally equivalent to those made with the
for-loop, and work just as well with rstickleback
functions.
bw_events2 <- Events(bw_events_df2,
deployid_col = "deployid",
datetime_col = "lunge")
bw_sensors2 <- Sensors(bw_sensors_df2,
deployid_col = "deployid",
datetime_col = "timestamp_utc",
sensor_cols = c("depth", "pitch", "roll", "speed"))
bw_events2
#> Events
#> 6 deployments.
#> With 218 events.
bw_sensors2
#> Sensors
#> 6 deployments.
#> With column(s): depth, pitch, roll, speed