Preparing your data

Examples in the rstickleback documentation rely on the load_lunges() helper function to provide data in a processed format. Though convenient, load_lunges() doesn’t help users understand how to prepare their own data. This vignette demonstrates the steps necessary to convert files in a directory to rstickleback-friendly objects. Your data won’t be in the same directory structure or follow the same naming conventions as these sample data, but hopefully this vignette gets you started in the right direction.

# Start by loading a few packages, including rstickleback
library(fs)
library(lubridate)
library(rstickleback)
library(tidyverse)
#> Warning: package 'readr' was built under R version 4.0.5

Starting from data files

The bluewhaledata repository on GitHub has a zipped archive with the same sample data as this package, only in CSV form. The archive contains two directories, sensors and events, with CSV files containing bio-logging sensor data and behavioral event timestamps, respectively. For more details about how the data were collected and what they represent, see load_lunges().

You can download the data directly in R.

# Create a temporary directory
temp_dir <- tempdir()

# Download file and unzip data
url <- "https://github.com/FlukeAndFeather/bluewhaledata/raw/master/bluewhaledata.zip"
download.file(url, file.path(temp_dir, "bluewhaledata.zip"))
unzip(file.path(temp_dir, "bluewhaledata.zip"), exdir = temp_dir)

Inspecting the data

Each CSV file is named according to the deployment identifier and data type (event or sensor).

data_dir <- file.path(temp_dir, "data")
# Note: your root folder won't match exactly. It's a random path, generated by
# tempdir().
dir_tree(data_dir)
#> /var/folders/x2/pd2b41ls033f6_p4h4h37k880000gn/T//Rtmpc8BTUv/data
#> ├── events
#> │   ├── bw180904-44_events.csv
#> │   ├── bw180904-48_events.csv
#> │   ├── bw180904-52_events.csv
#> │   ├── bw180905-42_events.csv
#> │   ├── bw180905-49_events.csv
#> │   └── bw180905-53_events.csv
#> └── sensors
#>     ├── bw180904-44_sensors.csv
#>     ├── bw180904-48_sensors.csv
#>     ├── bw180904-52_sensors.csv
#>     ├── bw180905-42_sensors.csv
#>     ├── bw180905-49_sensors.csv
#>     └── bw180905-53_sensors.csv

Inspecting the event and sensor data for the first deployment (identifier bw180904-44), we see that event CSV files have one datetime column, lunge, and sensor CSV files have five columns. In the sensor data, the timestamp_utc datetime column is the time and the other four columns (depth, pitch, roll, and speed) describe the animal’s movement.

sample_events <- read_csv(file.path(data_dir, "events", "bw180904-44_events.csv"))
#> Rows: 40 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dttm (1): lunge
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sample_events)
#> Rows: 40
#> Columns: 1
#> $ lunge <dttm> 2018-09-04 10:52:19, 2018-09-04 10:54:15, 2018-09-04 10:56:08, …
sample_sensors <- read_csv(file.path(data_dir, "sensors", "bw180904-44_sensors.csv"))
#> Rows: 72000 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl  (4): depth, pitch, roll, speed
#> dttm (1): timestamp_utc
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sample_sensors)
#> Rows: 72,000
#> Columns: 5
#> $ timestamp_utc <dttm> 2018-09-04 10:47:16, 2018-09-04 10:47:16, 2018-09-04 10…
#> $ depth         <dbl> 0.3480302, 0.3534735, 0.3587432, 0.3638423, 0.3683940, 0…
#> $ pitch         <dbl> -0.276757371, -0.253706385, -0.232672562, -0.212183548, …
#> $ roll          <dbl> 0.34689666, 0.29694713, 0.24615438, 0.19446428, 0.141366…
#> $ speed         <dbl> 6.297395, 6.110192, 5.890096, 5.673690, 5.456957, 5.2420…

Preparing the data

To prepare these data for use in rstickleback, we have to combine all the CSVs into two data frames; one each for events and sensors. Here is one way of doing that using a for-loop, which is likely familiar to most readers. I demonstrate a more elegant solution later.

# A list of the deployment ids
deployids = c("bw180904-44",
              "bw180904-48",
              "bw180904-52",
              "bw180905-42",
              "bw180905-49",
              "bw180905-53")

# Create empty data frames to hold the results
bw_events_df <- data.frame()
bw_sensors_df <- data.frame()

# Iterate through the deployment ids, read the data, and append to the results
for (id in deployids) {
  # Event data first...
  event_path <- dir(file.path(data_dir, "events"), 
                    pattern = id, 
                    full.names = TRUE)
  events <- read_csv(event_path, col_types = "T") %>% 
    # I happen to know dates are in UTC. Metadata are important!
    mutate(lunge = force_tz(lunge, "UTC"),
           deployid = id)
  bw_events_df <- rbind(bw_events_df, events)
  # ...then sensor data
  sensor_path <- dir(file.path(data_dir, "sensors"), 
                     pattern = id, 
                     full.names = TRUE)
  # Specifying the column types with the `col_types` parameter suppresses a lot
  # of extraneous messages about column parsing.
  sensors <- read_csv(sensor_path, col_types = "Tdddd") %>% 
    mutate(timestamp_utc = force_tz(timestamp_utc, "UTC"),
           deployid = id)
  bw_sensors_df <- rbind(bw_sensors_df, sensors)
}

# The results
glimpse(bw_events_df)
#> Rows: 218
#> Columns: 2
#> $ lunge    <dttm> 2018-09-04 10:52:19, 2018-09-04 10:54:15, 2018-09-04 10:56:0…
#> $ deployid <chr> "bw180904-44", "bw180904-44", "bw180904-44", "bw180904-44", "…
glimpse(bw_sensors_df)
#> Rows: 416,046
#> Columns: 6
#> $ timestamp_utc <dttm> 2018-09-04 10:47:16, 2018-09-04 10:47:16, 2018-09-04 10…
#> $ depth         <dbl> 0.3480302, 0.3534735, 0.3587432, 0.3638423, 0.3683940, 0…
#> $ pitch         <dbl> -0.276757371, -0.253706385, -0.232672562, -0.212183548, …
#> $ roll          <dbl> 0.34689666, 0.29694713, 0.24615438, 0.19446428, 0.141366…
#> $ speed         <dbl> 6.297395, 6.110192, 5.890096, 5.673690, 5.456957, 5.2420…
#> $ deployid      <chr> "bw180904-44", "bw180904-44", "bw180904-44", "bw180904-4…

Creating `rstickleback`-friendly objects

We now have two data frames containing all the event and sensor data for these six deployments, which we can use to create Event and Sensor objects for use with rstickleback.

bw_events <- Events(bw_events_df, 
                    deployid_col = "deployid", 
                    datetime_col = "lunge")
bw_sensors <- Sensors(bw_sensors_df,
                      deployid_col = "deployid",
                      datetime_col = "timestamp_utc", 
                      sensor_cols = c("depth", "pitch", "roll", "speed"))

bw_events
#> Events
#>   6 deployments.
#>   With 218 events.
bw_sensors
#> Sensors
#>   6 deployments.
#>   With column(s): depth, pitch, roll, speed

These data are now ready to use with rstickleback functions. For example, you can visualize the event and sensor data with sb_plot_data() and train a model using sb_fit().

A more elegant solution

As a final note, here’s the more elegant way to read an entire directory of CSVs using regular expressions to extract the deployment identifier from the file name. Note: this only works if your version of readr is at least 2.0.0.

event_csvs <- dir(file.path(data_dir, "events"), full.names = TRUE)
bw_events_df2 <- read_csv(event_csvs, col_types = "T", id = "file_name") %>% 
  mutate(deployid = str_extract(file_name, "bw[0-9]{6}-[0-9]{2}"))
sensor_csvs <- dir(file.path(data_dir, "sensors"), full.names = TRUE)
bw_sensors_df2 <- read_csv(sensor_csvs, col_types = "T", id = "file_name") %>% 
  mutate(deployid = str_extract(file_name, "bw[0-9]{6}-[0-9]{2}"))

These data frames are functionally equivalent to those made with the for-loop, and work just as well with rstickleback functions.

bw_events2 <- Events(bw_events_df2, 
                     deployid_col = "deployid", 
                     datetime_col = "lunge")
bw_sensors2 <- Sensors(bw_sensors_df2,
                       deployid_col = "deployid",
                       datetime_col = "timestamp_utc", 
                       sensor_cols = c("depth", "pitch", "roll", "speed"))

bw_events2
#> Events
#>   6 deployments.
#>   With 218 events.
bw_sensors2
#> Sensors
#>   6 deployments.
#>   With column(s): depth, pitch, roll, speed

Starting from data files

Inspecting the data

Preparing the data

Creating rstickleback-friendly objects

A more elegant solution

Creating `rstickleback`-friendly objects