2 Getting Started with climodr

The Idea of climodr is, to speed up climate modelling processes and make them easier to use. The package foresees that one needs to store relevant input data into one folder structure and the package does the rest using the example workflow provided in this vignette. The functions are still modifyable, so one can adjust the model workflow to your liking.

2.1 Downloading climodr

To start with climodr, you first need to download and install the package from CRAN.

#install climodr
install.packages("climodr")

It may asks you to install all packages climodr needs to execute all its functions. Its mandatory to install these packages, otherwise climodr won’t be able to execute its functions, as it takes many functions into its comprehensive workflow.

You can also install the latest development version from the Environmental Informatics Lab (envima) @ Marburg University from Github. To do so, you need devtools installed to your R. Once devtools is installed, you can simply add climodr by following commands.

#install climodr (last a bit, but shouldn't take longer than 5-10 Minutes)
devtools::install_github(
  "https://github.com/envima/climodr.git", 
  auth_token = "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", 
  dependencies = TRUE,
  build_vignettes = TRUE)

Nice to add: Link to the help-pages on CRAN, once package is published

2.2 How to setup climodr

Setting up climodr just requires one step, before you can get started.
Activating the package will cause some conflicts originating from tydiverse. This is completely fine, as these functions won’t be tackled. You can still use the old functions by addressing the package, e.g. like stats::filter().

With the envi.create() function, one points out a path where the package should store all its data. There is also the memfrac argument. This argument allows you to change the fraction of your RAM the terra-package is allowed to use. By default this number is pretty low, so this way the process can be sped up.

library(climodr)

# setting up the environment for climodr
envrmt <- envi.create(tempdir(),
                      memfrac = 0.8)

## Succesfully created an environment in C:\Users\ALEXAN~1\AppData\Local\Temp\RtmpAf8TXY.

# load in all the climodr example data for this vignette
clim.sample(envrmt = envrmt)

## Loading example data for the climodr example..

## Saved climodr example dependency files to {C:\Users\ALEXAN~1\AppData\Local\Temp\RtmpAf8TXY/input/dep}.

## Saved climodr example raster files to {C:\Users\ALEXAN~1\AppData\Local\Temp\RtmpAf8TXY/input/raster}.

## Saved climodr example tabular files to {C:\Users\ALEXAN~1\AppData\Local\Temp\RtmpAf8TXY/input/tabular}.

## Saved climodr example vector files to {C:\Users\ALEXAN~1\AppData\Local\Temp\RtmpAf8TXY/input/vector}.

## Done loading all the example files. You are ready to continue.

# remove everything in the global environment except of our environment path list
rm(list = setdiff(ls(), "envrmt"))

Climodr then creates an environment with three main folders:
- Input (for all necessary data the user must bring)
- Output (for ready-to-use data created by climodr)
- Workflow (for climodr to store data during the process)

Climodr Environment

The Input-Directory is the place, where all data, which shall be used for modelling, should be saved beforehand. It consists of four different folders:
- dep (Dependency, like a resolution image or metadata)
- raster (Raster data, work in progress)
- tabular (Tabular data, containing climate data from the climate stations)
- vector (Vector data, like the study area or climate station point data)

See list of possible inputs for further details, what kind of input-data can be used.
The Output-Folder is the place, where all final data, which is created by the package, is stored in. It consists of three different folders:
- maps (basic ready-to-use maps)
- predictions (plain prediction imagery)
- statistics (perfomance of the predictions and other statistics)

The Output-Directory contains all the reade-to-use data in some basic formats, which should be publication-ready if no other needs are wanted or required.

The Workflow-Directory contains all steps in between the Input and the Output. In here there are models, test and training data, clean tabular data, and so on.

Note: 
 - Do not delete any of these folders, since climodr requires those to run properly!
 - The higher you set the fraction of RAM that climodr will use, the slower the
PC will become when running climodr, in case you want to do something in parallel 
on the PC. Using a fraction > 0.8 can even make it hard to use a browser while
using climodr.

3 Pre-Processing

For this package, a small showcase product has been edited, which comes with climodr. Its a small scene located in Hainich national park in Germany. There are ten climate stations located in this scene.

Note: This is just an example with very few stations and a very small scene,
which will not result into a good model and is only used for educational purposes.

You don’t have to use raw format data that wasn’t pre-processed earlier. If you have data, that equals one of the levels in this Pre-Processing step, you can step in at the corresponding stage. For now you’ll need to take care that the data matches the pattern climodr produces in this workflow.

3.1 Prepare tabular data for processing

First, we have to prepare the raw tabular data for further uses. The prep.csv function cleans up the data and removes all NA values from the data.

prep.csv(envrmt = envrmt, 
         method = "proc", 
         save_output = TRUE)

## Removed NAs from 10 stations

#check the created csv files
csv_files <- grep("_no_NAs.csv$", 
                  list.files(envrmt$path_tworkflow), 
                   value=TRUE)
csv_files

##  [1] "Station_G06_no_NAs.csv" "Station_G17_no_NAs.csv" "Station_G20_no_NAs.csv"
##  [4] "Station_G21_no_NAs.csv" "Station_G25_no_NAs.csv" "Station_G48_no_NAs.csv"
##  [7] "Station_W10_no_NAs.csv" "Station_W11_no_NAs.csv" "Station_W19_no_NAs.csv"
## [10] "Station_W20_no_NAs.csv"

3.2 Process tabular data to average values

Next, the data needs to be aggregated for to the desired time steps.
In this version one can aggregate data into “monthly” and into “anual” data.
The rbind argument stores all climate station data into one file. This step is recommended, since the data usually will become way shorter after the time aggregation and is easier to be processed further this way.

csv_data <- proc.csv(envrmt = envrmt, 
                     method = "monthly",
                     rbind = TRUE,
                     save_output = TRUE)
head(csv_data)

##          plot            datetime year month monthly_mean_Ta_200
## 1 Station_G06 2017-07-01 12:00:00 2017     7            17.32016
## 2 Station_G17 2017-07-01 12:00:00 2017     7            17.21474
## 3 Station_G20 2017-07-01 12:00:00 2017     7            17.47536
## 4 Station_G21 2017-07-01 12:00:00 2017     7            17.32417
## 5 Station_G25 2017-07-01 12:00:00 2017     7            17.48638
## 6 Station_G48 2017-07-01 12:00:00 2017     7            17.91024

3.3 Spatial aggregation of tabular data

Next, the stations have to be spatially located in a coordinate system. This step is crucial to process the data in a modelation use case.

csv_spat <- spat.csv(envrmt = envrmt, 
                     method = "monthly",
                     des_file = "plot_description.csv",
                     save_output = TRUE)
head(csv_spat)

##          plot            datetime year month Ta_200        x       y
## 1 Station_G06 2017-07-01 12:00:00 2017     7 17.320 810379.0 5894758
## 2 Station_G17 2017-07-01 12:00:00 2017     7 17.215 811096.8 5893045
## 3 Station_G20 2017-07-01 12:00:00 2017     7 17.475 807298.3 5892820
## 4 Station_G21 2017-07-01 12:00:00 2017     7 17.324 806897.6 5894563
## 5 Station_G25 2017-07-01 12:00:00 2017     7 17.486 811406.0 5891638
## 6 Station_G48 2017-07-01 12:00:00 2017     7 17.910 807097.4 5894840

3.4 Pre-Process Raster Data for data extraction

Now, that we have spatial points of our stations, we can continue with our raster data. The preferred method here is "MB_Timeseries", which stands for multi band time series. Use this method, if you provide multiple single band rasters or raster stacks with different time stamps (YYYYMMDD…) in the file names per scene. The function sorts them by date and crops the data to our study area.

crop.all(envrmt = envrmt, 
         method = "MB_Timeseries", 
         overwrite = TRUE)

## Reading in raster 1/2.

## Cropping raster...

## Saving raster to rworkflow [C:\Users\ALEXAN~1\AppData\Local\Temp\RtmpAf8TXY/workflow/rworkflow].

## Reading in raster 2/2.

## Cropping raster...

## Saving DGM to rfinal [C:\Users\ALEXAN~1\AppData\Local\Temp\RtmpAf8TXY/output/rfinal].

One important step in climate modelling is to have good predictor variables, which we extract from our spatial raster imagery. In this case we have the 10 spectral bands from the spectral raster stack and one additional elevation layer. This data is already okay to predict with, but may not be sufficient. We can enhance our predictors by creating even more spatial raster layers. This way we can gain more information from these layers by calculating spectral indices, which derive from different spectral layers. For example, we can calculate the NDVI, which indicates the presence or absence of chlorophyll and thus of vegetation. These layers can also be fed in our model as new predictor variables.
Next, we calculate some basic indices, so we can create more predictor variables for our models. Therefore vi chooses the vegetation indices one wants to create. You can either list the desired indices in a vector, or simply use all to generate all available indices. For more detailed information use ?calc.indices in the console.

calc.indices(envrmt = envrmt, 
             vi = "all",
             bands = c("blue", "green", "red", 
                       "nir", "nirb", 
                       "re1", "re2", "re3", 
                       "swir1", "swir2"),
             overwrite = TRUE)

## Number of Scenes to calculate Indices for: 1

## Saving rasterstack with indices to rfinal [C:\Users\ALEXAN~1\AppData\Local\Temp\RtmpAf8TXY/output/rfinal].

3.5 Finalize tabular data for modelling

Now, that we have spatial points as well as raster data, we can extract additional predictor variables at the station points from the spatial raster data. Therefore we use the fin.csv function. The function uses the positions we have added to our climate station data using spat.csv to extract the raster values at these positions from every layer of the spatial raster data and adds the data to each corresponding climate station. During our modeling steps this climate station csv-file will be used to generate Spatial Points which then will be used to train our models.

Reminder: You can check your data at any step, just go into your Workflow-Folder of the Project directory you defined in the beginning with `envi.create` and take a look into this data. Just make sure to not alter that data, as this may cause climodr to nut run following functions correctly.

csv_fin <- fin.csv(envrmt = envrmt, 
                   method = "monthly",
                   save_output = TRUE)
head(csv_fin)

##          plot            datetime year month Ta_200        x       y elevation
## 1 Station_G06 2017-07-01 12:00:00 2017     7 17.320 810379.0 5894758  59.55454
## 2 Station_G17 2017-07-01 12:00:00 2017     7 17.215 811096.8 5893045  59.61478
## 3 Station_G20 2017-07-01 12:00:00 2017     7 17.475 807298.3 5892820  59.18333
## 4 Station_G21 2017-07-01 12:00:00 2017     7 17.324 806897.6 5894563  54.01669
## 5 Station_G25 2017-07-01 12:00:00 2017     7 17.486 811406.0 5891638  60.84670
## 6 Station_G48 2017-07-01 12:00:00 2017     7 17.910 807097.4 5894840  54.34514
##       blue    green      red      nir     nirb       re1      re2      re3
## 1 381.8609 594.5724 494.7336 2659.411 2539.976  845.2596 2039.955 2437.255
## 2 272.3525 466.0617 326.9346 2654.324 2504.981  696.6929 1987.608 2377.730
## 3 208.2174 400.8459 248.1707 2307.992 2201.704  604.3890 1759.978 2074.325
## 4 360.4428 687.6474 510.7201 3604.908 3454.017 1061.7035 2734.079 3258.767
## 5 263.9718 407.4745 293.0677 2107.948 1994.610  579.5432 1579.203 1893.036
## 6 373.8912 632.2380 528.5673 2631.460 2502.235  928.6302 2023.512 2378.249
##      swir1     swir2      NDVI       NDWI       NDBI      NDBSI     MSAVI
## 1 1425.205  872.6232 0.6862962 -0.6345572 -0.3021595 -0.2260200 0.8139402
## 2 1360.370  683.8118 0.7806734 -0.7012794 -0.3223044 -0.2686122 0.8768091
## 3 1068.956  494.7866 0.8058256 -0.7040459 -0.3669099 -0.3128039 0.8924525
## 4 2054.615 1044.7113 0.7518142 -0.6796093 -0.2739265 -0.2143751 0.8583096
## 5 1066.651  528.1791 0.7558803 -0.6760191 -0.3280091 -0.2712485 0.8609418
## 6 1668.608  918.8693 0.6654666 -0.6125634 -0.2239154 -0.1553429 0.7991048

Now the data is ready for further modelling.

Pre-processed data created by climodr # Processing

In this step the spatial raster data and the climate station data is ready to use. If your data isn’t, check out the ‘Pre-Processing’ chapter.

3.6 Test for Autocorrelation

First, one tests the data for autocorrelation. The evaluation vector contains all columns with the sensor data and predictor variables, which will be tested for autocorrelation. It creates the first outputs in the package with one tabular-file per sensor, which contains all columns which should be excluded from the modulation because they autocorrelate. It also creates some visualization for the user of the autocorrelation, if plot.corrplot is set to TRUE.

Note’: The visualization is quite messy, when there are a lot of predictors. Maybe make it prettier in future.

autocorr(
  envrmt = envrmt, 
  method = "monthly",
  resp = 5, 
  pred = c(8:23),
  plot.corrplot = TRUE,
  corrplot = "coef"
  )

## Calculated autocorrelations for these sensors [Ta_200].

3.7 Create climate models from spatial station data

Processing steps during model workflow

Now, that we’ve done all necessary previous steps, one can start modelling. This is by far the function with the most variables. Here we give a quick overview, what these arguments do. For more detailed information take a look into the associated paper for this package at (to be published).

timespan = Vector with last two digits of years to build models from (in this example 2017)
climresp = Vector of rows to create models for. (In this example Ta_200)
classifier = Vector of all model variants to be used. In this case:
- random forest = “rf”
- partial-least-squares = “pls”
- neural networks = “nnet”
- linear regression = “lm”

seed = Number to “pick randomness”. With the seed one can reproduce random pulls.
p = Fraction of random taken training data from full data.
folds = Character or vector. Method to Create spacetime folds. “all” ,“LLO”, “LTO” or “LLTO”.
mnote = Character. “Model Note”. 6 digits, Marks the different model runs in a project.
predrows = Vector with the row numbers used as predictors.
tc_method = Train control method. Default is cross validation “cv”.
metric = Summary Metric to select optimal model. Default: Root Mean Square Error “RMSE”.
autocorrelation = Logical Parameter. TRUE, if the results of the autocorrelation should be considered.
doParallel = Logical Parameter. When set True, the Model-Process will parallelize on all cores except two, so your PC will slow down a lot. Only recommended for PCs with at least 8 Cores. Warning: Your PC wont be able to process other stuff efficiently during parallelization.

Once all parameters are set, one can run the model workflow calc.model like this:

calc.model(
  envrmt = envrmt, 
  method = "monthly",
  timespan = c(2017),
  climresp = c(5),
  classifier = c(
    "rf", 
    "pls", 
    "lm"),
  seed = 707,
  p = 0.8,
  folds = "LLO",
  mnote = "vignette",
  predrows = c(8:23),
  tc_method = "cv",
  metric = "RMSE",
  autocorrelation = TRUE,
  doParallel = FALSE)

## Training models for 2017.  Year-Nr.: 1/1

## Training monthly models for 2017.  Month-Nr.: 7

## Use autocorellation data for filtering..

## Run with spatial folds for cross validation.  Fold-Nr.: 1/1

## Calculate models for sensor: Ta_200

## Next model: Random Forest.  1/3

## Computing model...

## Lade nötiges Paket: ggplot2

## Lade nötiges Paket: lattice

## Note: No increase in performance found using more than 2 variables

## Next model: Partial-Least-Squares.  2/3

## Computing model...

## Note: No increase in performance found using more than 3 variables

## Next model: Linear Regression.  3/3

## Computing model...

## Note: No increase in performance found using more than 2 variables

## Done! Saving evaluation data frame.

##   year_month classifier  accuracy     Nrmse Rsqrd sensor modeltype     note
## 1     201707        raf 0.2325688 0.1778049   NaN Ta_200       LLO vignette
## 2     201707        pls 0.1762902 0.1347784   NaN Ta_200       LLO vignette
## 3     201707        lim 0.1516702 0.1159558   NaN Ta_200       LLO vignette
##          variables
## 1      NDBI, NDBSI
## 2 NDWI, NDBI, blue
## 3       re2, swir1

Congratulations, you have created your first models using climodr!

Climodr also creates an evaluation data frame, which is saved in the statistics folder. These performance information are later used to predict the best models from this model run.

3.8 Predictions

Further we can predict the scenes from our example with the climate models from our spatial station data using the climpred function. climpred also calculates the area of applicability, if the AOA argument is set to TRUE. In this example the AOA will create a lot of dissimilaritys, because the sample data is dummy data and far away from the reality of the new data. Keep in mind to use the same mnote as in your model run, so climodr predicts with the models you created in this modelrun.

climpred(
  envrmt = envrmt, 
  method = "monthly",
  mnote = "vignette", 
  AOA = TRUE)

## Making Ta_200 lim-prediction for 201707.

##   |                                                                              |                                                                      |   0%  |                                                                              |=======                                                               |  10%  |                                                                              |==============                                                        |  20%  |                                                                              |=====================                                                 |  30%  |                                                                              |============================                                          |  40%  |                                                                              |===================================                                   |  50%  |                                                                              |==========================================                            |  60%  |                                                                              |=================================================                     |  70%  |                                                                              |========================================================              |  80%  |                                                                              |===============================================================       |  90%  |                                                                              |======================================================================| 100%

Lets show the list of predictions:

predlist <- list.files(envrmt$path_predictions, 
                       pattern = ".tif", 
                       recursive = TRUE)
head(predlist)

## [1] "aoa/vignette_Ta_200_201707_LLO_lim_aoa.tif"   
## [2] "vignette_Ta_200_201707_LLO_lim_prediction.tif"

For easier navigation and search the names of the predictions, as well as all other names created by using climodr, follow a pattern. They always consist of: The Mnote - the Sensor Name - the Date of the Scene - the Folds used during the model run - the model classifier - and the word prediction.

A short introduction to climodr

1 Introduction to climodr