# 4 Tutorial 1: Introduction

## 4.1 Setup

Implementing Sykdomspulsen Core requires a number of functions to be called in the correct order. To make this as simple as possible, we have provided a skeleton implementation at sykdomspulsen-org/scskeleton

For this tutorial you should clone GitHub repo to your server. This will be the package that you will be working on throughout this tutorial. You may choose to do a global find/replace on sc-tutorial-start with the name you want for your R package. We will refer to this R package as your “sc implementation”.

You can also clone sykdomspulsen-org/sc-tutorial-end to your server. This is the end product of the tutorial, and you should refer to it in order to check your work.

For the purposes of this tutorial, we assume that the reader is either using RStudio Server Open Source or RStudio Workbench inside Docker containers that have been built according to the Sykdomspulsen specifications. We will refer to your implementation of RStudio Server Open Source/RStudio Workbench with the generic term “RStudio”.

Opensc-tutorial-start in RStudio project mode. Restart the R session via Ctrl+Shift+F10, rstudioapi::restartSession(), or Session > Restart R. This will ensure that you have a clean working environment before you begin. You may now load your sc implementation. This can be done via Ctrl+Shift+L, devtools::load_all("."), or Build > Load All.

rstudioapi::restartSession()
devtools::load_all(".")

It you are wokring in the sykdomspulsen infrastructure you might see a warning message on the form: sh: 1: /bin/authenticate.sh: not found. This has to do with authentication beeing done automatically on sign in. You do not have to worry about this for the purpose of this tutorial. You might also see a warnign starting with “Objects listed as exports, but not present in namespace:”. This can also be ignored.

You can now see which schemas have been loaded by running sc::tm_get_schema_names(). These schemas were included in the skeleton. Note that schemas beginning with config_* are special schemas that are automatically generated by sc.

sc::tm_get_schema_names()
## [1] "config_last_updated"                                           "config_structure_time"
## [3] "rundate"                                                       "config_datetime"
## [5] "anon_example_weather_rawdata"                                  "anon_example_weather_data"
## [7] "anon_example_income"                                           "anon_example_house_prices"
## [9] "anon_example_house_prices_outliers_after_adjusting_for_income"

When you do this you will not see the schemas related to weather, income, and houseprices and you might see a Warning on the form of “In setup_ns_exports(path, export_all, export_imports) : Objects listed as exports…..”. This is as expected.

You can also see which tasks have been loaded by running sc::tm_get_task_names(). These tasks were included in the skeleton. We have not yet made any tasks, hence you will see NULL.

sc::tm_get_task_names()
## [3] "weather_export_plots"                                           "household_incomes_and_house_prices_import_data"
## [5] "household_incomes_and_house_prices_fit_model_and_find_outliers" "household_incomes_and_house_prices_plot"

## 4.3 Weather data example

We also create a task for cleaning the raw data, again with a schema, a task definition and an implementation of a data selector function and an action function.

Finally we create a task for the creation of the plots.

All the schemas are spesified in 03_db_schemas.r, all the task definitions are specified in 04_tasks.r. The data selector functions and the action functions corresponding to each task have there own respective script with the name specified in the task description.

We will walk you through the development of a task that downloads weather data from an API and imports the raw data into a database table.

### 4.4.1 1. Schemas

The first step when developing any task is specifying the schemas that will be used.

It is strongly recommended that you use the RStudio Addins menu to help you quickly insert code templates.

If you go into the script 03_db_schemas.r you can see a function called set_db_schemas. All schemas are placed within this function. If you scroll down you can see that there is already a schema called anon_example_weather_rawdata which is commented out.

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/03_db_schemas.r#L18-L64
##
## 18 |   ## > anon_example_weather_rawdata ----
## 20 |     name_access = c("anon"),
## 21 |     name_grouping = "example_weather",
## 22 |     name_variant = "rawdata",
## 23 |     db_configs = sc::config$db_configs, ## 24 | field_types = c( ## 25 | "granularity_time" = "TEXT", ## 26 | "granularity_geo" = "TEXT", ## 27 | "country_iso3" = "TEXT", ## 28 | "location_code" = "TEXT", ## 29 | "border" = "INTEGER", ## 30 | "age" = "TEXT", ## 31 | "sex" = "TEXT", ## 32 | ## 33 | "date" = "DATE", ## 34 | ## 35 | "isoyear" = "INTEGER", ## 36 | "isoweek" = "INTEGER", ## 37 | "isoyearweek" = "TEXT", ## 38 | "season" = "TEXT", ## 39 | "seasonweek" = "DOUBLE", ## 40 | ## 41 | "calyear" = "INTEGER", ## 42 | "calmonth" = "INTEGER", ## 43 | "calyearmonth" = "TEXT", ## 44 | ## 45 | "temp_max" = "DOUBLE", ## 46 | "temp_min" = "DOUBLE", ## 47 | "precip" = "DOUBLE" ## 48 | ), ## 49 | keys = c( ## 50 | "granularity_time", ## 51 | "location_code", ## 52 | "date", ## 53 | "age", ## 54 | "sex" ## 55 | ), ## 56 | censors = list( ## 57 | anon = list( ## 58 | ## 59 | ) ## 60 | ), ## 61 | validator_field_types = sc::validator_field_types_sykdomspulsen, ## 62 | validator_field_contents = sc::validator_field_contents_sykdomspulsen, ## 63 | info = "This db table is used for..." ## 64 | ) We are now going to recreate this schema. Make sure your pointer is inside of the curly brackets. Go to the Addins menu and click Insert db schema (anon). You have now created a boiler plate for your schema. #### 4.4.1.1 Schema name Start by replacing GROUPING_VARIANT in anon_GROUPING_VARIANT with the name of your schema. For example example_weather_rawdata. The grouping will now be example_weather and the variant is rawdata. ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/03_db_schemas.r#L20-L22 ## ## 20 | name_access = c("anon"), ## 21 | name_grouping = "example_weather", ## 22 | name_variant = "rawdata", Fill this inn for name_grouping and name_variant. The name of the schema is then anon_example_weather_rawdata. In the example we define the name of the schema to be anon_example_weather_weather_rawdata. #### 4.4.1.2 Validators The validators are pre-made and you do not have to change anything. ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/03_db_schemas.r#L61-L63 ## ## 61 | validator_field_types = sc::validator_field_types_sykdomspulsen, ## 62 | validator_field_contents = sc::validator_field_contents_sykdomspulsen, ## 63 | info = "This db table is used for..." These are validators that check: • Are the column names/field types in the schema definition in line with style guidelines? • Are the values/field contents of the datasets that will be uploaded to the database correct? E.g. Does a date column actually contain dates? When using validator_field_types = sc::validator_field_types_sykdomspulsen we expect that the first 16 columns are always as follows (i.e. standardized structural data): ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/03_db_schemas.r#L25-L43 ## ## 25 | "granularity_time" = "TEXT", ## 26 | "granularity_geo" = "TEXT", ## 27 | "country_iso3" = "TEXT", ## 28 | "location_code" = "TEXT", ## 29 | "border" = "INTEGER", ## 30 | "age" = "TEXT", ## 31 | "sex" = "TEXT", ## 32 | ## 33 | "date" = "DATE", ## 34 | ## 35 | "isoyear" = "INTEGER", ## 36 | "isoweek" = "INTEGER", ## 37 | "isoyearweek" = "TEXT", ## 38 | "season" = "TEXT", ## 39 | "seasonweek" = "DOUBLE", ## 40 | ## 41 | "calyear" = "INTEGER", ## 42 | "calmonth" = "INTEGER", ## 43 | "calyearmonth" = "TEXT", The field info should contain a short description of the data table. #### 4.4.1.3 Field types/column names Add the spesific column names and types needed. In our case we want to store the maximum and minimum temperature and the precipitation. Call them “temp_max”, “temp_min”, and “precip”. These are all “DOUBLE”. Remove "XXXX_n" = "INTEGER", and "XXXX_pr" = "DOUBLE"as these are dummy variables. ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/03_db_schemas.r#L45-L47 ## ## 45 | "temp_max" = "DOUBLE", ## 46 | "temp_min" = "DOUBLE", ## 47 | "precip" = "DOUBLE" These are the extra columns that contain the context-specific data in this dataset. #### 4.4.1.4 Keys The combination of these columns represents a unique row in the dataset. In this dataset the combination of “granularity_geo”, “location_code”, “date”, “age”, “sex” which are the initial suggestions are sufficient. ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/03_db_schemas.r#L49-L55 ## ## 49 | keys = c( ## 50 | "granularity_time", ## 51 | "location_code", ## 52 | "date", ## 53 | "age", ## 54 | "sex" ## 55 | ), #### 4.4.1.5 Censoring ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/03_db_schemas.r#L56-L60 ## ## 56 | censors = list( ## 57 | anon = list( ## 58 | ## 59 | ) ## 60 | ), Censoring that is applied to the datasets. In this example we do not apply censoring hence remove the boiler plate suggestions. ### 4.4.2 2. Task definition (task_from_config) Now we have a schema. The second step is defining the task. Again it is strongly recommended that you use the RStudio Addins menu to help you quickly insert code templates. Go to script 04_tasks.r and place your curser inside of the curly bracket. Use the addins menu and click Insert task_from_config. Now you have a boilerplate for a task definition. ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L21-L43 ## ## 21 | sc::add_task_from_config_v8( ## 22 | name_grouping = "weather", ## 23 | name_action = "download_and_import_rawdata", ## 24 | name_variant = NULL, ## 25 | cores = 1, ## 26 | plan_analysis_fn_name = NULL, ## 27 | for_each_plan = plnr::expand_list( ## 28 | location_code = fhidata::norway_locations_names()[granularity_geo %in% c("municip")]$location_code
## 29 |     ),
## 30 |     for_each_analysis = NULL,
## 31 |     universal_argset = NULL,
## 32 |     upsert_at_end_of_each_plan = FALSE,
## 33 |     insert_at_end_of_each_plan = FALSE,
## 36 |     schema = list(
## 37 |       # input
## 38 |
## 39 |       # output
## 40 |       "anon_example_weather_rawdata" = sc::config$schemas$anon_example_weather_rawdata
## 41 |     ),
## 42 |     info = "This task downloads and imports the raw weather data from MET's API at the municipal level"
## 43 |   )

Replace TASK_NAME by your task name. For example weather_download_and_import_rawdata. weather is the task/grouping and download_and_import_rawdatais the action name. Insert these for name_grouping, and name_action. For name_variant use NULL.

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L22-L24
##
## 22 |     name_grouping = "weather",
## 24 |     name_variant = NULL,

Now the name of the task is defined to be weather_download_and_import_rawdata.

#### 4.4.2.2 CPU cores

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L25-L25
##
## 25 |     cores = 1,

We specify that the plans will run sequentially with 1 CPU core. If the number of CPU cores is 2 or higher then the first and last plans will run sequentially, and all the plans in the middle will run in parallel. The first and last plans always run sequentially because this allows us to write “special” code for the first and last plans (i.e. “do this before everything runs” and “do this after everything runs”).

#### 4.4.2.3 Plan/analysis structure

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L26-L30
##
## 26 |     plan_analysis_fn_name = NULL,
## 27 |     for_each_plan = plnr::expand_list(
## 28 |       location_code = fhidata::norway_locations_names()[granularity_geo %in% c("municip")]$location_code ## 29 | ), ## 30 | for_each_analysis = NULL, We specify the plan/analysis structure here. You may use one of the following combinations: • plan_analysis_fn_name (rarely used) • for_each_plan (plan-heavy, one analysis per plan) • for_each_plan + for_each_analysis (typically analysis-heavy) plan_analysis_fn_name is a (rarely used) function that will provide a list containing the plan/analysis structure. It is generally only used when the plan/analysis structure needs to be reactive depending upon some external data (e.g. “an unknown number of data files are provided each day and need to be cleaned”). for_each_plan is a list, with each element corresponding to a plan defined by a named list. Within this named list, each of the named elements will be translated into argset elements that are available for the respective plans. This particular for_each_plan defines a task with 356 plans (one for each municipality). for_each_analysis is nearly the same as for_each_plan. It specifies what kind of analyses you would like to perform within each plan. It is a named list, with each element corresponding to an analysis defined by a named list. Within this named list, each of the named elements will be translated into argset elements that are available for the respective analyses. An example of a for_each_plan that would correspond to 11 tasks (one for each county): options(width = 150) for_each_plan = plnr::expand_list( location_code = fhidata::norway_locations_names()[granularity_geo %in% c("county")]$location_code
)
for_each_plan
## [[1]]
## [[1]]$location_code ## [1] "county42" ## ## ## [[2]] ## [[2]]$location_code
## [1] "county34"
##
##
## [[3]]
## [[3]]$location_code ## [1] "county15" ## ## ## [[4]] ## [[4]]$location_code
## [1] "county18"
##
##
## [[5]]
## [[5]]$location_code ## [1] "county03" ## ## ## [[6]] ## [[6]]$location_code
## [1] "county11"
##
##
## [[7]]
## [[7]]$location_code ## [1] "county54" ## ## ## [[8]] ## [[8]]$location_code
## [1] "county50"
##
##
## [[9]]
## [[9]]$location_code ## [1] "county38" ## ## ## [[10]] ## [[10]]$location_code
## [1] "county46"
##
##
## [[11]]
## [[11]]$location_code ## [1] "county30" fhidata::norway_locations_names() gives us location codes in Norway (try and run if in your console). Implement a plan in your task which has the location codes off all municipalities (municip) in Norway. ### 4.4.3 Universal argset ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L31-L31 ## ## 31 | universal_argset = NULL, Here we can specify a named list, where each of the named elements will be translated into argset elements that are available for all plans/analyses. #### 4.4.3.1 Upsert/insert at end of each plan ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L32-L33 ## ## 32 | upsert_at_end_of_each_plan = FALSE, ## 33 | insert_at_end_of_each_plan = FALSE, If you include a schema called output, then these options will let you upsert/insert the returned value from action_fn_name at the end of each plan. This is an important nuance, because when you write/develop your task, you can (typically) only write one function (action_fn_name) that is applied to all analyses. This means that if your action_fn wants to upsert/insert data to a schema, it (typically) will do this within every analysis. If you have an analysis-heavy task, then this will be a lot of frequent traffic to the databases, which may affect performance. By using these flags, you can restrict the upsert/insert to the end of the plan, which may increase performance. #### 4.4.3.2 action_fn_name The action_fn_name specifies the name of the function that corresponds to the action. That is, the function that is called in every analysis. Note that: • This is a string • It must include the package name • It is typically of the form PACKAGE::TASK_action In our case the package is scskeleton and the TASK is weather_download_and_import_rawdata. Insert this in your task. ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L34-L34 ## ## 34 | action_fn_name = "scexample::weather_download_and_import_rawdata_action", #### 4.4.3.3 data_selector_fn_name The data_selecotr_fn_name specifies the name of the function that corresponds to the data selector. That is, the function that is called at the start of every plan to provide data to all of the analyses inside the plan. Note that: • This is a string • It must include the package name • It is typically of the form PACKAGE::TASK_data_selector Try and guess what this would be in our example. ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L35-L35 ## ## 35 | data_selector_fn_name = "scexample::weather_download_and_import_rawdata_data_selector", #### 4.4.3.4 Schemas The schemas specify a named list, where each element consists of a schema. The names will be passed through as schema$name in action_fn_name and data_selector_fn_name. We must include both the schemas where we get data from and the schemas we store data to.

In our example we do not yet have data so we only specify the schema we have earlier which we called anon_example_weather_rawdata. This meand you can remove the boiler plate input schema and replace SCHEMA_NAME_2 with anon_example_weather_rawdata.

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L36-L41
##
## 36 |     schema = list(
## 37 |       # input
## 38 |
## 39 |       # output
## 40 |       "anon_example_weather_rawdata" = sc::config$schemas$anon_example_weather_rawdata
## 41 |     ),

Finally create a small task description.

### 4.4.4 3. data_selector_fn

The third step in creating a task is defining a data selector function. This is the function that will perform the “one data-pull per plan” and subsequently provide the data to the action.

Go to script weather_download_and_import_rawdata.r.

Use the RStudio Addins menu to help you quickly insert code templates by clicking Insert action and data selector.

Just like that, a pre-made boilerplate is ready to go! Find the data_selector part of the script and replace TASK_NAME with our task name weather_download_and_import_rawdata.

#### 4.4.4.1 plnr::is_run_directly()

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_download_and_import_rawdata.r#L94-L101
##
##  94 |   if (plnr::is_run_directly()) {
##  96 |
##  97 |     index_plan <- 1
##  98 |
## 101 |   }

At the top of all data_selector_fns you will see a section of code wrapped inside if (plnr::is_run_directly()) {. This code will only be run if it is manually highlighted inside RStudio and then “run”. This is extremely beneficial to the user, because it means that the user can easily write small pieces of code that are only used during development, which will not be run when the code is run “properly”.

Sykdomspulsen core uses these sections to let the user “jump” directly into the function. Look at the arguments for weather_download_and_import_rawdata_data_selector and you will see that it needs argset and schema.

The code inside if (plnr::is_run_directly()) { loads argset and schema for index_plan = 1. By running these lines, you can treat the inside of weather_download_and_import_rawdata_data_selector as an interactive script!

This makes the development of the code extremely easy as “everything is an interactive script”.

Check that you have an argset and a schema by running the lines within if (plnr::is_run_directly()) {}.

### 4.4.5 Getting data

The majority of the data_selector_fn is concerned with selecting data (obviously). Remember that the data should be selected to meet the needs of the plan. If you have 11 plans (one for each county), then your data_selector_fn should only extract data for the county of interest.

Take a look at your argset for plan = 1. Since we do not have input data from a schema we can remove the premade schema. Instead we are going to get data from fhimaps::norway_lau2_map_b2020_default_dt which provides latitudes (lat) and longitudes (long). Explore the available data by running fhimaps::norway_lau2_map_b2020_default_dt in your console. It returns a data table. We only want the mean latitude and longitude of the specific location_code for this particular plan and analysis. Therefor try and select only this data and call it gps.

Now we download the weather forcast for this specific location from an api by using httr::GET and glue::glue to get the right address.

httr::GET(glue::glue(“https://api.met.no/weatherapi/locationforecast/2.0/classic?lat={gps$lat}&lon={gps$long}”), httr::content_type_xml()). Use xml2::read_xml() to read the content. Take a peak below and implement it yourself.

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_download_and_import_rawdata.r#L103-L111
##
## 103 |   # find the mid lat/long for the specified location_code
## 104 |   gps <- fhimaps::norway_lau2_map_b2020_default_dt[location_code == argset$location_code,.( ## 105 | lat = mean(lat), ## 106 | long = mean(long) ## 107 | )] ## 108 | ## 109 | # download the forecast for the specified location_code ## 110 | d <- httr::GET(glue::glue("https://api.met.no/weatherapi/locationforecast/2.0/classic?lat={gps$lat}&lon={gps$long}"), httr::content_type_xml()) ## 111 | d <- xml2::read_xml(d$content)

### 4.4.6 Returning data

data_selector_fn needs to return a named list. This will be made available to the user in action_fn (weather_download_and_import_rawdata_action) via the argument data.

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_download_and_import_rawdata.r#L113-L116
##
## 113 |   # The variable returned must be a named list
## 114 |   retval <- list(
## 115 |     "data" = d
## 116 |   )

The entire data selector function should now look like this.

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_download_and_import_rawdata.r#L88-L119
##
##  88 | # **** data_selector **** ----
##  90 | #' @param argset Argset
##  91 | #' @param schema DB Schema
##  92 | #' @export
##  94 |   if (plnr::is_run_directly()) {
##  96 |
##  97 |     index_plan <- 1
##  98 |
## 101 |   }
## 102 |
## 103 |   # find the mid lat/long for the specified location_code
## 104 |   gps <- fhimaps::norway_lau2_map_b2020_default_dt[location_code == argset$location_code,.( ## 105 | lat = mean(lat), ## 106 | long = mean(long) ## 107 | )] ## 108 | ## 109 | # download the forecast for the specified location_code ## 110 | d <- httr::GET(glue::glue("https://api.met.no/weatherapi/locationforecast/2.0/classic?lat={gps$lat}&lon={gps$long}"), httr::content_type_xml()) ## 111 | d <- xml2::read_xml(d$content)
## 112 |
## 113 |   # The variable returned must be a named list
## 114 |   retval <- list(
## 115 |     "data" = d
## 116 |   )
## 117 |
## 118 |   retval
## 119 | }

Check that the data selector function works by restarting R (ctrl + shift + F10) and loading the packages (ctrl + shift + L) before running through the data selector function line by line.

### 4.4.7 4. action_fn

The fourth step is defining an action function. This is the function that will perform the “action” within the the analysis. That is, given:

• data
• argset
• schema

What do you actually want to do with them? Find the action part in your script and replace TASK_NAME with our task name weather_download_and_import_rawdata.

#### 4.4.7.1 plnr::is_run_directly()

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_download_and_import_rawdata.r#L95-L102
##
##  96 |
##  97 |     index_plan <- 1
##  98 |
## 101 |   }
## 102 |

At the top of all action_fns you will again see a section of code wrapped inside if (plnr::is_run_directly()) {. This works exactly the same as for the data_selector_fn.

Look at the arguments for weather_download_and_import_rawdata_data_selector and you will see that it needs data, argset and schema. The code inside if (plnr::is_run_directly()) { loads data, argset and schema for index_plan = 1 and index_analysis = 1. By running these lines, you can treat the inside of weather_download_and_import_rawdata_action as an interactive script!

Check out the data, argset and schema you have by running these lines.

#### 4.4.7.2 argset$first_analysis ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_download_and_import_rawdata.r#L21-L24 ## ## 21 | # special case that runs before everything ## 22 | if (argset$first_analysis == TRUE) {
## 23 |
## 24 |   }

This code is only run if it is the first analysis. It is typically used to drop rows in a database, so that the following code may insert data (faster) instead of using upsert data (slower). If you ran the full task at the beginning of this tutorial you can insert schema$anon_example_weather_rawdata$drop_all_rows() inside here to delete the stored data.

#### 4.4.7.3 Doing things

In this tutorial we do not go in to much detail about how the data is collected so for now copy the content of the action function into your action function. (You find it commented out in your file.)

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_download_and_import_rawdata.r#L26-L80
##
## 26 |   a <- data$data ## 27 | ## 28 | baz <- xml2::xml_find_all(a, ".//maxTemperature") ## 29 | res <- vector("list", length = length(baz)) ## 30 | for (i in seq_along(baz)) { ## 31 | parent <- xml2::xml_parent(baz[[i]]) ## 32 | grandparent <- xml2::xml_parent(parent) ## 33 | time_from <- xml2::xml_attr(grandparent, "from") ## 34 | time_to <- xml2::xml_attr(grandparent, "to") ## 35 | x <- xml2::xml_find_all(parent, ".//minTemperature") ## 36 | temp_min <- xml2::xml_attr(x, "value") ## 37 | x <- xml2::xml_find_all(parent, ".//maxTemperature") ## 38 | temp_max <- xml2::xml_attr(x, "value") ## 39 | x <- xml2::xml_find_all(parent, ".//precipitation") ## 40 | precip <- xml2::xml_attr(x, "value") ## 41 | res[[i]] <- data.frame( ## 42 | time_from = as.character(time_from), ## 43 | time_to = as.character(time_to), ## 44 | temp_max = as.numeric(temp_max), ## 45 | temp_min = as.numeric(temp_min), ## 46 | precip = as.numeric(precip) ## 47 | ) ## 48 | } ## 49 | res <- rbindlist(res) ## 50 | res <- res[stringr::str_sub(time_from, 12, 13) %in% c("00", "06", "12", "18")] ## 51 | res[, date := as.Date(stringr::str_sub(time_from, 1, 10))] ## 52 | res[, N := .N, by = date] ## 53 | res <- res[N == 4] ## 54 | res <- res[ ## 55 | , ## 56 | .( ## 57 | temp_max = max(temp_max), ## 58 | temp_min = min(temp_min), ## 59 | precip = sum(precip) ## 60 | ), ## 61 | keyby = .(date) ## 62 | ] ## 63 | ## 64 | # we look at the downloaded data ## 65 | # res ## 66 | ## 67 | # we now need to format it ## 68 | res[, granularity_time := "day"] ## 69 | res[, sex := "total"] ## 70 | res[, age := "total"] ## 71 | res[, location_code := argset$location_code]
## 72 |
## 73 |   # fill in missing structural variables
## 74 |   sc::fill_in_missing_v8(res, border = 2020)
## 75 |
## 77 |   # res
## 78 |
## 79 |   # put data in db table
## 80 |   schema$anon_example_weather_rawdata$insert_data(res)

Every analysis will perform this code.

Run through it line by line and pay special attention to how the data from the data_selector_fn is accesed, the last part where the data is formatted and we use sc::fill_in_missing_v8(res, border = 2020) to fill inn the mandatory data columns and the end where the data is inserted to the database.

#### 4.4.7.4 Accessing data from data_selector_fn

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_download_and_import_rawdata.r#L26-L26
##
## 26 |   a <- data$data Here you see that we access the data that was passed to us from data_selector_fn #### 4.4.7.5 Structural data/sc::fill_in_missing_v8 ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_download_and_import_rawdata.r#L68-L74 ## ## 68 | res[, granularity_time := "day"] ## 69 | res[, sex := "total"] ## 70 | res[, age := "total"] ## 71 | res[, location_code := argset$location_code]
## 72 |
## 73 |   # fill in missing structural variables
## 74 |   sc::fill_in_missing_v8(res, border = 2020)

We have 16 structural data columns that we expect. These columns typically have a lot of redundancy (e.g. date, isoyear, isoyearweek). To make things easier, we provide a function called sc::fill_in_missing_v8 that uses the information present in the dataset to try and impute the missing structural data.

#### 4.4.7.6 Insert/upsert to databases

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_download_and_import_rawdata.r#L80-L80
##
## 80 |   schema$anon_example_weather_rawdata$insert_data(res)

Here we insert the data to the database table.

Insert is an append (so the data cannot already exist in the database table), while upsert is “update (overwrite) if already exists, insert (append) if it doesn’t”.

If you want to deleate the data use schema$NAME_DATABASE$drop_all_rows() in our case schema$anon_example_weather_rawdata$drop_all_rows().

#### 4.4.7.7 argset$last_analysis ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_download_and_import_rawdata.r#L21-L24 ## ## 21 | # special case that runs before everything ## 22 | if (argset$first_analysis == TRUE) {
## 23 |
## 24 |   }

This code is only run if it is the last analysis. It is typically used to copy an internal database table (i.e. one that the public is not directly viewing) to an external database (i.e. one that the public is directly viewing).

By distinguishing between internal database tables (e.g. anon_webkhtint_test) and external database tables (e.g. anon_webkht_test) we can do whatever we want to anon_webkhtint_test while anon_webkht_test remains in place and untouched. This makes it less likely that any mistakes will affect any APIs or websites that the public uses.

### 4.4.8 Test the code

The action function should look like this.

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_download_and_import_rawdata.r#L1-L86
##
##  1 | # **** action **** ----
##  3 | #' @param data Data
##  4 | #' @param argset Argset
##  5 | #' @param schema DB Schema
##  6 | #' @export
##  9 |
## 10 |   if (plnr::is_run_directly()) {
## 12 |
## 13 |     index_plan <- 1
## 14 |     index_analysis <- 1
## 15 |
## 17 |     argset <- sc::tm_get_argset("weather_download_and_import_rawdata", index_plan = index_plan, index_analysis = index_analysis)
## 19 |   }
## 20 |
## 21 |   # special case that runs before everything
## 22 |   if (argset$first_analysis == TRUE) { ## 23 | ## 24 | } ## 25 | ## 26 | a <- data$data
## 27 |
## 28 |   baz <- xml2::xml_find_all(a, ".//maxTemperature")
## 29 |   res <- vector("list", length = length(baz))
## 30 |   for (i in seq_along(baz)) {
## 31 |     parent <- xml2::xml_parent(baz[[i]])
## 32 |     grandparent <- xml2::xml_parent(parent)
## 33 |     time_from <- xml2::xml_attr(grandparent, "from")
## 34 |     time_to <- xml2::xml_attr(grandparent, "to")
## 35 |     x <- xml2::xml_find_all(parent, ".//minTemperature")
## 36 |     temp_min <- xml2::xml_attr(x, "value")
## 37 |     x <- xml2::xml_find_all(parent, ".//maxTemperature")
## 38 |     temp_max <- xml2::xml_attr(x, "value")
## 39 |     x <- xml2::xml_find_all(parent, ".//precipitation")
## 40 |     precip <- xml2::xml_attr(x, "value")
## 41 |     res[[i]] <- data.frame(
## 42 |       time_from = as.character(time_from),
## 43 |       time_to = as.character(time_to),
## 44 |       temp_max = as.numeric(temp_max),
## 45 |       temp_min = as.numeric(temp_min),
## 46 |       precip = as.numeric(precip)
## 47 |     )
## 48 |   }
## 49 |   res <- rbindlist(res)
## 50 |   res <- res[stringr::str_sub(time_from, 12, 13) %in% c("00", "06", "12", "18")]
## 51 |   res[, date := as.Date(stringr::str_sub(time_from, 1, 10))]
## 52 |   res[, N := .N, by = date]
## 53 |   res <- res[N == 4]
## 54 |   res <- res[
## 55 |     ,
## 56 |     .(
## 57 |       temp_max = max(temp_max),
## 58 |       temp_min = min(temp_min),
## 59 |       precip = sum(precip)
## 60 |     ),
## 61 |     keyby = .(date)
## 62 |   ]
## 63 |
## 65 |   # res
## 66 |
## 67 |   # we now need to format it
## 68 |   res[, granularity_time := "day"]
## 69 |   res[, sex := "total"]
## 70 |   res[, age := "total"]
## 71 |   res[, location_code := argset$location_code] ## 72 | ## 73 | # fill in missing structural variables ## 74 | sc::fill_in_missing_v8(res, border = 2020) ## 75 | ## 76 | # we look at the downloaded data ## 77 | # res ## 78 | ## 79 | # put data in db table ## 80 | schema$anon_example_weather_rawdata$insert_data(res) ## 81 | ## 82 | # special case that runs after everything ## 83 | if (argset$last_analysis == TRUE) {
## 84 |
## 85 |   }
## 86 | }

Try and restart, load all and run the code line by line.

### 4.4.9 Which plan/analysis is which?

Inside the if (plnr::is_run_directly()) { sections, you specify index_plan and index_analysis. However, these are just numbers. If you want to specifically look at the plan for Oslo municipality, how do you know which index_plan this corresponds to?

options(width = 150)
##      index_plan index_analysis **universal** **plan** location_code **analysis** **automatic** index      today  yesterday first_analysis
##   1:          1              1             *        *   municip1820            *             *     1 2022-03-01 2022-02-28           TRUE
##   2:          2              1             *        *   municip5403            *             *     2 2022-03-01 2022-02-28          FALSE
##   3:          3              1             *        *   municip3428            *             *     3 2022-03-01 2022-02-28          FALSE
##   4:          4              1             *        *   municip4631            *             *     4 2022-03-01 2022-02-28          FALSE
##   5:          5              1             *        *   municip1871            *             *     5 2022-03-01 2022-02-28          FALSE
##  ---
## 352:        352              1             *        *   municip3442            *             *   352 2022-03-01 2022-02-28          FALSE
## 353:        353              1             *        *   municip3048            *             *   353 2022-03-01 2022-02-28          FALSE
## 354:        354              1             *        *   municip3440            *             *   354 2022-03-01 2022-02-28          FALSE
## 355:        355              1             *        *   municip4626            *             *   355 2022-03-01 2022-02-28          FALSE
## 356:        356              1             *        *   municip3453            *             *   356 2022-03-01 2022-02-28          FALSE
##      first_argset last_analysis last_argset
##   1:         TRUE         FALSE       FALSE
##   2:        FALSE         FALSE       FALSE
##   3:        FALSE         FALSE       FALSE
##   4:        FALSE         FALSE       FALSE
##   5:        FALSE         FALSE       FALSE
##  ---
## 352:        FALSE         FALSE       FALSE
## 353:        FALSE         FALSE       FALSE
## 354:        FALSE         FALSE       FALSE
## 355:        FALSE         FALSE       FALSE
## 356:        FALSE          TRUE        TRUE

Try and change the plan number and run the script again.

Now you have implemented your first task by creating a schema, a task description a data selector function and an action function! Congratulations!

Run the entire task by running tm_run_task("weather_download_and_import_rawdata").

## 4.5 Developing weather_clean_data

The task weather_clean_data focuses on cleaning the raw data and inserting it in another database table. That is, the data source is a Sykdomspulsen Core database table, and the output is also a Sykdomspulsen Core database table.

We will walk you through the development of weather_clean_data, however, the description of this task will be less comprehensive than the previous task, and will focus primarily on parts that are novel.

We already mentioned that weather_clean_data cleans the raw data. We want this task to take the raw data we obtained in the previous task and aggregate it to obtain weather data on different geographical regions than municipalities. To do so we use some pre-made FHI functions such as fhidata::make_skeleton which makes a data table skeleton for the regions of interest and fhidata::norway_locations_hierarchy which converts location codes from one location code level to another.

### 4.5.1 1. Schemas

First we start by creating a schema for the data we want this task to store. The structure is exactly the same as for the previous task with name access anon and temp_max, temp_min and precip as additional colums. Try and create this schema in 03_db_schemas.r.

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/03_db_schemas.r#L66-L113
##
##  66 |   ## > anon_example_weather_data ----
##  68 |     name_access = c("anon"),
##  69 |     name_grouping = "example_weather",
##  70 |     name_variant = "data",
##  71 |     db_configs = sc::config$db_configs, ## 72 | field_types = c( ## 73 | "granularity_time" = "TEXT", ## 74 | "granularity_geo" = "TEXT", ## 75 | "country_iso3" = "TEXT", ## 76 | "location_code" = "TEXT", ## 77 | "border" = "INTEGER", ## 78 | "age" = "TEXT", ## 79 | "sex" = "TEXT", ## 80 | ## 81 | "date" = "DATE", ## 82 | ## 83 | "isoyear" = "INTEGER", ## 84 | "isoweek" = "INTEGER", ## 85 | "isoyearweek" = "TEXT", ## 86 | "season" = "TEXT", ## 87 | "seasonweek" = "DOUBLE", ## 88 | ## 89 | "calyear" = "INTEGER", ## 90 | "calmonth" = "INTEGER", ## 91 | "calyearmonth" = "TEXT", ## 92 | ## 93 | "temp_max" = "DOUBLE", ## 94 | "temp_min" = "DOUBLE", ## 95 | "precip" = "DOUBLE" ## 96 | ), ## 97 | keys = c( ## 98 | "granularity_time", ## 99 | "location_code", ## 100 | "date", ## 101 | "age", ## 102 | "sex" ## 103 | ), ## 104 | censors = list( ## 105 | anon = list( ## 106 | ## 107 | ) ## 108 | ), ## 109 | validator_field_types = sc::validator_field_types_sykdomspulsen, ## 110 | validator_field_contents = sc::validator_field_contents_sykdomspulsen, ## 111 | info = "This db table is used for..." ## 112 | ) ## 113 | ### 4.5.2 2. Task definition (task_from_config) The next step is to define the task in 04_tasks.r . For this task the aim is to aggregate data to higher levels meaning we need all data available at the same time and we preform the entire task in one analysis. Hence we need a task with only one plan (x = 1). We need the data from the database created in the previous task as input schema and the schema we just implemented as output schema. Try and create this task definition! (Remember to use the addins menu to get a boiler plate task definition.) #### 4.5.2.1 Plan/analysis structure ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L52-L56 ## ## 52 | plan_analysis_fn_name = NULL, ## 53 | for_each_plan = plnr::expand_list( ## 54 | x = 1 ## 55 | ), ## 56 | for_each_analysis = NULL, For this particular task, we have decided to only implement one plan containing one analysis, which will process all of the data at once. If we were only aggregating municipality data to the county level, we could have implemented 11 plans (one for each county). However, because we are also aggregating to the national level, we need all the data available at once. #### 4.5.2.2 Schemas ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L62-L68 ## ## 62 | schema = list( ## 63 | # input ## 64 | "anon_example_weather_rawdata" = sc::config$schemas$anon_example_weather_rawdata, ## 65 | ## 66 | # output ## 67 | "anon_example_weather_data" = sc::config$schemas$anon_example_weather_data ## 68 | ), We need to specify the schemas that are used for both input and output. #### 4.5.2.3 Full task description ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L45-L70 ## ## 45 | ## > weather_clean_data ---- ## 46 | # tm_run_task("weather_clean_data") ## 47 | sc::add_task_from_config_v8( ## 48 | name_grouping = "weather", ## 49 | name_action = "clean_data", ## 50 | name_variant = NULL, ## 51 | cores = 1, ## 52 | plan_analysis_fn_name = NULL, ## 53 | for_each_plan = plnr::expand_list( ## 54 | x = 1 ## 55 | ), ## 56 | for_each_analysis = NULL, ## 57 | universal_argset = NULL, ## 58 | upsert_at_end_of_each_plan = FALSE, ## 59 | insert_at_end_of_each_plan = FALSE, ## 60 | action_fn_name = "scexample::weather_clean_data_action", ## 61 | data_selector_fn_name = "scexample::weather_clean_data_data_selector", ## 62 | schema = list( ## 63 | # input ## 64 | "anon_example_weather_rawdata" = sc::config$schemas$anon_example_weather_rawdata, ## 65 | ## 66 | # output ## 67 | "anon_example_weather_data" = sc::config$schemas$anon_example_weather_data ## 68 | ), ## 69 | info = "This task cleans the raw data and aggregates it to county and national level" ## 70 | ) ### 4.5.3 3. data_selector_fn Now we are ready to create the data selector function. Go to script weather_clean_data. Use the addins menu as before to get a boiler plate for the action function and the data selector function and scroll down to the data selector part. Start by inserting your task name instead of TASK_NAME. #### 4.5.3.1 Getting data (specify the schema) Next fill inn the name of the input schema instead of SCHEMA_NAME, connecting to the database table linked to the schema. ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L200-L200 ## ## 200 | d <- schema$anon_example_weather_rawdata$tbl() %>% #### 4.5.3.2 Getting data (sc::mandatory_db_filter) We then introduce the sc::mandatory_db_filter. This is a filter on the most common structural variables. We say this is “mandatory” because we want the user to always keep in mind: • The minimal amount of data needed to do the job • To be as explicit as possible with what data is needed to do the job Fill inn the mandatory filters as best you can and take a peak below if you are not sure. ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L201-L212 ## ## 201 | sc::mandatory_db_filter( ## 202 | granularity_time = "day", ## 203 | granularity_time_not = NULL, ## 204 | granularity_geo = "municip", ## 205 | granularity_geo_not = NULL, ## 206 | country_iso3 = NULL, ## 207 | location_code = NULL, ## 208 | age = "total", ## 209 | age_not = NULL, ## 210 | sex = "total", ## 211 | sex_not = NULL ## 212 | ) %>% You will notice that we don’t use all of the arguments passed into the function, but we use as many as we can. #### 4.5.3.3 Getting data (dplyr::select) We always want to be as explicit as possible with what data is needed to do the job. To achieve this, we use dplyr::select to select the columns that we are interested in. If you want to quickly generate a dplyr::select boilerplate for your schema that you can copy/paste, you can do this via either of the following: schema$anon_example_weather_rawdata$print_dplyr_select() ## dplyr::select( ## granularity_time, ## granularity_geo, ## country_iso3, ## location_code, ## border, ## age, ## sex, ## date, ## isoyear, ## isoweek, ## isoyearweek, ## season, ## seasonweek, ## calyear, ## calmonth, ## calyearmonth, ## temp_max, ## temp_min, ## precip ## ) %>% Use one of these functions and replace the dplyr::select part in your data selector function. To aggregate data we need location_code, date, temp_max, temp_min, precip, and the granularity_time (dayly, weekly, etc). Comment out the other variables. ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L213-L237 ## ## 213 | dplyr::select( ## 214 | granularity_time, ## 215 | # granularity_geo, ## 216 | # country_iso3, ## 217 | location_code, ## 218 | # border, ## 219 | # age, ## 220 | # sex, ## 221 | ## 222 | date, ## 223 | ## 224 | # isoyear, ## 225 | # isoweek, ## 226 | # isoyearweek, ## 227 | # season, ## 228 | # seasonweek, ## 229 | ## 230 | # calyear, ## 231 | # calmonth, ## 232 | # calyearmonth, ## 233 | ## 234 | temp_max, ## 235 | temp_min, ## 236 | precip ## 237 | ) %>% #### 4.5.3.4 Getting data (dplyr::collect) ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L238-L238 ## ## 238 | dplyr::collect() %>% This executes the SQL call to the database. #### 4.5.3.5 Getting data (data.table and setorder) ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L239-L243 ## ## 239 | as.data.table() %>% ## 240 | setorder( ## 241 | location_code, ## 242 | date ## 243 | ) Firstly, as a general rule we prefer to use data.table. So we would like to convert our data.frame to a data.table. Secondly, we are not guaranteed to receive our data in any particular order. Because of this, it is very important that we sort our data on arrival (if this is relevant to the action_fn, e.g. if cumulative sums are created). #### 4.5.3.6 Set a name Finally give the dataset you return a suitable name for example day_municip. Check that you data selector function works by saving, restarting, and loading all packages. Then run through the function line by line. #### 4.5.3.7 Example of the data_selector function ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L184-L251 ## ## 184 | # **** data_selector **** ---- ## 185 | #' weather_clean_data (data selector) ## 186 | #' @param argset Argset ## 187 | #' @param schema DB Schema ## 188 | #' @export ## 189 | weather_clean_data_data_selector <- function(argset, schema) { ## 190 | if (plnr::is_run_directly()) { ## 191 | # sc::tm_get_plans_argsets_as_dt("weather_clean_data") ## 192 | ## 193 | index_plan <- 1 ## 194 | ## 195 | argset <- sc::tm_get_argset("weather_clean_data", index_plan = index_plan) ## 196 | schema <- sc::tm_get_schema("weather_clean_data") ## 197 | } ## 198 | ## 199 | # The database schemas can be accessed here ## 200 | d <- schema$anon_example_weather_rawdata$tbl() %>% ## 201 | sc::mandatory_db_filter( ## 202 | granularity_time = "day", ## 203 | granularity_time_not = NULL, ## 204 | granularity_geo = "municip", ## 205 | granularity_geo_not = NULL, ## 206 | country_iso3 = NULL, ## 207 | location_code = NULL, ## 208 | age = "total", ## 209 | age_not = NULL, ## 210 | sex = "total", ## 211 | sex_not = NULL ## 212 | ) %>% ## 213 | dplyr::select( ## 214 | granularity_time, ## 215 | # granularity_geo, ## 216 | # country_iso3, ## 217 | location_code, ## 218 | # border, ## 219 | # age, ## 220 | # sex, ## 221 | ## 222 | date, ## 223 | ## 224 | # isoyear, ## 225 | # isoweek, ## 226 | # isoyearweek, ## 227 | # season, ## 228 | # seasonweek, ## 229 | ## 230 | # calyear, ## 231 | # calmonth, ## 232 | # calyearmonth, ## 233 | ## 234 | temp_max, ## 235 | temp_min, ## 236 | precip ## 237 | ) %>% ## 238 | dplyr::collect() %>% ## 239 | as.data.table() %>% ## 240 | setorder( ## 241 | location_code, ## 242 | date ## 243 | ) ## 244 | ## 245 | # The variable returned must be a named list ## 246 | retval <- list( ## 247 | "day_municip" = d ## 248 | ) ## 249 | ## 250 | retval ## 251 | } ### 4.5.4 4. action_fn The final step in the process is creating the action function. Replace TASK_NAME with your task name. #### 4.5.4.1 Skeleton In this action function we use fhi skeletons to create bases for our data tables. Read here about the concept of skeletons. Start by creating a variable (for example d_agg) for an empty list and copy the data collected in the data selector function into this (d_agg$day_municip <- copy(data$day_municip)). Extract the first and last date from this dataset. Now we are going to create a skeleton for where we separate between regions where we have data (municipalities) and regions where we do not have data (bo og arbeigs regioner). The skeleton function takes min and max dates and in this case we will pass it granularity_geo consisting of a list with list( "nodata" = c( "wardoslo", "extrawardoslo", "missingwardoslo", "wardbergen", "missingwardbergen", "wardstavanger", "missingwardstavanger", "notmainlandmunicip", "missingmunicip", "notmainlandcounty", "missingcounty" ), "municip" = c( "municip" ) ) ##$nodata
##  [1] "wardoslo"             "extrawardoslo"        "missingwardoslo"      "wardbergen"           "missingwardbergen"    "wardstavanger"
##  [7] "missingwardstavanger" "notmainlandmunicip"   "missingmunicip"       "notmainlandcounty"    "missingcounty"
##
## $municip ## [1] "municip" #### 4.5.4.2 Merge in weather data Next we want to merge the information we have on weather data for the municipalities into this data. ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L67-L84 ## ## 67 | # Merge in the information you have at different geographical granularities ## 68 | # one level at a time ## 69 | # municip ## 70 | multiskeleton_day$municip[
## 71 |     d_agg$day_municip, ## 72 | on = c("location_code", "date"), ## 73 | c( ## 74 | "temp_max", ## 75 | "temp_min", ## 76 | "precip" ## 77 | ) := .( ## 78 | temp_max, ## 79 | temp_min, ## 80 | precip ## 81 | ) ## 82 | ] ## 83 | ## 84 | multiskeleton_day$municip[]

#### 4.5.4.3 Aggregate to a county level

Now aggregate the data to a county level with the help of fhidata::norway_locations_hierarchy

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L86-L109
##
##  86 |   # Aggregate up to higher geographical granularities (county)
##  87 |   multiskeleton_day$county <- multiskeleton_day$municip[
##  88 |     fhidata::norway_locations_hierarchy(
##  89 |       from = "municip",
##  90 |       to = "county"
##  91 |     ),
##  92 |     on = c(
##  93 |       "location_code==from_code"
##  94 |     )
##  95 |   ][,
##  96 |     .(
##  97 |       temp_max = mean(temp_max, na.rm = T),
##  98 |       temp_min = mean(temp_min, na.rm = T),
##  99 |       precip = mean(precip, na.rm = T),
## 100 |       granularity_geo = "county"
## 101 |     ),
## 102 |     by = .(
## 103 |       granularity_time,
## 104 |       date,
## 105 |       location_code = to_code
## 106 |     )
## 107 |   ]
## 108 |

#### 4.5.4.5 Combine data

Combine all the different granularity geos by using rbindlist and storing it to a new name f.eks skeleton_day.

#### 4.5.4.6 Weekly data.

As a challenge try and aggregate the daily data to weekly data! You can use fhiplot::isoyearweek_c(date) to get the isoweek of a date.

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L134-L153
##
## 134 |   # 10. (If desirable) aggregate up to higher time granularities
## 135 |   # if necessary, it is now easy to aggregate up to weekly data from here
## 136 |   skeleton_isoweek <- copy(skeleton_day)
## 137 |   skeleton_isoweek[, isoyearweek := fhiplot::isoyearweek_c(date)]
## 138 |   skeleton_isoweek <- skeleton_isoweek[
## 139 |     ,
## 140 |     .(
## 141 |       temp_max = mean(temp_max, na.rm = T),
## 142 |       temp_min = mean(temp_min, na.rm = T),
## 143 |       precip = mean(precip, na.rm = T),
## 144 |       granularity_time = "isoweek"
## 145 |     ),
## 146 |     keyby = .(
## 147 |       isoyearweek,
## 148 |       granularity_geo,
## 149 |       location_code
## 150 |     )
## 151 |   ]
## 152 |
## 153 |   skeleton_isoweek[]

#### 4.5.4.7 Structural data

The next step is to fill in all missing structural data. Fill in sex = “total” and age= “total” manually then you can use sc::fill_in_missing_v8(skeleton_day, border = config$border). For the weekly data make sure to also convert the date by using as.Date(date) to ensure it is on the right format. Rbindlist binds the two data tables together. ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L155-L173 ## ## 155 | # we now need to format it and fill in missing structural variables ## 156 | # day ## 157 | skeleton_day[, sex := "total"] ## 158 | skeleton_day[, age := "total"] ## 159 | sc::fill_in_missing_v8(skeleton_day, border = config$border)
## 160 |
## 161 |   # isoweek
## 162 |   skeleton_isoweek[, sex := "total"]
## 163 |   skeleton_isoweek[, age := "total"]
## 164 |   sc::fill_in_missing_v8(skeleton_isoweek, border = config$border) ## 165 | skeleton_isoweek[, date := as.Date(date)] ## 166 | ## 167 | skeleton <- rbindlist( ## 168 | list( ## 169 | skeleton_day, ## 170 | skeleton_isoweek ## 171 | ), ## 172 | use.names = T ## 173 | ) #### 4.5.4.8 Store the data Insert the final data table into the database specified in the task description ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L175-L176 ## ## 175 | # put data in db table ## 176 | schema$anon_example_weather_data$drop_all_rows_and_then_insert_data(skeleton) Restart R, load all packages and try and run the task. Run the entire task by running tm_run_task("weather_clean_data"). If you ran the tasks at the beginning of the script you might need to run schema$anon_example_weather_data$drop_all_rows() first. #### 4.5.4.9 Full example ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L1-L182 ## ## 1 | # **** action **** ---- ## 2 | #' weather_clean_data (action) ## 3 | #' @param data Data ## 4 | #' @param argset Argset ## 5 | #' @param schema DB Schema ## 6 | #' @export ## 7 | weather_clean_data_action <- function(data, argset, schema) { ## 8 | # tm_run_task("weather_clean_data") ## 9 | ## 10 | if (plnr::is_run_directly()) { ## 11 | # sc::tm_get_plans_argsets_as_dt("weather_clean_data") ## 12 | ## 13 | index_plan <- 1 ## 14 | index_analysis <- 1 ## 15 | ## 16 | data <- sc::tm_get_data("weather_clean_data", index_plan = index_plan) ## 17 | argset <- sc::tm_get_argset("weather_clean_data", index_plan = index_plan, index_analysis = index_analysis) ## 18 | schema <- sc::tm_get_schema("weather_clean_data") ## 19 | } ## 20 | ## 21 | # special case that runs before everything ## 22 | if (argset$first_analysis == TRUE) {
##  23 |
##  24 |   }
##  25 |
##  26 |   # make sure there's no missing data via the creation of a skeleton
##  27 |   # https://folkehelseinstituttet.github.io/fhidata/articles/Skeletons.html
##  28 |
##  29 |   # Create a variable (possibly a list) to hold the data
##  30 |   d_agg <- list()
##  31 |   d_agg$day_municip <- copy(data$day_municip)
##  32 |
##  33 |   # Pull out important dates
##  34 |   date_min <- min(d_agg$day_municip$date, na.rm = T)
##  35 |   date_max <- max(d_agg$day_municip$date, na.rm = T)
##  36 |
##  37 |   # Create multiskeleton
##  38 |   # granularity_geo should have the following groups:
##  39 |   # - nodata (when no data is available, and there is no "finer" data available to aggregate up)
##  40 |   # - all levels of granularity_geo where you have data available
##  41 |   # If you do not have data for a specific granularity_geo, but there is "finer" data available
##  42 |   # then you should not include this granularity_geo in the multiskeleton, because you will create
##  43 |   # it later when you aggregate up your data (baregion)
##  44 |   multiskeleton_day <- fhidata::make_skeleton(
##  45 |     date_min = date_min,
##  46 |     date_max = date_max,
##  47 |     granularity_geo = list(
##  48 |       "nodata" = c(
##  49 |         "wardoslo",
##  50 |         "extrawardoslo",
##  51 |         "missingwardoslo",
##  52 |         "wardbergen",
##  53 |         "missingwardbergen",
##  54 |         "wardstavanger",
##  55 |         "missingwardstavanger",
##  56 |         "notmainlandmunicip",
##  57 |         "missingmunicip",
##  58 |         "notmainlandcounty",
##  59 |         "missingcounty"
##  60 |       ),
##  61 |       "municip" = c(
##  62 |         "municip"
##  63 |       )
##  64 |     )
##  65 |   )
##  66 |
##  67 |   # Merge in the information you have at different geographical granularities
##  68 |   # one level at a time
##  69 |   # municip
##  70 |   multiskeleton_day$municip[ ## 71 | d_agg$day_municip,
##  72 |     on = c("location_code", "date"),
##  73 |     c(
##  74 |       "temp_max",
##  75 |       "temp_min",
##  76 |       "precip"
##  77 |     ) := .(
##  78 |       temp_max,
##  79 |       temp_min,
##  80 |       precip
##  81 |     )
##  82 |   ]
##  83 |
##  84 |   multiskeleton_day$municip[] ## 85 | ## 86 | # Aggregate up to higher geographical granularities (county) ## 87 | multiskeleton_day$county <- multiskeleton_day$municip[ ## 88 | fhidata::norway_locations_hierarchy( ## 89 | from = "municip", ## 90 | to = "county" ## 91 | ), ## 92 | on = c( ## 93 | "location_code==from_code" ## 94 | ) ## 95 | ][, ## 96 | .( ## 97 | temp_max = mean(temp_max, na.rm = T), ## 98 | temp_min = mean(temp_min, na.rm = T), ## 99 | precip = mean(precip, na.rm = T), ## 100 | granularity_geo = "county" ## 101 | ), ## 102 | by = .( ## 103 | granularity_time, ## 104 | date, ## 105 | location_code = to_code ## 106 | ) ## 107 | ] ## 108 | ## 109 | multiskeleton_day$county[]
## 110 |
## 111 |   # Aggregate up to higher geographical granularities (nation)
## 112 |   multiskeleton_day$nation <- multiskeleton_day$municip[
## 113 |     ,
## 114 |     .(
## 115 |       temp_max = mean(temp_max, na.rm = T),
## 116 |       temp_min = mean(temp_min, na.rm = T),
## 117 |       precip = mean(precip, na.rm = T),
## 118 |       granularity_geo = "nation",
## 119 |       location_code = "norge"
## 120 |     ),
## 121 |     by = .(
## 122 |       granularity_time,
## 123 |       date
## 124 |     )
## 125 |   ]
## 126 |
## 127 |   multiskeleton_day$nation[] ## 128 | ## 129 | # combine all the different granularity_geos ## 130 | skeleton_day <- rbindlist(multiskeleton_day, fill = TRUE, use.names = TRUE) ## 131 | ## 132 | skeleton_day[] ## 133 | ## 134 | # 10. (If desirable) aggregate up to higher time granularities ## 135 | # if necessary, it is now easy to aggregate up to weekly data from here ## 136 | skeleton_isoweek <- copy(skeleton_day) ## 137 | skeleton_isoweek[, isoyearweek := fhiplot::isoyearweek_c(date)] ## 138 | skeleton_isoweek <- skeleton_isoweek[ ## 139 | , ## 140 | .( ## 141 | temp_max = mean(temp_max, na.rm = T), ## 142 | temp_min = mean(temp_min, na.rm = T), ## 143 | precip = mean(precip, na.rm = T), ## 144 | granularity_time = "isoweek" ## 145 | ), ## 146 | keyby = .( ## 147 | isoyearweek, ## 148 | granularity_geo, ## 149 | location_code ## 150 | ) ## 151 | ] ## 152 | ## 153 | skeleton_isoweek[] ## 154 | ## 155 | # we now need to format it and fill in missing structural variables ## 156 | # day ## 157 | skeleton_day[, sex := "total"] ## 158 | skeleton_day[, age := "total"] ## 159 | sc::fill_in_missing_v8(skeleton_day, border = config$border)
## 160 |
## 161 |   # isoweek
## 162 |   skeleton_isoweek[, sex := "total"]
## 163 |   skeleton_isoweek[, age := "total"]
## 164 |   sc::fill_in_missing_v8(skeleton_isoweek, border = config$border) ## 165 | skeleton_isoweek[, date := as.Date(date)] ## 166 | ## 167 | skeleton <- rbindlist( ## 168 | list( ## 169 | skeleton_day, ## 170 | skeleton_isoweek ## 171 | ), ## 172 | use.names = T ## 173 | ) ## 174 | ## 175 | # put data in db table ## 176 | schema$anon_example_weather_data$drop_all_rows_and_then_insert_data(skeleton) ## 177 | ## 178 | # special case that runs after everything ## 179 | if (argset$last_analysis == TRUE) {
## 180 |
## 181 |   }
## 182 | }

## 4.6 Developing weather_export_plots

The final task of this tutorial, weather_export_plots, takes the cleaned data and plots 11 graphs (one for each county) of min and max temperatures. This means we need 11 plans, one for each county. We use input data generated by data clean_weather_data. Hence, we do not need to create a new schema. We are going to pass a few universal argset through the task definition to define the location to store the figures.

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_clean_data.r#L84-L88
##
## 84 |   multiskeleton_day$municip[] ## 85 | ## 86 | # Aggregate up to higher geographical granularities (county) ## 87 | multiskeleton_day$county <- multiskeleton_day$municip[ ## 88 | fhidata::norway_locations_hierarchy( The benefits of placing the output directories and filenames in the task declaration are: • It makes your action_fn more generic, and can be reused by multiple tasks • It is easier to get an overview of where the output is being sent • “More decisions” in the task config and “fewer decisions” in the action_fn makes the system easier for everyone to understand, because decisions become more explicit Everything inside the curly brackets get passed through the action function. Each plan only need the data for that specific location_code. This can be implemented in the manditory filters in the data selector function. fs::dir_create(glue::glue(argset$output_dir)) can be used to create the output directory.

Try putting everything you have learned so fare together and create this task by yourself. If you get stuck you can always peak below. Good luck!

### 4.6.1 1. Schemas

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/03_db_schemas.r#L66-L112
##
##  66 |   ## > anon_example_weather_data ----
##  68 |     name_access = c("anon"),
##  69 |     name_grouping = "example_weather",
##  70 |     name_variant = "data",
##  71 |     db_configs = sc::config$db_configs, ## 72 | field_types = c( ## 73 | "granularity_time" = "TEXT", ## 74 | "granularity_geo" = "TEXT", ## 75 | "country_iso3" = "TEXT", ## 76 | "location_code" = "TEXT", ## 77 | "border" = "INTEGER", ## 78 | "age" = "TEXT", ## 79 | "sex" = "TEXT", ## 80 | ## 81 | "date" = "DATE", ## 82 | ## 83 | "isoyear" = "INTEGER", ## 84 | "isoweek" = "INTEGER", ## 85 | "isoyearweek" = "TEXT", ## 86 | "season" = "TEXT", ## 87 | "seasonweek" = "DOUBLE", ## 88 | ## 89 | "calyear" = "INTEGER", ## 90 | "calmonth" = "INTEGER", ## 91 | "calyearmonth" = "TEXT", ## 92 | ## 93 | "temp_max" = "DOUBLE", ## 94 | "temp_min" = "DOUBLE", ## 95 | "precip" = "DOUBLE" ## 96 | ), ## 97 | keys = c( ## 98 | "granularity_time", ## 99 | "location_code", ## 100 | "date", ## 101 | "age", ## 102 | "sex" ## 103 | ), ## 104 | censors = list( ## 105 | anon = list( ## 106 | ## 107 | ) ## 108 | ), ## 109 | validator_field_types = sc::validator_field_types_sykdomspulsen, ## 110 | validator_field_contents = sc::validator_field_contents_sykdomspulsen, ## 111 | info = "This db table is used for..." ## 112 | ) This schema has already been created by the previous task weather_clean_data. ### 4.6.2 2. Task definition (task_from_config) ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L72-L100 ## ## 72 | ## > weather_clean_data ---- ## 73 | # tm_run_task("weather_export_plots") ## 74 | sc::add_task_from_config_v8( ## 75 | name_grouping = "weather", ## 76 | name_action = "export_plots", ## 77 | name_variant = NULL, ## 78 | cores = 1, ## 79 | plan_analysis_fn_name = NULL, ## 80 | for_each_plan = plnr::expand_list( ## 81 | location_code = fhidata::norway_locations_names()[granularity_geo %in% c("county")]$location_code
##  82 |     ),
##  83 |     for_each_analysis = NULL,
##  84 |     universal_argset = list(
##  85 |       output_dir = tempdir(),
##  86 |       output_filename = "weather_{argset$location_code}.png", ## 87 | output_absolute_path = fs::path("{argset$output_dir}", "{argset$output_filename}") ## 88 | ), ## 89 | upsert_at_end_of_each_plan = FALSE, ## 90 | insert_at_end_of_each_plan = FALSE, ## 91 | action_fn_name = "scexample::weather_export_plots_action", ## 92 | data_selector_fn_name = "scexample::weather_export_plots_data_selector", ## 93 | schema = list( ## 94 | # input ## 95 | "anon_example_weather_data" = sc::config$schemas$anon_example_weather_data ## 96 | ## 97 | # output ## 98 | ), ## 99 | info = "This task ploduces plots" ## 100 | ) #### 4.6.2.1 Plan/analysis structure ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L79-L83 ## ## 79 | plan_analysis_fn_name = NULL, ## 80 | for_each_plan = plnr::expand_list( ## 81 | location_code = fhidata::norway_locations_names()[granularity_geo %in% c("county")]$location_code
## 82 |     ),
## 83 |     for_each_analysis = NULL,

Here we choose a plan-heavy approach (11 plans, 1 analysis per plan) to minimize the amount of data loaded into RAM at any point in time.

#### 4.6.2.2 Universal argset

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/04_tasks.r#L84-L88
##
## 84 |     universal_argset = list(
## 85 |       output_dir = tempdir(),
## 86 |       output_filename = "weather_{argset$location_code}.png", ## 87 | output_absolute_path = fs::path("{argset$output_dir}", "{argset$output_filename}") ## 88 | ), ### 4.6.3 3. data_selector_fn ## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_export_plots.r#L45-L110 ## ## 45 | # **** data_selector **** ---- ## 46 | #' weather_export_plots (data selector) ## 47 | #' @param argset Argset ## 48 | #' @param schema DB Schema ## 49 | #' @export ## 50 | weather_export_plots_data_selector = function(argset, schema){ ## 51 | if(plnr::is_run_directly()){ ## 52 | # sc::tm_get_plans_argsets_as_dt("weather_export_plots") ## 53 | ## 54 | index_plan <- 1 ## 55 | ## 56 | argset <- sc::tm_get_argset("weather_export_plots", index_plan = index_plan) ## 57 | schema <- sc::tm_get_schema("weather_export_plots") ## 58 | } ## 59 | ## 60 | # The database schemas can be accessed here ## 61 | d <- schema$anon_example_weather_data$tbl() %>% ## 62 | sc::mandatory_db_filter( ## 63 | granularity_time = NULL, ## 64 | granularity_time_not = NULL, ## 65 | granularity_geo = NULL, ## 66 | granularity_geo_not = NULL, ## 67 | country_iso3 = NULL, ## 68 | location_code = argset$location_code,
##  69 |       age = NULL,
##  70 |       age_not = NULL,
##  71 |       sex = NULL,
##  72 |       sex_not = NULL
##  73 |     ) %>%
##  74 |     dplyr::select(
##  75 |       # granularity_time,
##  76 |       # granularity_geo,
##  77 |       # country_iso3,
##  78 |       # location_code,
##  79 |       # border,
##  80 |       # age,
##  81 |       # sex,
##  82 |
##  83 |       date,
##  84 |
##  85 |       # isoyear,
##  86 |       # isoweek,
##  87 |       # isoyearweek,
##  88 |       # season,
##  89 |       # seasonweek,
##  90 |       #
##  91 |       # calyear,
##  92 |       # calmonth,
##  93 |       # calyearmonth,
##  94 |
##  95 |       temp_max,
##  96 |       temp_min
##  97 |     ) %>%
##  98 |     dplyr::collect() %>%
##  99 |     as.data.table() %>%
## 100 |     setorder(
## 101 |       # location_code,
## 102 |       date
## 103 |     )
## 104 |
## 105 |   # The variable returned must be a named list
## 106 |   retval <- list(
## 107 |     "data" = d
## 108 |   )
## 109 |   retval
## 110 | }

### 4.6.4 4. action_fn

## https://github.com/sykdomspulsen-org/sc-tutorial-end/blob/main/R/weather_export_plots.r#L1-L43
##
##  1 | # **** action **** ----
##  2 | #' weather_export_plots (action)
##  3 | #' @param data Data
##  4 | #' @param argset Argset
##  5 | #' @param schema DB Schema
##  6 | #' @export
##  7 | weather_export_plots_action <- function(data, argset, schema) {
##  9 |
## 10 |   if(plnr::is_run_directly()){
## 11 |     # sc::tm_get_plans_argsets_as_dt("weather_export_plots")
## 12 |
## 13 |     index_plan <- 1
## 14 |     index_analysis <- 1
## 15 |
## 16 |     data <- sc::tm_get_data("weather_export_plots", index_plan = index_plan)
## 17 |     argset <- sc::tm_get_argset("weather_export_plots", index_plan = index_plan, index_analysis = index_analysis)
## 18 |     schema <- sc::tm_get_schema("weather_export_plots")
## 19 |   }
## 20 |
## 21 |   # code goes here
## 22 |   # special case that runs before everything
## 23 |   if(argset$first_analysis == TRUE){ ## 24 | ## 25 | } ## 26 | ## 27 | # create the output_dir (if it doesn't exist) ## 28 | fs::dir_create(glue::glue(argset$output_dir))
## 29 |
## 30 |   q <- ggplot(data$data, aes(x = date, ymin = temp_min, ymax = temp_max)) ## 31 | q <- q + geom_ribbon(alpha = 0.5) ## 32 | ## 33 | ggsave( ## 34 | filename = glue::glue(argset$output_absolute_path),
## 35 |     plot = q
## 36 |   )
## 37 |
## 38 |   # special case that runs after everything
## 39 |   # copy to anon_web?
## 40 |   if(argset\$last_analysis == TRUE){
## 41 |
## 42 |   }
## 43 | }

## 4.7 Final package

If you save, restart and load all all packages you can now see which schemas have been loaded by running sc::tm_get_schema_names(). You can now see that the schemas you made are included.

sc::tm_get_schema_names()
## [1] "config_last_updated"                                           "config_structure_time"
## [3] "rundate"                                                       "config_datetime"
## [5] "anon_example_weather_rawdata"                                  "anon_example_weather_data"
## [7] "anon_example_income"                                           "anon_example_house_prices"
## [9] "anon_example_house_prices_outliers_after_adjusting_for_income"

You can alsp see which tasks have been loaded by running sc::tm_get_task_names(). These tasks are included in the skeleton.

sc::tm_get_task_names()
## [3] "weather_export_plots"                                           "household_incomes_and_house_prices_import_data"
## [5] "household_incomes_and_house_prices_fit_model_and_find_outliers" "household_incomes_and_house_prices_plot"

### 4.7.1 Running

You can now run these tasks in your console if you want. Note that we use scskeleton::tm_run_task instead of sc::tm_run_task. This is because we want to ensure that scexample::.onLoad has been called which authenticates you.

scskeleton::tm_run_task("weather_download_and_import_rawdata")
scskeleton::tm_run_task("weather_export_weather_plots")

Congratulations! You have now successfully finnished your first tutorial on Sykdomspulsen Core.

## 4.8 What now?

After Tutorial 1, we expect that you understand the four fundamental parts of developing a task:

1. Schemas
1. Run a task using tm_run_task
2. Use sc::tm_get_plans_argsets_as_dt to identify which index_plan and index_analysis corresponds to the plan/analysis you are interested in (e.g. Oslo)
3. Run the inside code of a data_selector_fn for different index_plans as if it were an interactive script
4. Run the inside code of an action_fn for different index_plans and index_analysiss as if it were an interactive script