1 DB Schemas

1.1 Introduction

A database schema is our way of representing how the database is constructed. In short, you can think of these as database tables.

1.2 Database servers

Normally, an implementation of Sykdomspulsen Core would have two database servers that run parallel systems. One database server is auto and the other is interactive.

If you run code in RStudio Workbench or on Airflow interactive, you should be automatically be connected to the interactive database server. If you run code on Airflow auto, you should be automatically be connected to the auto database server. This is something that your implementation will have to solve.

1.3 Access level (anon/restr/redirect)

Within each database server, there are multiple databases with different access levels and censoring requirements.

Censoring is performed via the db schema.

1.3.1 anon

The “anonymous” database contains data that is anonymous. All team members should have access to this database.

1.3.2 restr

The “restricted” database contains data that is:

  • Indirectly identifiable
  • Anonymous

Only a restricted number of team members should have access to this database.

1.3.3 redirect

This is not technically a database, however, it is treated as one.

If a person creates a db schema that exists in both the anonymous and restricted databases, then Sykdomspulsen Core will automatically detect the highest level of access and connect to that database when working with redirect schemas.

1.4 Creating your own

Sykdomspulsen Core requires a lot of boilerplate code. It is strongly recommended that you use the RStudio Addins menu to help you quickly insert code templates.

We will generate three database schemas:

  • restr_example (specified via name_access)
  • anon_example (specified via name_access)
  • redirect_example (automatically created when both restr and anon are used)
sc::add_schema_v8(
  name_access = c("restr", "anon"),
  name_grouping = "example",
  name_variant = NULL,
  db_configs = sc::config$db_configs,
  field_types =  c(
    "granularity_time" = "TEXT",
    "granularity_geo" = "TEXT",
    "country_iso3" = "TEXT",
    "location_code" = "TEXT",
    "border" = "INTEGER",
    "age" = "TEXT",
    "sex" = "TEXT",
    
    "date" = "DATE",
    
    "isoyear" = "INTEGER",
    "isoweek" = "INTEGER",
    "isoyearweek" = "TEXT",
    "season" = "TEXT",
    "seasonweek" = "DOUBLE",
    
    "calyear" = "INTEGER",
    "calmonth" = "INTEGER",
    "calyearmonth" = "TEXT",

    "value_n" = "INTEGER"
  ),
  keys = c(
    "granularity_time",
    "location_code",
    "date",
    "age",
    "sex"
  ),
  censors = list(
    restr = list(
      value_n = sc::censor_function_factory_nothing("value_n")
    ),
    anon = list(
      value_n = sc::censor_function_factory_values_0_4("value_n")
    )
  ),
  validator_field_types = sc::validator_field_types_sykdomspulsen,
  validator_field_contents = sc::validator_field_contents_sykdomspulsen,
  info = "This db table is used for..."
)

This schema has a few main parts.

1.4.1 Naming

The db schemas and tables will be given the names: name_access_name_grouping_name_variant

In this example, there will be three db schemas:

  • restr_example (accessible at sc::config$schemas$restr_example)
  • anon_example (accessible at sc::config$schemas$anon_example)
  • redirect_example (accessible at sc::config$schemas$redirect_example)

Corresponding to two db tables:

  • restr_example
  • anon_example

1.4.1.1 name_access

Either restr or anon

1.4.1.2 name_grouping

A descriptive name

1.4.1.3 name_variant

A descriptive name

1.4.2 db_configs

A list that contains information about the database:

names(sc::config$db_configs)
## [1] "restr"         "anon"          "specific_daar" "config"

1.4.3 db_field_types

A vector containing the names and variable types of the columns of the database table.

In the vast majority of cases, the first 16 columns are standardized and will always be the same.

Permitted variable types are:

  • TEXT
  • DOUBLE
  • INTEGER
  • BOOLEAN
  • DATE
  • DATETIME

1.4.4 keys

The columns that will form the primary key of the database table (i.e. identify unique rows).

1.4.5 censors

1.4.6 validator_field_types

A validator that is useful for ensuring that your database table names are consistent with predetermined rules. For example, in Sykdomspulsen we have decided that we always want the first 16 columns to be:

  • granularity_time
  • granularity_geo
  • country_iso3
  • location_code
  • border
  • age
  • sex
  • date
  • isoyear
  • isoweek
  • isoyearweek
  • season
  • seasonweek
  • calyear
  • calmonth
  • calyearmonth

While developing new code we found that it was difficult to force all developers to remember to include these 16 columns in the correct order. The validator sc::validator_field_types_sykdomspulsen ensures that the first 16 columns are as expected, and otherwise the developer will not be able to run their code.

validator_field_contents is a validator that ensures that the contents of your data is correct. We experienced that there were issues with granularity_time sometimes containing the value week and sometimes containing the value weekly. To maintain consistency in our data, the validator sc::validator_field_contents_sykdomspulsen will throw an error if it observes non-accepted values for certain variables.

1.5 Loading data into a db schema

Checklist:

  1. Remember that “keys” (as defined in sc::add_schema_v8) defines the uniquely identifying rows of data that are allowed in the db table
  2. Use sc::fill_in_missing_v8(d)
  3. Choose your method of loading the data (upsert/insert/drop_all_rows_and_then_upsert_data)

We check to see what schemas are available:

stringr::str_subset(names(sc::config$schemas), "_example$")
## [1] "restr_example"    "anon_example"     "redirect_example"

We then create a fictional dataset and work with it.

options(width = 150)
# fictional dataset
d <- data.table(
  granularity_time = "day",
  granularity_geo = "nation",
  country_iso3 = "nor",
  location_code = "norge",
  border = 2020,
  age = "total",
  sex = "total",
  
  date = c(as.Date("1990-01-07"),as.Date("1990-01-08")),
  
  isoyear = 1990,
  isoweek = 1,
  isoyearweek = "1990-01",
  season = "1990/1991",
  seasonweek = 24,
  
  calyear = NA,
  calmonth = NA,
  calyearmonth = NA,
  
  value_n = c(3,6)
)

# display the raw data
d[]
##    granularity_time granularity_geo country_iso3 location_code border   age   sex       date isoyear isoweek isoyearweek    season seasonweek calyear
## 1:              day          nation          nor         norge   2020 total total 1990-01-07    1990       1     1990-01 1990/1991         24      NA
## 2:              day          nation          nor         norge   2020 total total 1990-01-08    1990       1     1990-01 1990/1991         24      NA
##    calmonth calyearmonth value_n
## 1:       NA           NA       3
## 2:       NA           NA       6

# always fill in missing data!
sc::fill_in_missing_v8(d)

# we have four options to get the data into the db table
# remember that "keys" defines the uniquely identifying rows of data that are allowed in the db table!
# - upsert means "update if data exists, otherwise append"
# - insert means "append" (data cannot already exist)

sc::config$schemas$redirect_example$upsert_data(d)
## Creating table restr_example
## Creating table anon_example
#sc::config$schemas$redirect_example$insert_data(d)
#sc::config$schemas$redirect_example$drop_all_rows_and_then_upsert_data(d)
#sc::config$schemas$redirect_example$drop_all_rows_and_then_insert_data(d)

1.6 Accessing the data in a db schema

Checklist:

  1. sc::mandatory_db_filter
  2. dplyr::select

We extract data from db schemas using dplyr with a dbplyr backend.

options(width = 150)
sc::config$schemas$redirect_example$tbl() %>%
  sc::mandatory_db_filter(
    granularity_time = "day",
    granularity_time_not = NULL,
    granularity_geo = NULL,
    granularity_geo_not = NULL,
    country_iso3 = NULL,
    location_code = "norge",
    age = "total",
    age_not = NULL,
    sex = "total",
    sex_not = NULL
  ) %>%
  dplyr::select(
    granularity_time,
    location_code,
    date,
    value_n,
    value_n_censored
  ) %>%
  dplyr::collect() %>%
  as.data.table() %>%
  print()
##    granularity_time location_code       date value_n value_n_censored
## 1:              day         norge 1990-01-07       3            FALSE
## 2:              day         norge 1990-01-08       6            FALSE

We can observe the effects of censoring as defined in sc::add_schema_v8

options(width = 150)
sc::config$schemas$restr_example$tbl() %>%
  sc::mandatory_db_filter(
    granularity_time = "day",
    granularity_time_not = NULL,
    granularity_geo = NULL,
    granularity_geo_not = NULL,
    country_iso3 = NULL,
    location_code = "norge",
    age = "total",
    age_not = NULL,
    sex = "total",
    sex_not = NULL
  ) %>%
  dplyr::select(
    granularity_time,
    location_code,
    date,
    value_n,
    value_n_censored
  ) %>%
  dplyr::collect() %>%
  as.data.table() %>%
  print()
##    granularity_time location_code       date value_n value_n_censored
## 1:              day         norge 1990-01-07       3            FALSE
## 2:              day         norge 1990-01-08       6            FALSE

sc::config$schemas$anon_example$tbl() %>%
  sc::mandatory_db_filter(
    granularity_time = "day",
    granularity_time_not = NULL,
    granularity_geo = NULL,
    granularity_geo_not = NULL,
    country_iso3 = NULL,
    location_code = "norge",
    age = "total",
    age_not = NULL,
    sex = "total",
    sex_not = NULL
  ) %>%
  dplyr::select(
    granularity_time,
    location_code,
    date,
    value_n,
    value_n_censored
  ) %>%
  dplyr::collect() %>%
  as.data.table() %>%
  print()
##    granularity_time location_code       date value_n value_n_censored
## 1:              day         norge 1990-01-07       0             TRUE
## 2:              day         norge 1990-01-08       6            FALSE

1.7 Accessing the data in ad-hoc analyses

When doing ad-hoc analyses, you may access the database tables via the helper function sc::tbl

IT IS STRICTLY FORBIDDEN TO USE THIS INSIDE SYKDOMSPULSEN TASKS!!!

This is because sc::tbl:

  • is NOT SAFE to use in parallel programming
  • bypasses the input/output control mechanisms that we apply in sc::task_from_config_v8
options(width = 150)
sc::tbl("restr_example") %>%
  sc::mandatory_db_filter(
    granularity_time = "day",
    granularity_time_not = NULL,
    granularity_geo = NULL,
    granularity_geo_not = NULL,
    country_iso3 = NULL,
    location_code = "norge",
    age = "total",
    age_not = NULL,
    sex = "total",
    sex_not = NULL
  ) %>%
  dplyr::select(
    granularity_time,
    location_code,
    date,
    value_n,
    value_n_censored
  ) %>%
  dplyr::collect() %>% 
  as.data.table() %>% 
  print()
##    granularity_time location_code       date value_n value_n_censored
## 1:              day         norge 1990-01-07       3            FALSE
## 2:              day         norge 1990-01-08       6            FALSE

1.8 Exploring data in schemas

DB Schemas obviously contain a lot of data. It can be very overwhelming to try and understand what is inside the schema.

options(width = 150)

# Get the first few lines of the schema (use $tbl())
sc::config$schemas$restr_example$tbl()
## # Source:   table<restr_example> [?? x 18]
## # Database: Microsoft SQL Server 12.00.6433[FHI\RIWH@OFY-GN-SQL01/sykdomspulsen_interactive_restr]
##   granularity_time granularity_geo country_iso3 location_code border age   sex   date       isoyear isoweek isoyearweek season    seasonweek calyear
##   <chr>            <chr>           <chr>        <chr>          <int> <chr> <chr> <date>       <int>   <int> <chr>       <chr>          <dbl>   <int>
## 1 day              nation          nor          norge           2020 total total 1990-01-07    1990       1 1990-01     1990/1991         24      NA
## 2 day              nation          nor          norge           2020 total total 1990-01-08    1990       1 1990-01     1990/1991         24      NA
## # … with 4 more variables: calmonth <int>, calyearmonth <chr>, value_n <int>, value_n_censored <lgl>

# Get a summary of the schema (referencing the schema directly)
sc::config$schemas$restr_example
## [sykdomspulsen_interactive_restr].[dbo].[restr_example]   (connected)
## 
## granularity_time (TEXT):
##  - day (n = 2)
## granularity_geo (TEXT):
##  - nation (n = 2)
## country_iso3 (TEXT):
##  - nor (n = 2)
## location_code (TEXT)
## border (INTEGER):
##  - 2020 (n = 2)
## age (TEXT):
##  - total (n = 2)
## sex (TEXT):
##  - total (n = 2)
## date (DATE)
## isoyear (INTEGER):
##  - 1990 (n = 2)
## isoweek (INTEGER)
## isoyearweek (TEXT)
## season (TEXT):
##  - 1990/1991 (n = 2)
## seasonweek (DOUBLE)
## calyear (INTEGER)
## calmonth (INTEGER)
## calyearmonth (TEXT)
## value_n (INTEGER)
## value_n_censored (BOOLEAN)

# Get a summary of a variable inside the schema via 'hashing the data structure'
sc::config$schemas$restr_example %>%
  spltidy::hash_data_structure("value_n") %>%
  plot()
# This can also be done directly on a dbplyr table
sc::tbl("restr_example") %>%
  spltidy::hash_data_structure("value_n") %>%
  plot()