---
title: "Dataset Summary"
author: "Daniel Falster & Susie Zajitschek"
date: "06/07/2018"
output: 
  html_document:
    fig_height: 6
    fig_width: 10
    df_print: paged
    rows.print: 10
    code_folding: show
    theme: yeti
    toc: yes
    toc_depth: 3
    toc_float:
      collapsed: false
      smooth_scroll: true
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE, echo=TRUE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, cache=FALSE)
root.dir = rprojroot::find_root("README.md")
knitr::opts_knit$set(root.dir = root.dir)
```

```{r}
library(readr)
library(dplyr)
library(skimr)
library(ggplot2)
library(scales)
library(viridis)
library(knitr)
library(pander)
library(kableExtra)
source("R/data_load_clean.R")
```

# Loading data

Read in cleaned data, specifying variable types:

```{r}
data <- readRDS("export/data_clean.rds")
```

Note: cleaned data were generated by running 

```{r, eval=FALSE}
data <- load_raw("data/dr7.0_all_control_data.csv") %>% clean_raw_data()
```

where using the follwing function
```{r}
clean_raw_data
```

# Data overview

Number of rows & columns:

```{r}
data %>% dim()
```

Now we `r data %>% names() %>% length()` variables in the dataset:

```{r}
data %>% names()
```

Now use `skimr` to take a quick look of all variables:

```{r, results='asis'}
x <- data %>% skimr::skim()
pander::pander(x)
```

Next we'll look at some specific variables of potential importance.

# Production center

Contributions by `production_center`:
```{r}
x <- data %>% group_by(production_center) %>% summarise(n=n())

ggplot(x, aes(reorder(production_center, n), n)) +
  geom_col() + coord_flip()

x 
```


# Strains

There are several strains under the variable `strain_name`:

```{r}
data$strain_name %>% unique() %>% length()
```

```{r}
data$strain_name %>% table() %>% sort(decreasing = TRUE) %>%
  tibble(variable=names(.), count = .) %>%
  kable() %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "500px")
```  


There is also a variable  several strains under the variable `strain_accession_id`:

```{r}
data$strain_accession_id %>% unique() %>% length()
```

```{r}
data$strain_accession_id %>% table() %>% sort(decreasing = TRUE) %>%
  tibble(variable=names(.), count = .) %>%
  kable() %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "500px")
```  

# Weights

Overall distribution of weights:
```{r}
ggplot(data, aes(x=weight)) + 
  geom_histogram(bins=50)
```

Weights by center and sex:
```{r, fig.height=12}
ggplot(data, aes(x=weight, fill=sex)) + 
  geom_histogram(bins=50) + 
  scale_y_log10() +
  facet_wrap( ~ production_center, ncol=1)
```

# Ages

There seems to be an issue with some very negative values of age. The range in the raw data is too wide:
```{r}
range(data$age_in_days, na.rm=TRUE)
ggplot(data, aes(x=age_in_days)) + 
  geom_histogram(bins=50)
```

So for now we'll filter those out, to give an reasonable distribution of ages:

```{r}
data <- data %>% filter(age_in_days > 0 & age_in_days < 500)

ggplot(data, aes(x=age_in_days)) + 
  geom_histogram(bins=50)
```


Age by center and sex:
```{r, fig.height=12}
ggplot(data, aes(x=age_in_days, fill=sex)) + 
  geom_histogram(bins=50) + 
  scale_y_log10() +
  facet_wrap( ~ production_center, ncol=1)
```

Age vs weight by sex:
```{r}
data %>%
  filter(sex %in% c("male", "female")) %>% 
  ggplot(aes(x=age_in_days, y=weight)) + 
  geom_hex() + 
  viridis::scale_fill_viridis() + 
  coord_fixed() +
  facet_wrap( ~ sex, ncol=1)
```

# Procedures

Contributions by `procedure_name`:

```{r, fig.height=12}
x <- data %>% group_by(procedure_name) %>% summarise(n=n())
ggplot(x, aes(reorder(procedure_name, n), n)) +
  geom_col() + coord_flip()
```

```{r, results='asis'}
data$procedure_name %>% table() %>% sort(decreasing = TRUE) %>%
  kable() %>%
  kable_styling() %>%
  scroll_box(width = "500px", height = "400px")
```

Note the uneven distribution of procdures by production_center:

```{r, fig.height=25}

x <- data %>% 
  group_by(production_center, procedure_name) %>% 
  summarise(n=n())

ggplot(x, aes(reorder(production_center, n), n)) +
  geom_col() + coord_flip() +
  facet_wrap( ~ procedure_name, ncol=4)
```


```{r, results='asis'}
t(table(data$production_center, data$procedure_name)) %>%
  kable() %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "500px")
```



# Parameters

There are a lot of unique values under the variable `parameter_name`:

```{r}
data$parameter_name %>% unique() %>% length()
```

```{r}
data$parameter_name %>% table() %>% sort(decreasing = TRUE) %>%
  tibble(variable=names(.), count = .) %>%
  kable() %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "500px")
``` 

# Individuals

There seem to be multiple records for an individual, which is identified by the varaible `biological_sample_id`. Based on this there ar `r data$biological_sample_id %>% unique() %>% length()` unique individuals. And there are multiple records per individual. For example, here are records for  `biological_sample_id=107609`:

```{r}
select(filter(data, biological_sample_id == "107609"), sex, production_center, biological_sample_id, age_in_days, weight, parameter_name) %>% arrange(age_in_days) %>% data.frame()
```