---
title: "Dataset Summary"
author: "Daniel Falster & Susie Zajitschek"
date: "06/07/2018"
output: 
  html_document:
    fig_height: 6
    fig_width: 10
    df_print: paged
    rows.print: 10
    code_folding: hide
    theme: yeti
    toc: yes
    toc_depth: 3
    toc_float:
      collapsed: false
      smooth_scroll: true
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE, echo=TRUE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, cache=FALSE)
root.dir = rprojroot::find_root("README.md")
knitr::opts_knit$set(root.dir = root.dir)

library(readr)
library(dplyr)
library(skimr)
library(ggplot2)
library(scales)
library(viridis)
library(knitr)
library(kableExtra)
```

Here is an initial exploration of the mouse data we are working on. 

The data is big, but not so large that we can't use our standard tools on a desktop. I'd suggest using packages from the [tidyverse](https://www.tidyverse.org/) family, in particular `readr`, `dplyr`, `ggplot2`, `skimr`, `scales`, `viridis`.

#  Dataset overview

Read in data, specifying variable types:

```{r cars}
data_raw <- read_csv("data/dr7.0_all_control_data.csv", 
                  col_types = cols(
                      .default = col_character(),
                      project_id = col_character(),
                      id = col_character(),
                      parameter_id = col_character(),
                      age_in_days = col_integer(),
                      date_of_experiment = col_datetime(format = ""),
                      weight = col_double(),
                      phenotyping_center_id = col_character(),
                      production_center_id = col_character(),
                      weight_date = col_datetime(format = ""),
                      date_of_birth = col_datetime(format = ""),
                      procedure_id = col_character(),
                      pipeline_id = col_character(),
                      biological_sample_id = col_character(),
                      biological_model_id = col_character(),
                      weight_days_old = col_integer(),
                      datasource_id = col_character(),
                      experiment_id = col_character(),
                      data_point = col_double(),
                      age_in_weeks = col_integer(),
                      `_version_` = col_character()
                      )
                )
```


Number of rows & columns:

```{r}
data_raw %>% dim()
```

Of the `r data %>% names() %>% length()` variables in the dataset, lots of variables are full of NAs:

```{r}
n_records <- data_raw %>% nrow()
NA_count <- data_raw %>%
  summarise_all(funs(sum(is.na(.))/n_records*100)) %>% 
  unlist() %>% sort(decreasing=TRUE) %>%
  tibble(variable=names(.), percent_NA = .)

# make a plot
ggplot(NA_count, aes(percent_NA)) + geom_histogram(bins=50)
```  

Let's remove those variables that have all NAs:

```{r}
(all_NAs <- NA_count$variable[NA_count$percent_NA == 100])
data <- data_raw %>% select(-one_of(all_NAs))
```

Now we `r data %>% names() %>% length()` variables in the dataset:

```{r}
data %>% names()
```

Now use `skimr` to take a quick look of all variables:

```{r, results='asis'}
x <- data %>% skimr::skim()
pander::pander(x)
```

Next we'll look at some specific variables of potential importance.

# Production center

Contributions by `production_center`:
```{r}
x <- data %>% group_by(production_center) %>% summarise(n=n())

ggplot(x, aes(reorder(production_center, n), n)) +
  geom_col() + coord_flip()

x 
```

# Weights

Overall distribution of weights:
```{r}
ggplot(data, aes(x=weight)) + 
  geom_histogram(bins=50)
```

Weights by center and sex:
```{r, fig.height=12}
ggplot(data, aes(x=weight, fill=sex)) + 
  geom_histogram(bins=50) + 
  scale_y_log10() +
  facet_wrap( ~ production_center, ncol=1)
```

# Ages

There seems to be an issue with some very negative values of age. The range in the raw data is too wide:
```{r}
range(data$age_in_days, na.rm=TRUE)
ggplot(data, aes(x=age_in_days)) + 
  geom_histogram(bins=50)
```

So for now we'll filter those out, to give an reasonable distribution of ages:

```{r}
data <- data %>% filter(age_in_days > 0 & age_in_days < 500)

ggplot(data, aes(x=age_in_days)) + 
  geom_histogram(bins=50)
```


Age by center and sex:
```{r, fig.height=12}
ggplot(data, aes(x=age_in_days, fill=sex)) + 
  geom_histogram(bins=50) + 
  scale_y_log10() +
  facet_wrap( ~ production_center, ncol=1)
```

Age vs weight by sex:
```{r}
data %>%
  filter(sex %in% c("male", "female")) %>% 
  ggplot(aes(x=age_in_days, y=weight)) + 
  geom_hex() + 
  viridis::scale_fill_viridis() + 
  coord_fixed() +
  facet_wrap( ~ sex, ncol=1)
```

# Procedures

Contributions by `procedure_name`:

```{r, fig.height=12}
x <- data %>% group_by(procedure_name) %>% summarise(n=n())
ggplot(x, aes(reorder(procedure_name, n), n)) +
  geom_col() + coord_flip()
```

```{r, results='asis'}
data$procedure_name %>% table() %>% sort(decreasing = TRUE) %>%
  kable() %>%
  kable_styling() %>%
  scroll_box(width = "500px", height = "400px")
```

Note the uneven distribution of procdures by production_center:

```{r, fig.height=25}

x <- data %>% 
  group_by(production_center, procedure_name) %>% 
  summarise(n=n())

ggplot(x, aes(reorder(production_center, n), n)) +
  geom_col() + coord_flip() +
  facet_wrap( ~ procedure_name, ncol=4)
```


```{r, results='asis'}
t(table(data$production_center, data$procedure_name)) %>%
  kable() %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "500px")
```


# Parameters

There are a lot of unique values under the variable `parameter_name`:

```{r}
data$parameter_name %>% unique() %>% length()
```

```{r}
data$parameter_name %>% table() %>% sort(decreasing = TRUE) %>%
  tibble(variable=names(.), count = .) %>%
  kable() %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "500px")
```  

# Individuals

There seem to be multiple records for an individual, which is identified by the varaible `biological_sample_id`. Based on this there ar `r data$biological_sample_id %>% unique() %>% length()` unique individuals. And there are multiple records per individual. For example, here are records for  `biological_sample_id=107609`:

```{r}
select(filter(data_raw, biological_sample_id == "107609"), sex, production_center, external_sample_id, biological_sample_id, age_in_days, weight, parameter_name) %>% arrange(age_in_days) %>% data.frame()
```




```{r}

data_i <- data %>% 
  filter(parameter_name == "Body weight") %>%
  group_by(biological_sample_id) %>% 
  summarise(
    production_center = production_center[1],
    project_name = project_name[1],
    sex = sex[1],
    n = n(),
    med_age= median(age_in_days)
    )

data_i
```

# Datasource

more recently, Jeremy states: "I provided you too much data (!) which should not be included.  A quarter to a third of the data is from legacy projects and should be excluded.  As I mentioned, exclude anything without datasource_name = `IMPC`"

By source: 
```
data_raw$datasource_name %>% table()
```

So reducing to IMPC reduces the data by one third.

It also greatly reduces the number of parameters:
```{r}
filter(data_raw, datasource_name == 'IMPC') %>% pull(parameter_name) %>% unique() %>% length()
```

```
data_raw %>% pull(parameter_name) %>% unique() %>% length()
```

# Issues:


1.  `procedure_name` are values eding in `(GMC)` different?
2.  Spelling / case differences: e.g. `Body weight` vs `Body Weight`