--- title: "Dataset Summary" author: "Daniel Falster & Susie Zajitschek" date: "06/07/2018" output: html_document: fig_height: 6 fig_width: 10 df_print: paged rows.print: 10 code_folding: show theme: yeti toc: yes toc_depth: 3 toc_float: collapsed: false smooth_scroll: true editor_options: chunk_output_type: console --- ```{r setup, include=FALSE, echo=TRUE} knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, cache=FALSE) root.dir = rprojroot::find_root("README.md") knitr::opts_knit$set(root.dir = root.dir) ``` ```{r} library(readr) library(dplyr) library(skimr) library(ggplot2) library(scales) library(viridis) library(knitr) library(pander) library(kableExtra) source("R/data_load_clean.R") ``` # Loading data Read in cleaned data, specifying variable types: ```{r} data <- readRDS("export/data_clean.rds") ``` Note: cleaned data were generated by running ```{r, eval=FALSE} data <- load_raw("data/dr7.0_all_control_data.csv") %>% clean_raw_data() ``` where using the follwing function ```{r} clean_raw_data ``` # Data overview Number of rows & columns: ```{r} data %>% dim() ``` Now we `r data %>% names() %>% length()` variables in the dataset: ```{r} data %>% names() ``` Now use `skimr` to take a quick look of all variables: ```{r, results='asis'} x <- data %>% skimr::skim() pander::pander(x) ``` Next we'll look at some specific variables of potential importance. # Production center Contributions by `production_center`: ```{r} x <- data %>% group_by(production_center) %>% summarise(n=n()) ggplot(x, aes(reorder(production_center, n), n)) + geom_col() + coord_flip() x ``` # Strains There are several strains under the variable `strain_name`: ```{r} data$strain_name %>% unique() %>% length() ``` ```{r} data$strain_name %>% table() %>% sort(decreasing = TRUE) %>% tibble(variable=names(.), count = .) %>% kable() %>% kable_styling() %>% scroll_box(width = "100%", height = "500px") ``` There is also a variable several strains under the variable `strain_accession_id`: ```{r} data$strain_accession_id %>% unique() %>% length() ``` ```{r} data$strain_accession_id %>% table() %>% sort(decreasing = TRUE) %>% tibble(variable=names(.), count = .) %>% kable() %>% kable_styling() %>% scroll_box(width = "100%", height = "500px") ``` # Weights Overall distribution of weights: ```{r} ggplot(data, aes(x=weight)) + geom_histogram(bins=50) ``` Weights by center and sex: ```{r, fig.height=12} ggplot(data, aes(x=weight, fill=sex)) + geom_histogram(bins=50) + scale_y_log10() + facet_wrap( ~ production_center, ncol=1) ``` # Ages There seems to be an issue with some very negative values of age. The range in the raw data is too wide: ```{r} range(data$age_in_days, na.rm=TRUE) ggplot(data, aes(x=age_in_days)) + geom_histogram(bins=50) ``` So for now we'll filter those out, to give an reasonable distribution of ages: ```{r} data <- data %>% filter(age_in_days > 0 & age_in_days < 500) ggplot(data, aes(x=age_in_days)) + geom_histogram(bins=50) ``` Age by center and sex: ```{r, fig.height=12} ggplot(data, aes(x=age_in_days, fill=sex)) + geom_histogram(bins=50) + scale_y_log10() + facet_wrap( ~ production_center, ncol=1) ``` Age vs weight by sex: ```{r} data %>% filter(sex %in% c("male", "female")) %>% ggplot(aes(x=age_in_days, y=weight)) + geom_hex() + viridis::scale_fill_viridis() + coord_fixed() + facet_wrap( ~ sex, ncol=1) ``` # Procedures Contributions by `procedure_name`: ```{r, fig.height=12} x <- data %>% group_by(procedure_name) %>% summarise(n=n()) ggplot(x, aes(reorder(procedure_name, n), n)) + geom_col() + coord_flip() ``` ```{r, results='asis'} data$procedure_name %>% table() %>% sort(decreasing = TRUE) %>% kable() %>% kable_styling() %>% scroll_box(width = "500px", height = "400px") ``` Note the uneven distribution of procdures by production_center: ```{r, fig.height=25} x <- data %>% group_by(production_center, procedure_name) %>% summarise(n=n()) ggplot(x, aes(reorder(production_center, n), n)) + geom_col() + coord_flip() + facet_wrap( ~ procedure_name, ncol=4) ``` ```{r, results='asis'} t(table(data$production_center, data$procedure_name)) %>% kable() %>% kable_styling() %>% scroll_box(width = "100%", height = "500px") ``` # Parameters There are a lot of unique values under the variable `parameter_name`: ```{r} data$parameter_name %>% unique() %>% length() ``` ```{r} data$parameter_name %>% table() %>% sort(decreasing = TRUE) %>% tibble(variable=names(.), count = .) %>% kable() %>% kable_styling() %>% scroll_box(width = "100%", height = "500px") ``` # Individuals There seem to be multiple records for an individual, which is identified by the varaible `biological_sample_id`. Based on this there ar `r data$biological_sample_id %>% unique() %>% length()` unique individuals. And there are multiple records per individual. For example, here are records for `biological_sample_id=107609`: ```{r} select(filter(data, biological_sample_id == "107609"), sex, production_center, biological_sample_id, age_in_days, weight, parameter_name) %>% arrange(age_in_days) %>% data.frame() ```