--- title: "Dataset Summary" author: "Daniel Falster & Susie Zajitschek" date: "06/07/2018" output: html_document: fig_height: 6 fig_width: 10 df_print: paged rows.print: 10 code_folding: hide theme: yeti toc: yes toc_depth: 3 toc_float: collapsed: false smooth_scroll: true editor_options: chunk_output_type: console --- ```{r setup, include=FALSE, echo=TRUE} knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, cache=FALSE) root.dir = rprojroot::find_root("README.md") knitr::opts_knit$set(root.dir = root.dir) library(readr) library(dplyr) library(skimr) library(ggplot2) library(scales) library(viridis) library(knitr) library(kableExtra) ``` Here is an initial exploration of the mouse data we are working on. The data is big, but not so large that we can't use our standard tools on a desktop. I'd suggest using packages from the [tidyverse](https://www.tidyverse.org/) family, in particular `readr`, `dplyr`, `ggplot2`, `skimr`, `scales`, `viridis`. # Dataset overview Read in data, specifying variable types: ```{r cars} data_raw <- read_csv("data/dr7.0_all_control_data.csv", col_types = cols( .default = col_character(), project_id = col_character(), id = col_character(), parameter_id = col_character(), age_in_days = col_integer(), date_of_experiment = col_datetime(format = ""), weight = col_double(), phenotyping_center_id = col_character(), production_center_id = col_character(), weight_date = col_datetime(format = ""), date_of_birth = col_datetime(format = ""), procedure_id = col_character(), pipeline_id = col_character(), biological_sample_id = col_character(), biological_model_id = col_character(), weight_days_old = col_integer(), datasource_id = col_character(), experiment_id = col_character(), data_point = col_double(), age_in_weeks = col_integer(), `_version_` = col_character() ) ) ``` Number of rows & columns: ```{r} data_raw %>% dim() ``` Of the `r data %>% names() %>% length()` variables in the dataset, lots of variables are full of NAs: ```{r} n_records <- data_raw %>% nrow() NA_count <- data_raw %>% summarise_all(funs(sum(is.na(.))/n_records*100)) %>% unlist() %>% sort(decreasing=TRUE) %>% tibble(variable=names(.), percent_NA = .) # make a plot ggplot(NA_count, aes(percent_NA)) + geom_histogram(bins=50) ``` Let's remove those variables that have all NAs: ```{r} (all_NAs <- NA_count$variable[NA_count$percent_NA == 100]) data <- data_raw %>% select(-one_of(all_NAs)) ``` Now we `r data %>% names() %>% length()` variables in the dataset: ```{r} data %>% names() ``` Now use `skimr` to take a quick look of all variables: ```{r, results='asis'} x <- data %>% skimr::skim() pander::pander(x) ``` Next we'll look at some specific variables of potential importance. # Production center Contributions by `production_center`: ```{r} x <- data %>% group_by(production_center) %>% summarise(n=n()) ggplot(x, aes(reorder(production_center, n), n)) + geom_col() + coord_flip() x ``` # Weights Overall distribution of weights: ```{r} ggplot(data, aes(x=weight)) + geom_histogram(bins=50) ``` Weights by center and sex: ```{r, fig.height=12} ggplot(data, aes(x=weight, fill=sex)) + geom_histogram(bins=50) + scale_y_log10() + facet_wrap( ~ production_center, ncol=1) ``` # Ages There seems to be an issue with some very negative values of age. The range in the raw data is too wide: ```{r} range(data$age_in_days, na.rm=TRUE) ggplot(data, aes(x=age_in_days)) + geom_histogram(bins=50) ``` So for now we'll filter those out, to give an reasonable distribution of ages: ```{r} data <- data %>% filter(age_in_days > 0 & age_in_days < 500) ggplot(data, aes(x=age_in_days)) + geom_histogram(bins=50) ``` Age by center and sex: ```{r, fig.height=12} ggplot(data, aes(x=age_in_days, fill=sex)) + geom_histogram(bins=50) + scale_y_log10() + facet_wrap( ~ production_center, ncol=1) ``` Age vs weight by sex: ```{r} data %>% filter(sex %in% c("male", "female")) %>% ggplot(aes(x=age_in_days, y=weight)) + geom_hex() + viridis::scale_fill_viridis() + coord_fixed() + facet_wrap( ~ sex, ncol=1) ``` # Procedures Contributions by `procedure_name`: ```{r, fig.height=12} x <- data %>% group_by(procedure_name) %>% summarise(n=n()) ggplot(x, aes(reorder(procedure_name, n), n)) + geom_col() + coord_flip() ``` ```{r, results='asis'} data$procedure_name %>% table() %>% sort(decreasing = TRUE) %>% kable() %>% kable_styling() %>% scroll_box(width = "500px", height = "400px") ``` Note the uneven distribution of procdures by production_center: ```{r, fig.height=25} x <- data %>% group_by(production_center, procedure_name) %>% summarise(n=n()) ggplot(x, aes(reorder(production_center, n), n)) + geom_col() + coord_flip() + facet_wrap( ~ procedure_name, ncol=4) ``` ```{r, results='asis'} t(table(data$production_center, data$procedure_name)) %>% kable() %>% kable_styling() %>% scroll_box(width = "100%", height = "500px") ``` # Parameters There are a lot of unique values under the variable `parameter_name`: ```{r} data$parameter_name %>% unique() %>% length() ``` ```{r} data$parameter_name %>% table() %>% sort(decreasing = TRUE) %>% tibble(variable=names(.), count = .) %>% kable() %>% kable_styling() %>% scroll_box(width = "100%", height = "500px") ``` # Individuals There seem to be multiple records for an individual, which is identified by the varaible `biological_sample_id`. Based on this there ar `r data$biological_sample_id %>% unique() %>% length()` unique individuals. And there are multiple records per individual. For example, here are records for `biological_sample_id=107609`: ```{r} select(filter(data_raw, biological_sample_id == "107609"), sex, production_center, external_sample_id, biological_sample_id, age_in_days, weight, parameter_name) %>% arrange(age_in_days) %>% data.frame() ``` ```{r} data_i <- data %>% filter(parameter_name == "Body weight") %>% group_by(biological_sample_id) %>% summarise( production_center = production_center[1], project_name = project_name[1], sex = sex[1], n = n(), med_age= median(age_in_days) ) data_i ``` # Datasource more recently, Jeremy states: "I provided you too much data (!) which should not be included. A quarter to a third of the data is from legacy projects and should be excluded. As I mentioned, exclude anything without datasource_name = `IMPC`" By source: ``` data_raw$datasource_name %>% table() ``` So reducing to IMPC reduces the data by one third. It also greatly reduces the number of parameters: ```{r} filter(data_raw, datasource_name == 'IMPC') %>% pull(parameter_name) %>% unique() %>% length() ``` ``` data_raw %>% pull(parameter_name) %>% unique() %>% length() ``` # Issues: 1. `procedure_name` are values eding in `(GMC)` different? 2. Spelling / case differences: e.g. `Body weight` vs `Body Weight`