class: center, middle, inverse, title-slide .title[ # Functions and Data Types ] .author[ ### Yen-Chung Chen & Cassandra Buzby ] .date[ ### 2022-06-30 ] --- # Outline 1. Recap on data filtering and functions -- 2. Data types and associated horror stories --- # Introduction to RNA-Seq .pull-left[ ![](https://nyusurp.github.io/img/RNAseq.png) ] .pull-right[ {{content}} ] -- 1. cDNA synthesis {{content}} -- 2. Fragmentation {{content}} -- 3. Sequence & align {{content}} -- 4. Quantify -- .center[Is it fair to just count the fragments?] --- # Normalization in RNA-Seq .pull-left[ {{content}} ] -- ## Issues {{content}} -- 1. The more you seq, the more you get. {{content}} -- 2. The longer the gene is, the more you get. -- .pull-right[ ## Solutions {{content}} ] -- 1. Account for seq depth.by normalize with seq depth {{content}} -- 2. Normalize every gene with its length --- # Depth normalization What you would essentially do is: `$$Normalized\space{}count = \frac{Per\mbox{-}gene\space{}count}{Total\space{}count}$$` -- What is in the raw data? ```r # Use read.delim instead of read.csv for **tab**-delimited files rawcount = read.delim("BR-A-Control_counts.txt") ``` --- # Always know your data ```r head(rawcount) ``` ``` ## id count ## 1 ENSMUSG00000000001 1472 ## 2 ENSMUSG00000000003 0 ## 3 ENSMUSG00000000028 6 ## 4 ENSMUSG00000000031 109 ## 5 ENSMUSG00000000037 43 ## 6 ENSMUSG00000000049 0 ``` --- # Plan ahead: What are our goals? 1. Normalize read counts by sequencing depth = total reads we got from a sample. -- 2. Normalize again with transcript length. --- # `normalize_by_depth()` How do we define a function that normalize a vector of counts by total counts? -- ```r # What is our input? normalize_by_depth = function(input) { # How do we compute total count? # Let's divide everything by the total count calculated above # Provide an output } ``` --- # Good practice: Start humble So you can capture errors right away. -- ```r # Take a small fraction of the data so we can test our function test_count = head(rawcount) print(test_count) ``` ``` ## id count ## 1 ENSMUSG00000000001 1472 ## 2 ENSMUSG00000000003 0 ## 3 ENSMUSG00000000028 6 ## 4 ENSMUSG00000000031 109 ## 5 ENSMUSG00000000037 43 ## 6 ENSMUSG00000000049 0 ``` --- # Get the depth -- ```r # What is our input? normalize_by_depth = function(input) { # How do we compute total count? * depth = sum(input$count) # Let's divide everything by the total count calculated above # Provide an output return(depth) } ``` --- # Test the function .pull-left[ ### Test data ``` ## id count ## 1 ENSMUSG00000000001 1472 ## 2 ENSMUSG00000000003 0 ## 3 ENSMUSG00000000028 6 ## 4 ENSMUSG00000000031 109 ## 5 ENSMUSG00000000037 43 ## 6 ENSMUSG00000000049 0 ``` ] -- .pull-right[ ### Ground fact 1472 + 6 + 109 + 43 = 1630 ] -- ### Testing ```r normalize_by_depth(test_count) == 1472 + 6 + 109 + 43 ``` ``` ## [1] TRUE ``` --- # Divide everything by total count -- ```r # What is our input? normalize_by_depth = function(input) { # How do we compute total count? depth = sum(input$count) # Let's divide everything by the total count calculated above * normalized_count = input$count/depth input$normalized_count = normalized_count # Provide an output * return(input) } ``` --- # Test again .pull-left[ ### Test data ``` ## id count ## 1 ENSMUSG00000000001 1472 ## 2 ENSMUSG00000000003 0 ``` ] -- .pull-right[ ### Ground fact Total count is 1630. ] -- Thus, for the first gene we will expect 1472/1630=0.9030675, while the second gene we are expecting 0/1630 = 0. -- ```r normalize_by_depth(test_count) ``` ``` ## id count normalized_count ## 1 ENSMUSG00000000001 1472 0.903067485 ## 2 ENSMUSG00000000003 0 0.000000000 ## 3 ENSMUSG00000000028 6 0.003680982 ## 4 ENSMUSG00000000031 109 0.066871166 ## 5 ENSMUSG00000000037 43 0.026380368 ## 6 ENSMUSG00000000049 0 0.000000000 ``` --- # Counts per million (CPM) `$$CPM = \frac{Count\space{}per\space{}gene}{Total\space{}count}\times{}10^{6}$$` -- Let's multiply the result by `\(10^{6}\)` (10^6) before returning it in our function. -- ```r # What is our input? normalize_by_depth = function(input) { # How do we compute total count? depth = sum(input$count) # Let's divide everything by the total count calculated above # and multiply by 10^6 to get CPM * normalized_count = input$count/depth * 10^6 input$normalized_count = normalized_count # Provide an output return(input) } ``` --- # Is it working? For the first gene, we used to getting 0.9030675 before multiplying with `\(10^{6}\)`. -- ```r normalize_by_depth(test_count) ``` ``` ## id count normalized_count ## 1 ENSMUSG00000000001 1472 903067.485 ## 2 ENSMUSG00000000003 0 0.000 ## 3 ENSMUSG00000000028 6 3680.982 ## 4 ENSMUSG00000000031 109 66871.166 ## 5 ENSMUSG00000000037 43 26380.368 ## 6 ENSMUSG00000000049 0 0.000 ``` --- # What is our other goals? <p style = "color: #aaaaaa;"> 1. Normalize read counts by sequencing depth = total reads we got from a sample. </p> <p style = "color: #000000;"> 2. Normalize again with transcript length. </p> --- # `get_tx_length()` A gene model file contains the starting and ending coordinates of genes, transcripts, and exons. -- ```r # Load gene model file (pre-processed) gene_model = read.csv("mouse_gene_model.csv") head(gene_model) ``` ``` ## id start end ## 1 ENSMUSG00000102628 150956201 150958296 ## 2 ENSMUSG00000100595 150983666 150984611 ## 3 ENSMUSG00000097426 151012258 151012971 ## 4 ENSMUSG00000097426 151013347 151013531 ## 5 ENSMUSG00000104478 108344807 108347562 ## 6 ENSMUSG00000104385 6980784 6981446 ``` --- # Getting transcript length for each gene How do we define a function that get us lengths for each gene? -- ```r library(dplyr) # What is our input? get_tx_length = function(input) { # How do we compute the sum of all exons of a gene? # Provide an output return(tx_length) } ``` --- # Hint In `dplyr` term, you might want to: 1. First, define a new column called `length` defined as `end` - `start` + 1. -- 2. Group the table by `id` -- 3. Summarize the table by calculating the `sum()` of column `length` --- # `mutate()` ```r # How do we compute the length of each exons? gene_model %>% head() %>% mutate(length = (end - start + 1)) ``` ``` ## id start end length ## 1 ENSMUSG00000102628 150956201 150958296 2096 ## 2 ENSMUSG00000100595 150983666 150984611 946 ## 3 ENSMUSG00000097426 151012258 151012971 714 ## 4 ENSMUSG00000097426 151013347 151013531 185 ## 5 ENSMUSG00000104478 108344807 108347562 2756 ## 6 ENSMUSG00000104385 6980784 6981446 663 ``` --- # Sum of length per id -- ```r # What is our input? get_tx_length = function(input) { # How do we compute the sum of all exons of a gene? tx_length = input %>% mutate(length = (end - start + 1)) %>% * group_by(id) %>% * summarize(gene_length = sum(length)) # Provide an output return(tx_length) } ``` --- # What does the output look like? ```r gene_model %>% head() %>% get_tx_length() ``` ``` ## # A tibble: 5 × 2 ## id gene_length ## <chr> <dbl> ## 1 ENSMUSG00000097426 899 ## 2 ENSMUSG00000100595 946 ## 3 ENSMUSG00000102628 2096 ## 4 ENSMUSG00000104385 663 ## 5 ENSMUSG00000104478 2756 ``` -- .center[Does this look right?] --- # Use ENSMUSG00000097426 as a test -- ```r # Doing it differently with base R gene_of_interest = gene_model[gene_model$id == "ENSMUSG00000097426", ] print(gene_of_interest) ``` ``` ## id start end ## 3 ENSMUSG00000097426 151012258 151012971 ## 4 ENSMUSG00000097426 151013347 151013531 ``` -- ```r gene_of_interest$length = gene_of_interest$end - gene_of_interest$start + 1 print(gene_of_interest) ``` ``` ## id start end length ## 3 ENSMUSG00000097426 151012258 151012971 714 ## 4 ENSMUSG00000097426 151013347 151013531 185 ``` -- ```r sum(gene_of_interest$length) ``` ``` ## [1] 899 ``` --- # Merging data with a shared column .pull-left[ ## Count table ``` ## id count ## 1 ENSMUSG00000000001 1472 ## 2 ENSMUSG00000000003 0 ## 3 ENSMUSG00000000028 6 ## 4 ENSMUSG00000000031 109 ## 5 ENSMUSG00000000037 43 ## 6 ENSMUSG00000000049 0 ``` ] -- .pull-right[ ## Gene length ```r gene_model %>% head() %>% get_tx_length() ``` ``` ## # A tibble: 5 × 2 ## id gene_length ## <chr> <dbl> ## 1 ENSMUSG00000097426 899 ## 2 ENSMUSG00000100595 946 ## 3 ENSMUSG00000102628 2096 ## 4 ENSMUSG00000104385 663 ## 5 ENSMUSG00000104478 2756 ``` ] -- .center[Everything is good except for the order...] --- # `left_join(x, y, by = column)` - `left_join(x, y, by)` will merge each row according to a shared column. -- - In our case, that means we merge rows that have the same `id`. --- # Minimal example of join .pull-left[ ```r # Using small test data print(test_count) ``` ``` ## id count ## 1 ENSMUSG00000000001 1472 ## 2 ENSMUSG00000000003 0 ## 3 ENSMUSG00000000028 6 ## 4 ENSMUSG00000000031 109 ## 5 ENSMUSG00000000037 43 ## 6 ENSMUSG00000000049 0 ``` ] -- .pull-right[ ```r test_gene_model = gene_model %>% # Only keep rows that are present in test_count filter(id %in% test_count$id) # Compute gene length with our custom function test_gene_length = test_gene_model %>% get_tx_length() print(test_gene_length) ``` ``` ## # A tibble: 6 × 2 ## id gene_length ## <chr> <dbl> ## 1 ENSMUSG00000000001 3262 ## 2 ENSMUSG00000000003 902 ## 3 ENSMUSG00000000028 2143 ## 4 ENSMUSG00000000031 1217 ## 5 ENSMUSG00000000037 4842 ## 6 ENSMUSG00000000049 1190 ``` ] --- # `left_join()` time! ```r left_join(x = test_count, y = test_gene_length, by = "id") ``` ``` ## id count gene_length ## 1 ENSMUSG00000000001 1472 3262 ## 2 ENSMUSG00000000003 0 902 ## 3 ENSMUSG00000000028 6 2143 ## 4 ENSMUSG00000000031 109 1217 ## 5 ENSMUSG00000000037 43 4842 ## 6 ENSMUSG00000000049 0 1190 ``` -- `left_join()` will take care of row orders even if they are messy! -- ```r left_join(x = test_count[c(1, 3, 5, 2, 4, 6), ], y = test_gene_model, by = "id") ``` ``` ## id count start end ## 1 ENSMUSG00000000001 1472 108053204 108053462 ## 2 ENSMUSG00000000001 1472 108031111 108031153 ## 3 ENSMUSG00000000001 1472 108030858 108030999 ## 4 ENSMUSG00000000001 1472 108025617 108025774 ## 5 ENSMUSG00000000001 1472 108023079 108023207 ## 6 ENSMUSG00000000001 1472 108019789 108019918 ## 7 ENSMUSG00000000001 1472 108019251 108019404 ## 8 ENSMUSG00000000001 1472 108016719 108016928 ## 9 ENSMUSG00000000001 1472 108014596 108016632 ## 10 ENSMUSG00000000028 6 18630554 18630722 ## 11 ENSMUSG00000000028 6 18630265 18630459 ## 12 ENSMUSG00000000028 6 18630055 18630114 ## 13 ENSMUSG00000000028 6 18629139 18629231 ## 14 ENSMUSG00000000028 6 18627482 18627619 ## 15 ENSMUSG00000000028 6 18626030 18626173 ## 16 ENSMUSG00000000028 6 18624132 18624187 ## 17 ENSMUSG00000000028 6 18620551 18620599 ## 18 ENSMUSG00000000028 6 18617426 18617487 ## 19 ENSMUSG00000000028 6 18616099 18616149 ## 20 ENSMUSG00000000028 6 18614575 18614694 ## 21 ENSMUSG00000000028 6 18613845 18613976 ## 22 ENSMUSG00000000028 6 18613512 18613610 ## 23 ENSMUSG00000000028 6 18611928 18612089 ## 24 ENSMUSG00000000028 6 18605680 18605818 ## 25 ENSMUSG00000000028 6 18605519 18605602 ## 26 ENSMUSG00000000028 6 18603823 18603941 ## 27 ENSMUSG00000000028 6 18603556 18603632 ## 28 ENSMUSG00000000028 6 18600646 18600712 ## 29 ENSMUSG00000000028 6 18599197 18599323 ## 30 ENSMUSG00000000037 43 159945768 159946244 ## 31 ENSMUSG00000000037 43 159954486 159954531 ## 32 ENSMUSG00000000037 43 159960243 159960313 ## 33 ENSMUSG00000000037 43 159961033 159961267 ## 34 ENSMUSG00000000037 43 159970260 159970348 ## 35 ENSMUSG00000000037 43 159975200 159975443 ## 36 ENSMUSG00000000037 43 159985010 159985093 ## 37 ENSMUSG00000000037 43 159992191 159992265 ## 38 ENSMUSG00000000037 43 159992605 159992679 ## 39 ENSMUSG00000000037 43 159996237 159996320 ## 40 ENSMUSG00000000037 43 159998176 159998259 ## 41 ENSMUSG00000000037 43 160004662 160004745 ## 42 ENSMUSG00000000037 43 160006153 160006236 ## 43 ENSMUSG00000000037 43 160007457 160007540 ## 44 ENSMUSG00000000037 43 160008873 160008956 ## 45 ENSMUSG00000000037 43 160010369 160010452 ## 46 ENSMUSG00000000037 43 160011673 160011756 ## 47 ENSMUSG00000000037 43 160013087 160013170 ## 48 ENSMUSG00000000037 43 160014396 160014479 ## 49 ENSMUSG00000000037 43 160017609 160017692 ## 50 ENSMUSG00000000037 43 160020476 160020753 ## 51 ENSMUSG00000000037 43 160022746 160022860 ## 52 ENSMUSG00000000037 43 160024984 160025187 ## 53 ENSMUSG00000000037 43 160026362 160026544 ## 54 ENSMUSG00000000037 43 160029253 160029363 ## 55 ENSMUSG00000000037 43 160039562 160041192 ## 56 ENSMUSG00000000003 0 76897015 76897229 ## 57 ENSMUSG00000000003 0 76891581 76891720 ## 58 ENSMUSG00000000003 0 76888625 76888692 ## 59 ENSMUSG00000000003 0 76886503 76886613 ## 60 ENSMUSG00000000003 0 76886121 76886222 ## 61 ENSMUSG00000000003 0 76885466 76885517 ## 62 ENSMUSG00000000003 0 76881507 76881720 ## 63 ENSMUSG00000000031 109 142130566 142130647 ## 64 ENSMUSG00000000031 109 142130350 142130484 ## 65 ENSMUSG00000000031 109 142129268 142130267 ## 66 ENSMUSG00000000049 0 108286119 108286233 ## 67 ENSMUSG00000000049 0 108286642 108286818 ## 68 ENSMUSG00000000049 0 108288125 108288221 ## 69 ENSMUSG00000000049 0 108295637 108295713 ## 70 ENSMUSG00000000049 0 108298117 108298305 ## 71 ENSMUSG00000000049 0 108299957 108300136 ## 72 ENSMUSG00000000049 0 108302821 108303018 ## 73 ENSMUSG00000000049 0 108305066 108305222 ``` --- # One function to do it all ```r normalize_rnaseq = function(count, gene_model){ # 1. Normalize read counts by sequencing depth = total reads we got from a sample. # (This gives CPM) # 2. Normalize again with transcript length. # (CPM/Gene length = Transcript per million (TPM)) return(normalized_count) } ``` --- # The easy ```r normalize_rnaseq = function(count, gene_model){ # 1. Normalize read counts by sequencing depth = total reads we got from a sample. # (This gives CPM) * cpm = normalize_by_depth(count) # 2. Normalize again with transcript length. # (CPM/Gene length = Transcript per million (TPM)) return(normalized_count) } ``` -- ```r normalize_rnaseq = function(count, gene_model){ # 1. Normalize read counts by sequencing depth = total reads we got from a sample. # (This gives CPM) cpm = normalize_by_depth(count) # 2. Normalize again with transcript length. # (CPM/Gene length = Transcript per million (TPM)) * gene_lengths = get_tx_length(gene_model) return(normalized_count) } ``` --- # The join ```r normalize_rnaseq = function(count, gene_model){ # 1. Normalize read counts by sequencing depth = total reads we got from a sample. # (This gives CPM) cpm = normalize_by_depth(count) # 2. Normalize again with transcript length. # (CPM/Gene length = Transcript per million (TPM)) gene_lengths = get_tx_length(gene_model) # Make a master table containing both CPM and length per gene * normalized_count = left_join(cpm, gene_lengths, by = "id") return(normalized_count) } ``` --- # The final normalization (CPM/gene length) ```r normalize_rnaseq = function(count, gene_model){ # 1. Normalize read counts by sequencing depth = total reads we got from a sample. # (This gives CPM) cpm = normalize_by_depth(count) # 2. Normalize again with transcript length. # (CPM/Gene length = Transcript per million (TPM)) gene_lengths = get_tx_length(gene_model) # Make a master table containing both CPM and length per gene normalized_count = left_join(cpm, gene_lengths, by = "id") %>% # Divide cpm by gene_length * mutate(tpm = count/length) return(normalized_count) } ``` --- # Final test ```r normalize_rnaseq(test_count, test_gene_model) ``` ``` ## id count normalized_count gene_length tpm ## 1 ENSMUSG00000000001 1472 903067.485 3262 0.451256898 ## 2 ENSMUSG00000000003 0 0.000 902 0.000000000 ## 3 ENSMUSG00000000028 6 3680.982 2143 0.002799813 ## 4 ENSMUSG00000000031 109 66871.166 1217 0.089564503 ## 5 ENSMUSG00000000037 43 26380.368 4842 0.008880628 ## 6 ENSMUSG00000000049 0 0.000 1190 0.000000000 ``` --- # Everything at once -- ```r big_tpm_table = normalize_rnaseq(count = rawcount, gene_model = gene_model) head(big_tpm_table) ``` ``` ## id count normalized_count gene_length tpm ## 1 ENSMUSG00000000001 1472 47.666785 3262 0.451256898 ## 2 ENSMUSG00000000003 0 0.000000 902 0.000000000 ## 3 ENSMUSG00000000028 6 0.194294 2143 0.002799813 ## 4 ENSMUSG00000000031 109 3.529674 1217 0.089564503 ## 5 ENSMUSG00000000037 43 1.392440 4842 0.008880628 ## 6 ENSMUSG00000000049 0 0.000000 1190 0.000000000 ``` -- ```r dim(big_tpm_table) ``` ``` ## [1] 37996 5 ``` --- # Data types .pull-left[ ## Basic/atomic - character - numeric (integer and double) - factor - date and time ] -- .pull-right[ ## Compound {{content}} ] -- - vector: several elements of **the same type** {{content}} -- - array: several elements (>= 2D) of **the same type** {{content}} -- - data.frame: 2D table where **each column can be a different type** {{content}} -- - list: Each element can be anything (vectors, arrays, data.frames...) --- # How to take care of data types - Test early, test often, test with confidence. -- - Think about what type of data is your input and output if possible. -- - Check the types again with `typeof()` whenever you convert it. -- - Failed conversion often results in `NA`. Check with `is.na()`. --- # `character`s are friendly most of the time -- ## Hidden character ```r fake_num = c("1", "3", "5", "7", "9", "l0") ``` -- ```r typeof(fake_num) ``` ``` ## [1] "character" ``` -- ```r as.numeric(fake_num) ``` ``` ## Warning: NAs introduced by coercion ``` ``` ## [1] 1 3 5 7 9 NA ``` --- # Alphabetical and numerical sorting ```r chr_vec = c("5", "8", "6", "10", "7", "9") ``` -- ```r # You might not expect it to sort like this sort(chr_vec) ``` ``` ## [1] "10" "5" "6" "7" "8" "9" ``` -- ```r # If they are numbers, they sort differently sort(as.numeric(chr_vec)) ``` ``` ## [1] 5 6 7 8 9 10 ``` --- # Numeric type: Precision can be dangerous... ```r 0.1 + 0.2 == 0.3 ``` -- ``` ## [1] FALSE ``` There's a website called [https://0.30000000000000004.com/](https://0.30000000000000004.com/) that explains this in detail. But briefly, any number that is not an integer has limited precision, and [propagation of error](https://en.wikipedia.org/wiki/Propagation_of_uncertainty) is a thing. --- # A more robust way comparing non-integers ```r # Define an error margin that you want to tolerate error_margin = 10^-8 # and then decide if the difference is within the margin (0.1 + 0.2) - 0.3 < error_margin ``` ``` ## [1] TRUE ``` --- # Factor: Ordered categories ```r # Categories as characters works most of the time, but... month_tbl = data.frame( month = c( "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December" ), length = c( 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 ) ) str(month_tbl) ``` ``` ## 'data.frame': 12 obs. of 2 variables: ## $ month : chr "January" "February" "March" "April" ... ## $ length: num 31 28 31 30 31 30 31 31 30 31 ... ``` --- # Most functions will sort alphabetically ```r library(ggplot2) month_tbl %>% qplot(data = ., x = month, y = length, geom = "point") + # This adjust the axis text to make the text more visible theme(axis.text.x = element_text(size = 20, angle = 60, hjust = 1)) ``` ![](function_and_data_types_files/figure-html/unnamed-chunk-50-1.png)<!-- --> --- # Unless you convert it to a factor ```r month_tbl$month = factor( month_tbl$month, # R will respect the levels you set here levels = c( "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December" ) ) month_tbl %>% qplot(data = ., x = month, y = length, geom = "point") + # This adjust the axis text to make the text more visible theme(axis.text.x = element_text(size = 20, angle = 60, hjust = 1)) ``` ![](function_and_data_types_files/figure-html/unnamed-chunk-51-1.png)<!-- --> --- # Use factors when... - You know it's ordinal -- - When you are building statistical models with categorical variables (ANOVA et al.) --- # Date and time: Otherwise character ```r # This shouldn't be surprising by now random_dates = c("12-25-2022", "07-04-1989", "01-01-2077") sort(random_dates) ``` ``` ## [1] "01-01-2077" "07-04-1989" "12-25-2022" ``` -- ```r # You can tell R how the date is represented to chronologically sort library(lubridate) good_dates = mdy(random_dates) sort(good_dates) ``` ``` ## ## Attaching package: 'lubridate' ``` ``` ## The following objects are masked from 'package:base': ## ## date, intersect, setdiff, union ``` ``` ## [1] "1989-07-04" "2022-12-25" "2077-01-01" ``` --- # Take-home messages - Data types exist to help R figure out what to do with your data. -- - A common source of bugs since types are counterintuitive. -- - Proactively thinking about whether your functions assume specific data types help prevent type-related errors. -- - Learn to think about type when you test your code. -- - Incorporating type checks (is.*()) in your code might be worth learning in the future. ```r is.numeric("apple") ``` ``` ## [1] FALSE ```