Functions and Data Types

class: center, middle, inverse, title-slide

.title[
# Functions and Data Types
]
.author[
### Yen-Chung Chen & Cassandra Buzby
]
.date[
### 2022-06-30
]

---

# Outline

1. Recap on data filtering and functions

2. Data types and associated horror stories

---
# Introduction to RNA-Seq

.pull-left[
  ![](https://nyusurp.github.io/img/RNAseq.png)
]

.pull-right[
  {{content}}
]

1. cDNA synthesis
{{content}}

2. Fragmentation
{{content}}

3. Sequence & align
{{content}}

4. Quantify

.center[Is it fair to just count the fragments?]

---
# Normalization in RNA-Seq

.pull-left[
  {{content}}
]

## Issues
{{content}}

1. The more you seq, the more you get.
{{content}}

2. The longer the gene is, the more you get.

.pull-right[
## Solutions
{{content}}
]

1. Account for seq depth.by normalize with seq depth
{{content}}

2. Normalize every gene with its length

---
# Depth normalization

What you would essentially do is:
`$$Normalized\space{}count = \frac{Per\mbox{-}gene\space{}count}{Total\space{}count}$$`

What is in the raw data?

```r
# Use read.delim instead of read.csv for **tab**-delimited files
rawcount = read.delim("BR-A-Control_counts.txt")
```

---
# Always know your data

```r
head(rawcount)
```

```
##                   id count
## 1 ENSMUSG00000000001  1472
## 2 ENSMUSG00000000003     0
## 3 ENSMUSG00000000028     6
## 4 ENSMUSG00000000031   109
## 5 ENSMUSG00000000037    43
## 6 ENSMUSG00000000049     0
```

---
# Plan ahead: What are our goals?

1. Normalize read counts by sequencing depth = total reads we got from a sample.

2. Normalize again with transcript length.

---
# `normalize_by_depth()`

How do we define a function that normalize a vector of counts by total counts?

```r
# What is our input?
normalize_by_depth = function(input) {
  # How do we compute total count?
  
  # Let's divide everything by the total count calculated above
  
  # Provide an output
  
}
```

---
# Good practice: Start humble

So you can capture errors right away.

```r
# Take a small fraction of the data so we can test our function
test_count = head(rawcount)
print(test_count)
```

---
# Get the depth

```r
# What is our input?
normalize_by_depth = function(input) {
  # How do we compute total count?
*  depth = sum(input$count)
  # Let's divide everything by the total count calculated above
  
  # Provide an output
  return(depth)
}
```

---
# Test the function
.pull-left[
### Test data

]

--
.pull-right[
### Ground fact

1472 + 6 + 109 + 43 = 1630
]

--
### Testing

```r
normalize_by_depth(test_count) == 1472 + 6 + 109 + 43
```

```
## [1] TRUE
```

---
# Divide everything by total count

```r
# What is our input?
normalize_by_depth = function(input) {
  # How do we compute total count?
  depth = sum(input$count)
  
  # Let's divide everything by the total count calculated above
*  normalized_count = input$count/depth
  input$normalized_count = normalized_count
  
  # Provide an output
*  return(input)
}
```

---
# Test again

.pull-left[
### Test data

```
##                   id count
## 1 ENSMUSG00000000001  1472
## 2 ENSMUSG00000000003     0
```

]

--
.pull-right[
### Ground fact

Total count is 1630.
]

Thus, for the first gene we will expect 1472/1630=0.9030675, while the 
second gene we are expecting 0/1630 = 0.

```r
normalize_by_depth(test_count)
```

```
##                   id count normalized_count
## 1 ENSMUSG00000000001  1472      0.903067485
## 2 ENSMUSG00000000003     0      0.000000000
## 3 ENSMUSG00000000028     6      0.003680982
## 4 ENSMUSG00000000031   109      0.066871166
## 5 ENSMUSG00000000037    43      0.026380368
## 6 ENSMUSG00000000049     0      0.000000000
```

---
# Counts per million (CPM)

`$$CPM = \frac{Count\space{}per\space{}gene}{Total\space{}count}\times{}10^{6}$$`

Let's multiply the result by `$10^{6}$` (10^6) before returning it in our
function.

```r
# What is our input?
normalize_by_depth = function(input) {
  # How do we compute total count?
  depth = sum(input$count)
  
  # Let's divide everything by the total count calculated above
  # and multiply by 10^6 to get CPM
* normalized_count = input$count/depth * 10^6
  input$normalized_count = normalized_count
  
  # Provide an output
  return(input)
}
```

---
# Is it working?

For the first gene, we used to getting 0.9030675 before multiplying with
`$10^{6}$`.

```r
normalize_by_depth(test_count)
```

```
##                   id count normalized_count
## 1 ENSMUSG00000000001  1472       903067.485
## 2 ENSMUSG00000000003     0            0.000
## 3 ENSMUSG00000000028     6         3680.982
## 4 ENSMUSG00000000031   109        66871.166
## 5 ENSMUSG00000000037    43        26380.368
## 6 ENSMUSG00000000049     0            0.000
```

---
# What is our other goals?

<p style = "color: #aaaaaa;">
1. Normalize read counts by sequencing depth = total reads we got from a sample.
</p>

<p style = "color: #000000;">
2. Normalize again with transcript length.
</p>

---
# `get_tx_length()`

A gene model file contains the starting and ending coordinates of genes,
transcripts, and exons.

```r
# Load gene model file (pre-processed)
gene_model = read.csv("mouse_gene_model.csv")
head(gene_model)
```

```
##                   id     start       end
## 1 ENSMUSG00000102628 150956201 150958296
## 2 ENSMUSG00000100595 150983666 150984611
## 3 ENSMUSG00000097426 151012258 151012971
## 4 ENSMUSG00000097426 151013347 151013531
## 5 ENSMUSG00000104478 108344807 108347562
## 6 ENSMUSG00000104385   6980784   6981446
```

---
# Getting transcript length for each gene

How do we define a function that get us lengths for each gene?

```r
library(dplyr)

# What is our input?
get_tx_length = function(input) {
  # How do we compute the sum of all exons of a gene?
  
  # Provide an output
  return(tx_length)
}
```

---
# Hint

In `dplyr` term, you might want to:

1. First, define a new column called `length` defined as `end` - `start` + 1.

2. Group the table by `id`

3. Summarize the table by calculating the `sum()` of column `length`

---
# `mutate()`

```r
# How do we compute the length of each exons?
gene_model %>%
  head() %>%
  mutate(length = (end - start + 1))
```

```
##                   id     start       end length
## 1 ENSMUSG00000102628 150956201 150958296   2096
## 2 ENSMUSG00000100595 150983666 150984611    946
## 3 ENSMUSG00000097426 151012258 151012971    714
## 4 ENSMUSG00000097426 151013347 151013531    185
## 5 ENSMUSG00000104478 108344807 108347562   2756
## 6 ENSMUSG00000104385   6980784   6981446    663
```

---
# Sum of length per id

```r
# What is our input?
get_tx_length = function(input) {
  # How do we compute the sum of all exons of a gene?
  tx_length = input %>%
    mutate(length = (end - start + 1)) %>%
*    group_by(id) %>%
*    summarize(gene_length = sum(length))
  # Provide an output
  return(tx_length)
}
```

---
# What does the output look like?

```r
gene_model %>%
  head() %>%
  get_tx_length()
```

```
## # A tibble: 5 × 2
##   id                 gene_length
##   <chr>                    <dbl>
## 1 ENSMUSG00000097426         899
## 2 ENSMUSG00000100595         946
## 3 ENSMUSG00000102628        2096
## 4 ENSMUSG00000104385         663
## 5 ENSMUSG00000104478        2756
```

.center[Does this look right?]

---
# Use ENSMUSG00000097426 as a test

```r
# Doing it differently with base R
gene_of_interest = gene_model[gene_model$id == "ENSMUSG00000097426", ]
print(gene_of_interest)
```

```
##                   id     start       end
## 3 ENSMUSG00000097426 151012258 151012971
## 4 ENSMUSG00000097426 151013347 151013531
```

```r
gene_of_interest$length = gene_of_interest$end - gene_of_interest$start + 1
print(gene_of_interest)
```

```
##                   id     start       end length
## 3 ENSMUSG00000097426 151012258 151012971    714
## 4 ENSMUSG00000097426 151013347 151013531    185
```

```r
sum(gene_of_interest$length)
```

```
## [1] 899
```

---
# Merging data with a shared column

.pull-left[
## Count table

]

--
.pull-right[
## Gene length

```r
gene_model %>%
  head() %>%
  get_tx_length()
```

.center[Everything is good except for the order...]

---
# `left_join(x, y, by = column)`

- `left_join(x, y, by)` will merge each row according to a shared column.

- In our case, that means we merge rows that have the same `id`.

---
# Minimal example of join

.pull-left[

```r
# Using small test data
print(test_count)
```

.pull-right[

```r
test_gene_model = gene_model %>%
  # Only keep rows that are present in test_count
  filter(id %in% test_count$id)

# Compute gene length with our custom function
test_gene_length = test_gene_model %>%
  get_tx_length()
print(test_gene_length)
```

```
## # A tibble: 6 × 2
##   id                 gene_length
##   <chr>                    <dbl>
## 1 ENSMUSG00000000001        3262
## 2 ENSMUSG00000000003         902
## 3 ENSMUSG00000000028        2143
## 4 ENSMUSG00000000031        1217
## 5 ENSMUSG00000000037        4842
## 6 ENSMUSG00000000049        1190
```
]

---
# `left_join()` time!

```r
left_join(x = test_count, y = test_gene_length, by = "id")
```

```
##                   id count gene_length
## 1 ENSMUSG00000000001  1472        3262
## 2 ENSMUSG00000000003     0         902
## 3 ENSMUSG00000000028     6        2143
## 4 ENSMUSG00000000031   109        1217
## 5 ENSMUSG00000000037    43        4842
## 6 ENSMUSG00000000049     0        1190
```

`left_join()` will take care of row orders even if they are messy!

```r
left_join(x = test_count[c(1, 3, 5, 2, 4, 6), ], y = test_gene_model, by = "id")
```

```
##                    id count     start       end
## 1  ENSMUSG00000000001  1472 108053204 108053462
## 2  ENSMUSG00000000001  1472 108031111 108031153
## 3  ENSMUSG00000000001  1472 108030858 108030999
## 4  ENSMUSG00000000001  1472 108025617 108025774
## 5  ENSMUSG00000000001  1472 108023079 108023207
## 6  ENSMUSG00000000001  1472 108019789 108019918
## 7  ENSMUSG00000000001  1472 108019251 108019404
## 8  ENSMUSG00000000001  1472 108016719 108016928
## 9  ENSMUSG00000000001  1472 108014596 108016632
## 10 ENSMUSG00000000028     6  18630554  18630722
## 11 ENSMUSG00000000028     6  18630265  18630459
## 12 ENSMUSG00000000028     6  18630055  18630114
## 13 ENSMUSG00000000028     6  18629139  18629231
## 14 ENSMUSG00000000028     6  18627482  18627619
## 15 ENSMUSG00000000028     6  18626030  18626173
## 16 ENSMUSG00000000028     6  18624132  18624187
## 17 ENSMUSG00000000028     6  18620551  18620599
## 18 ENSMUSG00000000028     6  18617426  18617487
## 19 ENSMUSG00000000028     6  18616099  18616149
## 20 ENSMUSG00000000028     6  18614575  18614694
## 21 ENSMUSG00000000028     6  18613845  18613976
## 22 ENSMUSG00000000028     6  18613512  18613610
## 23 ENSMUSG00000000028     6  18611928  18612089
## 24 ENSMUSG00000000028     6  18605680  18605818
## 25 ENSMUSG00000000028     6  18605519  18605602
## 26 ENSMUSG00000000028     6  18603823  18603941
## 27 ENSMUSG00000000028     6  18603556  18603632
## 28 ENSMUSG00000000028     6  18600646  18600712
## 29 ENSMUSG00000000028     6  18599197  18599323
## 30 ENSMUSG00000000037    43 159945768 159946244
## 31 ENSMUSG00000000037    43 159954486 159954531
## 32 ENSMUSG00000000037    43 159960243 159960313
## 33 ENSMUSG00000000037    43 159961033 159961267
## 34 ENSMUSG00000000037    43 159970260 159970348
## 35 ENSMUSG00000000037    43 159975200 159975443
## 36 ENSMUSG00000000037    43 159985010 159985093
## 37 ENSMUSG00000000037    43 159992191 159992265
## 38 ENSMUSG00000000037    43 159992605 159992679
## 39 ENSMUSG00000000037    43 159996237 159996320
## 40 ENSMUSG00000000037    43 159998176 159998259
## 41 ENSMUSG00000000037    43 160004662 160004745
## 42 ENSMUSG00000000037    43 160006153 160006236
## 43 ENSMUSG00000000037    43 160007457 160007540
## 44 ENSMUSG00000000037    43 160008873 160008956
## 45 ENSMUSG00000000037    43 160010369 160010452
## 46 ENSMUSG00000000037    43 160011673 160011756
## 47 ENSMUSG00000000037    43 160013087 160013170
## 48 ENSMUSG00000000037    43 160014396 160014479
## 49 ENSMUSG00000000037    43 160017609 160017692
## 50 ENSMUSG00000000037    43 160020476 160020753
## 51 ENSMUSG00000000037    43 160022746 160022860
## 52 ENSMUSG00000000037    43 160024984 160025187
## 53 ENSMUSG00000000037    43 160026362 160026544
## 54 ENSMUSG00000000037    43 160029253 160029363
## 55 ENSMUSG00000000037    43 160039562 160041192
## 56 ENSMUSG00000000003     0  76897015  76897229
## 57 ENSMUSG00000000003     0  76891581  76891720
## 58 ENSMUSG00000000003     0  76888625  76888692
## 59 ENSMUSG00000000003     0  76886503  76886613
## 60 ENSMUSG00000000003     0  76886121  76886222
## 61 ENSMUSG00000000003     0  76885466  76885517
## 62 ENSMUSG00000000003     0  76881507  76881720
## 63 ENSMUSG00000000031   109 142130566 142130647
## 64 ENSMUSG00000000031   109 142130350 142130484
## 65 ENSMUSG00000000031   109 142129268 142130267
## 66 ENSMUSG00000000049     0 108286119 108286233
## 67 ENSMUSG00000000049     0 108286642 108286818
## 68 ENSMUSG00000000049     0 108288125 108288221
## 69 ENSMUSG00000000049     0 108295637 108295713
## 70 ENSMUSG00000000049     0 108298117 108298305
## 71 ENSMUSG00000000049     0 108299957 108300136
## 72 ENSMUSG00000000049     0 108302821 108303018
## 73 ENSMUSG00000000049     0 108305066 108305222
```

---
# One function to do it all

```r
normalize_rnaseq = function(count, gene_model){
  # 1. Normalize read counts by sequencing depth = total reads we got from a sample.
  # (This gives CPM)

# 2. Normalize again with transcript length.
  # (CPM/Gene length = Transcript per million (TPM))

return(normalized_count)
}
```

---
# The easy

```r
normalize_rnaseq = function(count, gene_model){
  # 1. Normalize read counts by sequencing depth = total reads we got from a sample.
  # (This gives CPM)
*  cpm = normalize_by_depth(count)
  # 2. Normalize again with transcript length.
  # (CPM/Gene length = Transcript per million (TPM))

return(normalized_count)
}
```

---
# The join

```r
normalize_rnaseq = function(count, gene_model){
  # 1. Normalize read counts by sequencing depth = total reads we got from a sample.
  # (This gives CPM)
  cpm = normalize_by_depth(count)
  # 2. Normalize again with transcript length.
  # (CPM/Gene length = Transcript per million (TPM))
  gene_lengths = get_tx_length(gene_model)
  
  # Make a master table containing both CPM and length per gene
*  normalized_count = left_join(cpm, gene_lengths, by = "id")
  
  return(normalized_count)
}
```

---
# The final normalization (CPM/gene length)

```r
normalize_rnaseq = function(count, gene_model){
  # 1. Normalize read counts by sequencing depth = total reads we got from a sample.
  # (This gives CPM)
  cpm = normalize_by_depth(count)
  # 2. Normalize again with transcript length.
  # (CPM/Gene length = Transcript per million (TPM))
  gene_lengths = get_tx_length(gene_model)
  
  # Make a master table containing both CPM and length per gene
  normalized_count = left_join(cpm, gene_lengths, by = "id") %>%
  # Divide cpm by gene_length
*    mutate(tpm = count/length)
  
  return(normalized_count)
}
```

---
# Final test

```r
normalize_rnaseq(test_count, test_gene_model)
```

```
##                   id count normalized_count gene_length         tpm
## 1 ENSMUSG00000000001  1472       903067.485        3262 0.451256898
## 2 ENSMUSG00000000003     0            0.000         902 0.000000000
## 3 ENSMUSG00000000028     6         3680.982        2143 0.002799813
## 4 ENSMUSG00000000031   109        66871.166        1217 0.089564503
## 5 ENSMUSG00000000037    43        26380.368        4842 0.008880628
## 6 ENSMUSG00000000049     0            0.000        1190 0.000000000
```

---
# Everything at once

```r
big_tpm_table = normalize_rnaseq(count = rawcount, gene_model = gene_model)
head(big_tpm_table)
```

```
##                   id count normalized_count gene_length         tpm
## 1 ENSMUSG00000000001  1472        47.666785        3262 0.451256898
## 2 ENSMUSG00000000003     0         0.000000         902 0.000000000
## 3 ENSMUSG00000000028     6         0.194294        2143 0.002799813
## 4 ENSMUSG00000000031   109         3.529674        1217 0.089564503
## 5 ENSMUSG00000000037    43         1.392440        4842 0.008880628
## 6 ENSMUSG00000000049     0         0.000000        1190 0.000000000
```

```r
dim(big_tpm_table)
```

```
## [1] 37996     5
```

---
# Data types

.pull-left[
## Basic/atomic
- character
- numeric (integer and double)
- factor
- date and time
]

.pull-right[
## Compound
{{content}}
]

--
- vector: several elements of **the same type**
{{content}}

--
- array: several elements (>= 2D) of **the same type**
{{content}}

--
- data.frame: 2D table where **each column can be a different type**
{{content}}

--
- list: Each element can be anything (vectors, arrays, data.frames...)

---
# How to take care of data types

- Test early, test often, test with confidence.

- Think about what type of data is your input and output if possible.

- Check the types again with `typeof()` whenever you convert it.

- Failed conversion often results in `NA`. Check with `is.na()`.

---
# `character`s are friendly most of the time

## Hidden character

```r
fake_num = c("1", "3", "5", "7", "9", "l0")
```

```r
typeof(fake_num)
```

```
## [1] "character"
```

```r
as.numeric(fake_num)
```

```
## Warning: NAs introduced by coercion
```

```
## [1]  1  3  5  7  9 NA
```

---
# Alphabetical and numerical sorting

```r
chr_vec = c("5", "8", "6", "10", "7", "9")
```

```r
# You might not expect it to sort like this
sort(chr_vec)
```

```
## [1] "10" "5"  "6"  "7"  "8"  "9"
```

```r
# If they are numbers, they sort differently
sort(as.numeric(chr_vec))
```

```
## [1]  5  6  7  8  9 10
```

---
# Numeric type: Precision can be dangerous...

```r
0.1 + 0.2 == 0.3
```
--

```
## [1] FALSE
```

There's a website called [https://0.30000000000000004.com/](https://0.30000000000000004.com/)
that explains this in detail.

But briefly, any number that is not an integer has limited precision, and 
[propagation of error](https://en.wikipedia.org/wiki/Propagation_of_uncertainty) is a thing.

---
# A more robust way comparing non-integers

```r
# Define an error margin that you want to tolerate
error_margin = 10^-8

# and then decide if the difference is within the margin
(0.1 + 0.2) - 0.3 < error_margin
```

```
## [1] TRUE
```

---
# Factor: Ordered categories

```r
# Categories as characters works most of the time, but...
month_tbl = data.frame(
  month = c(
    "January", "February", "March", "April", "May", "June",
    "July", "August", "September", "October", "November", "December"
  ),
  length = c(
    31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31
  )
)
str(month_tbl)
```

```
## 'data.frame':	12 obs. of  2 variables:
##  $ month : chr  "January" "February" "March" "April" ...
##  $ length: num  31 28 31 30 31 30 31 31 30 31 ...
```
---
# Most functions will sort alphabetically

```r
library(ggplot2)
month_tbl %>%
  qplot(data = ., x = month, y = length, geom = "point") +
  # This adjust the axis text to make the text more visible
  theme(axis.text.x = element_text(size = 20, angle = 60, hjust = 1))
```

![](function_and_data_types_files/figure-html/unnamed-chunk-50-1.png)

---
# Unless you convert it to a factor

```r
month_tbl$month = factor(
  month_tbl$month,
  # R will respect the levels you set here
  levels = c(
    "January", "February", "March", "April", "May", "June",
    "July", "August", "September", "October", "November", "December"
  )
)

month_tbl %>%
  qplot(data = ., x = month, y = length, geom = "point") +
  # This adjust the axis text to make the text more visible
  theme(axis.text.x = element_text(size = 20, angle = 60, hjust = 1))
```

![](function_and_data_types_files/figure-html/unnamed-chunk-51-1.png)

---
# Use factors when...

- You know it's ordinal

- When you are building statistical models with categorical variables (ANOVA et al.)

---
# Date and time: Otherwise character

```r
# This shouldn't be surprising by now
random_dates = c("12-25-2022", "07-04-1989", "01-01-2077")
sort(random_dates)
```

```
## [1] "01-01-2077" "07-04-1989" "12-25-2022"
```

```r
# You can tell R how the date is represented to chronologically sort
library(lubridate)
good_dates = mdy(random_dates)
sort(good_dates)
```

```
## 
## Attaching package: 'lubridate'
```

```
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
```

```
## [1] "1989-07-04" "2022-12-25" "2077-01-01"
```

---
# Take-home messages

- Data types exist to help R figure out what to do with your data.

- A common source of bugs since types are counterintuitive.

- Proactively thinking about whether your functions assume specific data types
help prevent type-related errors.

- Learn to think about type when you test your code.

- Incorporating type checks (is.*()) in your code might be worth learning in
the future.

```r
is.numeric("apple")
```

```
## [1] FALSE
```