---
title: 'Computing in R: exercises'
author: "Perry Moerland"
#date: '`r format(Sys.time(), "%d %B, %Y")`'
date: "June 12-16, 2023"
urlcolor: blue
output:
unilur::tutorial_pdf_solution:
number_sections: true
unilur::tutorial_html:
toc: true
toc_float: true
theme: default
number_sections: true
unilur::tutorial_html_solution:
toc: true
toc_float: true
solution_suffix: ""
theme: default
number_sections: true
unilur::tutorial_pdf:
number_sections: true
---
```{r setup, include=FALSE, purl=FALSE}
knitr::opts_template$set(solution = list(box.title = "Answer",
box.body = list(fill = "white"),
box.collapse = FALSE))
knitr::opts_chunk$set(echo=TRUE,warning=FALSE,comment="#",fig.path='Figures/Figure-',fig.align='center',fig.width=7,fig.height=7,out.width="50%")
```
The objective of these computer exercises is to become familiar with the basic structure of R and to learn some more sophisticated tips and tricks of R. The R code to perform the analyses, together with the output, is given in a separate file. However, you are strongly advised to write the code yourself. Questions are posed to reflect on what you are doing and to study output and results in more detail.
# Basics
If you have not done so already, start R via **Starten - Alle programma's - R - R 4.3.0** (older versions of R should also be fine). This opens the R console. R is a command line driven environment. This means that you have to type in commands (line-by-line) for it to compute or calculate something. In its simplest form R can therefore be used as a pocket calculator:
```{r twotwo}
2+2
```
**Question 1. (a)**
Look up today's exchange rate of the euro versus the US\$ on the Internet. Use R to calculate how many US\$ is 15 euro.
```{r conversion,solution=TRUE}
# This is more or less the current exchange rate
15 * 1.0720699
```
**(b)**
Round the number you found to the second decimal using function `round` (use `help` or `?` to inspect the arguments of the function) and assign the result to an object `curr`.
```{r rounding,solution=TRUE}
curr <- round(15 * 1.0658, digits = 2)
curr
```
**(c)**
Use `mode`, `str`, and `summary` to inspect the object you created. These functions give a compact display of the type of object and its contents.
```{r summary,solution=TRUE}
mode(curr)
str(curr)
summary(curr)
```
Luckily R is more than just a calculator, it is a programming language with most of the elements that every programming language has: statements, objects, types, classes _etc._
# R syntax: vectors
**Question 2. (a)**
One of the simplest data structures in R is a vector. Use the function `seq` to generate a vector `vec` that consists of all numbers from 11 to 30 (see `?seq` for documentation).
```{r vec,solution=TRUE}
vec <- seq(11,30)
vec
```
**(b)**
Select the 7th element of the vector.
```{r vec7,solution=TRUE}
vec[7]
```
**(c)**
Select all elements except the 15th.
```{r vec-15,solution=TRUE}
# Use '-' to exclude elements
vec[-15]
```
**(d)**
Select the 2nd and 5th element of the vector.
```{r vec2and5,solution=TRUE}
# Use the concatenate function 'c'
vec[c(2,5)]
```
**(e)**
Select only the odd valued elements of the vector `vec`. Do this is in two steps: first create the appropriate vector of indices `index` using the function `seq`, then use `index` to make the selection.
```{r odd,solution=TRUE}
# Instead of "hard coding" the length of the vector by specifying the value 20,
# you can calculate it using the function 'length'. This makes your code more
# generic and less error prone
index <- seq(1,length(vec),by=2)
vec[index]
```
Until now you have only used the R console. However, RStudio (**Starten - Alle programma's - R - RStudio**) provides a much nicer and richer interface when working with R. Although all exercises can be made within the basic R environment, we highly recommend to use RStudio.
**Question 3. (a)**
What are the windows that RStudio consists of?
```{block RStudio,solution=TRUE}
The most important ones are Script, Console, Environment, History, Files, Plots, Packages, Help, Viewer.
```
From now on we advise you to use RStudio. You can use either the Console or the Script window (the upper left window). We recommend you to use the Script window, since this allows you to easily save your code to a file for later usage.
**(b)**
Use the elementary functions `/`, `-`, `^` and the functions `sum` and `length` to calculate the mean $\bar{x}=\sum_i x_i/n$ and the standard deviation $\sqrt{\sum_i(x_i-\bar{x})^2/(n-1)}$
of the vector `vec` of all numbers from 11 to 30. You can verify your answer by using the built-in R functions `mean` and `sd`.
```{r sd,solution=TRUE}
# Don't call the variable 'mean' since the function 'mean' already exists
xmean <- sum(vec)/length(vec)
xmean
mean(vec)
# Calculate the standard deviation in three (baby) steps
numerator <- sum((vec-xmean)^2)
denominator <- (length(vec)-1)
xstd <- sqrt(numerator/denominator)
xstd
sd(vec)
```
**(c)**
Once you completed the analysis, have a look at the Environment and History windows. Do you understand their contents?
**Question 4.**
R comes with many predefined datasets. You can type `data()` to get the whole list. The `islands` dataset gives the areas of the world's major landmasses exceeding 10,000 square miles. It can be loaded by typing:
```{r islands}
# This could in this case actually be skipped since the package datasets is
# already loaded
data(islands)
```
**(a)**
Inspect the object using the functions `head`, `str` _etc._ Also have a look at the help file using `help(islands)`.
**(b)**
How many landmasses in the world exceeding 10,000 square miles are there?
```{r length,solution=TRUE}
# From ?islands: "Description: The areas in thousands of square miles of
# the landmasses which exceed 10,000 square miles." So all of them:
length(islands)
```
**(c)**
Make a Boolean vector that has the value `TRUE` for all landmasses exceeding 20,000 square miles.
```{r sel-islands1,solution=TRUE}
# Remember that area was given in thousands of square miles
islands.more20 <- (islands > 20)
```
**(d)**
Select the islands with landmasses exceeding 20,000 square miles.
```{r sel-islands2,solution=TRUE}
# Use the Boolean vector islands.more20 to select the ones with value TRUE
islands[islands.more20]
```
**(e)**
Make a character vector that only contains the names of the islands.
```{r names-islands,solution=TRUE}
islands.names <- names(islands)
```
**(f)**
The Moluccas have mistakenly been counted as a single island. The largest island of the Moluccas, Halmahera, only has about 7,000 square miles. Remove Moluccas from the data. Hint: an elegant solution uses the Boolean operator `!=`.
```{r moluccas,solution=TRUE}
notMoluccas <- islands.names != "Moluccas"
islands.withoutMoluccas <- islands[notMoluccas]
# Another solution is:
islands.withoutMoluccas <- subset(islands,islands.names!="Moluccas")
```
# R syntax: matrices
**Question 5. (a)**
Create a character matrix `M` with 6 rows and three columns containing the first eighteen letters of the Roman alphabet (hint: see `?letters`). Fill the matrix column-wise.
```{r matrix-create,solution=TRUE}
letters
M <- matrix(letters[1:18], nrow = 6, ncol = 3)
M
```
**(b)**
Which is the letter on the fifth row and second column?
```{r matrix-select,solution=TRUE}
M[5, 2]
```
**(c)**
Replace the letters in the last column of `M` with the last six letters of the Roman alphabet.
```{r matrix-replace,solution=TRUE}
M[,3] <- letters[21:26]
M
```
**(d)**
Make a character vector containing all the vowels and use this vector to replace the vowels in `M` with the character string "vowel".
```{r matrix-vowel,solution=TRUE}
# Let's assume that "y" is also a vowel
vowels <- c("a","e","i","o","u","y")
M[M %in% vowels] <- "vowel"
M
```
**(e)**
Now replace all consonants in `M` with the character string "consonant".
```{r matrix-consonant,solution=TRUE}
# '!=': not equal to
M[M!="vowel"] <- "consonant"
M
```
**(f)**
Count the number of elements in `M` that are equal to "consonant" and "vowel" respectively.
```{r matrix-count,solution=TRUE}
sum(M=="consonant")
sum(M=="vowel")
```
# Data frames, data import, missing values and factors
**Question 6.**
We are going to use yet another predefined dataset: `mtcars`. These data were extracted from the 1974 Motor Trend US magazine, and comprise fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). See `?mtcars` for the definition of the variables contained in it.
**(a)**
What kind of data structure is the `mtcars` data?
```{r cars-class,solution=TRUE}
# See '?mtcars', it is a data frame. On the command line you can use 'class' or 'str'
class(mtcars)
str(mtcars)
```
**(b)**
How many car models can ride more than 25 miles per gallon?
```{r cars-mpg,solution=TRUE}
sum(mtcars$mpg > 25)
# Other solutions using different ways of extracting a column from a data frame
sum(mtcars[,"mpg"] > 25)
sum(mtcars["mpg"] > 25)
sum(mtcars[,1] > 25)
```
**(c)**
What are the names of the car models that can ride more than 25 miles per gallon?
```{r cars-mpg-names,solution=TRUE}
rownames(mtcars[mtcars$mpg > 25,])
```
**(d)**
Select the car models that can ride less than 22 miles per gallon and have more than 6 cylinders.
```{r cars-extract,solution=TRUE}
# Use the Boolean operator '&' (logical AND)
mtcars[(mtcars$mpg < 22) & (mtcars$cyl > 6), ]
```
**(e)**
In addition to the previous selection, also select the car models that either have more than 3 carburetors or weigh less than 3500 lbs. How many car models satisfy these conditions?
```{r cars-extract2,solution=TRUE}
# Use the Boolean operator '|' (logical OR).
# Note that the weight is specified per 1000 lbs (see ?mtcars)
mtcars[(mtcars$mpg < 22) & (mtcars$cyl > 6) & ((mtcars$carb > 3) | (mtcars$wt) < 3.5), ]
# The number of car models is the number of rows, that is the first dimension
dim(mtcars[(mtcars$mpg < 22) & (mtcars$cyl > 6) & ((mtcars$carb > 3) | (mtcars$wt) < 3.5), ])
# The number of rows can also be retrieved with 'nrow'
nrow(mtcars[(mtcars$mpg < 22) & (mtcars$cyl > 6) & ((mtcars$carb > 3) | (mtcars$wt) < 3.5), ])
```
**(f)**
The variable `am` is a numeric variable that indicates the type of transmission:
* 0: automatic
* 1: manual
Add a variable called `am_char` to the `mtcars` data set where you recode the variable `am` and replace 0 by "automatic" and 1 by "manual".
```{r cars-am-recode,solution=TRUE}
mtcars$am_char[mtcars$am == 0] <- "automatic"
mtcars$am_char[mtcars$am == 1] <- "manual"
mtcars[,c("am","am_char")]
```
We will work with a data set that gives the survival status of passengers on the Titanic. See [titanic3info.txt](https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic3info.txt) for a description of the variables. Some background information on the data can be found on [titanic.html](https://hbiostat.org/data/repo/titanic.html).
**Question 7. (a)**
Download the `titanic` data set `titanic3.dta` in STATA format from the [course website](https://bioinformaticslaboratory.eu/gs-computing-in-r/) and import it into R using the function `read.dta`. Do not use the function `read_dta` from the **haven** package nor the 'Import Dataset' option in RStudio. This leads to slight differences in the imported data which may lead to problems in some of the exercises. Give the data set an appropriate name as an R object. E.g.,\ we can call it `titanic3` (but feel free to give it another name). Before importing the data, you first have to load the **foreign** package in which the function `read.dta` is defined. Moreover, you probably will need to point R to the right folder where you saved the file `titanic3.dta`. You can do this by changing the so-called _working directory_, which will be explained later. Using the basic R environment you can do this via the menu (**File - Change dir**), and similarly for RStudio (**Session - Set Working Directory - Choose Directory**).
```{r stata-imp,echo=FALSE}
library(foreign)
titanic3 <- read.dta("titanic3.dta", convert.underscore=TRUE)
```
```{r stata-imp,eval=FALSE,solution=TRUE}
```
We take a look at the values in the `titanic` data set.
**(b)**
First, make the whole data set visible in a spreadsheet like format (how this is done depends on whether base R or RStudio is used).
```{block answers2,solution=TRUE}
This can be done using `View` on the command line. You can also use the menu: in R use **Edit - Data editor**, in RStudio click on the name of the data set in the **Environment** window.
```
**(c)**
Often, it is sufficient to show a few records of the total data set. This can be done via selections on the rows, but another option is to use the functions `head` and `tail`. Use these functions to inspect the first 4 and last 4 records (instead of 6 which is the default).
```{r head,solution=TRUE}
head(titanic3, n = 4)
tail(titanic3, n = 4)
```
The function `dim` can be used to find the number of rows (passengers in this case) and columns (variables) in the data.
```{r dim,eval=FALSE}
dim(titanic3)
```
An alternative is the function `str`. This function shows information per column (type of variable, first records, levels in case of factor variables) as well as the total number of rows and columns.
```{r str,eval=FALSE}
str(titanic3)
```
**(d)**
Issue both commands and study the output.
```{r dim,solution=TRUE}
```
**(e)**
Summarize the data set by using the `summary` function.
```{r summarydf,solution=TRUE,samepage=TRUE}
# Here the summary is shown for four of the variables only
summary(titanic3[,c("survived","pclass","home.dest","dob")])
```
This gives a summary of all the variables in the data set. We can see that for categorical variables like `pclass` a frequency distribution is given, whereas for continuous variables like
`age` numerical summaries are given. Still, for some variables we do not automatically get what we would like or expect:
* The variable `survived` is dichotomous with values zero and one, which are interpreted as numbers.
* The categorical variable `pclass` is represented differently from the categorical variable `home.dest`.
* The variable `dob`, which gives the date of birth, is represented by large negative numbers.
In subsequent exercises, we will shed further light on these anomalies and we will try to repair some of these.
**Question 8.**
We can also summarize specific columns (variables) of a `data.frame`. There are many ways to summarize a variable, depending on its structure and variability. For continuous variables, the same `summary` function can be used, but other options are the functions `mean`, `quantile`, `IQR`, `sd` and `var`. For categorical summaries, one may use `summary` and `table`. Note that missing values are treated differently depending on the function used.
**(a)**
Summarize the age variable in the `titanic` data set (the age column is selected via `titanic3$age`). Give the 5\%, 25\%, 50\%, 75\% and 95\% quantiles and the inter-quartile range. You may have to take a look at the help files for the functions.
```{r summaries,solution=TRUE}
# Note that you have to convert percentages into probabilities
quantile(titanic3$age, probs=c(0.05,0.25,0.5,0.75,0.95), na.rm=TRUE)
IQR(titanic3$age, na.rm=TRUE)
```
**(b)**
Summarize the `survived` variable using the `table` function. Also give a two-by-two table for sex and survival status using the same `table` function.
```{r tables,solution=TRUE}
table(titanic3$survived)
table(titanic3$sex, titanic3$survived)
```
**Question 9 (OPTIONAL).**
Instead of the file in STATA format, we also made available part of the `titanic` data set in tab-delimited format (`titanic3select.txt`). Download this file from the [course website](https://bioinformaticslaboratory.eu/gs-computing-in-r/) and then import it using the command `read.table`. Note that this file has been manipulated in Excel and that importing the data is not as straightforward as you would have hoped. You will need to tweak the arguments of `read.table` and/or make some changes to the file manually.
```{block read-titanic-txt,solution=TRUE}
Running the following line of code:
`titanic3.select <- read.table("titanic3select.txt",sep="\t",header=TRUE)`
throws an error message:
`Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 16 did not have 18 elements
Error during wrapup: cannot open the connection`
This error message can in principle be corrected by adding the argument `fill=TRUE`, which adds blank fields if rows have unequal length. However, running
`titanic3.select <- read.table("titanic3select.txt",sep="\t",header=TRUE,fill=TRUE)`
throws a warning message:
`Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string`
It turns out that the data was not correctly imported, which is often easily diagnosed by inspecting the dimensions:
`dim(titanic3.select)`
Indeed the number of rows is 25 instead of the expected 30. This is caused by a trailing ' at row 25 for variable `home_dest`. This can be corrected by changing the default set of quoting characters:
`titanic3.select <- read.table("titanic3select.txt", sep="\t", header=TRUE, fill=TRUE, quote="\"")`
A closer inspection of the data shows that the last column only contains NAs caused by trailing spaces. On the course website there is a [corrected version](https://bioinformaticslaboratory.eu/wp-content/uploads/gs-computing-in-r/titanic3select_corrected.txt) of the file where the last column and the trailing ' at row 25 were deleted.
`titanic3.select <- read.table("titanic3select_corrected.txt",sep="\t",header=TRUE)`
```
**Question 10.**
We have a further look at the output from the `summary` function, which was not always what we would like to see. First, give the commands (`sapply` will be explained later):
```{r diffchar-fact}
sapply(titanic3, mode)
sapply(titanic3, is.factor)
```
We see that `survived` has mode "numeric" and is not a `factor` (`is.factor(titanic3$survived)` gives `FALSE`). It is interpreted in the same way as the truly continuous variable `age`.
**(a)**
Add a variable `status` to the `titanic` data set, which gives the survival status of the passengers as a factor variable with labels "no" and "yes" describing whether individuals survived. Make the value "no" the first level (see [titanic3info.txt](https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic3info.txt) for the reason).
```{r factor-survived,solution=TRUE}
# titanic3info.txt: Survival (0 = No; 1 = Yes)
titanic3$status <- factor(titanic3$survived, labels=c("no","yes"))
```
**(b)**
Variable `pclass` has mode "numeric" and is a `factor`, whereas `home.dest` has mode "character" and is not a `factor`. Give the commands
```{r as-numeric,eval=FALSE}
as.numeric(head(titanic3$pclass))
as.numeric(head(titanic3$home.dest))
```
and explain the difference in output.
```{r as-numeric,solution=TRUE,warning=FALSE}
```
We do not change the variables in this dataset that have mode "character". They are variables that have many different values, and there is no reason to convert them into factors.
**Question 11. (a)**
Take a look at the name (`name`), and the home town/destination (`home.dest`) of all passengers who were older than 70 years. Use the appropriate selection functions.
```{r select-old,solution=TRUE}
# Note that the function 'subset' also leaves out the passsengers for which the
# variable 'age' is missing
subset(titanic3,age>70)[,c("name","home.dest")]
# Another solution is
subset(titanic3,age>70,select=c(name,home.dest))
# Yet another solution is
titanic3[titanic3$age>70 & !is.na(titanic3$age),c("name","home.dest")]
```
**(b)**
There is one person from Uruguay in this group. Select the single record from that person. Did this person travel with relatives?
```{r Uruguay,solution=TRUE}
subset(titanic3, name=="Artagaveytia, Mr. Ramon")
# No he didn't travel with relatives, since the variable 'family' is 'no'
```
**(c)**
Make a table of survivor status by sex, but only for the first class passengers. Use the `xtabs` function.
```{r table-class,solution=TRUE}
xtabs(~sex+status, data=titanic3, subset=(pclass=="1st"))
```
# Graphics
## Basic graphics
The graphics subsystem of R offers a very flexible toolbox for high-quality graphing. There are typically
three steps to producing useful graphics:
* Creating the basic plot
* Enhancing the plot with labels, legends, colors _etc._
* Exporting the plot from R for use elsewhere
In the graphics model that R uses, a figure region consists of a central plotting region surrounded by margins. The basic graphics command is `plot`. The following piece of code illustrates the most
common options:
```{r plot}
plot(3,3,main="Main title",sub="subtitle",xlab="x-label",ylab="y-label")
text(3.5,3.5,"text at (3.5,3.5)")
abline(h=3.5,v=3.5)
for (side in 1:4) mtext(-1:4,side=side,at=2.3,line=-1:4)
mtext(paste("side",1:4),side=1:4,line=-1,font=2)
```
In this figure one can see that
* Sides are labelled clockwise starting with _x_-axis
* Coordinates in the margins are specified in _lines of text_
* Default margins are not all wide enough to hold all numbers -1,...,4
You might want to use the `help` function to investigate some of the other functions and options.
Plots can be further finetuned with the `par` function. For instance, the default margin sizes can be changed using `par`. The default settings are
```{r}
par("mar")
```
This explains why only side 1 in the figure had a wide enough margin. This can be remedied by setting
```{r parmar5,eval=FALSE}
par(mar=c(5,5,5,5))
```
before plotting the figure.
**Question 12. (a)**
Try making this figure yourself by executing the code shown above.
**(b)**
Save the figure you just made to a file. For this you have to know that R sends graphics to a _device_. The default device on most operating systems is the screen, for example the "Plots" window in RStudio. Saving a figure, therefore, requires changing R's current device. See `help(device)` for the options. Save the figure you just made to a `png` and a `pdf` file. Don't forget to close the device afterwards.
```{r png,eval=FALSE,solution=TRUE}
png(file="figure.png")
# Add code here to plot the figure (including par(mar=c(5,5,5,5)))
dev.off()
pdf(file="figure.pdf")
# Add code here to plot the figure (including par(mar=c(5,5,5,5)))
dev.off()
```
**(c)**
When saving a figure to file, default values for the figure size are used. Save the figure to a `pdf` file with width and height of 21 inches.
```{r pdf,eval=FALSE,solution=TRUE}
pdf(file="figureLarge.pdf",width=21,height=21)
# Add code here to plot the figure (including par(mar=c(5,5,5,5)))
dev.off()
```
**Question 13.**
The `quakes` data set gives location and severity of earthquakes off Fuji. First load the data:
```{r quakes}
data(quakes)
```
**(a)** Make a scatterplot of latitude versus longitude. You should obtain a graph similar to:
```{r plot-quakes1,echo=FALSE}
plot(quakes$long, quakes$lat)
```
```{r plot-quakes1,eval=FALSE,solution=TRUE}
```
**(b)**
Use `cut` to divide the magnitude of quakes into three categories (cutoffs at 4.5 and 5) and use `ifelse` to divide depth into two categories (cutoff at 400). Hint: have a careful look at the (admittedly complicated) help page of `cut`, in particular the arguments `breaks` and `include.lowest`.
```{r mag-quakes,echo=FALSE}
mag.cat <- with(quakes,cut(mag, breaks = c(4, 4.5, 5, 7),
include.lowest = TRUE ))
# More generic using the functions 'min' and 'max' that return the
# minimum and maximum, respectively, of a vector
mag.cat <- with(quakes,cut(mag, breaks = c(min(mag), 4.5, 5, max(mag)),
include.lowest = TRUE ))
levels(mag.cat) <- c("low", "medium", "high")
depth.cat <- factor(ifelse(quakes$depth>400, "deep", "shallow"))
```
```{r mag-quakes,eval=FALSE,solution=TRUE}
```
**(c)**
Redraw the plot, with symbols by magnitude and colors by depth. You should obtain a graph similar to:
```{r plot-quakes2,echo=FALSE}
# Note that for 'col' the factor 'depth.cat' is interpreted as numeric (deep=1 and
# shallow=2) and that the first two colours of palette() are used. Strangely enough
# for 'pch' this is not the case and the factor ''mag.cat' has to be converted to a
# numeric vector using the function 'as.numeric'.
plot(quakes$long, quakes$lat, pch=as.numeric(mag.cat), col=depth.cat)
```
```{r plot-quakes2,eval=FALSE,solution=TRUE}
```
**(d)**
The magnitude of the earthquakes is given in Richter scale. Calculate the energy released in each earthquake as $10^{(3/2)}$ to the power of the Richter measurement.
```{r energy,echo=FALSE}
energy <- (10^(3/2))^quakes$mag
```
```{r energy,eval=FALSE,solution=TRUE}
```
**(e)**
The `cex` argument of `plot` can be used to scale the plot symbols. We will scale the plot symbols so that the surface of the plot symbol is proportional to the released energy. Calculate plot symbol size as the square root of the energy divided by the median square root of the energy (to get median symbol size 1).
```{r symbolsize,echo=FALSE}
symbolsize <- sqrt(energy)/median(sqrt(energy))
```
```{r symbolsize,eval=FALSE,solution=TRUE}
```
**(f)**
Plot the magnitude of earthquakes again, but with plot symbols sized by energy. Keep the coloring by depth. You should obtain a graph similar to:
```{r plot-quakes3,echo=FALSE}
plot(quakes$long, quakes$lat, cex=symbolsize, col=depth.cat)
```
```{r plot-quakes3,eval=FALSE,solution=TRUE}
```
**Question 14.**
Different plots can be combined in a single window (device) via , e.g., `par(mfrow=c(..))` or `layout(..)`. Combine the histogram and the two boxplots for the `titanic` data from the lecture into a single plot. Find a nice layout of your final plot, see help files for setting proper margins, etc. You should obtain a graph similar to:
```{r plot-titanic,echo=FALSE}
par(mfrow=c(3,1))
# Make the margins smaller than default
par(mar=c(2,4,2,2))
hist(titanic3$age,breaks=15,freq=FALSE)
boxplot(titanic3$fare,ylim=c(0,300),ylab="fare")
boxplot(fare~pclass,data=titanic3,ylim=c(0,300),ylab="fare")
```
```{r plot-titanic,eval=FALSE,solution=TRUE}
```
# Structure of R
**Question 15.**
Investigate which environments are in the search path. Take a look at the objects that exist in the workspace. Which is the current working directory?
```{r search,solution=TRUE}
# You will get different results depending on your configuration
search()
ls()
getwd()
```
**Question 16.**
The function `mean` is defined in the **base** package, which is included in the search path at startup. From the `search` command, we can see where in the search path the **base** package is located. Issue the command
```{r ls,eval=FALSE}
ls("package:base", pattern="mean")
```
Instead of the argument `package:base`, we could have given the position in the search path of the **base** package (that is `ls(10, pa="mean")` if base is in position 10). Have a look at the different names of the `mean` function.
**Question 17.**
Save the `titanic` data set in R binary format with extension ".RData".
```{r save,eval=FALSE,solution=TRUE}
save(titanic3, file="Titanic.RData")
```
# Data manipulation
**Question 18.**
Sort the `titanic` data set according to the age of the passengers and store the result in a separate object, e.g.\ named `data.sorted`. Have a look at the 10 youngest and the 10 oldest
individuals. For reasons of space, restrict to the first 5 columns when showing the results. What do you notice with respect to passenger class and age?
```{r arrange,solution=TRUE,warning=FALSE,message=FALSE}
# If not installed yet, first install the package dplyr that contains the function
# arrange
#install.packages("dplyr")
library(dplyr)
data.sorted <- arrange(titanic3, age)
head(data.sorted[,1:5],10)
# Use is.na to exclude the passengers for whom 'age' is missing
tail(subset(data.sorted,!is.na(age))[,1:5],10)
```
**Question 19.**
Give a summary of the fare paid for the three passenger classes separately, using the `summary` function, the `subset` function for selecting the appropriate rows, as well as one of the mechanisms for selecting columns.
```{r summary-fare1,solution=TRUE}
summary(subset(titanic3, pclass=="1st")$fare)
summary(subset(titanic3, pclass=="2nd")$fare)
summary(subset(titanic3, pclass=="3rd")$fare)
```
```{block summary-fare2,solution=TRUE,box.title="Remark"}
Instead of writing three lines of code, applying the `summary` function to each of the three subsets, we could write a `for` loop. When using a `for` loop, we need to explicitly write the `print` command:
`for(i in 1:3){print(summary(subset(titanic3, pclass==levels(pclass)[i])$fare))}`
The function `aggregate` is yet another option:
`aggregate(fare~pclass,data=titanic3,FUN=summary)`
Later, we will see even more efficient ways.
```
**Question 20.**
Create an extra variable that categorizes the variable `sibsp` (the number of siblings/spouses aboard) into three groups: 0, 1 and more than 1. Also, create an extra factor variable named `paid`, which shows whether the individual paid for the trip (i.e. whether fare >0). Preferably, use the `within` or the `transform` function. Check whether the results are correct.
```{r transform,solution=TRUE}
titanic3 <- within(titanic3, {
sibspcat <- cut(sibsp, breaks=c(0,1,2,9),include.lowest=TRUE,
right=FALSE, labels=c("0","1","2-8"))
paid <- factor(fare>0, labels=c("no","yes"))
} )
titanic3 <- transform(titanic3,
sibspcat = cut(sibsp, breaks=c(0,1,2,9),include.lowest=TRUE,
right=FALSE, labels=c("0","1","2-8")),
paid = factor(fare>0, labels=c("no","yes"))
)
```
# Documentation and help
**Question 21.**
R is also very flexible with respect to manipulation of character data, such as words. In this exercise, you will see an example.
**(a)**
Create a character vector of length three that consists of the words "Academic", "Medical" and "Center". Give the object the name `AMC`. Check the _mode_ of the object.
```{r AMC-1,echo=FALSE,results="hide"}
AMC <- c("Academic","Medical","Center") # a character vector named AMC
mode(AMC)
is.character(AMC)
```
```{r AMC-1,solution=TRUE}
```
**(b)**
Next, we want to abbreviate the word and obtain `AMC` as output. Try to find the appropriate functions using the commands
```{r AMC-2,eval=FALSE}
help.search("abbreviate")
help.search("combine")
help.search("quote")
```
```{r AMC-3,solution=TRUE}
#help(abbreviate)
abbreviate(AMC,1) # selects first character
#help(paste)
paste(abbreviate(AMC,1),collapse="") # gives AMC
```
Finally, if you want to remove the quotes give the following command
```{r AMC-4,results="hide"}
noquote(paste(abbreviate(AMC,1),collapse="")) # removes quotes
```
**Question 22.**
Try to find more information on a topic of interest and how R can be of help. If you do not have any idea, you can search for the keyword "crosstab".
```{r crosstab,eval=FALSE,solution=TRUE}
help(crosstab)
help.search("crosstab")
RSiteSearch("crosstab")
install.packages("sos")
library(sos)
findFn("crosstab")
```
# Statistical analysis
**Question 23.**
Fit a linear model that predicts fare as a function of age. Since fare has a very skewed distribution, we use the transformed variable `log10(fare+1)`. Consider the following issues.
**(a)**
Fit the model and store the results in an R object. Summarize the model using the `summary` function.
```{r linmodel-fit,solution=TRUE}
fit.fare <- lm(log10(fare+1) ~ age, data=titanic3)
summary(fit.fare)
```
**(b)**
One of the assumptions of a standard linear model is that the residuals have approximately normal distribution. Make a histogram of the residuals, using the functions `resid` and `hist`. You should obtain a graph similar to:
```{r hist-resid,echo=FALSE}
fit.fare <- lm(log10(fare+1) ~ age, data=titanic3)
hist(resid(fit.fare))
```
```{r hist-resid,eval=FALSE,solution=TRUE}
```
**(c)**
Make a plot of the residuals against the fitted values, using (with `fit.fare` the name of the object that contains the linear model fit):
```{r resid,eval=FALSE}
plot(resid(fit.fare)~fitted(fit.fare))
```
**(d)**
Make a scatterplot of fare against age and add the linear regression line. A fitted regression line is added to a plot via `abline(fit.fare)`. You should obtain a graph similar to:
```{r fareage,echo=FALSE}
plot(log10(fare+1)~age,data=titanic3)
abline(fit.fare)
```
```{r fareage,eval=FALSE,solution=TRUE}
```
**(e)**
Does the object have a class? If so, which are the generic functions that have a method for this class?
```{r linmodel-class,solution=TRUE}
class(fit.fare)
methods(class="lm")
```
# Programming and ply functions
Functions from the `apply` family are convenient shorthands for repetitions.
**Question 24.**
Use `apply` to calculate the mean of the variables `age`, `fare`, and `body` of `titanic3`.
```{r apply,solution=TRUE}
apply(subset(titanic3,select=c("age","fare","body")),2,mean,na.rm=TRUE)
```
**Question 25.**
The `chickwts` data describes chicken weights by feed type. First load the data:
```{r chickwts}
data(chickwts)
```
**(a)**
Calculate the mean weight for each feed type.
```{r mean-weight,solution=TRUE}
tapply(chickwts$weight, chickwts$feed, mean)
```
**(b)**
Count the number of chicks with weight over 300 grams,
```{r number-chicks,solution=TRUE}
sum(chickwts$weight > 300)
```
Further abstraction of the R code is possible through _functions_. Functions lead to more compact code that is easier to understand and also avoid duplication of the same code over and over again.
```{r func,eval=FALSE}
name <- function(arg_1,arg_2, ...){
expr
}
```
* `expr` is, in general, a grouped expression containing the arguments `arg_1, arg_2` ...
* A call to the function is of the form `name(expr_1,expr_2, ...)`
* The value of the expression `expr` is the value returned by the function
**Question 26. (a)**
For the `chickwts` data, write a function that takes a vector `x` as input and returns the number of observations in `x` greater than 300.
```{r gt-chicks,solution=TRUE}
greater300 <- function(x) {
sum(x > 300)
}
```
**(b)**
Calculate the number of chicks with weight over 300 grams for each feed type.
```{r gt-chicks2,solution=TRUE}
tapply(chickwts$weight, chickwts$feed, greater300)
```