Introduction to R
Tis course is offered within the context of the IMPRS BeSmart Summerschool.- Asynchronous teaching
- Videos can be found here
- Exercises: See below. Participants submit answers each day before exercises start.
- Synchronous teaching
- Daily exercises (16.8.-20.8.), 11:00-12:00.
During synchronous teaching we will use RStudio and the software mentioned below.
- Objectives
- R is a powerful statistical programming language. The course should enable participants understand the basic structure of this language.
- Handout
- Topics:
- Basics
- Installing R, RStudio, Packages
- Data Types, Numbers, Vectors, Matrices, Arrays.
- More on Data Types
- Missings, Characters, Factors.
- Lists, Data frames
- Randomness
- Data and Functions
- Example datasets.
- Functions.
- Closures.
- Graphs and Files
- Introduction to Graphs.
- Graphs for Univariate and Bivarate Data.
- Files, Reading and Writing Data.
- Control Structures, Structuring Data
- Pipes
- Conditions, Loops,Repetition.
- Structuring Data, Grouping, Summarising, Mutating.
- Selecting Variables, Sorting, Joining, Reshaping Data.
- Tables, Regression.
- Basics
- Software
- For our practical examples (during the entire course) we will use the software environment R. I think that it is helpful to coordinate on one environment. R is free, it is very powerful, and it is popular in the field.
- Documentation for R is provided throught the built in help. You also find support on the R Homepage.
You might find the following useful:
- The R Guide, Jason Owen (Easy to read, explains R with the help of examples from basic statistics)
- Simple R, John Verzani (Explains R with the help of examples from basic statistics)
- Einführung in R, Günther Sawitzki (In German. Rather compact introduction.)
- Econometrics in R, Grant V. Farnsworth (The introduction to R is rather compact and pragmatic.)
- An Introduction to R, W. N. Venables und D. M. Smith (The focus is more on R as a programming language)
- The R language definition (Concentrates only on R as a programming language.)
- You can download R from the homepage of the R-project.
- Installing R with Microsoft Windows:
- Download and start the Installer. Install R on your local drive. Installing on a network drive or in the cloud (Dropbox, Onedrive,...) is possible but not recommended.
- Installing R with GNU-Linux:
- Follow the advice to install R for your distribution.
- Installing R with MacOS X:
- Here is a guide to install R with MacOS X.
- In the lecture we use RStudio as a front end.
- We will use the following packages:
car, Ecdat, foreign, Hmisc, tidyverse, lattice
.If, e.g., the command
library(Ecdat)
generates an error message (Error in library(Ecdat): There is no package called 'Ecdat'
), you have to install the package.- Installing packages with Microsoft Windows:
- With RStudio: Use the tab “Install”. Otherwise: Start
Rgui.exe
and install packages from the menuPackages / Install Packages
). - Installing packages from GNU-Linux or MacOS X:
- From within R use the command
install.packages("Ecdat")
, e.g., to install the packageEcdat
- Documentation for R is provided throught the built in help. You also find support on the R Homepage.
You might find the following useful:
Exercises
Please send your answers to the following questions as an email to oliver@kirchkamp.de
. Don’t attach any files to your email.
Exercise 1. Submit before Mon., 16.8., 10:30.
Install R and RStudio. Also install the package Ecdat
from within RStudio. The command
help(package="Ecdat")
gives you a list of the datasets that are provided by the package Ecdat
.
Can you find a dataset whose name starts with the same letter as your last name and which contains at least one variable that is a number? If there is no matching dataset, find one with the next letter in the alphabet. After the letter Z
, continue with the letter A
.
In your answer include you name and the name of the dataset.
How many rows and how many columns does the dataset have?
Choose one variable in the dataset which is a number. With which
R
command can you calculate the mean of this variable?
Exercise 2. Submit before Tue., 17.8., 10:30.
Following the same strategy as in the previous exercise: Find a dataset whose name starts with the same letter as your last name and that contains either a character variable or a variable that is a factor. If there is no matching dataset, proceed alphabetically, until you have found one that contains either a character variable or a variable that is a factor. Once you have reached the letter Z
, continue with the letter A
.
In your answer include your name and the name of the dataset.
How many variables in the dataset are characters? How many are factors?
How can you find out whether any variable contains any missing values?
Exercise 3. Submit before Wed., 18.8., 10:30.
Now find a dataset that matches your first name and that includes at least one variable that is a number.
In your answer, include your name, the name of the dataset, and the name of the variable.
Find the median of this variable.
Read the help page for the function
quantile
. How can you use thequantile
function to find the median of the above variable?Write a function
Quantile
that behaves similar toquantile
, except that it has different defaults. The functionQuantile
should, if the parameterprobs
is not specified, only return the minimum, the maximum and the median.
Exercise 4. Submit before Thu., 19.8., 10:30.
Find again a dataset that matches your first name and that includes at least two variables that are numbers.
In your answer, include your name, the name of the dataset, and the name of the two variables.
With which command can you produce a graph that shows the joint distribution of these two variables?
With which command can you produce a graph that shows only the distribution of the first variable?
In your answer, only include the commands, not the graph!
Exercise 5. Submit before Fri., 20.8., 10:30.
Find a dataset that matches your last name and that includes at least one variables that is a number, and a second variable that has fewer than 12 different values. I will call these (less than 12) values “cases”.
In your answer, include your name, the name of the dataset, and the name of the two variables.
With which command can you provide a table that, for each case, shows the mean, the median, and the difference between the mean and the median.