Data Wrangling, workflow and replicability
This course is offered within the context of the IMPRS BeSmart Summerschool.- Material
- Videos can be found here. We all have different backgrounds. I suggest that all participants have a look at the videos before the course starts. This should make a productive discussion during the course easier.
- Here are the slides I use in the videos. Here are the R-commands.
- Not part of the course, but related: Here you find more information on Workflow of statistical data analysis.
- Also not part of the course: Here you find a brief introduction to R.
- Synchronous teaching
- 10 August + 11 August
- Motivation
- A significant part of statistical analysis relies on preparation of data. Raw data must be understood by the researcher, it must be structured and it must be cleaned. Causal inference is often only a small part of the work.
In this course we study which steps of data preparation are necessary for a paper like Christoph Engel. “Lucky you: Your case is heard by a seasoned panel—Panel effects in the German Constitutional Court.” Journal of Empirical Legal Studies. 2022. 1179-1221.
We will first discuss a number of tools. We will then give an example how to apply these tools.
- Finding and replacing text, regular expression.
- Reading time and Date
- Working with HTML.
- Working with funny dataset, repetition.
- Applying these tools to read and clean data from the Federal Constitutional Court.
- Tools
- We will outline the example first in R. Participants should have installed R, an IDE for R (e.g. RStudio),
and the libraries lubridate, stringr, dplyr, tidyr, parallel, ggplot2, mgcv, tidymv, httr, xml2, rvest, xtable.
If time permits, we will also discuss how to solve the problem with Python. For this part, participants should have installed Python, an IDE for Python, and the libraries pandas, numpy, dfply, time, locale.