Guidelines for the replication project

Learning outcomes

Carry out an empirical research project, from the data manipulation stage to the interpretation of the results
Connect a paper to its relevant literature
Innovate by proposing interesting extensions and additional analyses
Write clean, documented scripts

Four stages

Stage 1: Find the paper and the data

You should first find a paper and the associated data. The main criteria are:

how interesting you find the question tackled by the paper,
the quality of the data and the paper relating to the objectives of the replication project,
it needs to be a labour paper.

Given that you have to demonstrate your skills in the different stages (data cleaning, estimation), avoid projects that are too easy in these aspects. Also, because you have to propose an extension, think about papers you find inspiring enough and on which you believe to have a chance to improve.

You’ll need to get your propositions explicitly approved (by the module teachers) before starting to work.

Stage 2: Relate the paper in the literature

Usually, (good) papers provide a literature review to explain which previous papers are closest and how they contribute to the literature. You will complement this literation review by:

adding any previous or contemporary paper that was omitted by the author(s),
discussing the offspring of the paper, explaining which papers have later built on that paper, how they have improved on it, and whether the conclusions still hold at present.

Stage 3: Replicate the paper

Read the paper and replicate it. You can choose to replicate only part of the results, the most important ones, for the sake of space. You will start by making exactly the same choices in terms of sample and specification as what is done in the original paper. Then, you will vary those to assess the robustness of the paper. Crucially, any discrepancy between the findings reported in your paper and those you find can be investigated and commented. You will realise that most papers’ results tend to depend on the sample or the specification chosen for the estimation.

A good replication does not aim to kill the original paper but to provide a balanced assessment of the robustness of the results. Does any minor deviation from the specification and sample chosen by the author(s) make the results disappear? What are the dimensions in which the results seem to be particularly robust or fragile?

Because this exercise requires varying the sample and the specification, avoid the papers for which only the final sample is available or that only provide the covariates that are finally included in the estimation. Working on papers that start from data that are publicly available (or data you have access to), as you can replicate the full data cleaning process as well as the estimation.

Stage 4: Carry out an original analysis

In the last stage, you will propose an additional, original analysis that extends the paper in an interesting way. This is the most difficult part of the project and the one on which you have the opportunity to be creative.
There is no proper guideline for this part but, as with any piece of research, you will have to:

state the research question in a clear way,
provide a short literature review specific to the question you investigate,
explain your contributions,
present your methodology (assumptions…) and your results.

Practical advice

You should consider this project as a mock for your future applied projects. As such, you will take a special care to the following aspects.

Coding and scripts

The project will be coded in R or Python. The scripts should:

be written in a way that they can run in any folder they are placed (for instance, relative paths will be used),
be documented so that the whole structure and the details of each command is clear. For data cleaning steps, the choices will be explained in plain English in a comment preceding (or following) the line where the change is implemented,
duly indented,
produce tables and figures, ready to be copied in a report/paper, as an output.

In R, following Tidyverse Style Guide is good practice. While it is not compulsory, it is advised to use dplyr for the data cleaning process and ggplot2 to produce the figures. It is also advised to use a version control tool, like Git (for instance github or gitlab), to make the code available.

Report

The report will be below 5,000 words, without tables, figures, reference list and appendices (if necessary). It will take the form of a paper, with a title, an abstract, a body, a reference list and, if necessary, appendices. I will provide a template that students are welcome to use. Having roughly 500-1000 words for the literature review, 2000-2500 words for the replication exercise and 2000 words for the extension sounds good (but any other split is possible if it makes sense in a given case). It is very much advised to draft the report in LaTeX. Overleaf provides for instance a very good interface to LaTeX, saves one’s work on the cloud and does not require any local installation of LaTeX. The report will be carefully and concisely drafted. It is a piece of research and should be written as such. In particular, standard procedures or methods do not need to be detailed.

Assessment criteria

The replications need to be correct, fair, relevant, and well executed. The following criteria will be assessed (by decreasing importance):

the relevance and fairness of the replication exercise. Did the student think about all dimensions, did he interpret correctly the potential discrepancies?
the relevance of the extension (and to a lesser extent its difficulty),
the depth and the concision of the literature review: no need to summarise each single article, what matters is the overall picture,
the concision of the report and, to some extent, the cleanliness of the typesetting.
whether the student was able to come up with a relevant paper/dataset on his own,
the cleanliness of the scripts and the extent to which they are documented, indented and easy to read and to run,
the difficulty of the data cleaning process. Did the student start with a public dataset and re-do all the cleaning process? Or did he start with a file already quite clean and do just a couple of tweaks?
the difficulty of the estimation (did the student use only basic estimation functions?).