[[!meta Error: cannot parse date/time: ]] [[!meta Error: cannot parse date/time: ]]

Week 1: Concepts

What is Reproducible Research about?

Like music is precisely depicted by a score, the data analysis in a research must be depicted in a sort of reproducible way.

How do you develop the score for data analysis? We want to communicate:

  • What was done.
  • How to reproduce it.

There is not a universal standard agreed upon to communicate those ideas. There are variety of way to communicate them. We will focus writing documents with embedded code that are dynamic, and by sharing the data input to that code, so that other people can reproduce the work that you're doing.

Concept and Ideas

Replication

Sometimes a given study can not be replicated for restrictions or conditions related to:

  • Point in time when the study was done
  • Money
  • Other

The reproducibility is a point between replicate the whole study and nothing. You make publicly available:

  • The original data/metadata
  • The computational methods to process it, which will encompass all of the steps of computational analysis, including all data preprocessing steps.

These allow other people look the data, run the analysis you done and validate your findings. In this sense, reproducible research is about validation of the data analysis done in a given research.

What do We Need?

•  Analytic data are available •  Analytic code are available •  Documentation of code and data •  Standard means of distribution

The Players * Authors * Readers

Literate (Statistical) Programming

  • An article is a stream of text and code.
  • Each code chunk loads data and computes and formats results (tables, figures, etc.)
  • Article text explains what is going on.

Summary

•  Reproducible research is important as a minimum standard, particularly for studies that are difficult to replicate. •  Infrastructure is needed for creating and distributing reproducible documents, beyond what is currently available. •  There is a growing number of tools for creating reproducible documents.

First Steps in a data analysis

  • Define the question
  • Define the ideal data set
  • Determine what data you can access
  • Obtain the data
  • Clean the data

Define the question

This is The most powerful dimension reduction tool.

If you can narrow down your question as specifically as possible you can kind of remove a lot of other variables not related to that question. That specific question can be extremely useful for simplifying your problem.

Define the ideal data set

What is the ideal data set for a given problem? The type of data set may depend on your goal.

  • Exploratory - a random sample with many variables measured
  • Inferential - the right population, randomly sampled
  • Predictive - a training and test data set from the same population
  • Causal - data from a randomized study
  • Mechanistic - data about all components of the system

Sources of your data

Be sure to respect the terms of use of your source.

  • The web
  • A vendor

    If the data don't exist, you may need to generate it yourself.

Clean the data

Raw data very often needs to be processed. If it is pre-processed, make sure you understand how.

Record the steps you have done to clean the data. The more explicit way to do that is having the code that takes the raw data and produces the clean one.

Determine if the data are good enough

When the data is clean enough you must ask to yourself if it is good enough to solve your problem. The data may have not the required variables or characteristics or their sources or sampling may not fit your question.

If the data is not good enough you must change your data or your question.

Last Steps in a data analysis

  • Split the data in training and test (for predicitions)
  • Exploratory data analysis
  • Statistical prediction/modeling
  • Interpret results
  • Challenge results
  • Synthesize/write up results
  • Create reproducible code

Test and Training Data set

If your question is of the predicitive type. You must split your data in Training and Test sets.

Exploratory data analysis

You want to know and get familiar with your data. Look the summaries, check for missing data, why it is missing, make some plots to understand relations between variables, make more detailed exploratory analysis like clustering.

Statistical prediction/modeling

The exact methods depend on your question and your data. It is possible to need variable transformation to fit the requirements of a given analysis procedure you have chosen.

You must include the sources of uncertainty in your model and its measures.

Interpret results

Use a precise/appropriate language. Be careful when you use words like:

  • describes
  • correlates with
  • leads to/causes
  • predicts

Give an explanation for a better understanding of why the variables and your model fits/explain the problem/question.

Challenge results

Challenge every aspect of your analysis to find and possible amend any omission or misinterpretation you have done. Challenge:

  • The question
  • The Data source
  • The Processing
  • The Analysis
  • The Conclusions
  • The measures of uncertainty
  • The variables included in the model

Think of potential alternative analyses.

Synthesize/write up results

  • Start with the question as the framework for the following results.
  • Summarize the analysis steps relevant to tell the story or to address a challenge.
  • Explain the analysis in a way easier to understand the story rather than chronologically.
  • Include figures or graphs that better contribute to the story.

Create reproducible code

Build your article using Knitr to tell the story.

Organizing a Data Analysis

  • Data

    • Raw data
    • Processed data
  • Figures

    • Exploratory figures
    • Final figures
  • R code

    • Raw / unused scripts
    • Final scripts
    • R Markdown files
  • Text

    • README files
    • Text of analysis / report