Getting Started with R for Data Analysis: A Data Scientist’s Guide

Why R for Data Analysis?
As someone who's worked with both Python and R, I can confidently say R is the best tool for statistical analysis and visualization. Here's why:
Built for statistics (created by statisticians, for statisticians)
Unmatched visualization (ggplot2 is still the gold standard)
Thousands of specialized packages (CRAN has 18,000+ packages)
Preferred in academia and research (biology, psychology, economics)
Reproducible research with RMarkdown and Shiny apps
1. Setting Up Your R Environment
Installation Essentials
- Base R: Download from r-project.org
- RStudio (highly recommended): rstudio.com
- Essential Packages:rCopyinstall.packages(c(“tidyverse”, “data.table”, “ggplot2”, “dplyr”, “tidyr”, “lubridate”))
RStudio Layout Explained
- Script Editor: Write and save your code
- Console: Execute commands interactively
- Environment: See your loaded data
- Plots/Help: Visualizations and documentation
2. R Basics Every Analyst Needs
Data Structures
# Vectors (homogeneous) ages <- c(25, 30, 35, 40) # Data frames (like Excel tables) df <- data.frame( name = c("Alice", "Bob", "Charlie"), age = c(25, 30, 35) )
Essential Operations
# Subsetting df[df$age > 30, ] # Adding columns df$age_next_year <- df$age + 1
3. The Tidyverse Revolution
Hadley Wickham's tidyverse
packages have transformed R programming:
library(tidyverse) # Modern data manipulation df %>% filter(age > 30) %>% group_by(gender) %>% summarize(avg_age = mean(age))
Key Tidyverse Packages
Package | Purpose | Example Use |
---|---|---|
dplyr | Data manipulation | filter() , mutate() , summarize() |
ggplot2 | Visualization | ggplot() + geom_point() |
tidyr | Data cleaning | pivot_longer() , drop_na() |
readr | Fast data import | read_csv() |
lubridate | Date handling | ymd() , floor_date() |
4. Your First Complete Analysis
Let's analyze the built-in mtcars
dataset:
# Load data data(mtcars) # Explore summary(mtcars) glimpse(mtcars) # Visualization ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point() + geom_smooth(method="lm") + labs(title="Car Weight vs MPG") # Statistical test cor.test(mtcars$wt, mtcars$mpg)
5. Importing Real-World Data
From CSV
library(readr) sales <- read_csv("sales_data.csv")
From Excel
library(readxl) survey_data <- read_excel("survey_results.xlsx", sheet=2)
From Databases
library(DBI) con <- dbConnect(RSQLite::SQLite(), "database.db") results <- dbGetQuery(con, "SELECT * FROM customers")
6. Common Pitfalls & Solutions
Problem: Factors causing unexpected behavior Fix: Use
stringsAsFactors = FALSE
when creating data frames
Problem: Memory issues with large datasets Fix: Use
data.table
instead of data.frame
Problem: Package conflicts Fix: Use
conflicted
package or explicit package::function()
7. Where to Go Next
Learning Resources
Books:
- “R for Data Science” (free online: r4ds.had.co.nz)
- “The Art of R Programming”
Courses:
- DataCamp's “R Programmer” track
- Coursera's “Data Science Specialization” (Johns Hopkins)
Practice:
- TidyTuesday (weekly data challenges)
- Kaggle R Notebooks
Advanced Topics to Explore
- Functional programming with
purrr
- Interactive dashboards with Shiny
- Machine learning with
caret
ortidymodels
- Geospatial analysis with
sf
andleaflet
Final Thoughts
R has a steeper learning curve than Python for beginners, but nothing beats it for statistical depth and visualization quality. Start with small projects, master the tidyverse, and you'll soon appreciate why R remains dominant in data-heavy fields like:
- Academic research
- Biostatistics
- Financial modeling
- Social sciences
Pro Tip: Use RMarkdown (rmarkdown.rstudio.com) to combine code, results, and narrative in reproducible reports.