Data Insights Section — 24-Hour Crash Course

Hour 22 of 24 — Statistics & Probability for DI

Master descriptive stats, correlation, regression, and probability to dominate the Data Insights section.

Progress 22 / 24 hours complete
Hour 1 You are here Hour 24
← 24-Hour Crash Course / DI Section / Hour 22: Statistics & Probability

What You'll Learn This Hour

Core Concepts

1. Descriptive Statistics

Measures of Center

  • Mean: Sum of values ÷ count. Sensitive to outliers.
  • Median: Middle value when sorted. Resistant to outliers.
  • Mode: Most frequent value. Can be multimodal.

Measures of Spread

  • Range: Max − Min. Very sensitive to outliers.
  • IQR: Q3 − Q1. Robust measure of middle 50% spread.
  • Std Dev (SD): Average distance from the mean. Larger SD = more spread out.

Key Relationship: Mean vs. Median

Left skew: Mean < Median Symmetric: Mean ≈ Median Right skew: Mean > Median

2. Correlation & Regression

Correlation Coefficient (r)

  • Ranges from −1 to +1.
  • r = +1: Perfect positive linear relationship.
  • r = 0: No linear relationship.
  • r = −1: Perfect negative linear relationship.
  • |r| > 0.7 is generally considered strong.

Regression Line: y = mx + b

  • m (slope): For every 1-unit increase in x, y changes by m units.
  • b (intercept): Predicted value of y when x = 0.
  • Never extrapolate far beyond the data range.
  • Correlation ≠ Causation. Always.

3. Probability

Core Rules

  • P(A): Favorable outcomes ÷ total outcomes.
  • P(not A): 1 − P(A).
  • P(A and B): P(A) × P(B) if independent.
  • P(A or B): P(A) + P(B) − P(A and B).

Conditional Probability

  • P(A|B): Probability of A given B has occurred.
  • P(A|B) = P(A and B) ÷ P(B).
  • If A and B are independent: P(A|B) = P(A).
  • Use two-way tables to extract conditional probabilities efficiently.

4. Expected Value

Expected Value E(X) = ∑ [xi × P(xi)] for all outcomes.

It represents the long-run average — not what must happen on any single trial.

Example: Roll a fair die. E(X) = (1+2+3+4+5+6)/6 = 3.5. You never roll 3.5, but over many rolls the average converges to 3.5.

Visual Concepts

Scatter Plot (r = 0.92)

Study Hours Test Score Outlier r = 0.92 Data Outlier

Correlation ≠ Causation

High r means strong linear association, not that X causes Y.

Probability Tree: Two Coin Flips

Start H (1/2) H T (1/2) T H (1/2) HH P=1/4 T (1/2) HT P=1/4 H (1/2) TH P=1/4 T (1/2) TT P=1/4 All outcomes sum to 1: 4 × (1/4) = 1

Multiply along branches. Outcomes at right are equally likely.

Worked Examples

1

The salaries (in $000s) of 7 employees are: 42, 45, 48, 50, 52, 55, 120. What is the best measure of center for this data, and why?

Step 1: Calculate the mean. Sum = 42+45+48+50+52+55+120 = 412. Mean = 412 ÷ 7 ≈ 58.9

Step 2: Find the median. Sorted data: 42, 45, 48, 50, 52, 55, 120. Middle value (4th of 7) = 50.

Step 3: Compare. The outlier (120) pulls the mean up to 58.9 — higher than 6 of 7 values. The median (50) accurately represents the "typical" employee salary.

Answer: Median ($50,000) is better. It is resistant to the outlier (120) that distorts the mean. The mean overstates the typical salary by nearly $9,000.

2

A scatter plot of advertising spend vs. sales shows r = 0.88, with regression line: Sales = 3.2(Ad Spend) + 15. If a company spends $20K on ads, what are the predicted sales, and what does r = 0.88 tell us?

Step 1: Plug x = 20 into the regression equation.

Sales = 3.2(20) + 15 = 64 + 15 = $79K predicted sales.

Step 2: Interpret the slope. For every additional $1K in advertising, predicted sales increase by $3.2K.

Step 3: Interpret r = 0.88. This means a strong positive linear association between ad spend and sales. It does NOT mean advertising causes sales to increase — a confounding variable (like seasonal demand) could explain both.

Predicted sales = $79K. r = 0.88 means strong positive correlation, not causation. Never claim X causes Y from r alone.

3

A game pays $10 if you roll a 6, $2 if you roll a 4 or 5, and $0 otherwise. The game costs $2 to play. What is the expected profit per game?

Step 1: List outcomes and probabilities.

  • Roll 6: P = 1/6, payout = $10
  • Roll 4 or 5: P = 2/6, payout = $2
  • Roll 1, 2, or 3: P = 3/6, payout = $0

Step 2: E(payout) = (1/6)(10) + (2/6)(2) + (3/6)(0) = 10/6 + 4/6 + 0 = 14/6 ≈ $2.33

Step 3: Subtract the $2 cost. Expected profit = $2.33 − $2.00 = $0.33 per game.

Expected profit = +$0.33. This game is favorable in the long run, but on any single game you may lose $2. Expected value is a long-run average.

GMAT Traps to Avoid

⚠️

Trap 1: Confusing Correlation with Causation

A correlation coefficient of r = 0.9 means a strong association, NOT that one variable causes the other. GMAT answer choices often include tempting causal language — always reject it unless an experiment was described.

⚠️

Trap 2: Using Mean When Outliers Exist

The median is resistant to outliers; the mean is not. When a dataset has extreme values, the median better represents the "typical" value. GMAT questions frequently test whether you select the appropriate measure of center.

⚠️

Trap 3: Misinterpreting Expected Value

"Expected value" means the long-run average over many repetitions, not what will happen on the next trial. A positive expected value game can still result in a loss on any given play. Don't confuse expected value with certainty.

⚠️

Trap 4: Forgetting P(A or B) Formula

P(A or B) = P(A) + P(B) − P(A and B). Forgetting to subtract the intersection causes double-counting. For mutually exclusive events, P(A and B) = 0, so P(A or B) = P(A) + P(B).

⚠️

Trap 5: Extrapolating Regression Beyond Data Range

The regression line is only reliable within (or near) the range of observed data. Predicting far outside that range is invalid — the GMAT may ask you to identify this overreach as a flaw in reasoning.

Practice Questions

12 GMAT-style questions. Click "Show Answer" for the full explanation.

Q1. The ages of five managers are 34, 38, 41, 44, and 78. Which of the following best describes how the outlier (78) affects the dataset?

(A) It increases the mean but not the median.

(B) It increases both the mean and the median equally.

(C) It increases the median but not the mean.

(D) It has no effect on either the mean or the median.

(E) It decreases the standard deviation.

Show Answer

Answer: (A)

Mean = (34+38+41+44+78)/5 = 235/5 = 47. Without 78, mean = (34+38+41+44)/4 = 39.25. The outlier raised the mean significantly.

Median of {34, 38, 41, 44, 78} = 41 (3rd value). Without 78, the median would still be around 39.5. The outlier has almost no impact on the median. Outliers inflate the mean but barely shift the median. The standard deviation increases with 78, not decreases, eliminating (E).

Q2. A dataset has mean = 50 and median = 65. Which of the following best describes the distribution?

(A) Symmetric.

(B) Right-skewed (positively skewed).

(C) Left-skewed (negatively skewed).

(D) Bimodal.

(E) Cannot be determined.

Show Answer

Answer: (C)

When mean < median, the distribution is left-skewed (negatively skewed). Low outliers pull the mean below the median. The long tail extends to the left. Remember: Mean < Median → Left skew. Mean > Median → Right skew. Mean = Median → Symmetric.

Q3. A set of test scores has a mean of 70 and a standard deviation of 5. A new set is created by adding 10 points to every score. What are the new mean and standard deviation?

(A) Mean = 70, SD = 15

(B) Mean = 80, SD = 15

(C) Mean = 80, SD = 5

(D) Mean = 70, SD = 5

(E) Mean = 80, SD = 10

Show Answer

Answer: (C)

Adding a constant to every value shifts the mean by that constant: new mean = 70 + 10 = 80. However, standard deviation measures spread — the distances between values do not change when you add a constant to all. So SD remains 5. Only scaling (multiplying) changes the SD.

Q4. A study finds that cities with more ice cream sales have higher drowning rates (r = 0.85). A researcher concludes that ice cream causes drowning. Which is the strongest objection to this conclusion?

(A) The sample size may be too small.

(B) A confounding variable (hot weather) likely explains both trends.

(C) The correlation coefficient is not high enough to draw conclusions.

(D) Drowning rates are not a reliable variable to measure.

(E) The researcher should have used a larger dataset.

Show Answer

Answer: (B)

This is the classic correlation-does-not-imply-causation example. Hot weather (a lurking/confounding variable) causes both increased ice cream consumption AND more swimming → more drownings. The strong r = 0.85 shows association, not causation. The correct objection targets the causal claim, not the sample size or measurement reliability.

Q5. A regression line for predicting monthly revenue (Y, in $000s) from number of salespeople (X) is: Y = 12X + 40. What does the slope of 12 mean?

(A) Revenue starts at $12K when there are no salespeople.

(B) Each additional salesperson is associated with $12K more monthly revenue on average.

(C) Revenue increases by $40K for each salesperson added.

(D) The company needs 12 salespeople to break even.

(E) Revenue and salespeople have a correlation of 0.12.

Show Answer

Answer: (B)

The slope (12) represents the predicted change in Y for each one-unit increase in X. So for each additional salesperson, monthly revenue is predicted to increase by $12K. The intercept (40) means predicted revenue is $40K when X = 0 — not the slope. Slope and r are completely different statistics.

Q6. Two variables have r = −0.93. Which statement is most accurate?

(A) The variables have almost no relationship.

(B) As X increases, Y tends to decrease strongly, in a linear pattern.

(C) X causes Y to decrease.

(D) The relationship is weak because the correlation is negative.

(E) Exactly 93% of variation in Y is explained by X.

Show Answer

Answer: (B)

r = −0.93 is a strong negative correlation. The sign indicates direction (negative = as X rises, Y falls). The magnitude |0.93| indicates strength (strong). It does NOT indicate causation (eliminates C). Negative does not mean weak (eliminates D). r² = 0.865, meaning ~86.5% of variation explained — not 93% (eliminates E).

Q7. A bag contains 4 red and 6 blue marbles. If two marbles are drawn without replacement, what is the probability that both are red?

(A) 2/15

(B) 4/25

(C) 1/6

(D) 2/5

(E) 1/15

Show Answer

Answer: (A)

P(first red) = 4/10 = 2/5. After drawing one red, 3 red and 6 blue remain (9 total). P(second red | first red) = 3/9 = 1/3. P(both red) = (2/5) × (1/3) = 2/15. Without replacement changes the second probability, so you cannot simply square 4/10.

Q8. In a group of 100 employees, 40 have MBAs and 30 have engineering degrees. 10 have both. What is the probability that a randomly selected employee has an MBA or an engineering degree?

(A) 0.30

(B) 0.40

(C) 0.60

(D) 0.70

(E) 0.80

Show Answer

Answer: (C)

Use the addition rule: P(A or B) = P(A) + P(B) − P(A and B). P(MBA) = 40/100 = 0.40. P(Eng) = 30/100 = 0.30. P(Both) = 10/100 = 0.10. P(MBA or Eng) = 0.40 + 0.30 − 0.10 = 0.60. Forgetting to subtract P(both) would give 0.70, the most common wrong answer.

Q9. Of 200 customers, 80 are female. 50 of the female customers purchased premium products. What is the probability that a randomly chosen customer is female AND purchased premium?

(A) 0.25

(B) 0.40

(C) 0.625

(D) 0.50

(E) 0.10

Show Answer

Answer: (A)

P(female AND premium) = 50/200 = 0.25. This is straightforward joint probability from the raw count. Note: P(premium | female) = 50/80 = 0.625 is a conditional probability, which would be answer (C) — a common confusion. The question asks for joint probability from the total, not conditional.

Q10. An insurance company sells policies with the following payout structure: $0 payout (probability 0.90), $5,000 payout (probability 0.08), $50,000 payout (probability 0.02). What is the expected payout per policy?

(A) $400

(B) $1,000

(C) $1,400

(D) $18,333

(E) $2,200

Show Answer

Answer: (C)

E(payout) = (0)(0.90) + (5,000)(0.08) + (50,000)(0.02) = 0 + 400 + 1,000 = $1,400. The insurance company must charge more than $1,400 per policy to be profitable. Each term is: outcome × its probability. Sum all terms.

Q11. A sales rep has a 30% chance of closing any given deal, independently. What is the probability of closing at least one deal in 3 attempts?

(A) 0.30

(B) 0.51

(C) 0.657

(D) 0.90

(E) 0.343

Show Answer

Answer: (C)

Use the complement rule. P(at least one) = 1 − P(none). P(no deal in one attempt) = 0.70. P(no deals in 3 attempts) = 0.70 × 0.70 × 0.70 = 0.343. P(at least one) = 1 − 0.343 = 0.657. This complement approach avoids adding up many cases (1 deal, 2 deals, 3 deals).

Q12. A company is choosing between two projects. Project A yields $200K profit with probability 0.6 or a $50K loss with probability 0.4. Project B yields $100K profit with probability 0.9 or a $20K loss with probability 0.1. Which project has the higher expected value?

(A) Project A, by $20K.

(B) Project B, by $8K.

(C) Project A, by $8K.

(D) They have equal expected values.

(E) Project B, by $20K.

Show Answer

Answer: (C)

E(Project A) = (200,000)(0.6) + (−50,000)(0.4) = 120,000 − 20,000 = $100,000.

E(Project B) = (100,000)(0.9) + (−20,000)(0.1) = 90,000 − 2,000 = $92,000.

Project A has the higher expected value by $100K − $92K = $8K. Despite the higher risk (possible $50K loss), Project A is better in expected value terms.

Quick Reference Card

// Statistics & Probability — Key Formulas

// DESCRIPTIVE STATISTICS

Mean = (sum of all values) / n

Median = middle value when sorted (average of two middle if n is even)

Range = Max - Min

IQR = Q3 - Q1 (robust to outliers)

SD increases if spread increases; adding constant to all values leaves SD unchanged

// SKEWNESS RULE

Mean < Median → Left (negative) skew

Mean > Median → Right (positive) skew

Mean ≈ Median → Symmetric

// CORRELATION & REGRESSION

-1 ≤ r ≤ +1; |r| > 0.7 → strong; r = 0 → no linear relationship

Regression line: y = mx + b (m=slope, b=y-intercept)

Slope m = change in y per 1-unit increase in x

CRITICAL: r (correlation) ≠ causation. EVER.

// PROBABILITY

P(A) = favorable / total; 0 ≤ P(A) ≤ 1

P(not A) = 1 - P(A)

P(A and B) = P(A) × P(B) [if independent]

P(A or B) = P(A) + P(B) - P(A and B) [addition rule]

P(A|B) = P(A and B) / P(B) [conditional probability]

If mutually exclusive: P(A and B) = 0

If independent: P(A|B) = P(A)

// COMPLEMENT RULE (KEY SHORTCUT)

P(at least one) = 1 - P(none) ← use this to avoid adding many cases

// EXPECTED VALUE

E(X) = Σ [x_i × P(x_i)] for all outcomes i

= long-run average; NOT the guaranteed result of one trial

Positive E(X) = favorable game in long run