DI Lesson 8: Scatter Plots & Correlation • GMATCourse.mba

Scatter Plot Basics

A scatter plot displays individual data points, each representing one observation on two variables. The x-axis shows one variable; the y-axis shows the other. The overall pattern of points reveals the relationship between them.

Animated Scatter Plot — Ad Spend vs. Sales

● Yellow dot = outlier (unusually high sales for its spend level)

Correlation: Direction and Strength

↗

Positive Correlation

As X increases, Y tends to increase. Points cluster in an upward-sloping band.

↘

Negative Correlation

As X increases, Y tends to decrease. Points cluster in a downward-sloping band.

⬚

No Correlation

Points are scattered randomly. Knowing X tells you nothing about Y.

Strength of Correlation

Weak

Strong

Strong = points tightly clustered around the trend line. Weak = points spread widely. Perfect = all points on one line.

Outliers and Clusters

Outliers

Points far from the main cluster or trend line. They may represent data errors, exceptional cases, or the most interesting observations in the data set.

Clusters

Groups of points that are close together. Multiple clusters may indicate sub-populations (e.g., small vs. large companies in the same dataset).

Line of Best Fit Interpretation

What the Line of Best Fit Tells You

▸

Slope direction: Positive or negative correlation

▸

Y-intercept: Predicted Y value when X = 0

▸

Points above line: Actual value > predicted (overperforming)

▸

Points below line: Actual value < predicted (underperforming)

▸

Never proves causation: Correlation ≠ causation

10 Scatter Plot Traps

⚠ Correlation ≠ causation

A positive correlation between ad spend and sales does NOT prove that advertising caused sales to increase.

⚠ Outliers distort the line of best fit

A single extreme outlier can significantly pull the best-fit line in its direction, misrepresenting the bulk of the data.

⚠ Strong correlation ≠ exact prediction

Even with a strong correlation, individual points have significant scatter. The line predicts on average, not exactly.

⚠ Negative correlation is not "bad"

A negative correlation simply means as X increases, Y decreases. It's not inherently a problem.

⚠ Interpolation vs. extrapolation

Interpolating (within data range) is reliable. Extrapolating (outside data range) is speculative and prone to error.

⚠ Multiple clusters can hide the true relationship

If data has two sub-groups, the combined scatter plot may show no correlation while each sub-group does have one.

⚠ Y-intercept vs. data minimum

The y-intercept of the best-fit line is the predicted value when X=0, not necessarily the lowest data point.

⚠ R² value interpretation

An R² of 0.8 means 80% of Y's variation is explained by X. It does NOT mean 80% of predictions are correct.

⚠ Direction of causation

Even if A causes B, B could also cause A (reverse causation). A scatter plot cannot distinguish direction.

⚠ Apparent pattern in random data

With enough points, random data can appear to show patterns. Small samples especially deceive.

✦

10 Practice Questions

Q1 of 10

GI~550

A scatter plot shows hours studied (x-axis) and exam score (y-axis). The points form a tight upward-sloping band. This indicates:

Explanation: A strong positive correlation. Points tightly clustered in an upward-sloping band indicate a strong positive correlation. Note: correlation does not prove causation.

Q2 of 10

GI~600

A scatter plot shows a point far above the line of best fit. This outlier represents:

Explanation: Actual Y is higher than predicted Y. Points above the line of best fit have higher actual values than what the model predicts for their x-coordinate. This doesn't necessarily mean the data is wrong — it may represent an exceptional case.

Q3 of 10

GI~650

A scatter plot of ice cream sales vs. drowning incidents shows a strong positive correlation. This means:

Explanation: Both are likely driven by hot weather. This is a classic example of a confounding variable. Hot weather causes both more ice cream consumption and more swimming (thus more drowning). Correlation does not imply causation.

Q4 of 10

GI~600

A scatter plot has x-axis: Price ($) and y-axis: Quantity Demanded. The points slope downward from left to right. This indicates:

Explanation: Negative correlation. Downward-sloping scatter = as price (X) increases, quantity demanded (Y) decreases. This is the standard demand curve relationship — a negative correlation.

Q5 of 10

GI~750

A scatter plot has two distinct clusters: one in the lower-left and one in the upper-right. Within each cluster, there is no correlation. But overall, the data shows a positive correlation. This phenomenon is called:

Explanation: Simpson's Paradox. When the combined data shows a trend opposite to (or different from) each subgroup, this is Simpson's Paradox. The overall correlation can be misleading when sub-populations have different baseline characteristics.

Q6 of 10

GI~600

On a scatter plot, all points lie exactly on a straight line sloping upward. The correlation coefficient (r) is:

Explanation: r = +1. A perfect positive linear relationship has r = +1. All points on a single upward-sloping line is the definition of a perfect positive correlation.

Q7 of 10

GI~700

A scatter plot shows employee satisfaction (x) vs. productivity (y). The line of best fit has a slope of 0.8. A point at x=50 has actual y=55. The predicted y at x=50 (assuming y-intercept = 10) is:

Explanation: 40. Predicted y = 0.8(50) + 10 = 40 + 10 = 50. Wait: y = 0.8x + 10 = 0.8(50) + 10 = 40 + 10 = 50. Actual = 55 > predicted (50), so this point is above the line. Predicted = 50. (C) 50 is the correct predicted value.

Q8 of 10

GI~700

A business uses a scatter plot to analyze whether advertising spend predicts sales. The R² value is 0.25. This means:

Explanation: Advertising explains 25% of the variation in sales. R² = 0.25 means the model (advertising spend) accounts for 25% of the variability in sales. The other 75% is explained by factors not in the model. R² does not measure causation or prediction accuracy per se.

Q9 of 10

GI~700

Two scatter plots show the same data but with different axis scales. Plot 1: x-axis spans 0-100, y-axis 0-1000. Plot 2: x-axis spans 0-10, y-axis 0-10000. The correlation between the variables is:

Explanation: The same — correlation is scale-invariant. The correlation coefficient (r) is calculated from standardized values and is unaffected by axis scale or units. The visual appearance changes with scale, but the underlying correlation is identical.

Q10 of 10

GI~650

A scatter plot shows 15 companies with Operating Cost (x) and Net Profit (y). There is a negative correlation (r = −0.85). A new company has high operating costs. A reasonable prediction based on this data is:

Explanation: The new company will likely have lower profits. A strong negative correlation (r = −0.85) means high X values tend to correspond with low Y values. With high operating costs, the model predicts lower net profit — though this is a probabilistic prediction, not a certainty.

Lesson Summary

Direction: upward = positive, downward = negative

The overall slope of the cluster tells you the sign of the correlation.

Strength: tightness of cluster around the line

A tight, narrow cluster = strong correlation. A wide spread = weak correlation.

Outliers sit far from the main cluster

They have unusually high or low values for at least one variable relative to the trend.

Correlation NEVER proves causation

Even a perfect correlation (r=1) only tells you variables move together — not why.

← Lesson 7 Lesson 9 →

Scatter Plots &Correlation Mastery

Scatter Plot Basics

Correlation: Direction and Strength

Outliers and Clusters

Line of Best Fit Interpretation

10 Scatter Plot Traps

10 Practice Questions

Scatter Plots &
Correlation Mastery