y = mx + b: slope shows correlation direction, b is y-intercept. Points above line = overperforming. R² measures correlation strength. Never conflate correlation with causation.
Lesson 12 builds on Lesson 8 with advanced scatter plot scenarios: multiple clusters, curved trends, and interpreting lines of best fit with annotated equations.
Two distinct clusters may represent different sub-groups. The best-fit line across all data may be misleading if sub-groups have different relationships.
One extreme outlier can significantly skew the best-fit line. If the question says "excluding the outlier," the slope and intercept change.
The line minimizes total error — it may not pass through a single actual data point.
An outlier represents a real observation with unusual values. Don't automatically label it as a mistake.
Two separate clusters can create an artificial overall correlation not present within either cluster.
If points follow a curve (exponential, logarithmic), a straight best-fit line misrepresents the relationship.
Being below the line just means actual < predicted. The y-value itself may still be positive.
Slope = rate of change. Average y = sum of y-values / n. These are different.
Using y = 0.65x + 10 to predict y at x = 500 (far beyond data range) is unreliable.
X-axis is conventionally independent. But correlation runs both ways — verify context.
R² is a measure of fit quality, not prediction accuracy. 10% unexplained remains.
Two points close together horizontally may have very different y-values. Check both axes.
A scatter plot has best-fit line y = 0.65x + 10. At x = 40, the predicted y is:
A data point at (40, 55) is plotted on the scatter plot with equation y = 0.65x + 10. This point is:
The scatter plot shows a positive correlation overall, but two distinct clusters: one at low x/low y, one at high x/high y. Within each cluster, the points are scattered randomly. This situation describes:
If all outliers are removed from a scatter plot and the correlation strengthens significantly, what does this suggest?
A scatter plot has 30 points. 25 follow a tight positive correlation. 5 are scattered randomly at high x-values. The best-fit line for all 30 points vs. just the 25 main points would differ in:
A GI question asks "Based on the best-fit line, what is the predicted y when x = 0?" This is asking for:
Two scatter plots have the same slope in their best-fit lines but different R² values (Plot A: R²=0.9, Plot B: R²=0.3). Which has a stronger correlation?
A scatter plot uses hours of exercise (x) and resting heart rate (y). The slope of the best-fit line is negative. This means:
A line of best fit has equation y = −2x + 80. At x = 30, the predicted y is:
A scatter plot shows 20 companies: advertising spend (x) and customer satisfaction (y). The correlation is near zero (r ≈ 0). A manager says "We should increase ad spend to improve satisfaction." This recommendation is:
The equation eliminates visual estimation. Use it whenever provided.
Below line: actual < predicted. These are "residuals" — the error of the model.
Higher R² = more variance explained = stronger linear relationship.
Remove them only if explicitly told to. They represent real observations.