DI Lesson 12: Scatter Plot Analysis & Trend Logic • GMATCourse.mba

Scatter Plot Deep Dive

Lesson 12 builds on Lesson 8 with advanced scatter plot scenarios: multiple clusters, curved trends, and interpreting lines of best fit with annotated equations.

Scatter with Best-Fit Line and Equation

● Yellow = outlier (underperformed despite adequate training)

Best-Fit Line Equations

y = mx + b

m (slope) = change in y per unit change in x. Positive = positive correlation.

b (intercept) = predicted y when x = 0.

Using the Equation

If y = 0.65x + 10 and x = 50: predicted y = 0.65(50)+10 = 42.5

Clusters, Outliers, and Sub-Populations

Multiple Clusters

Two distinct clusters may represent different sub-groups. The best-fit line across all data may be misleading if sub-groups have different relationships.

Outlier Impact

One extreme outlier can significantly skew the best-fit line. If the question says "excluding the outlier," the slope and intercept change.

Scatter Plot Strategy

5-Second Orientation

① What is on x-axis? What is on y-axis?

② What is the overall direction of the cluster? (Up/down/flat)

③ How tight is the cluster? (Correlation strength)

④ Are there outliers? (Points far from the main band)

⑤ Is there a best-fit line equation provided?

10 Scatter Plot Traps

⚠ Best-fit line doesn't pass through most points

The line minimizes total error — it may not pass through a single actual data point.

⚠ Outliers are not errors

An outlier represents a real observation with unusual values. Don't automatically label it as a mistake.

⚠ Strong correlation in a cluster ≠ strong overall

Two separate clusters can create an artificial overall correlation not present within either cluster.

⚠ Curved trends: linear fit is wrong

If points follow a curve (exponential, logarithmic), a straight best-fit line misrepresents the relationship.

⚠ Points below the line are not "negative"

Being below the line just means actual < predicted. The y-value itself may still be positive.

⚠ Slope of line ≠ average value of y

Slope = rate of change. Average y = sum of y-values / n. These are different.

⚠ Extrapolation outside data range

Using y = 0.65x + 10 to predict y at x = 500 (far beyond data range) is unreliable.

⚠ Two variables plotted — which causes which?

X-axis is conventionally independent. But correlation runs both ways — verify context.

⚠ R² = 0.9 means 90% of variation explained

R² is a measure of fit quality, not prediction accuracy. 10% unexplained remains.

⚠ Cluster proximity = same value

Two points close together horizontally may have very different y-values. Check both axes.

✦

10 Practice Questions

Q1 of 10

GI~600

A scatter plot has best-fit line y = 0.65x + 10. At x = 40, the predicted y is:

Explanation: 36. y = 0.65(40) + 10 = 26 + 10 = 36.

Q2 of 10

GI~650

A data point at (40, 55) is plotted on the scatter plot with equation y = 0.65x + 10. This point is:

Explanation: Above the best-fit line. Predicted y at x=40 is 36. Actual y = 55 > 36. The point is above the line — this observation performed better than the model predicted.

Q3 of 10

GI~750

The scatter plot shows a positive correlation overall, but two distinct clusters: one at low x/low y, one at high x/high y. Within each cluster, the points are scattered randomly. This situation describes:

Explanation: Simpson's Paradox / Ecological Fallacy. The overall trend appears positive, but within each sub-group there is no correlation. The artificial trend is created by the gap between the two clusters, not a real relationship.

Q4 of 10

GI~700

If all outliers are removed from a scatter plot and the correlation strengthens significantly, what does this suggest?

Explanation: Outliers weakened the correlation. If removing outliers strengthens the correlation, those outliers were pulling the best-fit line away from the main trend. The bulk of the data has a stronger pattern than the full set showed.

Q5 of 10

GI~700

A scatter plot has 30 points. 25 follow a tight positive correlation. 5 are scattered randomly at high x-values. The best-fit line for all 30 points vs. just the 25 main points would differ in:

Explanation: Both slope and intercept. The 5 scattered high-x points pull the right end of the best-fit line down (reducing slope) and may shift the intercept. Both parameters of the line equation change.

Q6 of 10

GI~550

A GI question asks "Based on the best-fit line, what is the predicted y when x = 0?" This is asking for:

Explanation: The y-intercept. When x = 0, y = m(0) + b = b. The y-intercept is the predicted value of y when the x-variable equals zero.

Q7 of 10

GI~650

Two scatter plots have the same slope in their best-fit lines but different R² values (Plot A: R²=0.9, Plot B: R²=0.3). Which has a stronger correlation?

Explanation: Plot A — higher R² means tighter cluster. R² measures how well the best-fit line explains the data's variation. R² = 0.9 means 90% of variance explained (strong correlation). R² = 0.3 means only 30% explained (weak correlation), even if the slope is the same.

Q8 of 10

GI~600

A scatter plot uses hours of exercise (x) and resting heart rate (y). The slope of the best-fit line is negative. This means:

Explanation: More exercise is associated with lower resting heart rate. A negative slope means as x (exercise) increases, y (resting heart rate) tends to decrease. This is a negative correlation — correlation, not proven causation.

Q9 of 10

GI~600

A line of best fit has equation y = −2x + 80. At x = 30, the predicted y is:

Explanation: 20. y = −2(30) + 80 = −60 + 80 = 20.

Q10 of 10

GI~650

A scatter plot shows 20 companies: advertising spend (x) and customer satisfaction (y). The correlation is near zero (r ≈ 0). A manager says "We should increase ad spend to improve satisfaction." This recommendation is:

Explanation: Not supported by the data. r ≈ 0 means no linear relationship between advertising spend and satisfaction. The scatter plot provides no evidence that increasing ad spend improves satisfaction. The recommendation is not data-supported.

Lesson Summary

y = mx + b: plug in x to predict y

The equation eliminates visual estimation. Use it whenever provided.

Points above line: actual > predicted

Below line: actual < predicted. These are "residuals" — the error of the model.

R² measures fit quality

Higher R² = more variance explained = stronger linear relationship.

Outliers: real data, not errors

Remove them only if explicitly told to. They represent real observations.

← Lesson 11 Lesson 13 →

Scatter Plots:Advanced Trend Logic

Scatter Plot Deep Dive

Best-Fit Line Equations

Clusters, Outliers, and Sub-Populations

Scatter Plot Strategy

10 Scatter Plot Traps

10 Practice Questions

Scatter Plots:
Advanced Trend Logic