In the previous article, we drew graphs using subplots in Matplotlib.
In this article, we are going to perform linear regression analysis using numpy's polynomial.
1. What is Linear Regression?
According to Wikipedia, linear regression analysis is described as follows:
In statistics, linear regression is a regression analysis technique that models the linear correlation between a dependent variable y and one or more independent variables X.
-Wikipedia-
In simple terms, it can be described as the average slope of the values displayed on a graph.
When linear regression is displayed on a graph, it can more clearly show the correlation of two values that appear vaguely to the eye.

The above graph visualizes data downloaded from Kaggle.
It includes parent education, race, and student achievement, and we attempted to plot the relationship between math and reading scores.
You can download the file provided below or through a link. The file below is localized in Korean, so choose whichever is convenient for you.
2. Drawing the Graph
First, let's plot a scatter graph of students' math and reading scores like the one above.
Start in the same way as the code we saw in the previous lecture, and just change the path of the file being loaded.
import pandas as pd
# Import modules and set Korean font
import matplotlib.pyplot as plt
import matplotlib
# Font settings for MacOS
# matplotlib.rcParams["font.family"] = "AppleGothic"
# Font settings for Windows
matplotlib.rcParams["font.family"] = "Malgun Gothic"
# Set font size
matplotlib.rcParams["font.size"] = 13
# Resolve minus sign output issue
plt.rcParams['axes.unicode_minus'] = False
score = pd.read_excel("./StudentsPerformance.xlsx")
score.head(3)
Now use it to pass math and reading scores as parameters to plt.scatter.
plt.scatter(score["수학점수"], score["읽기점수"])
Since the graph doesn't look good, I've styled it a bit more.
Added color, transparency, and axis labels to the graph.
plt.scatter(score["수학점수"], score["읽기점수"], alpha=0.4, color="green")
plt.xlabel("Math Score")
plt.ylabel("Reading Score")
With the basic graph completed, let's perform linear regression analysis using numpy.
3. Polynomial
Call numpy's polynomial and input the x values, y values, and the degree of the function as parameters.
Let's perform a linear regression analysis of math and reading scores with a first-degree function.
from numpy.polynomial import Polynomial
f = Polynomial.fit(score["수학점수"], score["읽기점수"], 1)When input this way, polynomial returns the predicted linear function.
Thus, f becomes a function that takes x values as parameters.
Let's see the predicted value by entering it as shown below.
from numpy.polynomial import Polynomial
f = Polynomial.fit(score["수학점수"], score["읽기점수"], 1)
f(40)
The predicted reading score for a student with a math score of 40 is 40.
Now let's draw the graph.
4. Linear Regression Graph
In the data, students' math and reading scores are not sorted from 0 to 100.
So if you pass math scores as parameters to f, the graph tends to get jumbled as the degree increases.


First, generate numbers from 0 to 100 and use this as the x values for the linear function to draw the graph.
numpy's linspace takes the starting point, ending point, and the number of numbers to fill inside as parameters to generate values.
import numpy as np
x = np.linspace(0,100,200)
plt.plot(x,f(x))If you check the values of x, you can see they are generated as shown on the left.
And if you build a graph with it, it looks like this below.


5. Completing the Graph
Now just layer the two graphs together to finish.
plt.scatter(score["수학점수"], score["읽기점수"], alpha=0.4, color="green")
plt.xlabel("Math Score")
plt.ylabel("Reading Score")
plt.plot(x, f(x),"r--")
6. Concluding the Article
In this article, we drew graphs to see what kind of relationship two values have through linear regression analysis.
In the next article, I'll share some thoughts on how to fill out high school portfolios using data visualization.
댓글을 불러오는 중...