Comparing and Analyzing Student Engagement Between Q2 2021 and Q2 2022
Background:
Throughout this Tracking User Engagement with SQL, Excel, and Python project, I worked with a real dataset from company’s data. The project required me to analyze whether the new additions to the platform (new courses, exams, and career tracks) have increased student engagement.
The following information were given:
Holder (student ID) and issuance date of certificates issued in Q2 2022
Student ID and registration date of students registered between January 1, 2020 and June 30, 2022
Student ID, product type, purchase date, and refund date (if applicable) of purchases made between January 1, 2020 and June 30, 2022
Student watching (student ID), time watched, and date of courses watched in Q2 2021 and Q2 2022
Results:
Data Preparation
After the data preparation with SQL and Python, the distribution were plotted.
As the plots reveal, all distributions of the minutes students watched were skewed to the right. This suggests some outliers in the data who have watched much more than most of the students on the platform. Their presence in the data may skew all metrics, such as the mean, median, and standard deviation.
After removing outliers, new distribution were plotted:
New datasets with removal of outliers were then used for the later analysis.
For free-plan students who watched in Q2 2021, the mean minutes watched (14.20) are significantly higher than the median (2.79). This suggests a right-skewed distribution, indicating that a few students watched much more than others. A similar situation is observed for free-plan students who watched in Q2 2022, with the mean (16.03) being higher than the median (4.98), indicating right skewness. The same applies to paying students who watched in Q2 2021 and those who watched in Q2 2022, where the mean is higher than the median, indicating right skewness.
The results meet expectations because they align with the shapes of the data distribution. The right skewness observed in the distributions would lead us to expect a higher mean than median due to the influence of high-value outliers, which these statistics confirm. The difference in these metrics between free-plan and paying subscribers also makes sense because we expect paying students to generally watch more content than free-plan ones, leading to higher means and medians.
Observations:
For free-plan students, there’s an increase in engagement from Q2 2021 to Q2 2022, as the confidence interval for the later period (15.41 – 16.66 minutes) is slightly higher than for the earlier one (13.55 – 14.87 minutes).
Students with paid memberships watch substantially more than those without. This is evident by comparing the confidence intervals of the two groups in Q2 2021: 13.55 – 14.87 minutes for non-subscribers and 339.60 – 380.61 minutes for subscribers.
Among the paid subscribers, there’s a decrease in engagement from Q2 2021 to Q2 2022, as the confidence interval for the later period (276.54 – 307.90 minutes) is lower than for the earlier one (339.60 – 380.61 minutes).
Data Prediction with Python
Linear Regression was used for predicting the user engagement in this project.
The linear equation would be:
y = 1.056 + 0.001740x
R-squared value was:
0.3052942915158895
This suggests that about 30% of the variability in the target variable (the number of certificates issued) is explained by the input variable (the number of minutes watched). This model does not account for the other 70%.
To visualize the model performance, a scatter plot was done in which the x-axis represented the actual test values, and the y-axis was the predicted values.
Hey there! If you're curious, why not swing by my GitHub link below? Check it out for more details!