Predicting Student Academic Performance Using Machine Learning
Comprehensive Analysis of 80,000+ Student Records
This project represents a comprehensive data science initiative to predict student academic performance through advanced machine learning techniques. By analyzing 80,000 synthetically generated student records, we have identified critical factors that influence academic success and developed highly accurate predictive models with 87% accuracy.
Our dataset was synthetically generated to simulate realistic student behavior patterns while maintaining statistical validity. This approach ensures comprehensive coverage of diverse student profiles and scenarios relevant to educational outcomes.
| Metric | Value | Description |
|---|---|---|
| Total Records | 80,000 | Student profiles included |
| Total Features | 31 | Attributes per student |
| Missing Values | 0 | Data quality assured at 100% |
| Target Variable | Exam Score | Range from 36 to 100 |
| Features After Selection | 21 | Significant predictors retained |
study_hours_per_day, attendance_percentage, previous_gpa, semester, time_management_score, study_environment, access_to_tutoring, and related metrics
mental_health_rating, stress_level, exam_anxiety_score, motivation_level, sleep_hours, diet_quality, exercise_frequency
age, gender, part_time_job, extracurricular_participation, social_activity, learning_style, dropout_risk
parental_education_level, family_income_range, parental_support_level
Our research followed a structured, scientifically-rigorous data science pipeline to ensure robust and reliable results throughout the analysis process.
Our exploratory analysis revealed important patterns and relationships in student data that directly impact academic performance.
| Variable | Mean | Std Dev | Range | Distribution Type |
|---|---|---|---|---|
| Age | 22.0 years | 3.7 | 16-28 | Normal |
| Study Hours per Day | 4.17 hours | 2.0 | 0-12 | Right-skewed |
| Sleep Hours | 7.02 hours | 1.47 | 4-12 | Normal |
| Exam Score | 89.14 out of 100 | 11.59 | 36-100 | Left-skewed |
| Stress Level | 5.01 out of 10 | 1.95 | 1-10 | Normal |
Previous GPA - Strongest predictor
Motivation Level
Exam Anxiety Score
We employed rigorous statistical testing to identify the most significant predictors of academic performance, reducing the feature set from 31 to 13 through evidence-based selection.
| Rank | Feature | Correlation with Score | P-Value | Significance |
|---|---|---|---|---|
| 1 | Previous GPA | 0.9329 | <0.001 | Strongest Predictor |
| 2 | Motivation Level | 0.2503 | <0.001 | Significant |
| 3 | Study Hours per Day | 0.2415 | <0.001 | Significant |
| 4 | Screen Time | 0.1698 | <0.001 | Moderate |
| 5 | Exam Anxiety Score | -0.2359 | <0.001 | Significant (Negative) |
| 6 | Stress Level | -0.1186 | <0.001 | Moderate (Negative) |
| 7 | Sleep Hours | 0.0908 | <0.001 | Significant |
| 8 | Exercise Frequency | 0.0870 | <0.001 | Significant |
| 9 | Mental Health Rating | 0.0106 | <0.01 | Significant |
| Feature | Chi-Square Statistic | P-Value | Significance Level |
|---|---|---|---|
| Access to Tutoring | 420.48 | <0.001 | Highly Significant |
| Dropout Risk | 332.68 | <0.001 | Highly Significant |
| Study Environment | 250.92 | <0.001 | Highly Significant |
| Major | 6.28 | <0.05 | Significant |
We implemented and compared five state-of-the-art regression algorithms to identify the optimal predictive model for student academic performance.
A baseline linear model that fits a hyperplane through the data using ordinary least squares. Highly interpretable and computationally efficient, it serves as our benchmark model.
L2 regularized regression with alpha=1.0 that penalizes large coefficients to prevent overfitting. Particularly effective at handling multicollinearity present in encoded categorical variables.
L1 regularized regression with alpha=0.01 that performs automatic feature selection by shrinking insignificant coefficients to zero, helping identify the most important predictors.
Ensemble of 100 decision trees trained on random subsets of data. Robust to outliers, handles non-linear relationships, and provides built-in feature importance rankings.
Uses radial basis function (RBF) kernel to map data to higher dimensions and find optimal regression hyperplane. Effective for non-linear pattern recognition.
All models demonstrated strong predictive performance, with linear models slightly outperforming non-linear alternatives on this dataset.
| Model | R² Score | RMSE | MAE | Ranking |
|---|---|---|---|---|
| Multiple Linear Regression | 0.8705 | 0.1305 | 0.2757 | 1st Place (Tied) |
| Ridge Regression | 0.8705 | 0.1305 | 0.2757 | 1st Place (Tied) |
| Lasso Regression | 0.8703 | 0.1307 | 0. |