Data Science Research Project
Done by: Amirtha Ganesh R
Student Academic Performance Prediction System

Racing for the 'Score'

Predicting Student Academic Performance Using Machine Learning

Comprehensive Analysis of 80,000+ Student Records

Quick Navigation

Project Overview

This project represents a comprehensive data science initiative to predict student academic performance through advanced machine learning techniques. By analyzing 80,000 synthetically generated student records, we have identified critical factors that influence academic success and developed highly accurate predictive models with 87% accuracy.

Primary Objective: Develop an accurate, interpretable model to predict exam scores while identifying actionable insights for educational institutions to improve student outcomes and support at-risk students.

Project Scope

80,000
Student Records Analyzed
31
Features Examined
5
ML Models Tested
87%
Model Accuracy (R²)

Dataset & Features

Our dataset was synthetically generated to simulate realistic student behavior patterns while maintaining statistical validity. This approach ensures comprehensive coverage of diverse student profiles and scenarios relevant to educational outcomes.

Dataset Overview

Metric Value Description
Total Records 80,000 Student profiles included
Total Features 31 Attributes per student
Missing Values 0 Data quality assured at 100%
Target Variable Exam Score Range from 36 to 100
Features After Selection 21 Significant predictors retained

Feature Categories

Academic & Study (9 features)

study_hours_per_day, attendance_percentage, previous_gpa, semester, time_management_score, study_environment, access_to_tutoring, and related metrics

Psychological & Health (7 features)

mental_health_rating, stress_level, exam_anxiety_score, motivation_level, sleep_hours, diet_quality, exercise_frequency

Lifestyle & Demographics (7 features)

age, gender, part_time_job, extracurricular_participation, social_activity, learning_style, dropout_risk

Family & Support (3 features)

parental_education_level, family_income_range, parental_support_level

Research Methodology

Our research followed a structured, scientifically-rigorous data science pipeline to ensure robust and reliable results throughout the analysis process.

Phase 1: Data Preparation

Phase 2: Exploratory Analysis

Phase 3: Feature Engineering

Phase 4: Model Development

Exploratory Data Analysis

Our exploratory analysis revealed important patterns and relationships in student data that directly impact academic performance.

Key Statistical Findings

Variable Mean Std Dev Range Distribution Type
Age 22.0 years 3.7 16-28 Normal
Study Hours per Day 4.17 hours 2.0 0-12 Right-skewed
Sleep Hours 7.02 hours 1.47 4-12 Normal
Exam Score 89.14 out of 100 11.59 36-100 Left-skewed
Stress Level 5.01 out of 10 1.95 1-10 Normal

Correlation Analysis with Exam Scores

Strongest Positive Correlation
0.93

Previous GPA - Strongest predictor

Moderate Positive Correlation
0.25

Motivation Level

Moderate Negative Correlation
-0.24

Exam Anxiety Score

Feature Engineering & Selection

We employed rigorous statistical testing to identify the most significant predictors of academic performance, reducing the feature set from 31 to 13 through evidence-based selection.

Selected Numerical Features (9 Total)

Rank Feature Correlation with Score P-Value Significance
1 Previous GPA 0.9329 <0.001 Strongest Predictor
2 Motivation Level 0.2503 <0.001 Significant
3 Study Hours per Day 0.2415 <0.001 Significant
4 Screen Time 0.1698 <0.001 Moderate
5 Exam Anxiety Score -0.2359 <0.001 Significant (Negative)
6 Stress Level -0.1186 <0.001 Moderate (Negative)
7 Sleep Hours 0.0908 <0.001 Significant
8 Exercise Frequency 0.0870 <0.001 Significant
9 Mental Health Rating 0.0106 <0.01 Significant

Selected Categorical Features (4 Total)

Feature Chi-Square Statistic P-Value Significance Level
Access to Tutoring 420.48 <0.001 Highly Significant
Dropout Risk 332.68 <0.001 Highly Significant
Study Environment 250.92 <0.001 Highly Significant
Major 6.28 <0.05 Significant
Data Preprocessing Results: Final dataset contains 80,000 rows with 21 features after encoding. All numerical features were normalized using standard scaling. Categorical features were processed with label encoding for binary variables and one-hot encoding for multi-class variables.

Machine Learning Models

We implemented and compared five state-of-the-art regression algorithms to identify the optimal predictive model for student academic performance.

Linear Regression Models

Multiple Linear Regression (MLR)

A baseline linear model that fits a hyperplane through the data using ordinary least squares. Highly interpretable and computationally efficient, it serves as our benchmark model.

Ridge Regression

L2 regularized regression with alpha=1.0 that penalizes large coefficients to prevent overfitting. Particularly effective at handling multicollinearity present in encoded categorical variables.

Lasso Regression

L1 regularized regression with alpha=0.01 that performs automatic feature selection by shrinking insignificant coefficients to zero, helping identify the most important predictors.

Non-Linear Regression Models

Random Forest Regressor

Ensemble of 100 decision trees trained on random subsets of data. Robust to outliers, handles non-linear relationships, and provides built-in feature importance rankings.

Support Vector Regressor (SVR)

Uses radial basis function (RBF) kernel to map data to higher dimensions and find optimal regression hyperplane. Effective for non-linear pattern recognition.

Results & Performance

All models demonstrated strong predictive performance, with linear models slightly outperforming non-linear alternatives on this dataset.

Model Performance Comparison

Model R² Score RMSE MAE Ranking
Multiple Linear Regression 0.8705 0.1305 0.2757 1st Place (Tied)
Ridge Regression 0.8705 0.1305 0.2757 1st Place (Tied)
Lasso Regression 0.8703 0.1307 0.