Racing for the Score - Student Performance Prediction

Project Overview

This project represents a comprehensive data science initiative to predict student academic performance through advanced machine learning techniques. By analyzing 80,000 synthetically generated student records, we have identified critical factors that influence academic success and developed highly accurate predictive models with 87% accuracy.

                Primary Objective: Develop an accurate, interpretable model to predict exam scores while identifying actionable insights for educational institutions to improve student outcomes and support at-risk students.
            

Project Scope

Comprehensive exploratory data analysis across 31 features
Statistical feature selection using Pearson and Chi-Square tests
Implementation and comparison of 5 regression algorithms
Rigorous assumption validation for linear models
Actionable recommendations for educational stakeholders

80,000

Student Records Analyzed

31

Features Examined

5

ML Models Tested

87%

Model Accuracy (R²)

Dataset & Features

Our dataset was synthetically generated to simulate realistic student behavior patterns while maintaining statistical validity. This approach ensures comprehensive coverage of diverse student profiles and scenarios relevant to educational outcomes.

Dataset Overview

Metric	Value	Description
Total Records	80,000	Student profiles included
Total Features	31	Attributes per student
Missing Values	0	Data quality assured at 100%
Target Variable	Exam Score	Range from 36 to 100
Features After Selection	21	Significant predictors retained

Feature Categories

Academic & Study (9 features)

study_hours_per_day, attendance_percentage, previous_gpa, semester, time_management_score, study_environment, access_to_tutoring, and related metrics

Psychological & Health (7 features)

mental_health_rating, stress_level, exam_anxiety_score, motivation_level, sleep_hours, diet_quality, exercise_frequency

Lifestyle & Demographics (7 features)

age, gender, part_time_job, extracurricular_participation, social_activity, learning_style, dropout_risk

Family & Support (3 features)

parental_education_level, family_income_range, parental_support_level

Research Methodology

Our research followed a structured, scientifically-rigorous data science pipeline to ensure robust and reliable results throughout the analysis process.

Phase 1: Data Preparation

Data Cleaning: Removed irrelevant identifiers such as student ID
Completeness Verification: Confirmed 0 missing values across all fields
Data Type Validation: Verified correct data types for all attributes
Statistical Summary: Generated comprehensive descriptive statistics

Phase 2: Exploratory Analysis

Univariate Analysis: Distribution analysis on all 30 features
Correlation Analysis: Pearson correlation with target variable
Outlier Detection: Identified and documented unusual data points
Visualization: Generated 47 comprehensive visualizations

Phase 3: Feature Engineering

Statistical Testing: Applied Pearson test for numerical and Chi-Square for categorical variables
Feature Selection: Retained only statistically significant features (p-value < 0.05)
Encoding: Applied label encoding for binary and one-hot encoding for multi-class variables
Scaling: Normalized numerical features using standard scaling (mean=0, std=1)

Phase 4: Model Development

Data Splitting: 80% training data, 20% testing data with random_state=42
Model Training: Implemented 5 different regression algorithms
Hyperparameter Tuning: Optimized parameters for each model
Assumption Validation: Performed rigorous statistical assumption checks

Exploratory Data Analysis

Our exploratory analysis revealed important patterns and relationships in student data that directly impact academic performance.

Key Statistical Findings

Variable	Mean	Std Dev	Range	Distribution Type
Age	22.0 years	3.7	16-28	Normal
Study Hours per Day	4.17 hours	2.0	0-12	Right-skewed
Sleep Hours	7.02 hours	1.47	4-12	Normal
Exam Score	89.14 out of 100	11.59	36-100	Left-skewed
Stress Level	5.01 out of 10	1.95	1-10	Normal

Correlation Analysis with Exam Scores

Strongest Positive Correlation

0.93

Previous GPA - Strongest predictor

Moderate Positive Correlation

0.25

Motivation Level

Moderate Negative Correlation

-0.24

Exam Anxiety Score

Feature Engineering & Selection

We employed rigorous statistical testing to identify the most significant predictors of academic performance, reducing the feature set from 31 to 13 through evidence-based selection.

Selected Numerical Features (9 Total)

Rank	Feature	Correlation with Score	P-Value	Significance
1	Previous GPA	0.9329	<0.001	Strongest Predictor
2	Motivation Level	0.2503	<0.001	Significant
3	Study Hours per Day	0.2415	<0.001	Significant
4	Screen Time	0.1698	<0.001	Moderate
5	Exam Anxiety Score	-0.2359	<0.001	Significant (Negative)
6	Stress Level	-0.1186	<0.001	Moderate (Negative)
7	Sleep Hours	0.0908	<0.001	Significant
8	Exercise Frequency	0.0870	<0.001	Significant
9	Mental Health Rating	0.0106	<0.01	Significant

Selected Categorical Features (4 Total)

Feature	Chi-Square Statistic	P-Value	Significance Level
Access to Tutoring	420.48	<0.001	Highly Significant
Dropout Risk	332.68	<0.001	Highly Significant
Study Environment	250.92	<0.001	Highly Significant
Major	6.28	<0.05	Significant

                Data Preprocessing Results: Final dataset contains 80,000 rows with 21 features after encoding. All numerical features were normalized using standard scaling. Categorical features were processed with label encoding for binary variables and one-hot encoding for multi-class variables.
            

Machine Learning Models

We implemented and compared five state-of-the-art regression algorithms to identify the optimal predictive model for student academic performance.

Linear Regression Models

Multiple Linear Regression (MLR)

A baseline linear model that fits a hyperplane through the data using ordinary least squares. Highly interpretable and computationally efficient, it serves as our benchmark model.

Ridge Regression

L2 regularized regression with alpha=1.0 that penalizes large coefficients to prevent overfitting. Particularly effective at handling multicollinearity present in encoded categorical variables.

Lasso Regression

L1 regularized regression with alpha=0.01 that performs automatic feature selection by shrinking insignificant coefficients to zero, helping identify the most important predictors.

Non-Linear Regression Models

Random Forest Regressor

Ensemble of 100 decision trees trained on random subsets of data. Robust to outliers, handles non-linear relationships, and provides built-in feature importance rankings.

Support Vector Regressor (SVR)

Uses radial basis function (RBF) kernel to map data to higher dimensions and find optimal regression hyperplane. Effective for non-linear pattern recognition.

Results & Performance

All models demonstrated strong predictive performance, with linear models slightly outperforming non-linear alternatives on this dataset.

Model Performance Comparison

Model	R² Score	RMSE	MAE	Ranking
Multiple Linear Regression	0.8705	0.1305	0.2757	1st Place (Tied)
Ridge Regression	0.8705	0.1305	0.2757	1st Place (Tied)
Lasso Regression	0.8703	0.1307	0.

Racing for the 'Score'

Quick Navigation