Athlete Biographical Data Analysis
This project involves cleaning, transforming, and analyzing a dataset of international athletes using Python and Pandas.
Dataset
- File:
bios.csv - Type: CSV
- Fields include athlete name, height, birthplace, birthdate, and more
Tools Used
- Python 3.10.11
- Pandas
- Jupyter Notebook
Project Steps
-
Data Loading & Initial Exploration
- Loaded CSV into a DataFrame using
pd.read_csv() - Checked null values, structure, and sample rows
- Loaded CSV into a DataFrame using
-
Filtering
- Filtered by height > 215 cm
- Filtered by birthplace (country, city)
- Regex filtering for names (e.g., "Keith", "Patrick")
-
Feature Engineering
- Extracted
first_namefrom full names - Converted
born_datetodatetimeformat - Created
born_yearcolumn
- Extracted
-
Output
- Saved cleaned dataset as
bios_new.csv
- Saved cleaned dataset as
Key Insights
- Tallest athletes identified across countries
- Seattle and specific U.S. states were major contributor regions
- Extracted clean year-based insights from raw date data
Files in This Repo
- README.md
bios.csv: Original datasetbios_analysis.ipynb: Jupyter Notebook with analysisbios_new.csv: Cleaned and enriched dataset
Author Contributed by: Divya Bariya (Hands-on implementation based on publicly available learning content)