Data Wrangling and Exploration: Nutrition Facts

Summary

This project explores and cleans a large dataset from Open Food Facts, which includes nutritional and product information for over 385.000 food items. The main focus is on fixing missing or incorrect values, organizing the data for analysis, and creating clear visual summaries to understand patterns across products and countries.

Dataset Overview

385.384 samples and 99 columns
Includes nutritional info (e.g. fat, sugar, protein), product names, countries, and units
Many missing or inconsistent values (e.g. missing vitamins, extreme outliers)
Data types grouped by structure: per 100g, per portion, unit type, and other metadata

Key Steps

1. Data Cleaning

Removed duplicate rows and columns with too many missing values
Filled in missing unit labels using the most frequent values (e.g. “g”, “mg”)
Labeled missing ingredients and product names to keep track of gaps
Replaced obviously incorrect values (e.g. 500g of fat per 100g product) with NaN
Standardized energy values (kJ to kcal) and fixed inconsistencies

2. Exploratory Analysis

Analyzed categorical variables like country and unit type using bar plots
Created histograms and descriptive stats for numeric values (e.g. sugar, fat, energy)
Found right-skewed distributions and extreme outliers in multiple features