The term "The Broken Sample" typically refers to a dataset where some samples (rows) are incomplete, corrupted, or contain errors. Here's a structured approach to handling such data:
Common Causes of Broken Samples:
- Missing Values: Gaps in data (e.g., empty cells,
NaN). - Outliers: Extreme values deviating from normal distribution.
- Incorrect Labels: Misclassified data points.
- Data Corruption: Due to transmission/storage errors.
- Sensor Failures: In IoT/scientific data, malfunctioning devices produce invalid readings.
Solutions to Handle Broken Samples:
Identify Broken Samples
- Missing Values: Use
.isnull().sum()in Python (Pandas) to detect gaps. - Outliers: Apply statistical methods (Z-score, IQR) or visualization (box plots).
- Incorrect Labels: Cross-validation, manual inspection, or label consistency checks.
Preprocessing Techniques
- For Missing Values:
- Imputation: Replace missing values with mean, median, mode, or predictions (e.g.,
SimpleImputerin Scikit-learn). - Advanced: Use algorithms like KNN imputation or iterative imputation.
- Imputation: Replace missing values with mean, median, mode, or predictions (e.g.,
- For Outliers:
- Capping: Clip values to a percentile (e.g., 1st-99th).
- Transformation: Apply log/Box-Cox to reduce skewness.
- Removal: Delete samples if outliers are erroneous.
- For Incorrect Labels:
- Relabeling: Use majority voting or expert correction.
- Robust Models: Algorithms like Random Forests or SVMs that tolerate label noise.
Model Selection
- Tree-Based Models: XGBoost, LightGBM handle missing values natively.
- Ensemble Methods: Bagging/Boosting reduce sensitivity to broken samples.
- Neural Networks: Use dropout layers to mitigate noise impact.
Validation Strategies
- Cross-Validation: Ensure robustness (e.g., K-fold CV).
- Synthetic Data: Generate samples via GANs if data is scarce.
- Domain-Specific Rules: E.g., reject negative sensor readings.
Example Workflow in Python:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv("broken_sample.csv")
# Identify missing values
print(data.isnull().sum())
# Impute missing values
imputer = SimpleImputer(strategy='median')
data_imputed = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)
# Detect outliers (IQR method)
Q1 = data_imputed.quantile(0.25)
Q3 = data_imputed.quantile(0.75)
IQR = Q3 - Q1
outliers = (data_imputed < (Q1 - 1.5 * IQR)) | (data_imputed > (Q3 + 1.5 * IQR))
data_cleaned = data_imputed[~outliers.any(axis=1)]
# Train model
model = RandomForestClassifier()
model.fit(data_cleaned.drop("target", axis=1), data_cleaned["target"])
Key Considerations:
- Domain Context: In medical data, removal may be safer than imputation.
- Data Volume: Large datasets tolerate more aggressive cleaning.
- Bias Introduction: Imputation can skew distributions; validate with domain experts.
When to Discard Samples:
-
30% missing values in a row.
- Critical errors (e.g., negative age, impossible timestamps).
- Label inconsistencies confirmed via manual review.
By systematically addressing broken samples, you ensure model reliability and prevent skewed results. Always document preprocessing steps for reproducibility.
Request an On-site Audit / Inquiry