Example usage

  Blog    |     February 05, 2026

To address Fake Label Discovery, the goal is to identify and correct mislabeled data points in a dataset. This is crucial for improving model performance, especially when labels are noisy or incorrect. Below is a step-by-step solution using Python and scikit-learn, followed by an explanation of the approach.

Solution Code

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
def find_fake_labels(X, y, k=10, confidence_threshold=0.9, random_state=42):
    """
    Identifies potential fake labels (mislabeled data points) using out-of-fold predictions.
    Args:
        X (np.ndarray): Feature matrix of shape (n_samples, n_features).
        y (np.ndarray): Labels of shape (n_samples,).
        k (int): Number of top candidates to return. Default is 10.
        confidence_threshold (float): Minimum confidence for a prediction to be considered. Default is 0.9.
        random_state (int): Random seed for reproducibility. Default is 42.
    Returns:
        list: Indices of the top k most likely mislabeled samples.
    """
    # Initialize model
    model = RandomForestClassifier(n_estimators=100, random_state=random_state)
    # Generate out-of-fold predictions
    probs = cross_val_predict(
        model, X, y, 
        cv=5, 
        method='predict_proba', 
        n_jobs=-1
    )
    # Get predicted classes and confidence scores
    pred_classes = np.argmax(probs, axis=1)
    confidences = np.max(probs, axis=1)
    # Identify mislabeled samples above confidence threshold
    mislabeled_mask = (pred_classes != y) & (confidences >= confidence_threshold)
    mislabeled_indices = np.where(mislabeled_mask)[0]
    # Calculate margins (confidence difference between top two predictions)
    sorted_probs = np.sort(probs, axis=1)[:, ::-1]  # Sort descending
    margins = sorted_probs[:, 0] - sorted_probs[:, 1]
    # Combine indices, confidences, and margins
    candidates = list(zip(
        mislabeled_indices,
        confidences[mislabeled_mask],
        margins[mislabeled_mask]
    ))
    # Sort by confidence (descending) and margin (descending)
    candidates.sort(key=lambda x: (-x[1], -x[2]))
    # Return top k indices
    return [idx for idx, _, _ in candidates[:k]]
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    # Generate synthetic data with 5% label noise
    X, y = make_classification(
        n_samples=1000, 
        n_features=20, 
        n_classes=3, 
        n_informative=10,
        flip_y=0.05,
        random_state=42
    )
    # Find top 10 fake labels
    fake_indices = find_fake_labels(X, y, k=10)
    print("Top 10 likely mislabeled indices:", fake_indices)

Key Steps Explained

  1. Model Training with Cross-Validation:

    • Use cross_val_predict with method='predict_proba' to generate out-of-fold predictions. This avoids overfitting by training on folds not containing the sample during prediction.
  2. Identify Mislabeled Candidates:

    • Compare predicted classes (pred_classes) with true labels (y). Discrepancies indicate potential mislabeling.
    • Apply a confidence_threshold (default: 0.9) to ensure the model is confident in its prediction.
  3. Confidence and Margin Calculation:

    • Confidence: Probability of the predicted class (np.max(probs, axis=1)).
    • Margin: Difference between the top two predicted probabilities. A high margin indicates strong disagreement between the top two classes, reducing false positives.
  4. Rank and Return Candidates:

    • Sort candidates by confidence (descending) and margin (descending).
    • Return the top k indices of likely mislabeled samples.

Why This Approach Works

  • Out-of-Fold Predictions: Prevents overfitting, ensuring predictions are unbiased.
  • Confidence Threshold: Filters out low-confidence predictions, reducing false discoveries.
  • Margin Metric: Helps distinguish between ambiguous samples (low margin) and clear mislabelings (high margin).
  • Random Forest: Robust to noise and non-linear relationships, suitable for diverse datasets.

Practical Considerations

  • Threshold Tuning: Adjust confidence_threshold based on dataset noise levels (e.g., 0.8 for noisy data).
  • Model Choice: Replace RandomForestClassifier with other models (e.g., LogisticRegression) if needed.
  • Scalability: For large datasets, reduce cv folds (e.g., cv=3) or use incremental learning.
  • Validation: Always review top candidates manually to confirm label errors.

This method efficiently surfaces high-confidence mislabeled samples, enabling targeted data cleaning for improved model accuracy.


Request an On-site Audit / Inquiry

SSL Secured Inquiry