Breast Cancer Wisconsin Classification with PyTorch | Machine Learning Project
Table Of Contents
- 1. Introduction & Problem Framing
- Quick Facts
- 2. Imports & Environment Setup
- 3. Load & Inspect the Dataset
- 4. Data Preparation & Train/Test Splits
- 5. Exploratory Data Analysis (EDA)
- 6. Evaluation Strategy & Metrics
- 7. Baseline Model: Logistic Regression
- Evaluation Metrics (Validation Set)
- Summary
- 8. Baseline Results & What They Suggest
- Confusion matrix (threshold = 0.5)
- Why the baseline can be this strong on this dataset
- Next steps
- 9. Interpreting the Baseline Model
- Summary
- 10. Error Analysis
- 11. Improved Model: Tree-Based Methods
- 12. Model Comparison
- 13. Optimization Experiments
- Summary
- Final Evaluation on Held-Out Test Set
- Calibration (Probability Quality)
- Limitations
- Confidence & Stability (Cross-Validation)
- What I’d Do Next in Production
End-to-End Machine Learning Workflow with Classical Models working with tabular data
Portfolio goals: clarity, reasoning, tradeoffs, and clean experimentation over squeezing out maximum accuracy.
1. Introduction & Problem Framing
In this notebook, we build and evaluate models that classify breast tumors as benign or malignant using the Breast Cancer Wisconsin dataset.
Quick Facts
- Dataset: Breast Cancer Wisconsin (UCI)
- Task: Binary classification
- Models: Logistic Regression, MLP
- Framework: PyTorch
- Evaluation: Accuracy, F1, ROC-AUC
- Focus: Model comparison & feature scaling
Why this dataset?
- It's a classic, well-understood supervised learning problem
- It's tabular (unlike MNIST), so we can demonstrate tabular best practices: pipelines, leakage prevention, thresholding
Workflow philosophy:
- Start with simple baselines
- Use a consistent evaluation protocol
- Analyze mistakes and tradeoffs
2. Imports & Environment Setup
We'll import:
- Core numeric + plotting libraries
- scikit-learn modeling + evaluation utilities
- (Optional) PyTorch for a small MLP comparison
Note: For small tabular datasets, GPU acceleration is usually not the bottleneck.
import os
import math
import random
from dataclasses import dataclass
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score,
roc_auc_score,
confusion_matrix,
ConfusionMatrixDisplay,
classification_report,
roc_curve,
precision_recall_curve,
auc,
)
# Optional: PyTorch (only used in Section 14)
try:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
TORCH_AVAILABLE = True
except Exception:
TORCH_AVAILABLE = False
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["axes.grid"] = True
def get_torch_device():
"""Return best available torch device (CUDA > MPS > CPU)."""
if not TORCH_AVAILABLE:
return None
if torch.cuda.is_available():
return torch.device("cuda")
# Apple Silicon
if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
return torch.device("mps")
return torch.device("cpu")
device = get_torch_device()
print("Torch available:", TORCH_AVAILABLE)
print("Torch device:", device)
3. Load & Inspect the Dataset
We will:
- Load the dataset
- Convert it into a pandas DataFrame
- Inspect shape, columns, and class balance
Key questions:
- Are there missing values?
- Are classes imbalanced?
- Do features look like they need scaling?
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")
target_names = list(data.target_names) # usually ['malignant', 'benign']
print("Target names:", target_names)
print("X shape:", X.shape)
print("y distribution:\n", y.value_counts(), "\n")
display(X.head())
display(X.describe().T.head(10))
print("Missing values (total):", int(X.isna().sum().sum()))
4. Data Preparation & Train/Test Splits
Important tabular best practice:
- Use stratified splits (preserve class ratio)
- Use pipelines so scaling is fit only on the training data
We'll create:
- Train set
- Validation set
- Test set (held out until the end)
# Split into train+temp and test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=SEED, stratify=y
)
# Split train+temp into train and val
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.25, random_state=SEED, stratify=y_train
)
print("Train:", X_train.shape, " Val:", X_val.shape, " Test:", X_test.shape)
print("Train class balance:\n", y_train.value_counts(normalize=True))
print("Val class balance:\n", y_val.value_counts(normalize=True))
print("Test class balance:\n", y_test.value_counts(normalize=True))
All splits are stratified to preserve class balance and reduce evaluation variance, which is especially important for medical classification tasks.
5. Exploratory Data Analysis (EDA)
We'll do lightweight EDA to build intuition:
- Feature distributions
- Correlations
- A couple of class-conditional comparisons
Goal: understand the data just enough to make model choices feel justified.
# Correlation heatmap (matplotlib-only)
corr = X_train.corr(numeric_only=True)
plt.figure(figsize=(12, 10))
plt.imshow(corr, aspect="auto")
plt.title("Feature Correlation Heatmap (Train)")
plt.colorbar()
plt.tight_layout()
plt.show()
# Simple distribution plot for a few features
feature_subset = ["mean radius", "mean texture", "mean perimeter", "mean area"]
X_plot = X_train[feature_subset].copy()
X_plot["target"] = y_train.values
for feat in feature_subset:
plt.figure()
plt.hist(X_plot.loc[X_plot["target"] == 0, feat], bins=30, alpha=0.6, label=target_names[0])
plt.hist(X_plot.loc[X_plot["target"] == 1, feat], bins=30, alpha=0.6, label=target_names[1])
plt.title(f"Distribution: {feat} (Train)")
plt.legend()
plt.tight_layout()
plt.show()
6. Evaluation Strategy & Metrics
We'll report multiple metrics because accuracy alone can hide failure modes.
Metrics:
- Accuracy (easy baseline)
- ROC-AUC (ranking quality across thresholds)
- Precision / Recall / F1 (threshold-dependent)
- Confusion matrix (interpretable error counts)
Important context idea:
- In many medical settings, false negatives can be more costly than false positives.
- We'll later discuss threshold tuning as a principled way to control that tradeoff.
def plot_roc(y_true, y_proba, title="ROC Curve"):
fpr, tpr, _ = roc_curve(y_true, y_proba)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.3f}")
plt.plot([0, 1], [0, 1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title(title)
plt.legend()
plt.tight_layout()
plt.show()
def plot_precision_recall(y_true, y_proba, title="Precision-Recall Curve"):
precision, recall, _ = precision_recall_curve(y_true, y_proba)
pr_auc = auc(recall, precision)
plt.figure()
plt.plot(recall, precision, label=f"AUC = {pr_auc:.3f}")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title(title)
plt.legend()
plt.tight_layout()
plt.show()
def show_confusion_matrix(y_true, y_pred, title="Confusion Matrix"):
cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=target_names)
disp.plot(values_format="d")
plt.title(title)
plt.tight_layout()
plt.show()
7. Baseline Model: Logistic Regression
We start with Logistic Regression because:
- It's a strong baseline for tabular data
- It's fast and stable
- It's relatively interpretable (coefficients)
We also put scaling into a Pipeline to prevent leakage.
logreg = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(
max_iter=5000,
solver="lbfgs",
n_jobs=-1,
random_state=SEED
))
])
logreg.fit(X_train, y_train)
val_proba = logreg.predict_proba(X_val)[:, 1]
val_pred = (val_proba >= 0.5).astype(int)Evaluation Metrics (Validation Set)
We evaluate models using multiple complementary metrics:
- Accuracy as a baseline sanity check
- ROC-AUC to measure ranking quality independent of threshold
- Precision / Recall / F1 to understand class-specific tradeoffs
- Confusion Matrix to inspect concrete error counts
Unless otherwise stated, metrics in this section are computed on the validation set.
print("Validation accuracy:", accuracy_score(y_val, val_pred))
print("Validation ROC-AUC:", roc_auc_score(y_val, val_proba))
print("\nClassification report (val):\n")
print(classification_report(y_val, val_pred, target_names=target_names))
show_confusion_matrix(y_val, val_pred, title="LogReg Confusion Matrix (Val, threshold=0.5)")
plot_roc(y_val, val_proba, title="LogReg ROC (Val)")
plot_precision_recall(y_val, val_proba, title="LogReg Precision-Recall (Val)")
Summary
At threshold=0.5, the baseline misses 1 malignant case on validation. In a medical screening context this error type is often more costly than a false positive, so later we’ll revisit threshold selection to trade precision for higher malignant recall.
8. Baseline Results & What They Suggest
The Logistic Regression baseline performs extremely well on the validation set, with ROC-AUC near 1.0. This suggests the model ranks malignant vs benign cases cleanly across thresholds (not just at a single decision cutoff).
Confusion matrix (threshold = 0.5)
At the default threshold of 0.5, the model makes very few mistakes. The most important error type to inspect is a false negative (malignant predicted as benign). In many screening or diagnostic-support contexts, false negatives are more costly than false positives, so later we revisit threshold selection to trade precision for higher malignant recall.
Why the baseline can be this strong on this dataset
The Breast Cancer Wisconsin dataset is known to be relatively “well-behaved”:
- Features are engineered summary statistics with strong signal.
- The dataset is small and clean with low missingness.
- Many features are highly predictive even for linear models.
This makes strong baseline performance plausible — but it also means improvements from more complex models may be incremental.
Despite strong aggregate performance, these metrics alone are insufficient for decision-making in a medical context. A small number of false negatives can be unacceptable, and high ROC-AUC does not guarantee an appropriate operating point.
Therefore, in the following sections we:
- Interpret which features drive predictions,
- Examine misclassifications and borderline cases,
- Explicitly tune the decision threshold to reflect domain priorities.
Next steps
Even with excellent aggregate metrics, it’s still important to:
- Interpret what the model learned (feature effects / importance).
- Analyze mistakes and borderline cases.
- Select an operating threshold aligned with the domain’s priorities (e.g., minimizing false negatives).
9. Interpreting the Baseline Model
We'll inspect the Logistic Regression coefficients.
Caveat:
- Coefficients are easiest to interpret when features are standardized.
- High correlation among features can make coefficients unstable.
Note: In this dataset, class 0 corresponds to malignant and class 1 to benign.
Positive coefficients increase the log-odds of predicting benign, while negative coefficients push predictions toward malignant.
# Extract coefficients from the pipeline
coef = logreg.named_steps["model"].coef_.ravel()
coef_df = pd.DataFrame({
"feature": X_train.columns,
"coef": coef,
"abs_coef": np.abs(coef),
}).sort_values("abs_coef", ascending=False)
display(coef_df.head(15))
plt.figure()
top = coef_df.head(15).iloc[::-1]
plt.barh(top["feature"], top["coef"])
plt.title("Top Logistic Regression Coefficients (by |coef|)")
plt.tight_layout()
plt.show()
Summary
Because many features are correlated, coefficient signs/magnitudes should be interpreted as conditional effects (holding other standardized features constant), not as standalone importance.
Note on coefficient direction: in this dataset 0 = malignant and 1 = benign.
For Logistic Regression, a positive coefficient increases the log-odds of class 1 (benign), while a negative coefficient pushes predictions toward malignant.
10. Error Analysis
We'll look at:
- Most confident wrong predictions
- Borderline cases near the decision threshold
This helps answer:
- Are errors random noise or systematic?
- Is there a threshold that better matches our priorities?
# Most confident wrong predictions on validation set
val_df = X_val.copy()
val_df["y_true"] = y_val.values
val_df["y_proba"] = val_proba
val_df["y_pred"] = val_pred
val_df["correct"] = (val_df["y_true"] == val_df["y_pred"])
wrong = val_df[val_df["correct"] == False].copy()
wrong["confidence"] = np.where(wrong["y_pred"] == 1, wrong["y_proba"], 1 - wrong["y_proba"])
wrong = wrong.sort_values("confidence", ascending=False)
display(wrong[["y_true", "y_pred", "y_proba", "confidence"]].head(10))
# Borderline cases near 0.5
border = val_df.copy()
border["dist_from_0.5"] = np.abs(border["y_proba"] - 0.5)
border = border.sort_values("dist_from_0.5", ascending=True)
display(border[["y_true", "y_pred", "y_proba", "dist_from_0.5"]].head(10))
While the absolute number of errors is small, inspecting these cases helps validate that mistakes are not driven by obvious data issues or leakage.
11. Improved Model: Tree-Based Methods
Next we try a more expressive model that can capture nonlinearities and feature interactions.
Candidates:
- RandomForestClassifier
- GradientBoostingClassifier
- HistGradientBoostingClassifier (often strong, fast)
We'll keep tuning modest and focus on clean comparison.
We initially use a default decision threshold of 0.5, and later revisit this choice during threshold optimization.
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
hgb = HistGradientBoostingClassifier(random_state=SEED)
hgb.fit(X_train, y_train)
val_proba_hgb = hgb.predict_proba(X_val)[:, 1]
val_pred_hgb = (val_proba_hgb >= 0.5).astype(int)
print("HGB Validation accuracy:", accuracy_score(y_val, val_pred_hgb))
print("HGB Validation ROC-AUC:", roc_auc_score(y_val, val_proba_hgb))
print("\nClassification report (val):\n")
print(classification_report(y_val, val_pred_hgb, target_names=target_names))
show_confusion_matrix(y_val, val_pred_hgb, title="HGB Confusion Matrix (Val, threshold=0.5)")
plot_roc(y_val, val_proba_hgb, title="HGB ROC (Val)")
plot_precision_recall(y_val, val_proba_hgb, title="HGB Precision-Recall (Val)")
12. Model Comparison
Compare Logistic Regression vs Tree-based model:
- ROC-AUC, PR-AUC
- Confusion matrices
- Error analysis patterns
Discuss tradeoffs:
- Interpretability vs flexibility
- Stability / sensitivity to preprocessing
comparison = pd.DataFrame([
{
"model": "Logistic Regression",
"val_accuracy": accuracy_score(y_val, val_pred),
"val_roc_auc": roc_auc_score(y_val, val_proba),
},
{
"model": "HistGradientBoosting",
"val_accuracy": accuracy_score(y_val, val_pred_hgb),
"val_roc_auc": roc_auc_score(y_val, val_proba_hgb),
},
]).sort_values("val_roc_auc", ascending=False)
display(comparison)
Based on validation performance and qualitative considerations, we select Logistic Regression as the final model for test evaluation.
While the tree-based model captures nonlinear interactions, it does not outperform the linear baseline on this dataset. Given the strong performance, interpretability, and stability of Logistic Regression, increased model complexity does not appear justified here.
We therefore proceed with the tuned Logistic Regression pipeline and a validation-selected decision threshold for final evaluation.
13. Optimization Experiments
We’ll do small, principled optimization experiments:
- Cross-validated hyperparameter search
- Decision-threshold tuning
Key idea:
Optimization should be measured, justified, and repeatable.
from sklearn.model_selection import GridSearchCV
# Example: regularization sweep for Logistic Regression
param_grid = {
"model__C": [0.01, 0.1, 1.0, 10.0, 100.0]
}
logreg_grid = GridSearchCV(
estimator=logreg,
param_grid=param_grid,
scoring="roc_auc",
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED),
n_jobs=-1,
)
logreg_grid.fit(X_train, y_train)
print("Best params:", logreg_grid.best_params_)
print("Best CV ROC-AUC:", logreg_grid.best_score_)
best_logreg = logreg_grid.best_estimator_
val_proba_best = best_logreg.predict_proba(X_val)[:, 1]
# Threshold tuning example: choose threshold maximizing F1 on validation
# Tune threshold using P(malignant) directly (clearer for clinical framing)
# Identify index for malignant class in predict_proba columns
# In sklearn, classes_ is sorted: typically [0, 1] => [malignant, benign]
classes = best_logreg.named_steps["model"].classes_
idx_malignant = int(np.where(classes == 0)[0][0]) # class 0 = malignant
proba_malignant = best_logreg.predict_proba(X_val)[:, idx_malignant]
thresholds = np.linspace(0.05, 0.95, 19)
rows = []
for thr in thresholds:
# predict malignant if P(malignant) >= thr
pred_malignant = (proba_malignant >= thr).astype(int) # 1 means malignant (custom)
y_true_malignant = (y_val.values == 0).astype(int)
# compute precision/recall for malignant as "positive"
tp = np.sum((pred_malignant == 1) & (y_true_malignant == 1))
fp = np.sum((pred_malignant == 1) & (y_true_malignant == 0))
fn = np.sum((pred_malignant == 0) & (y_true_malignant == 1))
precision = tp / (tp + fp + 1e-12)
recall = tp / (tp + fn + 1e-12)
f1 = 2 * precision * recall / (precision + recall + 1e-12)
rows.append({"thr": thr, "precision_malig": precision, "recall_malig": recall, "f1_malig": f1})
df_thr = pd.DataFrame(rows)
# Example: pick threshold maximizing malignant recall subject to precision >= 0.90
candidates = df_thr[df_thr["precision_malig"] >= 0.90]
best_row = candidates.sort_values("recall_malig", ascending=False).head(1)
display(df_thr)
display(best_row)
# Freeze the threshold choice derived from validation (no manual editing needed)
thr_star = float(best_row["thr"].iloc[0])
min_precision = 0.90
print(f"Chosen threshold policy: maximize recall_malig subject to precision_malig >= {min_precision:.2f}")
print(f"Chosen thr_star (on P(malignant)): {thr_star:.2f}")from sklearn.inspection import permutation_importance
perm = permutation_importance(
best_logreg, X_val, y_val,
n_repeats=20,
random_state=SEED,
n_jobs=-1
)
imp = pd.DataFrame({
"feature": X_val.columns,
"importance_mean": perm.importances_mean,
"importance_std": perm.importances_std
}).sort_values("importance_mean", ascending=False)
plt.figure()
top = imp.head(15).iloc[::-1]
plt.barh(top["feature"], top["importance_mean"])
plt.title("Permutation Importance (Validation) — Logistic Regression Pipeline")
plt.tight_layout()
plt.show()
Summary
Rather than fixing the decision threshold at 0.5, we explicitly tune it using validation data.
Treating malignant as the positive class, we examine the precision–recall tradeoff across thresholds.
Lower thresholds reduce false negatives (higher recall) at the cost of more false positives, while higher thresholds do the opposite.
For this analysis, we select a threshold of 0.25, which achieves ~98% malignant recall while maintaining >90% precision.
This operating point reflects a screening-oriented objective where missing malignant cases is more costly than over-flagging benign ones.
Based on this analysis, we select a threshold of 0.25 on P(malignant) as a reasonable operating point.
plt.figure()
plt.plot(df_thr["thr"], df_thr["precision_malig"], label="Precision (malignant)")
plt.plot(df_thr["thr"], df_thr["recall_malig"], label="Recall (malignant)")
plt.xlabel("Decision threshold (P(malignant))")
plt.ylabel("Metric value")
plt.title("Precision–Recall Tradeoff for Malignant Class (Validation)")
plt.legend()
plt.tight_layout()
plt.show()Final Evaluation on Held-Out Test Set
At this point, we freeze our modeling decisions (model family + hyperparameters + threshold) using training/validation only.
We now evaluate exactly once on the held-out test set to estimate generalization performance.
Final choice: We use the tuned Logistic Regression model as the final model because it achieves near-ceiling performance with strong interpretability; the more complex model did not provide a meaningful improvement under the chosen evaluation criteria.
# --- Final test-set evaluation (uses frozen model + frozen threshold) ---
# Sanity checks (fail fast if something is missing)
assert "best_logreg" in globals(), "best_logreg not found — did GridSearchCV run?"
assert "thr_star" in globals(), "thr_star not found — did threshold selection run?"
assert "X_test" in globals() and "y_test" in globals(), "Test split not found."
assert "target_names" in globals(), "target_names not found."
final_model = best_logreg
# Compute P(malignant) on test (sklearn breast cancer convention: 0=malignant, 1=benign)
classes = final_model.named_steps["model"].classes_
idx_malignant = int(np.where(classes == 0)[0][0])
proba_malig_test = final_model.predict_proba(X_test)[:, idx_malignant]
# Predict malignant if P(malignant) >= thr_star
y_true_malig_test = (y_test.values == 0).astype(int) # 1 means malignant (custom)
y_pred_malig_test = (proba_malig_test >= thr_star).astype(int) # 1 means malignant (custom)
# Convert back to original labels for reporting: 0=malignant, 1=benign
y_pred_test = np.where(y_pred_malig_test == 1, 0, 1)
print("FINAL MODEL:", "best_logreg (tuned Logistic Regression pipeline)")
print(f"FINAL THRESHOLD POLICY: maximize recall_malig subject to precision_malig >= 0.90 (on validation)")
print(f"thr_star (threshold on P(malignant)) = {thr_star:.2f}\n")
print("Test ROC-AUC (malignant as positive):", roc_auc_score(y_true_malig_test, proba_malig_test))
print("\nClassification report (test):\n")
print(classification_report(y_test, y_pred_test, target_names=target_names))
show_confusion_matrix(y_test, y_pred_test, title=f"Final Model Confusion Matrix (Test, thr={thr_star:.2f})")
if "plot_roc" in globals():
plot_roc(y_true_malig_test, proba_malig_test, title="Final ROC (Test) — Malignant Positive")
if "plot_precision_recall" in globals():
plot_precision_recall(y_true_malig_test, proba_malig_test, title="Final PR (Test) — Malignant Positive")
Calibration (Probability Quality)
ROC-AUC measures ranking quality, but it does not guarantee that predicted probabilities are well-calibrated.
In many decision-making contexts (e.g., threshold policies, risk scoring), we want predicted probabilities to reflect empirical frequencies. Here we evaluate calibration on the validation set using reliability curves and Brier score, and apply post-hoc calibration.
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import brier_score_loss
# --- Helper: get P(malignant) from any fitted sklearn estimator with predict_proba ---
def proba_malignant(estimator, X):
# Works for pipelines too
if hasattr(estimator, "named_steps") and "model" in estimator.named_steps:
classes = estimator.named_steps["model"].classes_
else:
classes = estimator.classes_
idx_malig = int(np.where(classes == 0)[0][0]) # class 0 = malignant in this dataset
return estimator.predict_proba(X)[:, idx_malig]
# Binary targets for calibration metrics (1 = malignant)
y_val_malig = (y_val.values == 0).astype(int)
# --- BEFORE calibration ---
val_proba_malig_raw = proba_malignant(best_logreg, X_val)
brier_raw = brier_score_loss(y_val_malig, val_proba_malig_raw)
# Reliability curve
frac_pos_raw, mean_pred_raw = calibration_curve(
y_val_malig, val_proba_malig_raw, n_bins=10, strategy="quantile"
)
print(f"Brier score (val) BEFORE calibration: {brier_raw:.4f}")
plt.figure()
plt.plot([0, 1], [0, 1], linestyle="--", label="Perfectly calibrated")
plt.plot(mean_pred_raw, frac_pos_raw, marker="o", label="best_logreg (raw)")
plt.xlabel("Mean predicted probability (P(malignant))")
plt.ylabel("Fraction of positives (malignant)")
plt.title("Calibration Curve (Validation) — Before Calibration")
plt.legend()
plt.tight_layout()
plt.show()
# --- AFTER calibration ---
# We calibrate the already-fit model using the validation set as a held-out calibration set.
# (Note: this consumes the validation set for calibration; avoid re-tuning thresholds on this same data afterwards.)
cal_sigmoid = CalibratedClassifierCV(best_logreg, method="sigmoid", cv="prefit")
cal_sigmoid.fit(X_val, y_val)
val_proba_malig_sig = proba_malignant(cal_sigmoid, X_val)
brier_sig = brier_score_loss(y_val_malig, val_proba_malig_sig)
frac_pos_sig, mean_pred_sig = calibration_curve(
y_val_malig, val_proba_malig_sig, n_bins=10, strategy="quantile"
)
print(f"Brier score (val) AFTER calibration (sigmoid): {brier_sig:.4f}")
plt.figure()
plt.plot([0, 1], [0, 1], linestyle="--", label="Perfectly calibrated")
plt.plot(mean_pred_raw, frac_pos_raw, marker="o", label="raw")
plt.plot(mean_pred_sig, frac_pos_sig, marker="o", label="sigmoid calibrated")
plt.xlabel("Mean predicted probability (P(malignant))")
plt.ylabel("Fraction of positives (malignant)")
plt.title("Calibration Curve (Validation) — Raw vs Calibrated")
plt.legend()
plt.tight_layout()
plt.show()
# Optional: try isotonic (can overfit on small data; compare Brier)
cal_iso = CalibratedClassifierCV(best_logreg, method="isotonic", cv="prefit")
cal_iso.fit(X_val, y_val)
val_proba_malig_iso = proba_malignant(cal_iso, X_val)
brier_iso = brier_score_loss(y_val_malig, val_proba_malig_iso)
print(f"Brier score (val) AFTER calibration (isotonic): {brier_iso:.4f}")
# Pick a calibrated model (simple rule: lower Brier on validation)
calibrated_model = cal_sigmoid if brier_sig <= brier_iso else cal_iso
print("Chosen calibrated model:", "sigmoid" if calibrated_model is cal_sigmoid else "isotonic")y_test_malig = (y_test.values == 0).astype(int)
test_proba_raw = proba_malignant(best_logreg, X_test)
test_proba_cal = proba_malignant(calibrated_model, X_test)
print("Brier score (test) raw: ", brier_score_loss(y_test_malig, test_proba_raw))
print("Brier score (test) calibrated:", brier_score_loss(y_test_malig, test_proba_cal))
frac_pos_raw_t, mean_pred_raw_t = calibration_curve(y_test_malig, test_proba_raw, n_bins=10, strategy="quantile")
frac_pos_cal_t, mean_pred_cal_t = calibration_curve(y_test_malig, test_proba_cal, n_bins=10, strategy="quantile")
plt.figure()
plt.plot([0, 1], [0, 1], linestyle="--", label="Perfectly calibrated")
plt.plot(mean_pred_raw_t, frac_pos_raw_t, marker="o", label="raw")
plt.plot(mean_pred_cal_t, frac_pos_cal_t, marker="o", label="calibrated")
plt.xlabel("Mean predicted probability (P(malignant))")
plt.ylabel("Fraction of positives (malignant)")
plt.title("Calibration Curve (Test) — Raw vs Calibrated")
plt.legend()
plt.tight_layout()
plt.show()
Limitations
This notebook is intentionally focused on clarity and disciplined evaluation rather than claiming real-world clinical readiness. Several limitations are worth noting:
- Dataset size and curation: The Breast Cancer Wisconsin dataset is small, clean, and well-curated. Many features are engineered summary statistics with strong signal, which makes the problem easier than many real-world medical datasets.
- Feature availability: All features are numeric and preprocessed; this avoids challenges common in practice such as missing values, noisy measurements, or data collected across heterogeneous devices or institutions.
- Threshold policy context: The chosen decision threshold reflects a demonstration policy (maximize malignant recall subject to precision ≥ 0.90). In real deployments, thresholds should be selected using explicit cost, risk, and stakeholder considerations.
- No external validation: Performance is reported on a held-out test split from the same dataset distribution. Generalization to new populations or institutions is not assessed.
These limitations do not invalidate the analysis, but they do constrain how results should be interpreted.
Confidence & Stability (Cross-Validation)
Because the dataset is relatively small, single train/validation/test splits can introduce variance. To assess the stability of the final model, we evaluate performance using stratified K-fold cross-validation on the training data.
from sklearn.model_selection import StratifiedKFold, cross_validate
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
scoring = {
"accuracy": "accuracy",
"roc_auc": "roc_auc",
"precision": "precision",
"recall": "recall",
"f1": "f1",
}
cv_results = cross_validate(
best_logreg,
X_train,
y_train,
cv=cv,
scoring=scoring,
n_jobs=-1,
)
cv_summary = pd.DataFrame({
metric.replace("test_", ""): [
cv_results[f"test_{metric}"].mean(),
cv_results[f"test_{metric}"].std(),
]
for metric in scoring
}, index=["mean", "std"]).T
display(cv_summary)
The relatively small standard deviations suggest that the model’s performance is stable across folds, and that reported results are not driven by a single favorable split.
What I’d Do Next in Production
If this model were part of a real system, the next steps would focus less on model complexity and more on robustness, alignment, and monitoring:
- External validation: Evaluate performance on data from a different source or institution to assess distribution shift and generalization.
- Explicit cost modeling: Work with stakeholders to define the relative costs of false negatives vs false positives and choose thresholds accordingly.
- Calibration monitoring: Track probability calibration over time, especially if class prevalence or data collection processes change.
- Data quality checks: Add checks for missing values, out-of-range inputs, and feature drift.
- Human-in-the-loop workflows: Integrate predictions as decision support rather than automation, particularly for high-risk cases.
- Periodic re-training: Establish criteria for retraining or recalibration as new labeled data becomes available.
In many real-world settings, these system-level considerations have more impact than incremental model improvements.