ML Debugging Interview Questions: How Data Scientists Should Answer Them

CrackJobs Team5/3/20265 min read
Share:

ML Debugging Interview Questions: How Data Scientists Should Answer Them

Why ML debugging questions are different

Most data science interview prep focuses on theory: explain gradient descent, derive the bias-variance trade-off, describe how XGBoost handles missing values. ML debugging questions are different. They're scenario-based, open-ended, and test whether you can reason about a model that's misbehaving in a specific way.

The failure mode for most candidates is answering debugging questions with theory. "This could be overfitting, or it could be underfitting, or it could be a data issue" is not a debugging answer — it's a list of possibilities with no structure. Interviewers are looking for someone who would know what to do at 9pm when a production model starts degrading.

The four categories of ML debugging questions

1. Model performance degradation

Example: "Your model's AUC was 0.87 in validation, but 0.71 in production. What do you do?"

This is a train-serve skew problem. Structure your answer in this order:

  • Data: Is the production data distribution different from training? Check feature distributions in production vs. training set. Focus on: categorical feature value distributions, numerical feature ranges, missing value rates, and class balance.
  • Pipeline: Is the feature engineering applied identically in training and serving? A single preprocessing step that runs in training but not serving (or vice versa) is enough to collapse model performance.
  • Label drift: Has the definition of the target variable changed? This is especially common in fraud detection and content moderation where labelling policies evolve.
  • Time: How long has the model been in production? Models degrade over time as the world changes. If the model is 18 months old, retraining on recent data is the first experiment.

2. Unexpected model behaviour on specific inputs

Example: "Your recommendation model works well globally, but performs poorly for new users. Why, and what do you do?"

This is a cold-start problem. The debugger's approach:

  • Confirm the scope: is "new users" defined as 0 interactions, fewer than 5, or something else? The threshold matters for your fix.
  • Identify what features the model is relying on for established users (usually interaction history) and what it falls back to for new users (usually demographic or contextual signals).
  • The fix is typically a separate model or rule-based fallback for the cold-start segment, not patching the main model.

3. Training instability

Example: "Your neural network loss is not converging — it oscillates and never stabilises. What do you check?"

Go in this order:

  • Learning rate: Too high causes oscillation; too low causes slow convergence or getting stuck. Try reducing by 10x and observe whether the loss curve stabilises.
  • Batch size: Very small batches create noisy gradient estimates. Increase batch size and check if training becomes more stable.
  • Data normalisation: Unnormalised input features with very different scales cause gradient instability. Confirm all inputs are normalised.
  • Gradient clipping: For RNNs and deep networks, exploding gradients are a common cause of oscillation. Check gradient norms during training.
  • Loss function: Is the loss function appropriate for the task? A classification loss applied to a regression problem will cause instability.

4. Overfitting and underfitting

Example: "Your model achieves 98% training accuracy but 61% validation accuracy. Walk me through your diagnosis and fix."

This is classic overfitting. The strong answer goes beyond naming the problem:

  • Diagnose severity: A 37-point train/val gap is severe, not marginal. This suggests either extremely few training samples, a very complex model, or data leakage.
  • Check for data leakage first: Before tuning regularisation, check whether any feature was computed with knowledge of the target variable. Leakage produces 98%+ training accuracy in models that have no real signal.
  • If no leakage — regularise: Add L1/L2 regularisation, dropout (if neural net), or reduce model complexity. For tree-based models, tune max_depth and min_samples_leaf.
  • Add data: If regularisation isn't enough, the model needs more training examples. Collect more data or use data augmentation.
  • Cross-validate: Confirm the gap persists across multiple folds — a single train/val split can produce misleading results.

The debugging framework interviewers want to see

For any ML debugging question, structure your answer with this pattern:

  1. Reproduce: How do you confirm the problem is real? (Not a logging bug, not a one-off data issue)
  2. Scope: Is it affecting all predictions or a specific subset? Specific inputs, time periods, or user segments?
  3. Hypothesise: State your top 2–3 hypotheses for root cause, ordered by likelihood.
  4. Test: For each hypothesis, describe the specific check you'd run. Be concrete — name the metric, the query, or the experiment.
  5. Fix and validate: Once you've identified the cause, describe the fix and how you'd confirm it worked.

The key is step 3 — most candidates skip from "I'd look at the data" straight to "I'd retrain the model." Interviewers want to hear you generate specific, testable hypotheses before running anything.

Production ML debugging vs. interview ML debugging

In production, you have logs, dashboards, and colleagues. In an interview, you have a scenario and a blank whiteboard. The skill being tested is whether you can reason systematically without any of those tools.

The best way to build that reasoning is to practice debugging scenarios out loud, with someone who can interrupt you and say "that hypothesis is right, but how would you actually test it?" That's what a data science mock interview gives you — a real ML engineer who has debugged real production models, pushing back on your reasoning in real time.

ML debugging questions to practice

  • Your fraud detection model's precision dropped from 84% to 67% overnight. What happened?
  • A feature that was highly predictive in training has near-zero importance in your production model. Why?
  • Your model performs well on desktop users but poorly on mobile users. How do you diagnose this?
  • Your A/B test shows the new model outperforms the baseline, but revenue is down. What do you investigate?
  • Your model's predictions have become systematically biased toward one class over the last month. What's causing this?
Related Articles
8 Machine Learning Interview Mistakes That Make Data Scientists Fail (Even With Strong Models)
Many data scientists fail interviews not because their models are weak, but because their judgement, framing, and communication fall short. These 8 machine learning interview mistakes explain why.
4 min read
Ready to Put This Into Practice?
Book a 1:1 mock interview with expert mentors from Google, Meta, Amazon and get personalized feedback to ace your next interview