2026

DistilBERT Sports Discourse Classifier

PythonPyTorchHuggingFace Transformers

Problem

Sports discussion is not just about topic; it is also about argument style. I wanted to classify whether a comment was analysis, a hot take, or a reaction, then compare a small fine-tuned model against a much larger zero-shot baseline.

What I built

I fine-tuned distilbert-base-uncased on a 212-example dataset I built by hand, training it to classify sports discussion by argument style rather than by topic. It reached 87.5% accuracy and a 0.876 macro-F1 score. I then benchmarked it fairly against a zero-shot Llama-3.3-70B baseline on the same held-out test set, and my fine-tuned model lost, 90.6% to 87.5%. Instead of leaving that out, I dug into why: using confusion matrices and per-class error analysis, I found that my smaller model had learned a keyword shortcut instead of actually reasoning about argument structure. Most student projects only show the win. I think showing the honest result, and knowing exactly why it happened, says more about how I actually think.

Impact

Reached 87.5% accuracy and 0.876 macro-F1 on a hand-built evaluation set.
Benchmarked fairly against a stronger zero-shot Llama-3.3-70B baseline.
Used confusion matrices and per-class error analysis to explain why the smaller model failed.

All projects