Classifying stress in text is a deceptively tricky problem. The language people use when stressed on Reddit looks superficially similar to language used in many other emotional contexts. Getting a model to reliably distinguish stress from general negativity requires picking up on subtle semantic and contextual cues.
Why BERT won
The RNN model showed systematic bias toward false negatives — missing stressed posts. The vanilla Transformer had the opposite problem: false positive bias, flagging too many posts as stressed. BERT was the only architecture that achieved balanced classification across both classes. The reason is transfer learning: BERT's pre-training gives it contextual embeddings that understand how words relate to each other in ways smaller architectures simply can't match on a limited training set.
The numbers
Fine-tuned BERT: Precision = 0.88, Recall = 0.87, F1 = 0.88, ROC-AUC = 0.94. The AUC is the number I'm most proud of — 0.94 means the model is genuinely discriminating, not just getting lucky on class balance.
What I'd do differently
More careful error analysis earlier. I spent time tuning hyperparameters before I looked closely at which posts were being misclassified. When I did, the pattern was obvious: sarcastic posts were the hard cases for every model. That insight should have driven the project from day one.