In natural language processing, a recently popular line of work explores how to best report the experimental results of neural networks. One exemplar publication, titled “Show Your Work: Improved Reporting of Experimental Results” (Dodge et al., 2019), advocates for reporting the expected validation effectiveness of the best-tuned model, with respect to the computational budget. In the present work, we critically examine this paper. As far as statistical generalizability is concerned, we find unspoken pitfalls and caveats with this approach. We analytically show that their estimator is biased and uses error-prone assumptions. We find that the estimator favors negative errors and yields poor bootstrapped confidence intervals. We derive an unbiased alternative and bolster our claims with empirical evidence from statistical simulation. Our codebase is at https://github.com/castorini/meanmax.
jhan014 at SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media
Jiahui Han | Shengtan Wu | Xinyu Liu
Proceedings of the 13th International Workshop on Semantic Evaluation
In this paper, we present two methods to identify and categorize the offensive language in Twitter. In the first method, we establish a probabilistic model to evaluate the sentence offensiveness level and target level according to different sub-tasks. In the second method, we develop a deep neural network consisting of bidirectional recurrent layers with Gated Recurrent Unit (GRU) cells and fully connected layers. In the comparison of two methods, we find both method has its own advantages and drawbacks while they have similar accuracy.