AbstractWe introduce distributed NLI, a new NLU task with a goal to predict the distribution of human judgements for natural language inference. We show that by applying additional distribution estimation methods, namely, Monte Carlo (MC) Dropout, Deep Ensemble, Re-Calibration, and Distribution Distillation, models can capture human judgement distribution more effectively than the softmax baseline. We show that MC Dropout is able to achieve decent performance without any distribution annotations while Re-Calibration can give further improvements with extra distribution annotations, suggesting the value of multiple annotations for one example in modeling the distribution of human judgements. Despite these improvements, the best results are still far below the estimated human upper-bound, indicating that predicting the distribution of human judgements is still an open, challenging problem with a large room for improvements. We showcase the common errors for MC Dropout and Re-Calibration. Finally, we give guidelines on the usage of these methods with different levels of data availability and encourage future work on modeling the human opinion distribution for language reasoning.