Top 3 Papers from NeurIPS 2020

Is normalization indispensable for training deep neural network?

When training a classifier or detector, batchnorm is always applied to help with solving the problem of vanishing/exploding variables, however batchnorm can lead to worse model performance if training with very small batch size. Limited by available gpu memory, if we need to train with small batch size, normally we’d adopt synchronized batchnorm so that statistics of batchnorm can be aggregated over multiple gpus to mitigate the small batch size problem, with a cost of longer training time. The synchronization between gpus is not as efficient, it lowers gpu utilization.

This paper tackles the problem of vanishing/exploding variables in a new way — a new residual operation ‘RescaleNet’ which adds one more hparam to stabilize ResNet-like layer. By redefining residual connection (minimal changes to original residual operation), we see in the detection/segmentation experimentation table that RescaleNet performs better than models trained with BatchNorm.

Uncertainty-aware Self-training for Few-shot Text Classification

Pseudo labeling has been widely used for model training in semi-supervised learning. Given small amount of labeled data for each class, a good pre-trained base model and a large pool of unlabeled data, make better use of all data by doing self-learning. The authors propose three key ideas: better uncertainty estimation on the pre-trained model (teacher model), better sample selection based on the uncertainty estimation (teacher model), confident learning (emphasize more on low variance examples) on student side.

The authors use Monte Carlo Dropout for uncertainty estimation — turns the deep neural net inference to bayesian interpretation. Hard example selection then uses the results from uncertainty estimation which chooses the most and least confused examples to use for self-training.

Though the proposed method has only been evaluated on text models, We think this has the potential to make a difference in vision related tasks as well.

Distribution Matching for Crowd Counting

Crowd counting — if viewed as a workflow application, it can be done by detect and count, however this does not lead to good performance because detectors are not prone to heavy occlusion. However if it is viewed as a use case that needs a model designed for it, there are multiple ways to tackle it — regression model that takes in pixels and outputs count, density map estimation, distribution matching.

This paper views crowd counting as distribution matching problem, it takes in image and outputs a map of density values, sum up the density map to get the final count estimate. The approach described by the authors can be applied to any network architecture. By applying the counting loss, the optimal transport loss and the total variation loss all together, the model outperforms the state-of-the-art methods by a large margin, especially on the large-scale and challenging datasets.

To start using Clarifai’s Computer Vision, NLP and AI platform, click here to signup for a free API key: and get 1000 free operations per month.

About Clarifai

Clarifai is a leading provider of artificial intelligence for unstructured image, video, and text data. It delivers deep learning to its enterprise and public sector customers through its end-to-end computer vision and NLP platform, which manages the entire AI lifecycle.

Founded in 2013 by Matt Zeiler, Ph.D., Clarifai has been a market leader in computer vision AI since winning the top five places in image classification at the 2013 ImageNet Challenge. Clarifai, headquartered in Delaware, has raised $40M from top technology investors and is continuing to grow with more than 90 employees and offices in New York City, San Francisco, Washington, D.C., and Tallinn, Estonia. For more information, please visit

Originally published at

The World’s AI Deep learning workspace for developers, data scientists, and business operators.