# Search ICLR 2019

Searching papers submitted to ICLR 2019 can be painful. You might want to know which paper uses technique X, dataset D, or cites author ME. Unfortunately, search is limited to titles, abstracts, and keywords, missing the actual contents of the paper. This Frankensteinian search has returned from 2018 to help scour the papers of ICLR by ripping out their souls using pdftotext.

Good luck! Warranty's not included :)

Need random search inspiration..? Grab something from the list of all tags! ^_^
How about: information plasticity, generative adversarial user model, multivariate time series, network inference, certified robustness ..?

Sanity Disclaimer: As you stare at the continuous stream of ICLR and arXiv papers, don't lose confidence or feel overwhelmed. This isn't a competition, it's a search for knowledge. You and your work are valuable and help carve out the path for progress in our field :)

### "data augmentation" has 257 results

##### RETHINKING SELF-DRIVING : MULTI -TASK KNOWLEDGE FOR BETTER GENERALIZATION AND ACCIDENT EXPLANATION ABILITY

tl;dr we proposed a new self-driving model which is composed of perception module for see and think and driving module for behave to acquire better generalization and accident explanation ability.

Current end-to-end deep learning driving models have two problems: (1) Poor generalization ability of unobserved driving environment when diversity of train- ing driving dataset is limited (2) Lack of accident explanation ability when driving models don’t work as expected. To tackle these two problems, rooted on the be- lieve that knowledge of associated easy task is benificial for addressing difficult task, we proposed a new driving model which is composed of perception module for see and think and driving module for behave, and trained it with multi-task perception-related basic knowledge and driving knowledge stepwisely. Specifi- cally segmentation map and depth map (pixel level understanding of images) were considered as what & where and how far knowledge for tackling easier driving- related perception problems before generating final control commands for difficult driving task. The results of experiments demonstrated the effectiveness of multi- task perception knowledge for better generalization and accident explanation abil- ity. With our method the average sucess rate of finishing most difficult navigation tasks in untrained city of CoRL test surpassed current benchmark method for 15 percent in trained weather and 20 percent in untrained weathers.

tl;dr A new cyclic adversarial learning augmented with auxiliary task model which improves domain adaptation performance in low resource supervised and unsupervised situations

##### Stop memorizing: A data-dependent regularization framework for intrinsic pattern learning

tl;dr we propose a new framework for data-dependent DNN regularization that can prevent DNNs from overfitting random data or random labels.

Deep neural networks (DNNs) typically have enough capacity to fit random data by brute force even when conventional data-dependent regularizations focusing on the geometry of the features are imposed. We find out that the reason for this is the inconsistency between the enforced geometry and the standard softmax cross entropy loss. To resolve this, we propose a new framework for data-dependent DNN regularization, the Geometrically-Regularized-Self-Validating neural Networks (GRSVNet). During training, the geometry enforced on one batch of features is simultaneously validated on a separate batch using a validation loss consistent with the geometry. We study a particular case of GRSVNet, the Orthogonal-Low-rank Embedding (OLE)-GRSVNet, which is capable of producing highly discriminative features residing in orthogonal low-rank subspaces. Numerical experiments show that OLE-GRSVNet outperforms DNNs with conventional regularization when trained on real data. More importantly, unlike conventional DNNs, OLE-GRSVNet refuses to memorize random data or random labels, suggesting it only learns intrinsic patterns by reducing the memorizing capacity of the baseline DNN.

##### On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

tl;dr The SVRG method fails on modern deep learning problems

The application of stochastic variance reduction to optimization has shown remarkable recent theoretical and practical success. The applicability of these techniques to the hard non-convex optimization problems encountered during training of modern deep neural networks is an open problem. We show that naive application of the SVRG technique and related approaches fail, and explore why.

##### SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY

tl;dr We present a new approach, SNIP, that is simple, versatile and interpretable; it prunes irrelevant connections for a given task at single-shot prior to training and is applicable to a variety of neural network models without modifications.

Pruning large neural networks while maintaining the performance is often highly desirable due to the reduced space and time complexity. In existing methods, pruning is incorporated within an iterative optimization procedure with either heuristically designed pruning schedules or additional hyperparameters, undermining their utility. In this work, we present a new approach that prunes a given network once at initialization. Specifically, we introduce a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task even before training. This eliminates the need for both pretraining as well as the complex pruning schedule while making it robust to architecture variations. After pruning, the sparse network is trained in the standard way. Our method obtains extremely sparse networks with virtually the same accuracy as the reference network on image classification tasks and is broadly applicable to various architectures including convolutional, residual and recurrent networks. Unlike existing methods, our approach enables us to demonstrate that the retained connections are indeed relevant to the given task.

##### Invariant-covariant representation learning

tl;dr This paper presents a novel latent-variable generative modelling technique that enables the representation of global information into one latent variable and local information into another latent variable.

Representations learnt through deep neural networks tend to be highly informative, but opaque in terms of what information they learn to encode. We introduce an approach to probabilistic modelling that learns to represent data with two separate deep representations: an invariant representation that encodes the information of the class from which the data belongs, and a covariant representation that encodes the symmetry transformation defining the particular data point within the class manifold (covariant in the sense that the representation varies naturally with symmetry transformations). This approach to representation learning is conceptually transparent, easy to implement, and in-principle generally applicable to any data comprised of discrete classes of continuous distributions (e.g. objects in images, topics in language, individuals in behavioural data). We demonstrate qualitatively compelling representation learning and competitive quantitative performance, in both supervised and semi-supervised settings, versus comparable modelling approaches in the literature with little fine tuning.

##### Open-Ended Content-Style Recombination Via Leakage Filtering

No tl;dr =[

We consider visual domains in which a class label specifies the content of an image, and class-irrelevant properties that differentiate instances constitute the style. We present a domain-independent method that permits the open-ended recombination of style of one image with the content of another. Open ended simply means that the method generalizes to style and content not present in the training data. The method starts by constructing a content embedding using an existing deep metric-learning technique. This trained content encoder is incorporated into a variational autoencoder (VAE), paired with a to-be-trained style encoder. The VAE reconstruction loss alone is inadequate to ensure a decomposition of the latent representation into style and content. Our method thus includes an auxiliary loss, leakage filtering, which ensures that no style information remaining in the content representation is used for reconstruction and vice versa. We synthesize novel images by decoding the style representation obtained from one image with the content representation from another. Using this method for data-set augmentation, we obtain state-of-the-art performance on few-shot learning tasks.

##### IncSQL: Training Incremental Text-to-SQL Parsers with Non-Deterministic Oracles

tl;dr We design incremental sequence-to-action parsers for text-to-SQL task and achieve SOTA results. We further improve by using non-deterministic oracles to allow multiple correct action sequences.

We present a sequence-to-action parsing approach for the natural language to SQL task that incrementally fills the slots of a SQL query with feasible actions from a pre-defined inventory. To account for the fact that typically there are multiple correct SQL queries with the same or very similar semantics, we draw inspiration from syntactic parsing techniques and propose to train our sequence-to-action models with non-deterministic oracles. We evaluate our models on the WikiSQL dataset and achieve an execution accuracy of 83.7% on the test set, a 2.1% absolute improvement over the models trained with traditional static oracles assuming a single correct target SQL query. When further combined with the execution-guided decoding strategy, our model sets a new state-of-the-art performance at an execution accuracy of 87.1%.

##### Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication

No tl;dr =[

Currently, progressively larger deep neural networks are trained on ever growing data corpora. In result, distributed training schemes are becoming increasingly relevant. A major issue in distributed training is the limited communication bandwidth between contributing nodes or prohibitive communication cost in general. %These challenges become even more pressing, as the number of computation nodes increases. To mitigate this problem we propose Sparse Binary Compression (SBC), a compression framework that allows for a drastic reduction of communication cost for distributed training. SBC combines existing techniques of communication delay and gradient sparsification with a novel binarization method and optimal weight update encoding to push compression gains to new limits. By doing so, our method also allows us to smoothly trade-off gradient sparsity and temporal sparsity to adapt to the requirements of the learning task. %We use tools from information theory to reason why SBC can achieve the striking compression rates observed in the experiments. Our experiments show, that SBC can reduce the upstream communication on a variety of convolutional and recurrent neural network architectures by more than four orders of magnitude without significantly harming the convergence speed in terms of forward-backward passes. For instance, we can train ResNet50 on ImageNet in the same number of iterations to the baseline accuracy, using $\times 3531$ less bits or train it to a $1\%$ lower accuracy using $\times 37208$ less bits. In the latter case, the total upstream communication required is cut from 125 terabytes to 3.35 gigabytes for every participating client. Our method also achieves state-of-the-art compression rates in a Federated Learning setting with 400 clients.

##### Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

No tl;dr =[

We propose a novel framework, called Disjoint Mapping Network (DIMNet), for cross-modal biometric matching, in particular of voices and faces. Different from the existing methods, DIMNet does not explicitly learn the joint relationship between the modalities. Instead, DIMNet learns a shared representation for different modalities by mapping them individually to their common covariates. These shared representations can then be used to find the correspondences between the modalities. We show empirically that DIMNet is able to achieve better performance than the current state-of-the-art methods, with the additional benefits of being conceptually simpler and less data-intensive.

##### A Biologically Inspired Visual Working Memory for Deep Networks

tl;dr A biologically inspired working memory that can be integrated in recurrent visual attention models for state of the art performance

The ability to look multiple times through a series of pose-adjusted glimpses is fundamental to human vision. This critical faculty allows us to understand highly complex visual scenes. Short term memory plays an integral role in aggregating the information obtained from these glimpses and informing our interpretation of the scene. Computational models have attempted to address glimpsing and visual attention but have failed to incorporate the notion of memory. We introduce a novel, biologically inspired visual working memory architecture that we term the Hebb-Rosenblatt memory. We subsequently introduce a fully differentiable Short Term Attentive Working Memory model (STAWM) which uses transformational attention to learn a memory over each image it sees. The state of our Hebb-Rosenblatt memory is embedded in STAWM as the weights space of a layer. By projecting different queries through this layer we can obtain goal-oriented latent representations for tasks including classification and visual reconstruction. Our model obtains highly competitive classification performance on MNIST and CIFAR-10. As demonstrated through the CelebA dataset, to perform reconstruction the model learns to make a sequence of updates to a canvas which constitute a parts-based representation. Classification with the self supervised representation obtained from MNIST is shown to be in line with the state of the art models (none of which use a visual attention mechanism). Finally, we show that STAWM can be trained under the dual constraints of classification and reconstruction to provide an interpretable visual sketchpad which helps open the black-box' of deep learning.

##### Visual Reasoning by Progressive Module Networks

No tl;dr =[

Humans learn to solve tasks of increasing complexity by building on top of previously acquired knowledge. Typically, there exists a natural progression in the tasks that we learn – most do not require completely independent solutions, but can be broken down into simpler subtasks. We propose to represent a solver for each task as a neural module that calls existing modules (solvers for simpler tasks) in a functional program-like manner. Lower modules are a black box to the calling module, and communicate only via a query and an output. Thus, a module for a new task learns to query existing modules and composes their outputs in order to produce its own output. Our model effectively combines previous skill-sets, does not suffer from forgetting, and is fully differentiable. We test our model in learning a set of visual reasoning tasks, and demonstrate improved performances in all tasks by learning progressively. By evaluating the reasoning process using human judges, we show that our model is more interpretable than an attention-based baseline.

##### Probabilistic Binary Neural Networks

tl;dr We introduce a stochastic training method for training Binary Neural Network with both binary weights and activations.

Low bit-width weights and activations are an effective way of combating the increasing need for both memory and compute power of Deep Neural Networks. In this work, we present a probabilistic training method for Neural Network with both binary weights and activations, called PBNet. By embracing stochasticity during training, we circumvent the need to approximate the gradient of functions for which the derivative is zero almost always, such as $\textrm{sign}(\cdot)$, while still obtaining a fully Binary Neural Network at test time. Moreover, it allows for anytime ensemble predictions for improved performance and uncertainty estimates by sampling from the weight distribution. Since all operations in a layer of the PBNet operate on random variables, we introduce stochastic versions of Batch Normalization and max pooling, which transfer well to a deterministic network at test time. We evaluate two related training methods for the PBNet: one in which activation distributions are propagated throughout the network, and one in which binary activations are sampled in each layer. Our experiments indicate that sampling the binary activations is an important element for stochastic training of binary Neural Networks.

##### A Proposed Hierarchy of Deep Learning Tasks

tl;dr We use 50 GPU years of compute time to study how deep learning scales with more data, and propose a new way to organize the space of problems by difficulty.

As the pace of deep learning innovation accelerates, it becomes increasingly important to organize the space of problems by relative difficultly. Looking to other fields for inspiration, we see analogies to the Chomsky Hierarchy in computational linguistics and time and space complexity in theoretical computer science. As a complement to prior theoretical work on the data and computational requirements of learning, this paper presents an empirical approach. We introduce a methodology for measuring validation error scaling with data and model size and test tasks in natural language, vision, and speech domains. We find that power-law validation error scaling exists across a breadth of factors and that model size scales sublinearly with data size, suggesting that simple learning theoretic models offer insights into the scaling behavior of realistic deep learning settings, and providing a new perspective on how to organize the space of problems. We measure the power-law exponent---the "steepness" of the learning curve---and propose using this metric to sort problems by degree of difficulty. There is no data like more data, but some tasks are more effective at taking advantage of more data. Those that are more effective are easier on the proposed scale. Using this approach, we can observe that studied tasks in speech and vision domains scale faster than those in the natural language domain, offering insight into the observation that progress in these areas has proceeded more rapidly than in natural language.

##### Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes

tl;dr Finite-width SGD trained CNNs vs. infinitely wide fully Bayesian CNNs. Who wins?

There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating an FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers. Surprisingly, in the absence of pooling layers, the corresponding GP is identical for CNNs with and without weight sharing. This means that translation equivariance in SGD-trained finite CNNs has no corresponding property in the Bayesian treatment of the infinite-width limit -- a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally that in some scenarios, while the performance of trained finite CNNs becomes similar to that of the corresponding GP with increasing channel count, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs. Finally, we introduce a Monte Carlo method to estimate the GP corresponding to a NN architecture, even in cases where the analytic form has too many terms to be computationally feasible.

##### S3TA: A Soft, Spatial, Sequential, Top-Down Attention Model

We present a soft, spatial, sequential, top-down attention model (S3TA). This model uses a soft attention mechanism to bottleneck its view of the input. A recurrent core is used to generate query vectors, which actively select information from the input by correlating the query with input- and space-dependent key maps at different spatial locations. We demonstrate the power and interpretabilty of this model under two settings. First, we build an agent which uses this attention model in RL environments and show that we can achieve performance competitive with state-of-the-art models while producing attention maps that elucidate some of the strategies used to solve the task. Second, we use this model in supervised learning tasks and show that it also achieves competitive performance and provides interpretable attention maps that show some of the underlying logic in the model's decision making.

##### Backdrop: Stochastic Backpropagation

tl;dr We introduce backdrop, intuitively described as dropout acting on the backpropagation pipeline and find significant improvements in generalization for problems with non-decomposable losses and problems with multi-scale, hierarchical data structure.

We introduce backdrop, a flexible and simple-to-implement method, intuitively described as dropout acting only along the backpropagation pipeline. Backdrop is implemented via one or more masking layers which are inserted at specific points along the network. Each backdrop masking layer acts as the identity in the forward pass, but randomly masks parts of the backward gradient propagation. Intuitively, inserting a backdrop layer after any convolutional layer leads to stochastic gradients corresponding to features of that scale. Therefore, backdrop is well suited for problems in which the data have a multi-scale, hierarchical structure. Backdrop can also be applied to problems with non-decomposable loss functions where standard SGD methods are not well suited. We perform a number of experiments and demonstrate that backdrop leads to significant improvements in generalization.

##### Overfitting Detection of Deep Neural Networks without a Hold Out Set

tl;dr We introduce and analyze several criteria for detecting overfitting.

Overfitting is an ubiquitous problem in neural network training and usually mitigated using a holdout data set. Here we challenge this rationale and investigate criteria for overfitting without using a holdout data set. Specifically, we train a model for a fixed number of epochs multiple times with varying fractions of randomized labels and for a range of regularization strengths. A properly trained model should not be able to attain an accuracy greater than the fraction of properly labeled data points. Otherwise the model overfits. We introduce two criteria for detecting overfitting and one to detect underfitting. We analyze early stopping, the regularization factor, and network depth. In safety critical applications we are interested in models and parameter settings which perform well and are not likely to overfit. The methods of this paper allow characterizing and identifying such models.

##### ALISTA: Analytic Weights Are As Good As Learned Weights in LISTA

No tl;dr =[

Deep neural networks based on unfolding an iterative algorithm, for example, LISTA (learned iterative shrinkage thresholding algorithm), have been an empirical success for sparse signal recovery. The weights of these neural networks are currently determined by data-driven “black-box” training. In this work, we propose Analytic LISTA (ALISTA), where the weight matrix in LISTA is computed as the solution to a data-free optimization problem, leaving only the stepsize and threshold parameters to data-driven learning. This signiﬁcantly simpliﬁes the training. Speciﬁcally, the data-free optimization problem is based on coherence minimization. We show our ALISTA retains the optimal linear convergence proved in (Chen et al., 2018) and has a performance comparable to LISTA. Furthermore, we extend ALISTA to convolutional linear operators, again determined in a data-free manner. We also propose a feed-forward framework that combines the data-free optimization and ALISTA networks from end to end, one that can be jointly trained to gain robustness to small perturbations in the encoding model.

##### Déjà Vu: An Empirical Evaluation of the Memorization Properties of Convnets

tl;dr We analyze the memorization properties by a convnet of the training set and propose several use-cases where we can extract some information about the training set.

Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting. This paper considers the related question of membership inference'', where the goal is to determine if an image was used during training. We consider it under three complementary angles. We first analyze explicit memorization and extend classical random label experiments to the problem of learning a model that predicts if an image belongs to an arbitrary set. We then show how to detect if a dataset was used to train a model, and in particular whether some validation images were used at train time. Finally, we propose a new approach to infer membership when a few of the top layers are not available or have been fine-tuned, and show that lower layers still carry information about the training samples. To support our findings, we conduct large-scale experiments on Imagenet and subsets of YFCC-100M with modern architectures such as VGG and Resnet.

##### Neural Rendering Model: Joint Generation and Prediction for Semi-Supervised Learning

tl;dr We develop a new deep generative model for semi-supervised learning and propose a new Max-Min cross-entropy for training CNNs.

Unsupervised and semi-supervised learning are important problems that are especially challenging with complex data like natural images. Progress on these problems would accelerate if we had access to appropriate generative models under which to pose the associated inference tasks. Inspired by the success of Convolutional Neural Networks (CNNs) for supervised prediction in images, we design the Neural Rendering Model (NRM), a new hierarchical probabilistic generative model whose inference calculations correspond to those in a CNN. The NRM introduces a small set of latent variables at each level of the model and enforces dependencies among all the latent variables via a conjugate prior distribution. The conjugate prior yields a new regularizer for learning based on the paths rendered in the generative model for training CNNs–the Rendering Path Normalization (RPN). We demonstrate that this regularizer improves generalization both in theory and in practice. Likelihood estimation in the NRM yields the new Max-Min cross entropy training loss, which suggests a new deep network architecture–the Max- Min network–which exceeds or matches the state-of-art for semi-supervised and supervised learning on SVHN, CIFAR10, and CIFAR100.

##### Three Mechanisms of Weight Decay Regularization

tl;dr We investigate weight decay regularization for different optimizers and identify three distinct mechanisms by which weight decay improves generalization.

Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of L2 regularization. Literal weight decay has been shown to outperform L2 regularization for optimizers for which they differ. We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and KFAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) regularizing approximated input-output Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks.

##### Local Critic Training of Deep Neural Networks

tl;dr We propose a new learning algorithm of deep neural networks, which unlocks the layer-wise dependency of backpropagation.

This paper proposes a novel approach to train deep neural networks by unlocking the layer-wise dependency of backpropagation training. The approach employs additional modules called local critic networks besides the main network model to be trained, which are used to obtain error gradients without complete feedforward and backward propagation processes. We propose a cascaded learning strategy for these local networks. In addition, the approach is also useful from multi-model perspectives, including structural optimization of neural networks, computationally efficient progressive inference, and ensemble classification for performance improvement. Experimental results show the effectiveness of the proposed approach and suggest guidelines for determining appropriate algorithm parameters.

##### Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search

tl;dr A novel differentiable neural architecture search framework for mixed quantization of ConvNets.

Recent work in network quantization has substantially reduced the time and space complexity of neural network inference, enabling their deployment on embedded and mobile devices with limited computational and memory resources. However, existing quantization methods often represent all weights and activations with the same precision (bit-width). In this paper, we explore a new dimension of the design space: quantizing different layers with different bit-widths. We formulate this problem as a neural architecture search problem and propose a novel differentiable neural architecture search (DNAS) framework to efficiently explore its exponential search space with gradient-based optimization. Experiments show we surpass the state-of-the-art compression of ResNet on CIFAR-10 and ImageNet. Our quantized models with 21.1x smaller model size or 103.9x lower computational cost can still outperform baseline quantized or even full precision models.

##### Knowledge Flow: Improve Upon Your Teachers

tl;dr ‘Knowledge Flow’ trains a deep net (student) by injecting information from multiple nets (teachers). The student is independent upon training and performs very well on learned tasks irrespective of the setting (reinforcement or supervised learning).

A zoo of deep nets is available these days for almost any given task, and it is increasingly unclear which net to start with when addressing a new task, or which net to use as an initialization for fine-tuning a new model. To address this issue, in this paper, we develop knowledge flow which moves ‘knowledge’ from multiple deep nets, referred to as teachers, to a new deep net model, called the student. The structure of the teachers and the student can differ arbitrarily and they can be trained on entirely different tasks with different output spaces too. Upon training with knowledge flow the student is independent of the teachers. We demonstrate our approach on a variety of supervised and reinforcement learning tasks, outperforming fine-tuning and other ‘knowledge exchange’ methods.

##### GEOMETRIC AUGMENTATION FOR ROBUST NEURAL NETWORK CLASSIFIERS

tl;dr We develop a statistical-geometric unsupervised learning augmentation framework for deep neural networks to make them robust to adversarial attacks.

We introduce a novel geometric perspective and unsupervised model augmentation framework for transforming traditional deep (convolutional) neural networks into adversarially robust classifiers. Class-conditional probability densities based on Bayesian nonparametric mixtures of factor analyzers (BNP-MFA) over the input space are used to design soft decision labels for feature to label isometry. Classconditional distributions over features are also learned using BNP-MFA to develop plug-in maximum a posterior (MAP) classifiers to replace the traditional multinomial logistic softmax classification layers. This novel unsupervised augmented framework, which we call geometrically robust networks (GRN), is applied to CIFAR-10, CIFAR-100, and to Radio-ML (a time series dataset for radio modulation recognition). We demonstrate the robustness of GRN models to adversarial attacks from fast gradient sign method, Carlini-Wagner, and projected gradient descent.

##### Rotation Equivariant Networks via Conic Convolution and the DFT

tl;dr We propose conic convolution and the 2D-DFT to encode rotation equivariance into an neural network.

Performance of neural networks can be significantly improved by encoding known invariance for particular tasks. Many image classification tasks, such as those related to cellular imaging, exhibit invariance to rotation. In particular, to aid convolutional neural networks in learning rotation invariance, we consider a simple, efficient conic convolutional scheme that encodes rotational equivariance, along with a method for integrating the magnitude response of the 2D-discrete-Fourier transform (2D-DFT) to encode global rotational invariance. We call our new method the Conic Convolution and DFT Network (CFNet). We evaluated the efficacy of CFNet as compared to a standard CNN and group-equivariant CNN (G-CNN) for several different image classification tasks and demonstrated improved performance, including classification accuracy, computational efficiency, and its robustness to hyperparameter selection. Taken together, we believe CFNet represents a new scheme that has the potential to improve many imaging analysis applications.

##### A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations

tl;dr We show that CNNs are not robust to simple rotations and translation and explore methods of improving this.

We show that simple spatial transformations, namely translations and rotations alone, suffice to fool neural networks on a significant fraction of their inputs in multiple image classification tasks. Our results are in sharp contrast to previous work in adversarial robustness that relied on more complicated optimization ap- proaches unlikely to appear outside a truly adversarial context. Moreover, the misclassifying rotations and translations are easy to find and require only a few black-box queries to the target model. Overall, our findings emphasize the need to design robust classifiers even for natural input transformations in benign settings.

##### Stacked U-Nets: A No-Frills Approach to Natural Image Segmentation

tl;dr Presents new architecture which leverages information globalization power of u-nets in a deeper networks and performs well across tasks without any bells and whistles.

Many imaging tasks require global information about all pixels in an image. Conventional bottom-up classification networks globalize information by decreasing resolution; features are pooled and down-sampled into a single output. But for semantic segmentation and object detection tasks, a network must provide higher-resolution pixel-level outputs. To globalize information while preserving resolution, many researchers propose the inclusion of sophisticated auxiliary blocks, but these come at the cost of a considerable increase in network size and computational cost. This paper proposes stacked u-nets (SUNets), which iteratively combine features from different resolution scales while maintaining resolution. SUNets leverage the information globalization power of u-nets in a deeper net- work architectures that is capable of handling the complexity of natural images. SUNets perform extremely well on semantic segmentation tasks using a small number of parameters.

##### Meta-Learning with Latent Embedding Optimization

tl;dr Latent Embedding Optimization (LEO) is a novel gradient-based meta-learner with state-of-the-art performance on the challenging 5-way 1-shot and 5-shot miniImageNet and tieredImageNet classification tasks.

Gradient-based meta-learning techniques are both widely applicable and proficient at solving challenging few-shot learning and fast adaptation problems. However, they have practical difficulties when operating on high-dimensional parameter spaces in extreme low-data regimes. We show that it is possible to bypass these limitations by learning a data-dependent latent generative representation of model parameters, and performing gradient-based meta-learning in this low-dimensional latent space. The resulting approach, latent embedding optimization (LEO), decouples the gradient-based adaptation procedure from the underlying high-dimensional space of model parameters. Our evaluation shows that LEO can achieve state-of-the-art performance on the competitive miniImageNet and tieredImageNet few-shot classification tasks. Further analysis indicates LEO is able to capture uncertainty in the data and model parameters, and can perform adaptation more effectively by optimizing in latent space.

##### Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach

tl;dr We obtain non-vacuous generalization bounds on ImageNet-scale deep neural networks by combining an original PAC-Bayes bound and an off-the-shelf neural network compression method.

Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be compressed to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization bound for compressed networks based on the compressed size that, combined with off-the-shelf compression algorithms, leads to state-of-the-art generalization guarantees. In particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem. Additionally, we show that compressibility of models that tend to overfit is limited. Empirical results show that an increase in overfitting increases the number of bits required to describe a trained network.

##### Neural Probabilistic Motor Primitives for Humanoid Control

tl;dr Neural Probabilistic Motor Primitives compress motion capture tracking policies into one flexible model capable of one-shot imitation and reuse as a low-level controller.

Transferring functional properties from one or multiple expert policies to a student policy is an important challenge in control. Expert robustness is of particular interest; we would like to not only transfer the expert behavior but also its ability to recover from perturbations. With this in mind, we explore approaches for policy cloning and propose linear feedback policy cloning as a simple option for certain settings. We show that it can be surprisingly straightforward to clone ex-pert policies for seemingly complex behaviors without the student requiring any environment interactions. We then propose a latent-variable architecture that bottlenecks a sensory-motor primitive space, which, again, can be trained entirely offline to compress thousands of expert policies. We show this resulting neural probabilistic motor primitive system produces robust one-shot imitation of whole-body humanoid behaviors. In addition, we analyze the resulting latent space and demonstrate the ability to reuse this system. We encourage readers to view a supplementary video (https://youtu.be/44tPXdUCc-g ) summarizing our results.

##### Deepström Networks

tl;dr A new neural architecture where top dense layers of standard convolutional architectures are replaced with an approximation of a kernel function by relying on the Nyström approximation.

Recent work has focused on combining kernel methods and deep learning. With this in mind, we introduce Deepström networks -- a new architecture of neural networks which we use to replace top dense layers of standard convolutional architectures with an approximation of a kernel function by relying on the Nyström approximation. Our approach is easy highly flexible. It is compatible with any kernel function and it allows exploiting multiple kernels. We show that Deepström networks reach state-of-the-art performance on standard datasets like SVHN and CIFAR100. One benefit of the method lies in its limited number of learnable parameters which make it particularly suited for small training set sizes, e.g. from 5 to 20 samples per class. Finally we illustrate two ways of using multiple kernels, including a multiple Deepström setting, that exploits a kernel on each feature map output by the convolutional part of the model.

##### Label Smoothing and Logit Squeezing: A Replacement for Adversarial Training?

Adversarial training is one of the strongest defenses against adversarial attacks, but it requires adversarial examples to be generated for every mini-batch during optimization. The expense of producing these examples during training often precludes adversarial training from use on large and high-resolution image datasets. In this study, we explore the mechanisms by which adversarial training improves classifier robustness, and show that these mechanisms can be effectively mimicked using simple regularization methods, including label smoothing and logit squeezing. Remarkably, using these simple regularization methods in combination with Gaussian noise injection, we are able to achieve strong adversarial robustness -- often exceeding that of adversarial training -- using no adversarial examples.

##### An Empirical Study of Example Forgetting during Deep Neural Network Learning

tl;dr We show that catastrophic forgetting occurs within what is considered to be a single task and find that examples that are not prone to forgetting can be removed from the training set without loss of generalization.

Inspired by the phenomenon of catastrophic forgetting, we investigate the learning dynamics of neural networks as they train on single classification tasks. Our goal is to understand whether a related phenomenon occurs when data does not undergo a clear distributional shift. We define a forgetting event'' to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set's (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.

##### FEED: Feature-level Ensemble Effect for knowledge Distillation

No tl;dr =[

This paper proposes a versatile and powerful training algorithm named Feature-level Ensemble Effect for knowledge Distillation(FEED), which is inspired by the work of factor transfer. The factor transfer is one of the knowledge transfer methods that improves the performance of a student network with a strong teacher network. It transfers the knowledge of a teacher in the feature map level using high-capacity teacher network, and our training algorithm FEED is an extension of it. FEED aims to transfer ensemble knowledge, using either multiple teachers in parallel or multiple training sequences. Adapting the peer-teaching framework, we introduce a couple of training algorithms that transfer ensemble knowledge to the student at the feature map level, both of which help the student network find more generalized solutions in the parameter space. Experimental results on CIFAR-100 and ImageNet show that our method, FEED, has clear performance enhancements,without introducing any additional parameters or computations at test time.

##### Tangent-Normal Adversarial Regularization for Semi-supervised Learning

tl;dr We propose a novel manifold regularization strategy based on adversarial training, which can significantly improve the performance of semi-supervised learning.

The ever-increasing size of modern datasets combined with the difficulty of obtaining label information has made semi-supervised learning of significant practical importance in modern machine learning applications. In comparison to supervised learning, the key difficulty in semi-supervised learning is how to make full use of the unlabeled data. In order to utilize manifold information provided by unlabeled data, we propose a novel regularization called the tangent-normal adversarial regularization, which is composed by two parts. The two parts complement with each other and jointly enforce the smoothness along two different directions that are crucial for semi-supervised learning. One is applied along the tangent space of the data manifold, aiming to enforce local invariance of the classifier on the manifold, while the other is performed on the normal space orthogonal to the tangent space, intending to impose robustness on the classifier against the noise causing the observed data deviating from the underlying data manifold. Both of the two regularizers are achieved by the strategy of virtual adversarial training. Our method has achieved state-of-the-art performance on semi-supervised learning tasks on both artificial dataset and practical datasets.

##### Critical Learning Periods in Deep Networks

tl;dr Sensory deficits in early training phases can lead to irreversible performance loss in both artificial and neuronal networks, suggesting information phenomena as the common cause, and point to the importance of the initial transient and forgetting.

Similar to humans and animals, deep artificial neural networks exhibit critical periods during which a temporary stimulus deficit can impair the development of a skill. The extent of the impairment depends on the onset and length of the deficit window, as in animal models, and on the size of the neural network. Deficits that do not affect low-level statistics, such as vertical flipping of the images, have no lasting effect on performance and can be overcome with further training. To better understand this phenomenon, we use the Fisher Information of the weights to measure the effective connectivity between layers of a network during training. Counterintuitively, information raises rapidly in the early phases of training, and then decreases, preventing redistribution of information resources in a phenomenon we refer to as a loss of "Information Plasticity". Our analysis suggests that the first few epochs are critical for the creation of strong connections that are optimal relative to the input data distribution. Once such strong connections are created, they do not appear to change during additional training. These findings suggest that the initial learning transient, under-scrutinized compared to asymptotic behavior, plays a key role in determining the outcome of the training process. Our findings, combined with recent theoretical results in the literature, also suggest that forgetting (decrease of information in the weights) is critical to achieving invariance and disentanglement in representation learning. Finally, critical periods are not restricted to biological systems, but can emerge naturally in learning systems, whether biological or artificial, due to fundamental constrains arising from learning dynamics and information processing.

##### Where and when to look? Spatial-temporal attention for action recognition in videos

No tl;dr =[

Inspired by the observation that humans are able to process videos efficiently by only paying attention when and where it is needed, we propose a novel spatial-temporal attention mechanism for video-based action recognition. For spatial attention, we learn a saliency mask to allow the model to focus on the most salient parts of the feature maps. For temporal attention, we employ a soft temporal attention mechanism to identify the most relevant frames from an input video. Further, we propose a set of regularizers that ensure that our attention mechanism attends to coherent regions in space and time. Our model is efficient, as it proposes a separable spatio-temporal mechanism for video attention, while being able to identify important parts of the video both spatially and temporally. We demonstrate the efficacy of our approach on three public video action recognition datasets. The proposed approach leads to state-of-the-art performance on all of them, including the new large-scale Moments in Time dataset. Furthermore, we quantitatively and qualitatively evaluate our model's ability to accurately localize discriminative regions spatially and critical frames temporally. This is despite our model only being trained with per video classification labels.

##### Zero-shot Learning for Speech Recognition with Universal Phonetic Model

tl;dr We apply zero-shot learning for speech recognition to recognize unseen phonemes

There are more than 7,000 languages in the world, but due to the lack of training sets, only a small number of them have speech recognition systems. Multilingual speech recognition provides a solution if at least some audio training data is available. Often, however, phoneme inventories differ between the training languages and the target language, making this approach infeasible. In this work, we address the problem of building an acoustic model for languages with zero audio resources. Our model is able to recognize unseen phonemes in the target language, if only a small text corpus is available. We adopt the idea of zero-shot learning, and decompose phonemes into corresponding phonetic attributes such as vowel and consonant. Instead of predicting phonemes directly, we first predict distributions over phonetic attributes, and then compute phoneme distributions with a customized acoustic model. We extensively evaluate our model on 20 languages, and find that on average, it achieves 9.9% better phone error rate over the baseline model.

##### Decoupled Weight Decay Regularization

No tl;dr =[

L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it weight decay'' in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments will be available after the review process.

##### Learning deep representations by mutual information estimation and maximization

tl;dr We learn deep representation by maximizing mutual information, leveraging structure in the objective, and are able to compute with fully supervised classifiers with comparable architectures

In this work, we perform unsupervised learning of representations by maximizing mutual information between an input and the output of a deep neural network encoder. Importantly, we show that structure matters: incorporating knowledge about locality of the input to the objective can greatly influence a representation's suitability for downstream tasks. We further control characteristics of the representation by matching to a prior distribution adversarially. Our method, which we call Deep InfoMax (DIM), outperforms a number of popular unsupervised learning methods and competes with fully-supervised learning on several classification tasks. DIM opens new avenues for unsupervised learning of representations and is an important step towards the flexible formulation of representation-learning objectives for specific end-goals.

##### Causal importance of orientation selectivity for generalization in image recognition

No tl;dr =[

Although both our brain and deep neural networks (DNNs) can perform high-level sensory-perception tasks such as image or speech recognition, the inner mechanism of these hierarchical information-processing systems is poorly understood in both neuroscience and machine learning. Recently, Morcos et al. (2018) examined the effect of class-selective units in DNNs, i.e., units with high-level selectivity, on network generalization, concluding that hidden units that are selectively activated by specific input patterns may harm the network's performance. In this study, we revisit their hypothesis, considering units with selectivity for lower-level features, and argue that selective units are not always harmful to the network performance. Specifically, by using DNNs trained for image classification, we analyzed the orientation selectivity of individual units. Orientation selectivity is a low-level selectivity widely studied in visual neuroscience, in which, when images of bars with several orientations are presented to the eye, many neurons in the visual cortex respond selectively to a specific orientation. We found that orientation-selective units exist in both lower and higher layers of DNNs, as in our brain. In particular, units in the lower layers become more orientation-selective as the generalization performance improves during the course of training of the DNNs. Consistently, networks that generalize better are more orientation-selective in the lower layers. We finally reveal that ablating these selective units in the lower layers substantially degrades the generalization performance. These results suggest to the machine-learning community that, contrary to the triviality of units with high-level selectivity, lower-layer units with selectivity for low-level features are indispensable for generalization, and for neuroscientists, orientation selectivity does play a causally important role in object recognition.

##### Learning Discriminators as Energy Networks in Adversarial Learning

tl;dr We propose a novel adversarial learning framework for structured prediction, in which discriminative models can be used to refine structured prediction models at the inference stage.

We propose a novel framework for structured prediction via adversarial learning. Existing adversarial learning methods involve two separate networks, i.e., the structured prediction models and the discriminative models, in the training. The information captured by discriminative models complements that in the structured prediction models, but few existing researches have studied on utilizing such information to improve structured prediction models at the inference stage. In this work, we propose to refine the predictions of structured prediction models by effectively integrating discriminative models into the prediction. Discriminative models are treated as energy-based models. Similar to the adversarial learning, discriminative models are trained to estimate scores which measure the quality of predicted outputs, while structured prediction models are trained to predict contrastive outputs with maximal energy scores. In this way, the gradient vanishing problem is ameliorated, and thus we are able to perform inference by following the ascent gradient directions of discriminative models to refine structured prediction models. The proposed method is able to handle a range of tasks, \emph{e.g.}, multi-label classification and image segmentation. Empirical results on these two tasks validate the effectiveness of our learning method.

##### Efficient Multi-Objective Neural Architecture Search via Lamarckian Evolution

tl;dr We propose a method for efficient Multi-Objective Neural Architecture Search based on Lamarckian inheritance and evolutionary algorithms.

Architecture search aims at automatically finding neural architectures that are competitive with architectures designed by human experts. While recent approaches have achieved state-of-the-art predictive performance for image recognition, they are problematic under resource constraints for two reasons: (1) the neural architectures found are solely optimized for high predictive performance, without penalizing excessive resource consumption; (2)most architecture search methods require vast computational resources. We address the first shortcoming by proposing LEMONADE, an evolutionary algorithm for multi-objective architecture search that allows approximating the Pareto-front of architectures under multiple objectives, such as predictive performance and number of parameters, in a single run of the method. We address the second shortcoming by proposing a Lamarckian inheritance mechanism for LEMONADE which generates children networks that are warmstarted with the predictive performance of their trained parents. This is accomplished by using (approximate) network morphism operators for generating children. The combination of these two contributions allows finding models that are on par or even outperform different-sized NASNets, MobileNets, MobileNets V2 and Wide Residual Networks on CIFAR-10 and ImageNet64x64 within only one week on eight GPUs, which is about 20-40x less compute power than previous architecture search methods that yield state-of-the-art performance.

tl;dr Learning to synthesize raw waveform audio with GANs

While Generative Adversarial Networks (GANs) have seen wide success at the problem of synthesizing realistic images, they have seen little application to audio generation. Unlike for images, a barrier to success is that the best discriminative representations for audio tend to be non-invertible, and thus cannot be used to synthesize listenable outputs. In this paper we introduce WaveGAN, a first attempt at applying GANs to unsupervised synthesis of raw-waveform audio. Our experiments demonstrate that WaveGAN can produce intelligible words from a small vocabulary of speech, and can also synthesize audio from other domains such as drums, bird vocalizations, and piano. Qualitatively, we find that human judges prefer the sound quality of generated examples from WaveGAN over those from a method which naïvely apply GANs on image-like audio feature representations.

tl;dr We propose a framework to combine decision trees and neural networks, and show on image classification tasks that it enjoys the complementary benefits of the two approaches, while addressing the limitations of prior work.

Deep neural networks and decision trees operate on largely separate paradigms; typically, the former performs representation learning with pre-specified architectures, while the latter is characterised by learning hierarchies over pre-specified features with data-driven architectures. We unite the two via adaptive neural trees (ANTs), a model that incorporates representation learning into edges, routing functions and leaf nodes of a decision tree, along with a backpropagation-based training algorithm that adaptively grows the architecture from primitive modules (e.g., convolutional layers). ANTs allow increased interpretability via hierarchical clustering, e.g., learning meaningful class associations, such as separating natural vs. man-made objects. We demonstrate this whilst achieving over 99% and 90% accuracy on the MNIST and CIFAR-10 datasets. Furthermore, ANT optimisation naturally adapts the architecture to the size and complexity of the training data.

##### PA-GAN: Improving GAN Training by Progressive Augmentation

tl;dr We introduce a new technique - progressive augmentation of GANs (PA-GAN) - that helps to improve the overall stability of GAN training.

Despite recent progress, Generative Adversarial Networks (GANs) still suffer from training instability, requiring careful consideration of architecture design choices and hyper-parameter tuning. The reason for this fragile training behaviour is partially due to the discriminator performing well very quickly; its loss converges to zero, providing no reliable backpropagation signal to the generator. In this work we introduce a new technique - progressive augmentation of GANs (PA-GAN) - that helps to overcome this fundamental limitation and improve the overall stability of GAN training. The key idea is to gradually increase the task difficulty of the discriminator by progressively augmenting its input space, thus enabling continuous learning of the generator. We show that the proposed progressive augmentation preserves the original GAN objective, does not bias the optimality of the discriminator and encourages the healthy competition between the generator and discriminator, leading to a better-performing generator. We experimentally demonstrate the effectiveness of the proposed approach on multiple benchmarks (MNIST, Fashion-MNIST, CIFAR10, CELEBA) for the image generation task.

tl;dr An adaptve convolutional kernel, that includes non-linear transformations obtaining similar results as the state of the art algorithms, while yielding a reduction in required memory up to 16x in the CIFAR10

The quest for increased visual recognition performance has led to the development of highly complex neural networks with very deep topologies. To avoid high computing resource requirements of such complex networks and to enable operation on devices with limited resources, this paper introduces adaptive kernels for convolutional layers. Motivated by the non-linear perception response in human visual cells, the input image is used to define the weights of a dynamic kernel called Adaptive kernel. This new adaptive kernel is used to perform a second convolution of the input image generating the output pixel. Adaptive kernels enable accurate recognition with lower memory requirements; This is accomplished through reducing the number of kernels and the number of layers needed in the typical CNN configuration, in addition to reducing the memory used, increasing 2X the training speed and the number of activation function evaluations. Our experiments show a reduction of 70X in the memory used for MNIST, maintaining 99% accuracy and 16X memory reduction for CIFAR10 with 92.5% accuracy.

##### Learning Protein Structure with a Differentiable Simulator

tl;dr We use an unrolled simulator of a neural energy function as an end-to-end differentiable model of protein structure and show it can hierarchically generalize to unseen fold types.

The Boltzmann distribution is a natural model for many systems, from brains to materials and biomolecules, but is often of limited utility for fitting data because Monte Carlo algorithms are unable simulate it in available time. This gap between the expressive capabilities and sampling practicalities of energy-based models is exemplified by the protein folding problem, since energy landscapes underlie contemporary knowledge of protein biophysics but computer simulations are still unable to fold all but the smallest proteins from first-principles. In this work we bridge the gap between the expressive capacity of energy functions and the practical capabilities of their simulators by using an unrolled Monte Carlo simulation as a model for data. We compose a neural energy function with a novel and efficient simulator based on Langevin dynamics to build an end-to-end-differentiable model of atomic protein structure given amino acid sequence information. We introduce techniques for stabilizing backpropagation under long roll-outs and demonstrate the model's capacity to make multimodal predictions and to generalize to unobserved protein fold types when trained on a large corpus of protein structures.

##### Learning to Augment Influential Data

No tl;dr =[

Data augmentation is a technique to reduce overfitting and to improve generalization by increasing the number of labeled data by performing label preserving transformations; however, it is currently conducted in a trial and error manner. A composition of predefined transformations such as rotation, scaling, and cropping is performed on training samples, and its effect on performance over test samples can only be empirically evaluated and cannot be predicted. This paper considers an influence function that predicts how generalization is affected by a particular augmented training sample in terms of validation loss, without comparing the performance that includes and excludes the sample in the training process. A differentiable augmentation model that generalizes the conventional composition of predefined transformations is proposed. The differentiable augmentation model and reformulation of the influence function allow the augmented model parameters to be updated by backpropagation to minimize the validation loss. The experimental results show that the proposed method provides better generalization over conventional data augmentation methods.

##### The role of over-parametrization in generalization of neural networks

tl;dr We suggest a generalization bound that could potentially explain the improvement in generalization with over-parametrization.

Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes, and could potentially explain the improvement in generalization with over-parametrization. We further present a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks.

##### ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.

tl;dr ImageNet-trained CNNs are biased towards object texture (instead of shape like humans). Overcoming this bias (using a novel data augmentation) yields improved detection performance and previously unseen robustness to image distortions.

Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies hint to a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on our novel Stylized-ImageNet dataset. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.

##### Mixture of Pre-processing Experts Model for Noise Robust Deep Learning on Resource Constrained Platforms

No tl;dr =[

Deep learning on an edge device requires energy efficient operation due to ever diminishing power budget. Intentional low quality data during the data acquisition for longer battery life, and natural noise from the low cost sensor degrade the quality of target output which hinders adoption of deep learning on an edge device. To overcome these problems, we propose simple yet efficient mixture of pre-processing experts (MoPE) model to handle various image distortions including low resolution and noisy images. We also propose to use adversarially trained auto encoder as a pre-processing expert for the noisy images. We evaluate our proposed method for various machine learning tasks including object detection on MS-COCO 2014 dataset, multiple object tracking problem on MOT-Challenge dataset, and human activity recognition on UCF 101 dataset. Experimental results show that the proposed method achieves better detection, tracking and activity recognition accuracies under noise without sacrificing accuracies for the clean images. The overheads of our proposed MoPE are 0.67% and 0.17% in terms of memory and computation compared to the baseline object detection network.

##### Meta-Learning with Individualized Feature Space for Few-Shot Classification

No tl;dr =[

Meta-learning provides a promising learning framework to address few-shot classification tasks. In existing meta-learning methods, the meta-learner is designed to learn about model optimization, parameter initialization, or similarity metric. Differently, in this paper, we propose to learn how to create an individualized feature embedding specific to a given query image for better classifying, i.e., given a query image, a specific feature embedding tailored for its characteristics is created accordingly, leading to an individualized feature space in which the query image can be more accurately classified.  Specifically, we introduce a kernel generator as meta-learner to learn to construct feature embedding for query images. The kernel generator acquires meta-knowledge of generating adequate convolutional kernels for different query images during training, which can generalize to unseen categories without fine-tuning. In two standard few-shot classification data sets, i.e. Omniglot, and \emph{mini}ImageNet, our method shows highly competitive performance.

##### Toward Understanding the Impact of Staleness in Distributed Machine Learning

tl;dr Empirical and theoretical study of the effects of staleness in non-synchronous execution on machine learning algorithms.

Most distributed machine learning (ML) systems store a copy of the model parameters locally on each machine to minimize network communication. In practice, in order to reduce synchronization waiting time, these copies of the model are not necessarily updated in lock-steps, and can become stale. Despite much development in large-scale ML, the effect of staleness on the learning efficiency is inconclusive, mainly because it is challenging to control or monitor the staleness in complex distributed environments. In this work, we study the convergence behaviors of a wide array of ML models and algorithms under delayed updates. Our extensive experiments reveal the rich diversity of the effects of staleness on the convergence of ML algorithms, and offer insights into seemingly contradictory reports in the literature. The empirical findings also inspire a new convergence analysis of SGD in non-convex optimization under staleness, matching the best known convergence rate of O(1/\sqrt{T}).

##### Logit Regularization Methods for Adversarial Robustness

tl;dr Logit regularization methods help explain and improve state of the art adversarial defenses

While great progress has been made at making neural networks effective across a wide range of tasks, many are surprisingly vulnerable to small, carefully chosen perturbations of their input, known as adversarial examples. In this paper, we advocate for and experimentally investigate the use of logit regularization techniques as an adversarial defense, which can be used in conjunction with other methods for creating adversarial robustness at little to no cost. We demonstrate that much of the effectiveness of one recent adversarial defense mechanism can be attributed to logit regularization and show how to improve its defense against both white-box and black-box attacks, in the process creating a stronger black-box attacks against PGD-based models.

##### Point Cloud GAN

tl;dr We propose a GAN variant which learns to generate point clouds. Different studies have been explores, including tighter Wasserstein distance estimate, conditional generation, generalization to unseen point clouds and image to point cloud.

Generative Adversarial Networks (GAN) can achieve promising performance on learning complex data distributions on different types of data. In this paper, we first show a straightforward extension of existing GAN algorithm is not applicable to point clouds, because the constraint required for discriminators is undefined for set data. We propose a two fold modification to GAN algorithm for learning to generate point clouds (PC-GAN). First, we combine ideas from hierarchical Bayesian modeling and implicit generative models by learning a hierarchical and interpretable sampling process. A key component of our method is that we train a posterior inference network for the hidden variables. Second, PC-GAN defines a generic framework that can incorporate many existing GAN algorithms. We further propose a sandwiching objective, which results in a tighter Wasserstein distance estimate than the commonly used dual form in WGAN. We validate our claims on ModelNet40 benchmark dataset. PC-GAN trained by the sandwiching objective achieves better results on test data than the existing methods by comparing with true meshes quantitatively. We also conduct studies on several tasks, including generalization on unseen point clouds, latent space interpolation, classification, and image to point clouds, to demonstrate the versatility of the proposed PC-GAN.

##### L2-Nonexpansive Neural Networks

No tl;dr =[

This paper proposes a class of well-conditioned neural networks in which a unit amount of change in the inputs causes at most a unit amount of change in the outputs or any of the internal layers. We develop the known methodology of controlling Lipschitz constants to realize its full potential in maximizing robustness, with a new regularization scheme for linear layers, new ways to adapt nonlinearities and a new loss function. With MNIST and CIFAR-10 classifiers, we demonstrate a number of advantages. Without needing any adversarial training, the proposed classifiers exceed the state of the art in robustness against white-box L2-bounded adversarial attacks. They generalize better than ordinary networks from noisy data with partially random labels. Their outputs are quantitatively meaningful and indicate levels of confidence and generalization, among other desirable properties.

##### AIM: Adversarial Inference by Matching Priors and Conditionals

No tl;dr =[

Effective inference for a generative adversarial model remains an important and challenging problem. We propose a novel framework, Adversarial Inference by Matching priors and conditionals (AIM), which explicitly matches prior and conditional distributions in both data and code spaces, and puts a direct constraint on the dependency structure of the generative model. We derive an equivalent form of the prior and conditional matching objective that can be optimized efficiently without any parametric assumption on the data. We validate the effectiveness of AIM on the MNIST, CIFAR-10, and CelebA datasets by conducting quantitative and qualitative evaluations. Results show that AIM significantly improves both reconstruction and generation compared with other adversarial inference models.

##### Efficient Augmentation via Data Subsampling

tl;dr Selectively augmenting difficult to classify points results in efficient training.

Data augmentation is commonly used to encode invariances in learning methods. However, this process is often performed in an inefficient manner, as artificial examples are created by applying a number of transformations to all points in the training set. The resulting explosion of the dataset size can be an issue in terms of storage and training costs, as well as in selecting and tuning the optimal set of transformations to apply. In this work, we demonstrate that it is possible to significantly reduce the number of data points included in data augmentation while realizing the same accuracy and invariance benefits of augmenting the entire dataset. We propose a novel set of subsampling policies, based on model influence and loss, that can achieve a 90% reduction in augmentation set size while maintaining the accuracy gains of standard data augmentation.

##### Hiding Objects from Detectors: Exploring Transferrable Adversarial Patterns

tl;dr We focus on creating universal adversaries to fool object detectors and hide objects from the detectors.

Adversaries in neural networks have drawn much attention since their first debut. While most existing methods aim at deceiving image classification models into misclassification or crafting attacks for specific object instances in the object setection tasks, we focus on creating universal adversaries to fool object detectors and hide objects from the detectors. The adversaries we examine are universal in three ways: (1) They are not specific for specific object instances; (2) They are image-independent; (3) They can further transfer to different unknown models. To achieve this, we propose two novel techniques to improve the transferability of the adversaries: \textit{piling-up} and \textit{monochromatization}. Both techniques prove to simplify the patterns of generated adversaries, and ultimately result in higher transferability.

##### The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects

tl;dr We provide theoretical and empirical analysis on the role of anisotropic noise introduced by stochastic gradient on escaping from minima.

Understanding the behavior of stochastic gradient descent (SGD) in the context of deep neural networks has raised lots of concerns recently. Along this line, we theoretically study a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics. Through investigating this general optimization dynamics, we analyze the behavior of SGD on escaping from minima and its regularization effects. A novel indicator is derived to characterize the efficiency of escaping from minima through measuring the alignment of noise covariance and the curvature of loss function. Based on this indicator, two conditions are established to show which type of noise structure is superior to isotropic noise in term of escaping efficiency. We further show that the anisotropic noise in SGD satisfies the two conditions, and thus helps to escape from sharp and poor minima effectively, towards more stable and flat minima that typically generalize well. We verify our understanding through comparing this anisotropic diffusion with full gradient descent plus isotropic diffusion (i.e. Langevin dynamics) and other types of position-dependent noise.

##### Augment your batch: better training with larger batches

tl;dr Improve accuracy by large batches composed of multiple instances of each sample at the same batch

Recently, there is regained interest in large batch training of neural networks, both of theory and practice. New insights and methods allowed certain models to be trained using large batches with no adverse impact on performance. Most works focused on accelerating wall clock training time by modifying the learning rate schedule, without introducing accuracy degradation. We propose to use large batch training to boost accuracy and accelerate convergence by combining it with data augmentation. Our method, "batch augmentation", suggests using multiple instances of each sample at the same large batch. We show empirically that this simple yet effective method improves convergence and final generalization accuracy. We further suggest possible reasons for its success.

##### Laplacian Networks: Bounding Indicator Function Smoothness for Neural Networks Robustness

No tl;dr =[

For the past few years, Deep Neural Network (DNN) robustness has become a question of paramount importance. As a matter of fact, in sensitive settings misclassification can lead to dramatic consequences. Such misclassifications are likely to occur when facing adversarial attacks, hardware failures or limitations, and imperfect signal acquisition. To address this question, authors have proposed different approaches aiming at increasing the robustness of DNNs, such as adding regularizers or training using noisy examples. In this paper we propose a new regularizer built upon the Laplacian of similarity graphs obtained from the representation of training data at each layer of the DNN architecture. This regularizer penalizes large changes (across consecutive layers in the architecture) in the distance between examples of different classes, and as such enforces smooth variations of the class boundaries. Since it is agnostic to the type of deformations that are expected when predicting with the DNN, the proposed regularizer can be combined with existing ad-hoc methods. We provide theoretical justification for this regularizer and demonstrate its effectiveness to improve robustness of DNNs on classical supervised learning vision datasets.

##### Low-Cost Parameterizations of Deep Convolutional Neural Networks

tl;dr This paper introduces efficient and economic parametrizations of convolutional neural networks motivated by partial differential equations

Convolutional Neural Networks (CNNs) filter the input data using a series of spatial convolution operators with compactly supported stencils and point-wise nonlinearities. Commonly, the convolution operators couple features from all channels. For wide networks, this leads to immense computational cost in the training of and prediction with CNNs. In this paper, we present novel ways to parameterize the convolution more efficiently, aiming to decrease the number of parameters in CNNs and their computational complexity. We propose new architectures that use a sparser coupling between the channels and thereby reduce both the number of trainable weights and the computational cost of the CNN. Our architectures arise as new types of residual neural network (ResNet) that can be seen as discretizations of a Partial Differential Equations (PDEs) and thus have predictable theoretical properties. Our first architecture involves a convolution operator with a special sparsity structure, and is applicable to a large class of CNNs. Next, we present an architecture that can be seen as a discretization of a diffusion reaction PDE, and use it with three different convolution operators. We outline in our experiments that the proposed architectures, although considerably reducing the number of trainable weights, yield comparable accuracy to existing CNNs that are fully coupled in the channel dimension.

##### Stochastic Optimization of Sorting Networks via Continuous Relaxations

tl;dr We provide a continuous relaxation to the sorting operator, enabling end-to-end, gradient-based stochastic optimization.

Sorting input objects is an important step within many machine learning pipelines. However, the sorting operator is non-differentiable w.r.t. its inputs, which prohibits end-to-end gradient-based optimization. In this work, we propose a general-purpose continuous relaxation of the output of the sorting operator from permutation matrices to the set of "unimodal matrices". Further, we use this relaxation to enable more efficient stochastic optimization over the combinatorially large space of permutations. In particular, we derive a reparameterized gradient estimator for the widely used Plackett-Luce family of distributions. We demonstrate the usefulness of our framework on three tasks that require learning semantic orderings of high-dimensional objects.

##### DL2: Training and Querying Neural Networks with Logic

tl;dr A differentiable loss for logic constraints for training and querying neural networks.

We present DL2, a system for training and querying neural networks with logical constraints. The key idea is to translate these constraints into a differentiable loss with desirable mathematical properties and to then either train with this loss in an iterative manner or to use the loss for querying the network for inputs subject to the constraints. We empirically demonstrate that DL2 is effective in both training and querying scenarios, across a range of constraints and data sets.

##### UNSUPERVISED MONOCULAR DEPTH ESTIMATION WITH CLEAR BOUNDARIES

tl;dr This paper propose a mask method which solves the previous blurred results of unsupervised monocular depth estimation caused by occlusion

Unsupervised monocular depth estimation has made great progress after deep learning is involved. Training with binocular stereo images is considered as a good option as the data can be easily obtained. However, the depth or disparity prediction results show poor performance for the object boundaries. The main reason is related to the handling of occlusion areas during the training. In this paper, we propose a novel method to overcome this issue. Exploiting disparity maps property, we generate an occlusion mask to block the back-propagation of the occlusion areas during image warping. We also design new networks with flipped stereo images to induce the networks to learn occluded boundaries. It shows that our method achieves clearer boundaries and better evaluation results on KITTI driving dataset and Virtual KITTI dataset.

##### Fast adversarial training for semi-supervised learning

tl;dr We propose a fast and efficient semi-supervised learning method using adversarial training.

##### Slimmable Neural Networks

tl;dr We present a simple and general method to train a single neural network executable at different widths (number of channels in a layer), permitting instant and adaptive accuracy-efficiency trade-offs at runtime.

We present a simple and general method to train a single neural network executable at different widths (number of channels in a layer), permitting instant and adaptive accuracy-efficiency trade-offs at runtime. Instead of training individual networks with different width multipliers, we train a shared network with switchable batch normalization. At runtime, the network can adjust its width on the fly according to on-device benchmarks and resource constraints, rather than downloading and offloading different models. Our trained networks, named slimmable neural networks, achieve similar (and in many cases better) ImageNet classification accuracy than individually trained models of MobileNet v1, MobileNet v2, ShuffleNet and ResNet-50 at different widths respectively. We also demonstrate better performance of slimmable models compared with individual ones across a wide range of applications including COCO bounding-box object detection, instance segmentation and person keypoint detection without tuning hyper-parameters. Lastly we visualize and discuss the learned features of slimmable networks. Code and models will be released.

##### Analysing Mathematical Reasoning Abilities of Neural Models

tl;dr A dataset for testing mathematical reasoning (and algebraic generalization), and results on current sequence-to-sequence models.

Mathematical reasoning---a core ability within human intelligence---presents some unique challenges as a domain: we do not come to understand and solve mathematical problems primarily on the back of experience and evidence, but on the basis of inferring, learning, and exploiting laws, axioms, and symbol manipulation rules. In this paper, we present a new challenge for the evaluation (and eventually the design) of neural architectures and similar system, developing a task suite of mathematics problems involving sequential questions and answers in a free-form textual input/output format. The structured nature of the mathematics domain, covering arithmetic, algebra, probability and calculus, enables the construction of training and test spits designed to clearly illuminate the capabilities and failure-modes of different architectures, as well as evaluate their ability to compose and relate knowledge and learned processes. Having described the data generation process and its potential future expansions, we conduct a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and find notable differences in their ability to resolve mathematical problems and generalize their knowledge.

##### Prior Networks for Detection of Adversarial Attacks

tl;dr We show that it is possible to successfully detect a range of adversarial attacks using measures of uncertainty derived from Prior Networks.

No tl;dr =[

##### LARGE BATCH SIZE TRAINING OF NEURAL NETWORKS WITH ADVERSARIAL TRAINING AND SECOND-ORDER INFORMATION

tl;dr Large batch size training using adversarial training and second order information

Stochastic Gradient Descent (SGD) methods using randomly selected batches are widely-used to train neural network (NN) models. Performing design exploration to find the best NN for a particular task often requires extensive training with different models on a large dataset, which is very computationally expensive. The most straightforward method to accelerate this computation is to distribute the batch of SGD over multiple processors. However, large batch training often times leads to degradation in accuracy, poor generalization, and even poor robustness to adversarial attacks. Existing solutions for large batch training either do not work or require massive hyper-parameter tuning. To address this issue, we propose a novel large batch training method which combines recent results in adversarial training (to regularize against sharp minima'') and second order optimization (to use curvature information to change batch size adaptively during training). We extensively evaluate our method on Cifar-10/100, SVHN, TinyImageNet, and ImageNet datasets, using multiple NNs, including residual networks as well as compressed networks such as SqueezeNext. Our new approach exceeds the performance of the existing solutions in terms of both accuracy and the number of SGD iterations (up to 1\% and $3\times$, respectively). We emphasize that this is achieved without any additional hyper-parameter tuning to tailor our method to any of these experiments.

##### Minimum Divergence vs. Maximum Margin: an Empirical Comparison on Seq2Seq Models

No tl;dr =[

Sequence to sequence (seq2seq) models have become a popular framework for neural sequence prediction. While traditional seq2seq models are trained by Maximum Likelihood Estimation (MLE), much recent work has made various attempts to optimize evaluation scores directly to solve the mismatch between training and evaluation, since model predictions are usually evaluated by a task specific evaluation metric like BLEU or ROUGE scores instead of perplexity. This paper for the first time puts this existing work into two categories, a) minimum divergence, and b) maximum margin. We introduce a new training criterion based on the analysis of existing work, and empirically compare models in the two categories. Our experimental results show that training criteria based on the idea of minimum divergence can usually work better than maximum margin methods, on both the tasks of machine translation and sentence summarization.

##### Convolutional CRFs for Semantic Segmentation

tl;dr We propose Convolutional CRFs a fast, powerful and trainable alternative to Fully Connected CRFs.

For the challenging semantic image segmentation task the best performing models have traditionally combined the structured modelling capabilities of Conditional Random Fields (CRFs) with the feature extraction power of CNNs. In more recent works however, CRF post-processing has fallen out of favour. We argue that this is mainly due to the slow training and inference speeds of CRFs, as well as the difficulty of learning the internal CRF parameters. To overcome both issues we propose to add the assumption of conditional independence to the framework of fully-connected CRFs. This allows us to reformulate the inference in terms of convolutions, which can be implemented highly efficiently on GPUs.Doing so speeds up inference and training by two orders of magnitude. All parameters of the convolutional CRFs can easily be optimized using backpropagation. Towards the goal of facilitating further CRF research we have made our implementations publicly available.

##### Trace-back along capsules and its application on semantic segmentation

tl;dr A capsule-based semantic segmentation, which the probabilities of the class labels are traced back through capsule layers.

In this paper, we propose a capsule-based neural network model to solve the semantic segmentation problem. By taking advantage of the extractable part-whole dependencies available in capsule layers, we derive the probabilities of the class labels for individual capsules through a layer-by-layer recursive procedure. We model this procedure as a traceback layer, and take it as a central piece to build an end-to-end segmentation network. In addition to object boundaries, image-level class labels are also explicitly sought in our model, which poses a significant advantage over the state-of-the-art fully convolutional network (FCN) solutions. Experiments conducted on modified MNIST and neuroimages demonstrate that our model considerably enhance the segmentation performance compared to the leading FCN variant.

##### Bamboo: Ball-Shape Data Augmentation Against Adversarial Attacks from All Directions

tl;dr The first data augmentation method specially designed for improving the general robustness of DNN without any hypothesis on the attacking algorithms.

Deep neural networks (DNNs) are widely adopted in real-world cognitive applications because of their high accuracy. The robustness of DNN models, however, has been recently challenged by adversarial attacks where small disturbance on input samples may result in misclassification. State-of-the-art defending algorithms, such as adversarial training or robust optimization, improve DNNs' resilience to adversarial attacks by paying high computational costs. Moreover, these approaches are usually designed to defend one or a few known attacking techniques only. The effectiveness to defend other types of attacking methods, especially those that have not yet been discovered or explored, cannot be guaranteed. This work aims for a general approach of enhancing the robustness of DNN models under adversarial attacks. In particular, we propose Bamboo -- the first data augmentation method designed for improving the general robustness of DNN without any hypothesis on the attacking algorithms. Bamboo augments the training data set with a small amount of data uniformly sampled on a fixed radius ball around each training data and hence, effectively increase the distance between natural data points and decision boundary. Our experiments show that Bamboo substantially improve the general robustness against arbitrary types of attacks and noises, achieving better results comparing to previous adversarial training methods, robust optimization methods and other data augmentation methods with the same amount of data points.

##### Visual Explanation by Interpretation: Improving Visual Feedback Capabilities of Deep Neural Networks

tl;dr Interpretation by Identifying model-learned features that serve as indicators for the task of interest. Explain model decisions by highlighting the response of these features in test data. Evaluate explanations objectively with a controlled dataset.

Visual Interpretation and explanation of deep models is critical towards wide adoption of systems that rely on them. In this paper, we propose a novel scheme for both interpretation as well as explanation in which, given a pretrained model, we automatically identify internal features relevant for the set of classes considered by the model, without relying on additional annotations. We interpret the model through average visualizations of this reduced set of features. Then, at test time, we explain the network prediction by accompanying the predicted class label with supporting visualizations derived from the identified features. In addition, we propose a method to address the artifacts introduced by strided operations in deconvNet-based visualizations. Moreover, we introduce an8Flower , a dataset specifically designed for objective quantitative evaluation of methods for visual explanation. Experiments on the MNIST , ILSVRC 12, Fashion 144k and an8Flower datasets show that our method produces detailed explanations with good coverage of relevant features of the classes of interest.

##### How to train your MAML

tl;dr MAML is great, but it has many problems, we solve many of those problems and as a result we learn most hyper parameters end to end, speed-up training and inference and set a new SOTA in few-shot learning

The field of few-shot learning has recently seen substantial advancements. Most of these advancements came from casting few-shot learning as a meta-learning problem.Model Agnostic Meta Learning or MAML is currently one of the best approaches for few-shot learning via meta-learning. MAML is simple, elegant and very powerful, however, it has a variety of issues, such as being very sensitive to neural network architectures, often leading to instability during training, requiring arduous hyperparameter searches to stabilize training and achieve high generalization and being very computationally expensive at both training and inference times. In this paper, we propose various modifications to MAML that not only stabilize the system, but also substantially improve the generalization performance, convergence speed and computational overhead of MAML, which we call MAML++.

##### Convolutional Neural Networks combined with Runge-Kutta Methods

No tl;dr =[

A convolutional neural network for image classification can be constructed mathematically since it is inspired by the ventral stream in visual cortex which can be regarded as a multi-period dynamical system. In this paper, a novel approach is proposed to construct network models from the dynamical systems view. Since a pre-activation residual network can be deemed an approximation of a time-dependent dynamical system using the Euler method, higher order Runge-Kutta methods (RK methods) can be utilized to build network models in order to achieve higher accuracy. The model constructed in such a way is referred to as the Runge-Kutta Convolutional Neural Network (RKNet). RK methods also provide an interpretation of Dense Convolutional Networks (DenseNets) and Convolutional Neural Networks with Alternately Updated Clique (CliqueNets) from the dynamical systems view. The proposed methods are evaluated on the benchmark datasets: CIFAR-10/100 and ImageNet. The experimental results are consistent with the theoretical properties of RK methods and support the dynamical systems interpretation. Moreover, the experimental results show that the RKNets are superior to the state-of-the-art network models on CIFAR-10 and be comparable with them on CIFAR-100 and ImageNet.

##### Inference of unobserved event streams with neural Hawkes particle smoothing

No tl;dr =[

Events that we observe in the world may be caused by other, unobserved events. We consider sequences of discrete events in continuous time. When only some of the events are observed, we propose particle smoothing to infer the missing events. Particle smoothing is an extension of particle filtering in which proposed events are conditioned on the future as well as the past. For our setting, we develop a novel proposal distribution that is a type of continuous-time bidirectional LSTM. We use the sampled particles in an approximate minimum Bayes risk decoder that outputs a single low-risk prediction of the missing events. We experiment in multiple synthetic and real domains, modeling the complete sequences in each domain with a neural Hawkes process (Mei & Eisner, 2017). On held-out incomplete sequences, our method is effective at inferring the ground-truth unobserved events. In particular, particle smoothing consistently improves upon particle filtering, showing the benefit of training a bidirectional proposal distribution.

##### Empirical Study of Easy and Hard Examples in CNN Training

tl;dr Unknown properties of easy and hard examples are shown, and they come from biases in a dataset and SGD.

Deep Neural Networks (DNNs) generalize well despite their massive size and capability of memorizing all examples. There is a hypothesis that DNNs start learning from simple patterns based on the observations that are consistently well-classified at early epochs (i.e., easy examples) and examples misclassified (i.e., hard examples). However, despite the importance of understanding the learning dynamics of DNNs, properties of easy and hard examples are not fully investigated. In this paper, we study the similarities of easy and hard examples respectively among different CNNs, assessing those examples’ contributions to generalization. Our results show that most easy examples are identical among different CNNs, as they share similar dataset-dependent patterns (e.g., colors, structures, and superficial cues in high-frequency). Moreover, while hard examples tend to contribute more to generalization than easy examples, removing a large number of easy examples leads to poor generalization, and we find that most misclassified examples in validation dataset are hard examples. By analyzing intriguing properties of easy and hard examples, we discover that the reason why easy and hard examples have such properties can be explained by biases in a dataset and Stochastic Gradient Descent (SGD).

##### Pooling Is Neither Necessary nor Sufficient for Appropriate Deformation Stability in CNNs

tl;dr We find that pooling alone does not determine deformation stability in CNNs and that filter smoothness plays an important role in determining stability.

Many of our core assumptions about how neural networks operate remain empirically untested. One common assumption is that convolutional neural networks need to be stable to small translations and deformations to solve image recognition tasks. For many years, this stability was baked into CNN architectures by incorporating interleaved pooling layers. Recently, however, interleaved pooling has largely been abandoned. This raises a number of questions: Are our intuitions about deformation stability right at all? Is it important? Is pooling necessary for deformation invariance? If not, how is deformation invariance achieved in its absence? In this work, we rigorously test these questions, and find that deformation stability in convolutional networks is more nuanced than it first appears: (1) Deformation invariance is not a binary property, but rather that different tasks require different degrees of deformation stability at different layers. (2) Deformation stability is not a fixed property of a network and is heavily adjusted over the course of training, largely through the smoothness of the convolutional filters. (3) Interleaved pooling layers are neither necessary nor sufficient for achieving the optimal form of deformation stability for natural image classification. (4) Pooling confers \emph{too much} deformation stability for image classification at initialization, and during training, networks have to learn to \emph{counteract} this inductive bias. Together, these findings provide new insights into the role of interleaved pooling and deformation invariance in CNNs, and demonstrate the importance of rigorous empirical testing of even our most basic assumptions about the working of neural networks.

##### Active Learning with Partial Feedback

No tl;dr =[

While many active learning papers assume that the learner can simply ask for a label and receive it, real annotation often presents a mismatch between the form of a label (say, one among many classes), and the form of an annotation (typically yes/no binary feedback). To annotate examples corpora for multiclass classification, we might need to ask multiple yes/no questions, exploiting a label hierarchy if one is available. To address this more realistic setting, we propose active learning with partial feedback (ALPF), where the learner must actively choose both which example to label and which binary question to ask. At each step, the learner selects an example, asking if it belongs to a chosen (possibly composite) class. Each answer eliminates some classes, leaving the learner with a partial label. The learner may then either ask more questions about the same example (until an exact label is uncovered) or move on immediately, leaving the first example partially labeled. Active learning with partial labels requires (i) a sampling strategy to choose (example, class) pairs, and (ii) learning from partial labels between rounds. Experiments on Tiny ImageNet demonstrate that our most effective method improves 26% (relative) in top-1 classification accuracy compared to i.i.d. baselines and standard active learners given 30% of the annotation budget that would be required (naively) to annotate the dataset. Moreover, ALPF-learners fully annotate TinyImageNet at 42% lower cost. Surprisingly, we observe that accounting for per-example annotation costs can alter the conventional wisdom that active learners should solicit labels for hard examples.

##### Clean-Label Backdoor Attacks

tl;dr We show how to successfully perform backdoor attacks without changing training labels.

Deep neural networks have been recently demonstrated to be vulnerable to backdoor attacks. Specifically, by altering a small set of training examples, an adversary can install a backdoor that is able to be used during inference to fully control the model's behavior. While the attack is very powerful, it crucially relies on the adversary being able to introduce arbitrary, often clearly mislabeled, inputs to the training set and can thus be foiled even by fairly rudimentary data sanitization. In this paper, we introduce a new approach to executing backdoor attacks. This approach utilizes adversarial examples and GAN-generated data. The key feature is that the resulting poisoned inputs appear to be consistent with their label and thus seem benign even upon human inspection.

##### On the loss landscape of a class of deep neural networks with no bad local valleys

No tl;dr =[

We identify a class of over-parameterized deep neural networks with standard activation functions and cross-entropy loss which provably have no bad local valley, in the sense that from any point in parameter space there exists a continuous path on which the cross-entropy loss is non-increasing and gets arbitrarily close to zero. This implies that these networks have no sub-optimal strict local minima.

##### Improved resistance of neural networks to adversarial images through generative pre-training

tl;dr Generative pre-training with mean field Boltzmann machines increases robustness against adversarial images in neural networks.

We train a feed forward neural network with increased robustness against adversarial attacks compared to conventional training approaches. This is achieved using a novel pre-trained building block based on a mean field description of a Boltzmann machine. On the MNIST dataset the method achieves strong adversarial resistance without data augmentation or adversarial training. We show that the increased adversarial resistance is correlated with the generative performance of the underlying Boltzmann machine.

##### Towards Understanding Regularization in Batch Normalization

No tl;dr =[

Batch Normalization (BN) improves both convergence and generalization in training neural networks. This work understands these phenomena theoretically. We analyze BN by using a basic block of neural networks, consisting of a kernel layer, a BN layer, and a nonlinear activation function. This basic network helps us understand the impacts of BN in three aspects. First, by viewing BN as an implicit regularizer, BN can be decomposed into population normalization (PN) and gamma decay as an explicit regularization. Second, learning dynamics of BN and the regularization show that training converged with large maximum and effective learning rate. Third, generalization of BN is explored by using statistical mechanics. Experiments demonstrate that BN in convolutional neural networks share the same traits of regularization as the above analyses.

##### Predicting the Generalization Gap in Deep Networks with Margin Distributions

tl;dr We develop a new scheme to predict the generalization gap in deep networks with high accuracy.

As shown in recent research, deep neural networks can perfectly fit randomly labeled data, but with very poor accuracy on held out data. This phenomenon indicates that loss functions such as cross-entropy are not a reliable indicator of generalization. This leads to the crucial question of how generalization gap should be predicted from the training data and network parameters. In this paper, we propose such a measure, and conduct extensive empirical studies on how well it can predict the generalization gap. Our measure is based on the concept of margin distribution, which are the distances of training points to the decision boundary. We find that it is necessary to use margin distributions at multiple layers of a deep network. On the CIFAR-10 and the CIFAR-100 datasets, our proposed measure correlates very strongly with the generalization gap. In addition, we find the following other factors to be of importance: normalizing margin values for scale independence, using characterizations of margin distribution rather than just the margin (closest distance to decision boundary), and working in log space instead of linear space (effectively using a product of margins rather than a sum). Our measure can be easily applied to feedforward deep networks with any architecture and may point towards new training loss functions that could enable better generalization.

##### Theoretical and Empirical Study of Adversarial Examples

No tl;dr =[

Many techniques are developed to defend against adversarial examples at scale. So far, the most successful defenses generate adversarial examples during each training step and add them to the training data. Yet, this brings significant computational overhead. In this paper, we investigate defenses against adversarial attacks. First, we propose feature smoothing, a simple data augmentation method with little computational overhead. Essentially, feature smoothing trains a neural network on virtual training data as an interpolation of features from a pair of samples, with the new label remaining the same as the dominant data point. The intuition behind feature smoothing is to generate virtual data points as close as adversarial examples, and to avoid the computational burden of generating data during training. Our experiments on MNIST and CIFAR10 datasets explore different combinations of known regularization and data augmentation methods and show that feature smoothing with logit squeezing performs best for both adversarial and clean accuracy. Second, we propose an unified framework to understand the connections and differences among different efficient methods by analyzing the biases and variances of decision boundary. We show that under some symmetrical assumptions, label smoothing, logit squeezing, weight decay, mix up and feature smoothing all produce an unbiased estimation of the decision boundary with smaller estimated variance. All of those methods except weight decay are also stable when the assumptions no longer hold.

##### Learning Domain-Invariant Representation under Domain-Class Dependency

tl;dr Address the trade-off caused by the dependency of classes on domains in domain generalization

Learning domain-invariant representation is a dominant approach for domain generalization, where we need to build a classifier that is robust toward domain shifts induced by change of users, acoustic or lighting conditions, etc. However, prior domain-invariance-based methods overlooked the underlying dependency of classes (target variable) on domains during optimization, which causes the trade-off between classification accuracy and domain-invariance, and often interferes with the domain generalization performance. This study first provides the notion of domain generalization under domain-class dependency and elaborates on the importance of considering the dependency by expanding the analysis of Xie et al. (2017). We then propose a method, invariant feature learning under optimal classifier constrains (IFLOC), which explicitly considers the dependency and maintains accuracy while improving domain-invariance. Specifically, the proposed method regularizes the representation so that it has as much domain information as the class labels, unlike prior methods that remove all domain information. Empirical validations show the superior performance of IFLOC to baseline methods, supporting the importance of the domain-class dependency in domain generalization and the efficacy of the proposed method for overcoming the issue.

##### Self-Binarizing Networks

tl;dr A method to binarize both weights and activations of a deep neural network that is efficient in computation and memory usage and performs better than the state-of-the-art.

We present a method to train self-binarizing neural networks, that is, networks that evolve their weights and activations during training to become binary. To obtain similar binary networks, existing methods rely on the sign activation function. This function, however, has no gradients for non-zero values, which makes standard backpropagation impossible. To circumvent the difficulty of training a network relying on the sign activation function, these methods alternate between floating-point and binary representations of the network during training, which is sub-optimal and inefficient. We approach the binarization task by training on a unique representation involving a smooth activation function, which is iteratively sharpened during training until it becomes a binary representation equivalent to the sign activation function. Additionally, we introduce a new technique to perform binary batch normalization that simplifies the conventional batch normalization by transforming it into a simple comparison operation. This is unlike existing methods, which are forced to the retain the conventional floating-point-based batch normalization. Our binary networks, apart from displaying advantages of lower memory and computation as compared to conventional floating-point and binary networks, also show higher classification accuracy than existing state-of-the-art methods on multiple benchmark datasets.

##### Why do deep convolutional networks generalize so poorly to small image transformations?

tl;dr Modern deep CNNs are not invariant to translations, scalings and other realistic image transformations, and this lack of invariance is related to the subsampling operation and the biases contained in image datasets.

Deep convolutional network architectures are often assumed to guarantee generalization for small image translations and deformations. In this paper we show that modern CNNs (VGG16, ResNet50, and InceptionResNetV2) can drastically change their output when an image is translated in the image plane by a few pixels, and that this failure of generalization also happens with other realistic small image transformations. Furthermore, we see these failures to generalize more frequently in more modern networks. We show that these failures are related to the fact that the architecture of modern CNNs ignores the classical sampling theorem so that generalization is not guaranteed. We also show that biases in the statistics of commonly used image datasets makes it unlikely that CNNs will learn to be invariant to these transformations. Taken together our results suggest that the performance of CNNs in object recognition falls far short of the generalization capabilities of humans.

##### Large-Scale Visual Speech Recognition

No tl;dr =[

This work presents a scalable solution to open-vocabulary visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a production-level speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on other lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively.

##### Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

tl;dr We propose ImageNet-C to measure classifier corruption robustness and ImageNet-P to measure perturbation robustness

In this paper we establish rigorous benchmarks for image classifier robustness. Our first benchmark, ImageNet-C, standardizes and expands the corruption robustness topic, while showing which classifiers are preferable in safety-critical applications. Then we propose a new dataset called ImageNet-P which enables researchers to benchmark a classifier's robustness to common perturbations. Unlike recent robustness research, this benchmark evaluates performance on common corruptions and perturbations not worst-case adversarial perturbations. We find that there are negligible changes in relative corruption robustness from AlexNet classifiers to ResNet classifiers. Afterward we discover ways to enhance corruption and perturbation robustness. We even find that a bypassed adversarial defense provides substantial common perturbation robustness. Together our benchmarks may aid future work toward networks that robustly generalize.

##### PPD: Permutation Phase Defense Against Adversarial Examples in Deep Learning

tl;dr Permutation phase defense is proposed as a novel method to guard against adversarial attacks in deep learning.

Deep neural networks have demonstrated cutting edge performance on various tasks including classification. However, it is well known that adversarially designed imperceptible perturbation of the input can mislead advanced classifiers. In this paper, Permutation Phase Defense (PPD), is proposed as a novel method to resist adversarial attacks. PPD combines random permutation of the image with phase component of its Fourier transform. The basic idea behind this approach is to turn adversarial defense problems analogously into symmetric cryptography, which relies solely on safekeeping of the keys for security. In PPD, safe keeping of the selected permutation ensures effectiveness against adversarial attacks. Testing PPD on MNIST and CIFAR-10 datasets yielded state-of-the-art robustness against the most powerful adversarial attacks currently available.

##### Towards a better understanding of Vector Quantized Autoencoders

tl;dr Understand the VQ-VAE discrete autoencoder systematically using EM and use it to design non-autogressive translation model matching a strong autoregressive baseline.

Deep neural networks with discrete latent variables offer the promise of better symbolic reasoning, and learning abstractions that are more useful to new tasks. There has been a surge in interest in discrete latent variable models, however, despite several recent improvements, the training of discrete latent variable models has remained challenging and their performance has mostly failed to match their continuous counterparts. Recent work on vector quantized autoencoders (VQ-VAE) has made substantial progress in this direction, with its perplexity almost matching that of a VAE on datasets such as CIFAR-10. In this work, we investigate an alternate training technique for VQ-VAE, inspired by its connection to the Expectation Maximization (EM) algorithm. Training the discrete bottleneck with EM helps us achieve better image generation results on CIFAR-10, and together with knowledge distillation, allows us to develop a non-autoregressive machine translation model whose accuracy almost matches a strong greedy autoregressive baseline Transformer, while being 3.3 times faster at inference.

tl;dr We improve the performance of ADDA by incorporating task knowledge into the adversarial loss functions and treating the discriminator as a denoising autoencoder.

##### On Regularization and Robustness of Deep Neural Networks

No tl;dr =[

Despite their success, deep neural networks suffer from several drawbacks: they lack robustness to small changes of input data known as "adversarial examples" and training them with small amounts of annotated data is challenging. In this work, we study the connection between regularization and robustness by viewing neural networks as elements of a reproducing kernel Hilbert space (RKHS) of functions and by regularizing them using the RKHS norm. Even though this norm cannot be computed, we consider various approximations based on upper and lower bounds. These approximations lead to new strategies for regularization, but also to existing ones such as spectral norm penalties or constraints, gradient penalties, or adversarial training. Besides, the kernel framework allows us to obtain margin-based bounds on adversarial generalization. We study the obtained algorithms for learning on small datasets, learning adversarially robust models, and discuss implications for learning implicit generative models.

##### Learning Unsupervised Learning Rules

tl;dr We learn an unsupervised learning algorithm that produces useful representations from a set of supervised tasks. At test-time, we apply this algorithm to new tasks without any supervision and show performance comparable to a VAE.

##### How Training Data Affect the Accuracy and Robustness of Image Classification Models

No tl;dr =[

Recent work has demonstrated the lack of robustness of well-trained deep neural networks (DNNs) to adversarial examples. For example, visually indistinguishable perturbations, when mixed with an original image, can easily lead deep learning models to misclassifications. In light of a recent study on the mutual influence between robustness and accuracy over 18 different ImageNet models, this paper investigates how training data affect the accuracy and robustness of deep neural networks. We conduct extensive experiments on four different datasets, including CIFAR-10, MNIST, STL-10, and Tiny ImageNet, with several representative neural networks. Our results reveal previously unknown phenomena that exist between the size of training data and characteristics of the resulting models. In particular, we find that model accuracy improves monotonically with increased training data. Similarly, model robustness also improves, but starts to deteriorate when training data continue to increase. The occurrence of turning points depends on the deep neural network as well as the dataset on which it is trained.

##### Knowledge Distillation from Few Samples

tl;dr This paper proposes a novel and simple method for knowledge distillation from few samples.

Current knowledge distillation methods require full training data to distill knowledge from a large "teacher" network to a compact "student" network by matching certain statistics between "teacher" and "student" such as softmax outputs and feature responses. This is not only time-consuming but also inconsistent with human cognition in which children can learn knowledge from adults with few examples. This paper proposes a novel and simple method for knowledge distillation from few samples. Taking the assumption that both "teacher" and "student" have the same feature map sizes at each corresponding block, we add a 1x1 conv-layer at the end of each block in the student network, and align the block-level outputs between "teacher" and "student" by estimating the parameters of the added layer with limited samples. We prove that the added layer can be absorbed into the previous conv-layer so that no extra parameters and computation are introduced. Experiments show that the proposed method can recover a student network's top-1 accuracy on ImageNet from 0.2% to 62.7% with just 1000 samples in a few minutes, and is effective on various ways for constructing student networks.

##### A Modern Take on the Bias-Variance Tradeoff in Neural Networks

tl;dr We revisit empirically and theoretically the bias-variance tradeoff for neural networks to shed more light on their generalization properties.

We revisit the bias-variance tradeoff for neural networks in light of modern empirical findings. The traditional bias-variance tradeoff in machine learning suggests that as model complexity grows, variance increases. Classical bounds in statistical learning theory point to the number of parameters in a model as a measure of model complexity, which means the tradeoff would indicate that variance increases with the size of neural networks. However, we empirically find that variance due to training set sampling is roughly constant (with both width and depth) in practice. Variance caused by the non-convexity of the loss landscape is different. We find that it decreases with width and increases with depth, in our setting. We provide theoretical analysis, in a simplified setting inspired by linear models, that is consistent with our empirical findings for width. We view bias-variance as a useful lens to study generalization through and encourage further theoretical explanation from this perspective.

##### Recycling the discriminator for improving the inference mapping of GAN

No tl;dr =[

Generative adversarial networks (GANs) have achieved outstanding success in generating the high-quality data. Focusing on the generation process, existing GANs learn a unidirectional mapping from the latent vector to the data. Later, various studies point out that the latent space of GANs is semantically meaningful and can be utilized in advanced data analysis and manipulation. In order to analyze the real data in the latent space of GANs, it is necessary to investigate the inverse generation mapping from the data to the latent vector. To tackle this problem, the bidirectional generative models introduce an encoder to establish the inverse path of the generation process. Unfortunately, this effort leads to the degradation of generation quality because the imperfect generator rather interferes the encoder training and vice versa. In this paper, we propose an effective algorithm to infer the latent vector based on existing unidirectional GANs by preserving their generation quality. It is important to note that we focus on increasing the accuracy and efficiency of the inference mapping but not influencing the GAN performance (i.e., the quality or the diversity of the generated sample). Furthermore, utilizing the proposed inference mapping algorithm, we suggest a new metric for evaluating the GAN models by measuring the reconstruction error of unseen real data. The experimental analysis demonstrates that the proposed algorithm achieves more accurate inference mapping than the existing method and provides the robust metric for evaluating GAN performance.

##### Reliable Uncertainty Estimates in Deep Neural Networks using Noise Contrastive Priors

tl;dr We train neural networks to be uncertain on noisy inputs to avoid overconfident predictions outside of the training distribution.

Obtaining reliable uncertainty estimates of neural network predictions is a long standing challenge. Bayesian neural networks have been proposed as a solution, but it remains open how to specify the prior. In particular, the common practice of a standard normal prior in weight space imposes only weak regularities, causing the function posterior to possibly generalize in unforeseen ways on out-of-distribution inputs. We propose noise contrastive priors (NCPs). The key idea is to train the model to output high uncertainty for data points outside of the training distribution. NCPs do so using an input prior, which adds noise to the inputs of the current mini batch, and an output prior, which is a wide distribution given these inputs. NCPs are compatible with any model that represents predictive uncertainty, are easy to scale, and yield reliable uncertainty estimates throughout training. Empirically, we show that NCPs prevent overfitting outside of the training distribution and result in uncertainty estimates that are useful for active learning. We demonstrate the scalability of our method on the flight delays data set, where we significantly improve upon previously published results.

##### How Training Data Affect the Accuracy and Robustness of Neural Networks for Image Classification

No tl;dr =[

Recent work has demonstrated the lack of robustness of well-trained deep neural networks (DNNs) to adversarial examples. For example, visually indistinguishable perturbations, when mixed with an original image, can easily lead deep learning models to misclassifications. In light of a recent study on the mutual influence between robustness and accuracy over 18 different ImageNet models, this paper investigates how training data affect the accuracy and robustness of deep neural networks. We conduct extensive experiments on four different datasets, including CIFAR-10, MNIST, STL-10, and Tiny ImageNet, with several representative neural networks. Our results reveal previously unknown phenomena that exist between the size of training data and characteristics of the resulting models. In particular, besides confirming that the model accuracy improves as the amount of training data increases, we also observe that the model robustness improves initially, but there exists a turning point after which robustness starts to decrease. How and when such turning points occur vary for different neural networks and different datasets.

tl;dr A new nonlinear defense against adversarial attacks.

The linear and non-flexible nature of deep convolutional models makes them vulnerable to carefully crafted adversarial perturbations. To tackle this problem, in this paper, we propose a nonlinear radial basis convolutional feature transformation by learning the Mahalanobis distance function that maps the input convolutional features from the same class into tight clusters. In such a space, the clusters become compact and well-separated, which prevent small adversarial perturbations from forcing a sample to cross the decision boundary. We test the proposed method on three publicly available image classification and segmentation data-sets namely, MNIST, ISBI ISIC skin lesion, and NIH ChestX-ray14. We evaluate the robustness of our method to different gradient (targeted and untargeted) and non-gradient based attacks and compare it to several non-gradient masking defense strategies. Our results demonstrate that the proposed method can boost the performance of deep convolutional neural networks against adversarial perturbations without accuracy drop on clean data.

##### MARGINALIZED AVERAGE ATTENTIONAL NETWORK FOR WEAKLY-SUPERVISED LEARNING

tl;dr A novel marginalized average attentional network for weakly-supervised temporal action localization

In weakly-supervised temporal action localization, previous works suffer from overestimating the most salient regions and fail to locate dense and integral regions for each entire action. To alleviate this issue, we propose a marginalized average attentional network (MAAN) to suppress the dominant response of the most salient regions in a principled manner. The MAAN employs a novel marginalized average aggregation (MAA) module and learns a set of latent discriminative probabilities in an end-to-end fashion. MAA samples the subsets from the video snippet features based on the latent discriminative probabilities and takes the expectation over all the subset features. Theoretically, we prove that the learned latent discriminative probabilities reduce the difference of responses between the most salient regions and the others, and thus MAAN generates better class activation sequences to identify more dense and integral action regions in the videos. Moreover, we propose a fast algorithm to reduce the complexity of constructing MAA from $O(2^T)$ to $O(T^2)$. Extensive experiments on two large-scale video datasets show that our MAAN achieves superior performance on weakly-supervised temporal action localization task.

##### UNSUPERVISED CONVOLUTIONAL NEURAL NETWORKS FOR ACCURATE VIDEO FRAME INTERPOLATION WITH INTEGRATION OF MOTION COMPONENTS

No tl;dr =[

Optical flow and video frame interpolation are considered as a chicken-egg problem such that one problem affects the other and vice versa. This paper presents a deep neural network that integrates the flow network into the frame interpolation problem, with end-to-end learning. The proposed approach exploits the relationship between the two problems for quality enhancement of interpolation frames. Unlike recent convolutional neural networks, the proposed approach learns motions from natural video frames without graphical ground truth flows for training. This makes the network learn from extensive data and improve the performance. The motion information from the flow network guides interpolator networks to be trained to synthesize the interpolated frame accurately from motion scenarios. In addition, diverse datasets to cover various challenging cases that previous interpolations usually fail in is used for comparison. In all experimental datasets, the proposed network achieves better performance than state-of-art CNN based interpolations. With Middebury benchmark, compared with the top-ranked algorithm, the proposed network reduces an average interpolation error by about 9.3%. The proposed interpolation is ranked the 1st in Standard Deviation (SD) interpolation error, the 2nd in Average Interpolation Error among over 150 algorithms listed in the Middlebury interpolation benchmark.

##### Improved robustness to adversarial examples using Lipschitz regularization of the loss

tl;dr Improvements to adversarial robustness, as well as provable robustness guarantees, are obtained by augmenting adversarial training with a tractable Lipschitz regularization

Adversarial training is an effective method for improving robustness to adversarial attacks. We show that adversarial training using the Fast Signed Gradient Method can be interpreted as a form of regularization. We implemented a more effective form of adversarial training, which in turn can be interpreted as regularization of the loss in the 2-norm, $\|\nabla_x \ell(x)\|_2$. We obtained further improvements to adversarial robustness, as well as provable robustness guarantees, by augmenting adversarial training with Lipschitz regularization.

##### A Closer Look at Few-shot Classification

tl;dr A detailed empirical study in few-shot classification that revealing challenges in standard evaluation setting and showing a new direction.

Few-shot classiﬁcation aims to learn a classiﬁer to recognize unseen classes during training with limited labeled examples. While signiﬁcant progress has been made, the growing complexity of network designs, meta-learning algorithms, and differences in implementation details make a fair comparison difﬁcult. In this paper, we present 1) a consistent comparative analysis of several representative few-shot classiﬁcation algorithms, with results showing that deeper backbones signiﬁcantly reduce the gap across methods including the baseline, 2) a slightly modiﬁed baseline method that surprisingly achieves competitive performance when compared with the state-of-the-art on both the mini-ImageNet and the CUB datasets, and 3) a new experimental setting for evaluating the cross-domain generalization ability for few-shot classiﬁcation algorithms. Our results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones. In a realistic, cross-domain evaluation setting, we show that a baseline method with a standard ﬁne-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.

##### An Alarm System for Segmentation Algorithm Based on Shape Model

tl;dr We use VAE to capture the shape feature for automatic segmentation evaluation

It is usually hard for a learning system to predict correctly on the rare events, and there is no exception for segmentation algorithms. Therefore, we hope to build an alarm system to set off alarms when the segmentation result is possibly unsatisfactory. One plausible solution is to project the segmentation results into a low dimensional feature space, and then learn classifiers/regressors in the feature space to predict the qualities of segmentation results. In this paper, we form the feature space using shape feature which is a strong prior information shared among different data, so it is capable to predict the qualities of segmentation results given different segmentation algorithms on different datasets. The shape feature of a segmentation result is captured using the value of loss function when the segmentation result is tested using a Variational Auto-Encoder(VAE). The VAE is trained using only the ground truth masks, therefore the bad segmentation results with bad shapes become the rare events for VAE and will result in large loss value. By utilizing this fact, the VAE is able to detect all kinds of shapes that are out of the distribution of normal shapes in ground truth (GT). Finally, we learn the representation in the one-dimensional feature space to predict the qualities of segmentation results. We evaluate our alarm system on several recent segmentation algorithms for the medical segmentation task. The segmentation algorithms perform differently on different datasets, but our system consistently provides reliable prediction on the qualities of segmentation results.

##### Estimating Information Flow in DNNs

tl;dr Deterministic deep neural networks do not discard information, but they do cluster their inputs.

We study the evolution of internal representations during deep neural network (DNN) training, aiming to demystify the compression aspect of the information bottleneck theory. The theory suggests that DNN training comprises a rapid fitting phase followed by a slower compression phase, in which the mutual information I(X;T) between the input X and internal representations T decreases. Several papers observe compression of estimated mutual information on different DNN models, but the true I(X;T) over these networks is provably either constant (discrete X) or infinite (continuous X). This work explains the discrepancy between theory and experiments, and clarifies what was actually measured by these past works. To this end, we introduce an auxiliary (noisy) DNN framework for which I(X;T) is a meaningful quantity that depends on the network's parameters. This noisy framework is shown to be a good proxy for the original (deterministic) DNN both in terms of performance and the learned representations. We then develop a rigorous estimator for I(X;T) in noisy DNNs and observe compression in various models. By relating I(X;T) in the noisy DNN to an information-theoretic communication problem, we show that compression is driven by the progressive clustering of hidden representations of inputs from the same class. Several methods to directly monitor clustering of hidden representations, both in noisy and deterministic DNNs, are used to show that meaningful clusters form in the T space. Finally, we return to the estimator of I(X;T) employed in past works, and demonstrate that while it fails to capture the true (vacuous) mutual information, it does serve as a measure for clustering. This clarifies the past observations of compression and isolates the geometric clustering of hidden representations as the true phenomenon of interest.

##### Provable Guarantees on Learning Hierarchical Generative Models with Deep CNNs

tl;dr A generative model for deep CNNs with provable theoretical guarantees that actually works

Learning deep networks is computationally hard in the general case. To show any positive theoretical results, one must make assumptions on the data distribution. Current theoretical works often make assumptions that are very far from describing real data, like sampling from Gaussian distribution or linear separability of the data. We describe an algorithm that learns convolutional neural network, assuming the data is sampled from a deep generative model that generates images level by level, where lower resolution images correspond to latent semantic classes. We analyze the convergence rate of our algorithm assuming the data is indeed generated according to this model (as well as additional assumptions). While we do not pretend to claim that the assumptions are realistic for natural images, we do believe that they capture some true properties of real data. Furthermore, we show that on CIFAR-10, the algorithm we analyze achieves results in the same ballpark with vanilla convolutional neural networks that are trained with SGD.

##### Relaxed Quantization for Discretized Neural Networks

tl;dr We introduce a technique that allows for gradient based training of quantized neural networks.

Neural network quantization has become an important research area due to its great impact on deployment of large models on resource constrained devices. In order to train networks that can be effectively discretized without loss of performance, we introduce a differentiable quantization procedure. Differentiability can be achieved by transforming continuous distributions over the weights and activations of the network to categorical distributions over the quantization grid. These are subsequently relaxed to continuous surrogates that can allow for efficient gradient-based optimization. We further show that stochastic rounding can be seen as a special case of the proposed approach and that under this formulation the quantization grid itself can also be optimized with gradient descent. We experimentally validate the performance of our method on MNIST, CIFAR 10 and Imagenet classification.

##### Classification of Building Noise Type/Position via Supervised Learning

tl;dr This paper presents noise type/position classification of various impact noises generated in a building which is a serious conflict issue in apartment complexes

This paper presents noise type/position classification of various impact noises generated in a building which is a serious conflict issue in apartment complexes. For this study, a collection of floor impact noise dataset is recorded with a single microphone. Noise types/positions are selected based on a report by the Floor Management Center under Korea Environmental Corporation. Using a convolutional neural networks based classifier, the impact noise signals converted to log-scaled Mel-spectrograms are classified into noise types or positions. Also, our model is evaluated on a standard environmental sound dataset ESC-50 to show extensibility on environmental sound classification.

##### MuMoMAML: Model-Agnostic Meta-Learning for Multimodal Task Distributions

tl;dr We proposed a meta-learner that generalizes across a multimodal task distribution by identifying the modes of the distribution and modulating its meta-learned prior accordingly, allowing further efficient adaptation through gradient updates.

##### Tree-Structured Recurrent Switching Linear Dynamical Systems for Multi-Scale Modeling

No tl;dr =[

Many real-world systems studied are governed by complex, nonlinear dynamics. By modeling these dynamics, we can gain insight into how these systems work, make predictions about how they will behave, and develop strategies for controlling them. While there are many methods for modeling nonlinear dynamical systems, existing techniques face a trade off between offering interpretable descriptions and making accurate predictions. Here, we develop a class of models that aims to achieve both simultaneously, smoothly interpolating between simple descriptions and more complex, yet also more accurate models. Our probabilistic model achieves this multi-scale property through of a hierarchy of locally linear dynamics that jointly approximate global nonlinear dynamics. We call it the tree-structured recurrent switching linear dynamical system. To fit this model, we present a fully-Bayesian sampling procedure using P\'{o}lya-Gamma data augmentation to allow for fast and conjugate Gibbs sampling. Through a variety of synthetic and real examples, we show how these models outperform existing methods in both interpretability and predictive capability.

##### Better Generalization with On-the-fly Dataset Denoising

tl;dr We introduce a fast and easy-to-implement algorithm that is robust to dataset noise.

Memorization in over-parameterized neural networks can severely hurt generalization in the presence of mislabeled examples. However, mislabeled examples are to hard avoid in extremely large datasets. We address this problem using the implicit regularization effect of stochastic gradient descent with large learning rates, which we find to be able to separate clean and mislabeled examples with remarkable success using loss statistics. We leverage this to identify and on-the-fly discard mislabeled examples using a threshold on their losses. This leads to On-the-fly Data Denoising (ODD), a simple yet effective algorithm that is robust to mislabeled examples, while introducing almost zero computational overhead. Empirical results demonstrate the effectiveness of ODD on several datasets containing artificial and real-world mislabeled examples.

##### How Training Data Affect the Accuracy and Robustness of Neural Networks for Image Classification

No tl;dr =[

Recent work has demonstrated the lack of robustness of well-trained deep neural networks (DNNs) to adversarial examples. For example, visually indistinguishable perturbations, when mixed with an original image, can easily lead deep learning models to misclassifications. In light of a recent study on the mutual influence between robustness and accuracy over 18 different ImageNet models, this paper investigates how training data affect the accuracy and robustness of deep neural networks. We conduct extensive experiments on four different datasets, including CIFAR-10, MNIST, STL-10, and Tiny ImageNet, with several representative neural networks. Our results reveal previously unknown phenomena that exist between the size of training data and characteristics of the resulting models. In particular, besides confirming that the model accuracy improves as the amount of training data increases, we also observe that the model robustness improves initially, but there exists a turning point after which robustness starts to deteriorate. How and when such turning points occur vary for different neural networks and different datasets.

##### Complement Objective Training

tl;dr We propose Complement Objective Training (COT), a new training paradigm that optimizes both the primary and complement objectives for effectively learning the parameters of neural networks.

Learning with a primary objective, such as softmax cross entropy for classification and sequence generation, has been the norm for training deep neural networks for years. Although being a widely-adopted approach, using cross entropy as the primary objective exploits mostly the information from the ground-truth class for maximizing data likelihood, and largely ignores information from the complement (incorrect) classes. We argue that, in addition to the primary objective, training also using a complement objective that leverages information from the complement classes can be effective in improving model performance. This motivates us to study a new training paradigm that maximizes the likelihood of the ground-truth class while neutralizing the probabilities of the complement classes. We conduct extensive experiments on multiple tasks ranging from computer vision to natural language understanding. The experimental results confirm that, compared to the conventional training with just one primary objective, training also with the complement objective further improves the performance of the state-of-the-art models across all tasks. In addition to the accuracy improvement, we also show that models trained with both primary and complement objectives are more robust to adversarial attacks.

##### Unsupervised Conditional Generation using noise engineered mode matching GAN

tl;dr A GAN model where an inversion mapping from the generated data space to an engineered latent space is learned such that properties of the data generating distribution are matched to those of the latent distribution.

Conditional generation refers to the process of sampling from an unknown distribution conditioned on semantics of the data. This can be achieved by augmenting the generative model with the desired semantic labels, albeit it is not straightforward in an unsupervised setting where the semantic label of every data sample is unknown. In this paper, we address this issue by proposing a method that can generate samples conditioned on the properties of a latent distribution engineered in accordance with a certain data prior. In particular, a latent space inversion network is trained in tandem with a generative adversarial network such that the modal properties of the latent space distribution are induced in the data generating distribution. We demonstrate that our model despite being fully unsupervised, is effective in learning meaningful representations through its mode matching property. We validate our method on multiple unsupervised tasks such as conditional generation, dataset attribute discovery and inference using three real world image datasets namely MNIST, CIFAR-10 and CELEB-A and show that the results are comparable to the state-of-the-art methods.

##### PRUNING WITH HINTS: AN EFFICIENT FRAMEWORK FOR MODEL ACCELERATION

tl;dr This is a work aiming for boosting all the existing pruning and mimic method.

In this paper, we propose an efficient framework to accelerate convolutional neural networks. We utilize two types of acceleration methods: pruning and hints. Pruning can reduce model size by removing channels of layers. Hints can improve the performance of student model by transferring knowledge from teacher model. We demonstrate that pruning and hints are complementary to each other. On one hand, hints can benefit pruning by maintaining similar feature representations. On the other hand, the model pruned from teacher networks is a good initialization for student model, which increases the transferability between two networks. Our approach performs pruning stage and hints stage iteratively to further improve the performance. Furthermore, we propose an algorithm to reconstruct the parameters of hints layer and make the pruned model more suitable for hints. Experiments were conducted on various tasks including classification and pose estimation. Results on CIFAR-10, ImageNet and COCO demonstrate the generalization and superiority of our framework.

##### MixFeat: Mix Feature in Latent Space Learns Discriminative Space

tl;dr We provide a novel method named MixFeat, which directly makes the latent space discriminative.

Deep learning methods perform well in various tasks. However, the over-fitting problem remains, where the performance decreases for unknown data. We here provide a novel method named MixFeat, which directly makes the latent space discriminative. MixFeat mixes two feature maps in each latent space and uses one of their labels for learning. We report improved results obtained using existing network models with MixFeat on CIFAR-10/100 datasets. In addition, we show that MixFeat effectively reduces the over-fitting problem even in the case that the training dataset is small or contains errors. We argue that MixFeat is complementary with existing methods that mix both images and labels, in that MixFeat is suitable for discrimination tasks while existing methods are suitable for regression tasks. MixFeat is easy to implement and can be added to various network models without additional computational cost in the inference phase.

##### Outlier Detection from Image Data

tl;dr A novel approach that detects outliers from image data, while at the same time preserving the classification accuracy of the multi-class classification problem

Modern applications from Autonomous Vehicles to Video Surveillance generate massive amounts of image data. In this work we propose a novel image outlier detection approach (IOD for short) that leverages the cutting-edge image classifier to discover outliers without using any labeled outlier. We observe that although intuitively the confidence that a convolutional neural network (CNN) has that an image belongs to a particular class could serve as outlierness measure to each image, directly applying this confidence to detect outlier does not work well. This is because CNN often has high confidence on an outlier image that does not belong to any target class due to its generalization ability that ensures the high accuracy in classification. To solve this issue, we propose a Deep Neural Forest-based approach that harmonizes the contradictory requirements of accurately classifying images and correctly detecting the outlier images. Our experiments using several benchmark image datasets including MNIST, CIFAR-10, CIFAR-100, and SVHN demonstrate the effectiveness of our IOD approach for outlier detection, capturing more than 90% of outliers generated by injecting one image dataset into another, while still preserving the classification accuracy of the multi-class classification problem.

##### The Limitations of Adversarial Training and the Blind-Spot Attack

tl;dr We show that the effectiveness of adversarial training procedure on test set has a strong correlation with the distance between the test point and the manifold of training data.

The adversarial training procedure proposed by Madry et. al. is one of the most effective methods to defend against adversarial examples on deep neuron networks (DNNs). Despite being very effective on MNIST, adversarial training on larger datasets like CIFAR and ImageNet achieves much worse results. In our paper, we shed some lights on the practicality and hardness of adversarial training by first showing that the effectiveness of adversarial training procedure on test set has a strong correlation with the distance between the test point and the manifold of training data. The test examples that are relatively far away from the distribution of training dataset are more likely to be vulnerable to adversarial examples. Consequentially, adversarial training based defense is susceptible to a new class of attacks (“blind-spot attack”) where the input image resides in a “blind-spot” in the empirical distribution of training data but is still on the ground-truth data manifold. For MNIST, we found that these blind-spots can be easily found by simply scaling and shifting image pixel values. Most importantly, for large datasets with high dimensional and complex data manifold (CIFAR, ImageNet, etc), the existence of blind-spots in adversarial training makes the defense on any valid test examples almost impossible due to the curse of dimensionality.

tl;dr We propose a novel data-dependent structured gradient regularizer to increase the robustness of neural networks against adversarial perturbations.

We propose a novel data-dependent structured gradient regularizer to increase the robustness of neural networks vis-a-vis adversarial perturbations. Our regularizer can be derived as a controlled approximation from first principles, leveraging the fundamental link between training with noise and regularization. It adds very little computational overhead during learning and is simple to implement generically in standard deep learning frameworks. Our experiments provide strong evidence that structured gradient regularization can act as an effective first line of defense against attacks based on long-range correlated signal corruptions.

##### Deep Anomaly Detection with Outlier Exposure

tl;dr We teach anomaly detection methods to learn heuristics for spotting new anomalies; experiments are in NLP and vision settings

It is important to detect and handle anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data commonly used by deep learning systems are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach we call Outlier Exposure (OE). In extensive experiments in vision and natural language processing settings, we find that Outlier Exposure significantly improves the performance of existing anomaly detectors, including detectors based on density estimation, and that OE improves classifier calibration in the presence of anomalous inputs. We also analyze the flexibility and robustness of Outlier Exposure, and identify characteristics of the auxiliary dataset that improve performance.

##### ProxQuant: Quantized Neural Networks via Proximal Operators

tl;dr A principled framework for model quantization using the proximal gradient method.

To make deep neural networks feasible in resource-constrained environments (such as mobile devices), it is beneficial to quantize models by using low-precision weights. One common technique for quantizing neural networks is the straight-through gradient method, which enables back-propagation through the quantization mapping. Despite its empirical success, little is understood about why the straight-through gradient method works. Building upon a novel observation that the straight-through gradient method is in fact identical to the well-known Nesterov’s dual-averaging algorithm on a quantization constrained optimization problem, we propose a more principled alternative approach, called ProxQuant , that formulates quantized network training as a regularized learning problem instead and optimizes it via the prox-gradient method. ProxQuant does back-propagation on the underlying full-precision vector and applies an efficient prox-operator in between stochastic gradient steps to encourage quantizedness. For quantizing ResNets and LSTMs, ProxQuant outperforms state-of-the-art results on binary quantization and is on par with state-of-the-art on multi-bit quantization. For binary quantization, our analysis shows both theoretically and experimentally that ProxQuant is more stable than the straight-through gradient method (i.e. BinaryConnect), challenging the indispensability of the straight-through gradient method and providing a powerful alternative.

##### Unseen Action Recognition with Multimodal Learning

No tl;dr =[

In this paper, we present a method to learn a joint multimodal representation space that allows for the recognition of unseen activities in videos. We compare the effect of placing various constraints on the embedding space using paired text and video data. Additionally, we propose a method to improve the joint embedding space using an adversarial formulation with unpaired text and video data. In addition to testing on publicly available datasets, we introduce a new, large-scale text/video dataset. We experimentally confirm that learning such shared embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning.

##### Towards the first adversarially robust neural network model on MNIST

No tl;dr =[

Despite much effort, deep neural networks remain highly susceptible to tiny input perturbations and even for MNIST, one of the most common toy datasets in computer vision, no neural network model exists for which adversarial perturbations are large and make semantic sense to humans. We show that even the widely recognized and by far most successful L-inf defense by Madry et~al. (1) has lower L0 robustness than undefended networks and still highly susceptible to L2 perturbations, (2) classifies unrecognizable images with high certainty, (3) performs not much better than simple input binarization and (4) features adversarial perturbations that make little sense to humans. These results suggest that MNIST is far from being solved in terms of adversarial robustness. We present a novel robust classification model that performs analysis by synthesis using learned class-conditional data distributions. We derive bounds on the robustness and go to great length to empirically evaluate our model using maximally effective adversarial attacks by (a) applying decision-based, score-based, gradient-based and transfer-based attacks for several different Lp norms, (b) by designing a new attack that exploits the structure of our defended model and (c) by devising a novel decision-based attack that seeks to minimize the number of perturbed pixels (L0). The results suggest that our approach yields state-of-the-art robustness on MNIST against L0, L2 and L-inf perturbations and we demonstrate that most adversarial examples are strongly perturbed towards the perceptual boundary between the original and the adversarial class.

##### Evolutionary-Neural Hybrid Agents for Architecture Search

tl;dr We propose a class of Evolutionary-Neural hybrid agents, that retain the best qualities of the two approaches.

Neural Architecture Search has recently shown potential to automate the design of Neural Networks. The use of Neural Network agents trained with Reinforcement Learning can offer the possibility to learn complex patterns, as well as the ability to explore a vast and compositional search space. On the other hand, evolutionary algorithms offer the greediness and sample efficiency needed for such an application, as each sample requires a considerable amount of resources. We propose a class of Evolutionary-Neural hybrid agents (Evo-NAS), that retain the best qualities of the two approaches. We show that the Evo-NAS agent can outperform both Neural and Evolutionary agents, both on a synthetic task, and on architecture search for a suite of text classification datasets.

##### Hyper-Regularization: An Adaptive Choice for the Learning Rate in Gradient Descent

No tl;dr =[

We present a novel approach for adaptively selecting the learning rate in gradient descent methods. Specifically, we impose a regularization term on the learning rate via a generalized distance, and cast the joint updating process of the parameter and the learning rate into a maxmin problem. Some existing schemes such as AdaGrad (diagonal version) and WNGrad can be rederived from our approach. Based on our approach, the updating rules for the learning rate do not rely on the smoothness constant of optimization problems and are robust to the initial learning rate. We theoretically analyze our approach in full batch and online learning settings, which achieves comparable performances with other first-order gradient-based algorithms in terms of accuracy as well as convergence rate.

##### Distributionally Robust Optimization Leads to Better Generalization: on SGD and Beyond

No tl;dr =[

In this paper, we adopt distributionally robust optimization (DRO) in hope to achieve a better generalization in deep learning tasks. We establish the generalization guarantees and analyze the localized Rademacher complexity for DRO, and conduct experiments to show that DRO obtains a better performance. We reveal the profound connection between SGD and DRO, i.e., selecting a batch can be viewed as choosing a distribution over the training set. From this perspective, we prove that SGD is prone to escape from bad stationary points and small batch SGD outperforms large batch SGD. We give an upper bound for the robust loss when SGD converges and keeps stable. We propose a novel Weighted SGD (WSGD) algorithm framework, which assigns high-variance weights to the data of the current batch. We devise a practical implement of WSGD that can directly optimize the robust loss. We test our algorithm on CIFAR-10 and CIFAR-100, and WSGD achieves significant improvements over the conventional SGD.

##### Quasi-hyperbolic momentum and Adam for deep learning

tl;dr Mix plain SGD and momentum (or do something similar with Adam) for great profit.

Momentum-based acceleration of stochastic gradient descent (SGD) is widely used in deep learning. We propose the quasi-hyperbolic momentum algorithm (QHM) as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step. We describe numerous connections to and identities with other algorithms, and we characterize the set of two-state optimization algorithms that QHM can recover. Finally, we propose a QH variant of Adam called QHAdam, and we empirically demonstrate that our algorithms lead to significantly improved training in a variety of settings, including a new state-of-the-art result on WMT16 EN-DE. We hope that these empirical results, combined with the conceptual and practical simplicity of QHM and QHAdam, will spur interest from both practitioners and researchers. PyTorch code is immediately available.

##### Learning with Random Learning Rates.

tl;dr We test stochastic gradient descent with random per-feature learning rates in neural networks, and find performance comparable to using SGD with the optimal learning rate, alleviating the need for learning rate tuning.

Hyperparameter tuning is a bothersome step in the training of deep learning mod- els. One of the most sensitive hyperparameters is the learning rate of the gradient descent. We present the All Learning Rates At Once (Alrao) optimization method for neural networks: each unit or feature in the network gets its own learning rate sampled from a random distribution spanning several orders of magnitude. This comes at practically no computational cost. Perhaps surprisingly, stochastic gra- dient descent (SGD) with Alrao performs close to SGD with an optimally tuned learning rate, for various architectures and problems. Alrao could save time when testing deep learning models: a range of models could be quickly assessed with Alrao, and the most promising models could then be trained more extensively. This text comes with a PyTorch implementation of the method, which can be plugged on an existing PyTorch model.

##### Multilingual Neural Machine Translation with Knowledge Distillation

tl;dr We proposed a knowledge distillation based method to boost the accuracy of multilingual neural machine translation.

Multilingual machine translation, which translates multiple languages with a single model, has attracted much attention due to its efficiency of offline training and online serving. However, traditional multilingual translation usually yields inferior accuracy compared with the counterpart using individual models for each language pair, due to language diversity and model capacity limitations. In this paper, we propose a distillation-based approach to boost the accuracy of multilingual machine translation. Specifically, individual models are first trained and regarded as teachers, and then the multilingual model is trained to fit the training data and match the outputs of individual models simultaneously through knowledge distillation. Experiments on IWSLT, WMT and Ted talk translation datasets demonstrate the effectiveness of our method. Particularly, we show that one model is enough to handle multiple languages (up to 44 languages in our experiment), with comparable or even better accuracy than individual models.

##### ATTENTION INCORPORATE NETWORK: A NETWORK CAN ADAPT VARIOUS DATA SIZE

No tl;dr =[

In traditional neural networks for image processing, the inputs of the neural networks should be the same size such as 224×224×3. But how can we train the neural net model with different input size? A common way to do is image deformation which accompany a problem of information loss (e.g. image crop or wrap). In this paper we propose a new network structure called Attention Incorporate Network(AIN). It solve the problem of different size of input images and extract the key features of the inputs by attention mechanism, pay different attention depends on the importance of the features not rely on the data size. Experimentally, AIN achieve a higher accuracy, better convergence comparing to the same size of other network structure.

##### Reducing Overconfident Errors outside the Known Distribution

tl;dr Deep networks are more likely to be confidently wrong when testing on unexpected data. We propose two methods to reduce confident errors on unknown input distributions, and an experimental methodology to study the problem.

Intuitively, unfamiliarity should lead to lack of confidence. In reality, current algorithms often make highly confident yet wrong predictions when faced with unexpected test samples from an unknown distribution different from training. Unlike domain adaptation methods, we cannot gather an "unexpected dataset" prior to test, and unlike novelty detection methods, a best-effort original task prediction is still expected. We propose two simple solutions that reduce overconfident errors of samples from an unknown novel distribution without drastically increasing evaluation time: (1) G-distillation, training an ensemble of classifiers and then distill into a single model using both labeled and unlabeled examples, or (2) NCR, reducing prediction confidence based on its novelty detection score. Experimentally, we investigate the overconfidence problem and evaluate our solution by creating "familiar" and "novel" test splits, where "familiar" are identically distributed with training and "novel" are not. We show that our solution yields more appropriate prediction confidences, on familiar and novel data, compared to single models and ensembles distilled on training data only. For example, our G-distillation reduces confident errors in gender recognition by 94% on demographic groups different from the training data.

##### Bayesian Deep Learning via Stochastic Gradient MCMC with a Stochastic Approximation Adaptation

tl;dr a robust Bayesian deep learning algorithm to infer complex posteriors with latent variables

We propose a robust Bayesian deep learning algorithm to infer complex posteriors with latent variables. Inspired by dropout, a popular tool for regularization and model ensemble, we assign sparse priors to the weights in deep neural networks (DNN) in order to achieve automatic dropout'' and avoid over-fitting. By alternatively sampling from posterior distribution through stochastic gradient Markov Chain Monte Carlo (SG-MCMC) and optimizing latent variables via stochastic approximation (SA), the trajectory of the target weights is proved to converge to the true posterior distribution conditioned on optimal latent variables. This ensures a stronger regularization on the over-fitted parameter space and more accurate uncertainty quantification on the decisive variables. Simulations from large-p-small-n regressions showcase the robustness of this method when applied to models with latent variables. Additionally, its application on the convolutional neural networks (CNN) leads to state-of-the-art performance on MNIST and Fashion MNIST datasets and improved resistance to adversarial attacks.

##### A Direct Approach to Robust Deep Learning Using Adversarial Networks

tl;dr Jointly train an adversarial noise generating network with a classification network to provide better robustness to adversarial attacks.

Deep neural networks have been shown to perform well in many classical machine learning problems, especially in image classification tasks. However, researchers have found that neural networks can be easily fooled, and they are surprisingly sensitive to small perturbations imperceptible to humans. Carefully crafted input images (adversarial examples) can force a well-trained neural network to provide arbitrary outputs. Including adversarial examples during training is a popular defense mechanism against adversarial attacks. In this paper we propose a new defensive mechanism under the generative adversarial network~(GAN) framework. We model the adversarial noise using a generative network, trained jointly with a classification discriminative network as a minimax game. We show empirically that our adversarial network approach works well against black box attacks, with performance on par with state-of-art methods such as ensemble adversarial training and adversarial training with projected gradient descent.

##### Exemplar Guided Unsupervised Image-to-Image Translation with Semantic Consistency

tl;dr We propose the Exemplar Guided & Semantically Consistent Image-to-image Translation (EGSC-IT) network which conditions the translation process on an exemplar image in the target domain.

Image-to-image translation has recently received significant attention due to advances in deep learning. Most works focus on learning either a one-to-one mapping in an unsupervised way or a many-to-many mapping in a supervised way. However, a more practical setting is many-to-many mapping in an unsupervised way, which is harder due to the lack of supervision and the complex inner- and cross-domain variations. To alleviate these issues, we propose the Exemplar Guided & Semantically Consistent Image-to-image Translation (EGSC-IT) network which conditions the translation process on an exemplar image in the target domain. We assume that an image comprises of a content component which is shared across domains, and a style component specific to each domain. Under the guidance of an exemplar from the target domain we apply Adaptive Instance Normalization to the shared content component, which allows us to transfer the style information of the target domain to the source domain. To avoid semantic inconsistencies during translation that naturally appear due to the large inner- and cross-domain variations, we introduce the concept of feature masks that provide coarse semantic guidance without requiring the use of any semantic labels. Experimental results on various datasets show that EGSC-IT does not only translate the source image to diverse instances in the target domain, but also preserves the semantic consistency during the process.

##### ARM: Augment-REINFORCE-Merge Gradient for Stochastic Binary Networks

No tl;dr =[

To backpropagate the gradients through stochastic binary layers, we propose the augment-REINFORCE-merge (ARM) estimator that is unbiased and has low variance. Exploiting data augmentation, REINFORCE, and reparameterization, the ARM estimator achieves adaptive variance reduction for Monte Carlo integration by merging two expectations via common random numbers. The variance-reduction mechanism of the ARM estimator can also be attributed to antithetic sampling in an augmented space. Experimental results show the ARM estimator provides state-of-the-art performance in auto-encoding variational Bayes and maximum likelihood inference, for discrete latent variable models with one or multiple stochastic binary layers. Python code is available at https://github.com/ABC-anonymous-1.

##### TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer

tl;dr We present the TimbreTron, a pipeline for perfoming high-quality timbre transfer on musicalwaveforms using CQT-domain style transfer.

In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, an audio processing pipeline which combines three powerful ideas from different domains: Constant Q Transform (CQT) spectrogram for audio representation, a variant of CycleGAN for timbre transfer and WaveNet-Synthesizer for high quality audio generation. We verified that CQT TimbreTron in principle and in practice is more suitable than its STFT counterpart, even though STFT is more commonly used for audio representation. Based on human perceptual evaluations, we confirmed that timbre was transferred recognizably while the musical content was preserved by TimbreTron.

##### Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization

tl;dr We describe a dynamic sparse reparameterization technique that allow training of a small sparse network to generalize on par with, or better than, a full-sized dense model compressed to the same size.

Modern deep neural networks are highly overparameterized, and often of huge sizes. A number of post-training model compression techniques, such as distillation, pruning and quantization, can reduce the size of network parameters by a substantial fraction with little loss in performance. However, training a small network of the post-compression size de novo typically fails to reach the same level of accuracy achieved by compression of a large network, leading to a widely-held belief that gross overparameterization is essential to effective learning. In this work, we argue that this is not necessarily true. We describe a dynamic sparse reparameterization technique that closed the performance gap between a model compressed by pruning and a model of the post-compression size trained de novo. We applied our method to training deep residual networks and showed that it outperformed existing static reparameterization techniques, yielding the best accuracy for a given parameter budget for training. Compared to other dynamic reparameterization methods that reallocate non-zero parameters during training, our approach broke free from a few key limitations and achieved much better performance at lower computational cost. Our method is not only of practical value for training under stringent memory constraints, but also potentially informative to theoretical understanding of generalization properties of overparameterized deep neural networks.

##### Minimal Images in Deep Neural Networks: Fragile Object Recognition in Natural Images

No tl;dr =[

The human ability to recognize objects is impaired when the object is not shown in full. "Minimal images" are the smallest regions of an image that remain recognizable for humans. Ullman et al. (2016) show that a slight modification of the location and size of the visible region of the minimal image produces a sharp drop in human recognition accuracy. In this paper, we demonstrate that such drops in accuracy due to changes of the visible region are a common phenomenon between humans and existing state-of-the-art deep neural networks (DNNs), and are much more prominent in DNNs. We found many cases where DNNs classified one region correctly and the other incorrectly, though they only differed by one row or column of pixels, and were often bigger than the average human minimal image size. We show that this phenomenon is independent from previous works that have reported lack of invariance to minor modifications in object location in DNNs. Our results thus reveal a new failure mode of DNNs that also affects humans to a much lesser degree. They expose how fragile DNN recognition ability is in natural images even without adversarial patterns being introduced. Bringing the robustness of DNNs in natural images to the human level remains an open challenge for the community.

##### Adversarial Examples Are a Natural Consequence of Test Error in Noise

tl;dr Small adversarial perturbations should be expected given observed error rates of models outside the natural data distribution.

Maliciously constructed inputs, or adversarial examples, can fool trained machine learning models. Over the last few years, adversarial examples have captured the attention of the research community, especially in the case where the adversary is restricted to making only small modifications of a correctly handled input. When it was first discovered that neural networks are sensitive to small perturbations, many researchers found this surprising and proposed several hypotheses to explain it. In this work, we show that this sensitivity and the poor performance of classification models (relative to humans) on noisy images are two manifestations of the same underlying phenomenon. Nearby errors simply lie on the boundary of a large set of errors whose volume can be measured using test error in additive noise. We present compelling new evidence in favor of this interpretation before discussing some preexisting results which also support our perspective. The relationship between nearby errors and failure to generalize in noise has implications for the adversarial defense literature, as it suggests that defenses which fail to reduce test error in noise will also fail to defend against small adversarial perturbations. This yields a computationally tractable evaluation metric for defenses to consider: test error in noisy image distributions.

tl;dr The AMM is the first fully adversarially optimized method to model the conditional dependence between categorical and continuous latent variables.

The Adversarially Learned Mixture Model (AMM) is a generative model for unsupervised or semi-supervised data clustering. The AMM is the first adversarially optimized method to model the conditional dependence between inferred continuous and categorical latent variables. Experiments on the MNIST and SVHN datasets show that the AMM allows for semantic separation of complex data when little or no labeled data is available. The AMM achieves unsupervised clustering error rates of 3.32% and 20.4% on the MNIST and SVHN datasets, respectively. A semi-supervised extension of the AMM achieves a classification error rate of 5.60% on the SVHN dataset.

##### Uncertainty-guided Lifelong Learning in Bayesian Networks

tl;dr We formulate lifelong learning in the Bayesian-by-Backprop framework, exploiting the parameter uncertainty in two settings: for pruning network parameters and in importance weight based continual learning.

##### Learning data-derived privacy preserving representations from information metrics

tl;dr Learning privacy-preserving transformations from data. A collaborative approach

It is clear that users should own and control their data and privacy. Utility providers are also becoming more interested in guaranteeing data privacy. Therefore, users and providers can and should collaborate in privacy protecting challenges, and this paper addresses this new paradigm. We propose a framework where the user controls what characteristics of the data they wants to share (utility) and what they want to keep private (secret), without necessarily asking the utility provider to change its existing machine learning algorithms. We first analyze the space of privacy-preserving representations, and derive natural information-theoretic bounds on the utility-privacy trade-off when disclosing a sanitized version of the data X. We present explicit architectures to learn privacy-preserving representations that approach this bound in a data-driven fashion. We then describe important use-case scenarios where the utility providers are willing to collaborate, at least partially, with the sanitization process. In this setting, we limit the possible sanitization functions to space-preserving transformations, meaning the sanitation maps the data to the same space as the original data, and the utility provider can then use the exact same (existing) algorithm for the original and sanitized data, a novel critical attribute to help service providers to collaborate. We illustrate this framework and show how we can maintain utility while protecting secret information even in cases where the joint distribution of the data X and the utility and secret variables U and S are unknown; and where the difficulty of inferring the utility variable U is much higher than the task of inferring the secret variable S. This is done through the implementation of three use cases; subject-within-subject, where we tackle the problem of having an identity detector (from facial images) that works only on a consenting subset of users, an important application, for example, for mobile devices activated by face recognition, helping them to become private to the environment instead of always listening as they currently act; gender-and-subject, where we want to preserve facial verification (hard) while hiding the gender attribute (easy) for users who choose to do so; and emotion-and-gender, where we tackle the issue of hiding independent variables, as is the case of hiding gender while preserving emotion detection.

##### Dynamic Early Terminating of Multiply Accumulate Operations for Saving Computation Cost in Convolutional Neural Networks

No tl;dr =[

Deep learning has been attracting enormous attention from academia as well as industry due to its great success in many artificial intelligence applications. As more applications are developed, the need for implementing a complex neural network model on an energy-limited edge device becomes more critical. To this end, this paper proposes a new optimization method to reduce the computation efforts of convolutional neural networks. The method takes advantage of the fact that some convolutional operations are actually wasteful since their outputs are pruned by the following activation or pooling layers. Basically, a convolutional filter conducts a series of multiply-accumulate (MAC) operations. We propose to set a checkpoint in the MAC process to determine whether a filter could terminate early based on the intermediate result. Furthermore, a fine-tuning process is conducted to recover the accuracy drop due to the applied checkpoints. The experimental results show that the proposed method can save approximately 50% MAC operations with less than 1% accuracy drop for CIFAR-10 example model and Network in Network on the CIFAR-10 and CIFAR-100 datasets. Additionally, compared with the state-of- the-art method, the proposed method is more effective on the CIFAR-10 dataset and is competitive on the CIFAR-100 dataset.

##### Localized random projections challenge benchmarks for bio-plausible deep learning

tl;dr Spiking networks using localized random projections and STDP challenge current MNIST benchmark models for bio-plausible deep learning

Similar to models of brain-like computation, artificial deep neural networks rely on distributed coding, parallel processing and plastic synaptic weights. Training deep neural networks with the error-backpropagation algorithm, however, is considered bio-implausible. An appealing alternative to training deep neural networks is to use one or a few hidden layers with fixed random weights or trained with an unsupervised, local learning rule and train a single readout layer with a supervised, local learning rule. We find that a network of leaky-integrate-andfire neurons with fixed random, localized receptive fields in the hidden layer and spike timing dependent plasticity to train the readout layer achieves 98.1% test accuracy on MNIST, which is close to the optimal result achievable with error-backpropagation in non-convolutional networks of rate neurons with one hidden layer. To support the design choices of the spiking network, we systematically compare the classification performance of rate networks with a single hidden layer, where the weights of this layer are either random and fixed, trained with unsupervised Principal Component Analysis or Sparse Coding, or trained with the backpropagation algorithm. This comparison revealed, first, that unsupervised learning does not lead to better performance than fixed random projections for large hidden layers on digit classification (MNIST) and object recognition (CIFAR10); second, networks with random projections and localized receptive fields perform significantly better than networks with all-to-all connectivity and almost reach the performance of networks trained with the backpropagation algorithm. The performance of these simple random projection networks is comparable to most current models of bio-plausible deep learning and thus provides an interesting benchmark for future approaches.

##### Bias-Reduced Uncertainty Estimation for Deep Neural Classifiers

tl;dr We use snapshots from the training process to improve any uncertainty estimation method of a DNN classifier.

We consider the problem of uncertainty estimation in the context of (non-Bayesian) deep neural classification. In this context, all known methods are based on extracting uncertainty signals from a trained network optimized to solve the classification problem at hand. We demonstrate that such techniques tend to introduce biased estimates for instances whose predictions are supposed to be highly confident. We argue that this deficiency is an artifact of the dynamics of training with SGD-like optimizers, and it has some properties similar to overfitting. Based on this observation, we develop an uncertainty estimation algorithm that selectively estimates the uncertainty of highly confident points, using earlier snapshots of the trained model, before their estimates are jittered (and way before they are ready for actual classification). We present extensive experiments indicating that the proposed algorithm provides uncertainty estimates that are consistently better than all known methods.

##### Teaching to Teach by Structured Dark Knowledge

tl;dr We newly proposed teaching to teach, to educate a better teacher to teach a better student by introducing structured dark knowledge.

To educate hyper deep learners, \emph{Curriculum Learnings} (CLs) require either human heuristic participation or self-deciding the difficulties of training instances. These coaching manners are blind to the coherent structures among examples, categories, and tasks, which are pregnant with more knowledgeable curriculum-routed teachers. In this paper, we propose a general methodology \emph{Teaching to Teach} (T2T). T2T is facilitated by \emph{Structured Dark Knowledge} (SDK) that constitutes a communication protocol between structured knowledge prior and teaching strategies. On one hand, SDK adaptively extracts structured knowledge by selecting a training subset consistent with the previous teaching decisions. On the other hand, SDK teaches curriculum-agnostic teachers by transferring this knowledge to update their teaching policy. This virtuous cycle can be flexibly-deployed in most existing CL platforms and more importantly, very generic across various structured knowledge characteristics, e.g., diversity, complementarity, and causality. We evaluate T2T across different learners, teachers, and tasks, which significantly demonstrates that structured knowledge can be inherited by the teachers to further benefit learners' training.

##### Learning and Planning with a Semantic Model

tl;dr We propose a hybrid model-based & model-free approach using semantic information to improve DRL generalization in man-made environments.

Building deep reinforcement learning agents that can generalize and adapt to unseen environments remains a fundamental challenge for AI. This paper describes progresses on this challenge in the context of man-made environments, which are visually diverse but contain intrinsic semantic regularities. We propose a hybrid model-based and model-free approach, LEArning and Planning with Semantics (LEAPS), consisting of a multi-target sub-policy that acts on visual inputs, and a Bayesian model over semantic structures. When placed in an unseen environment, the agent plans with the semantic model to make high-level decisions, proposes the next sub-target for the sub-policy to execute, and updates the semantic model based on new observations. We perform experiments in visual navigation tasks using House3D, a 3D environment that contains diverse human-designed indoor scenes with real-world objects. LEAPS outperforms strong baselines that do not explicitly plan using the semantic content.

##### Beyond Pixel Norm-Balls: Parametric Adversaries using an Analytically Differentiable Renderer

tl;dr Enabled by a novel differentiable renderer, we propose a new metric that has real-world implications for evaluating adversarial machine learning algorithms, resolving the lack of realism of the existing metric based on pixel norms.

Many machine learning image classifiers are vulnerable to adversarial attacks, inputs with perturbations designed to intentionally trigger misclassification. Current adversarial methods directly alter pixel colors and evaluate against pixel norm-balls: pixel perturbations smaller than a specified magnitude, according to a measurement norm. This evaluation, however, has limited practical utility since perturbations in the pixel space do not correspond to underlying real-world phenomena of image formation that lead to them and has no security motivation attached. Pixels in natural images are measurements of light that has interacted with the geometry of a physical scene. As such, we propose a novel evaluation measure, parametric norm-balls, by directly perturbing physical parameters that underly image formation. One enabling contribution we present is a physically-based differentiable renderer that allows us to propagate pixel gradients to the parametric space of lighting and geometry. Our approach enables physically-based adversarial attacks, and our differentiable renderer leverages models from the interactive rendering literature to balance the performance and accuracy trade-offs necessary for a memory-efficient and scalable adversarial data augmentation workflow.

##### D-GAN: Divergent generative adversarial network for positive unlabeled learning and counter-examples generation

tl;dr A new two-stage positive unlabeled learning approach with GAN

Positive Unlabeled learning task remains an interesting challenge in the context of image analysis. Recent approaches suggest to exploit the GANs abilities to answer this problem. In this paper, we propose a new approach named Divergent-GAN (D-GAN). It keeps the light adversarial architecture of the PGAN method, with a better robustness counter the varying images complexity, while simultaneously allowing the same functionalities as the GenPU method, like the generation of relevant counter-examples. However, this is achieved without the need of prior knowledge, nor an onerous architecture and framework. Its functionning is based on the combination between the behaviour principles of Positive Unlabeled learning classification and the adversarial GAN training. Experimental results show that this divergent adversarial framework outperforms the state of the art PU learning in terms of prediction accuracy, training robustness, and its ability to work on both simple and complex real images. Combined with an additional generator, the proposed approach even allows to accomplish noisy labeled learning, and thus opening new application perspectives for GANs architectures.

##### A quantifiable testing of global translational invariance in Convolutional and Capsule Networks

tl;dr Testing of global translational invariance in Convolutional and Capsule Networks

We design simple and quantifiable testing of global translation-invariance in deep learning models trained on the MNIST dataset. Experiments on convolutional and capsules neural networks show that both models have poor performance in dealing with global translation-invariance; however, the performance improved by using data augmentation. Although the capsule network is better on the MNIST testing dataset, the convolutional neural network generally has better performance on the translation-invariance.

##### Multi-class classification without multi-class labels

No tl;dr =[

This work presents a new strategy for multi-class classification that requires no class-specific labels, but instead leverages pairwise similarity between examples, which is a weaker form of annotation. The proposed method, meta classification learning, optimizes a binary classifier for pairwise similarity prediction and through this process learns a multi-class classifier as a submodule. We formulate this approach, present a probabilistic graphical model for it, and derive a surprisingly simple loss function that can be used to learn neural network-based models. We then demonstrate that this same framework generalizes to the supervised, unsupervised cross-task, and semi-supervised settings. Our method is evaluated against state of the art in all three learning paradigms and shows a superior or comparable accuracy, providing evidence that learning multi-class classification without multi-class labels is a viable learning option.

##### DANA: Scalable Out-of-the-box Distributed ASGD Without Retuning

tl;dr A new distributed asynchronous SGD algorithm that achieves state-of-the-art accuracy on existing architectures without any additional tuning or overhead.

Distributed computing can significantly reduce the training time of neural networks. Despite its potential, however, distributed training has not been widely adopted: scaling the training process is difficult, and existing SGD methods require substantial tuning of hyperparameters and learning schedules to achieve sufficient accuracy when increasing the number of workers. In practice, such tuning can be prohibitively expensive given the huge number of potential hyperparameter configurations and the effort required to test each one. We propose DANA, a novel approach that scales out-of-the-box to large clusters using the same hyperparameters and learning schedule optimized for training on a single worker, while maintaining similar final accuracy without additional overhead. DANA estimates the future value of model parameters by adapting Nesterov Accelerated Gradient to a distributed setting, and so mitigates the effect of gradient staleness, one of the main difficulties in scaling SGD to more workers. Evaluation on three state-of-the-art network architectures and three datasets shows that DANA scales as well as or better than existing work without having to tune any hyperparameters or tweak the learning schedule. For example, DANA achieves 75.73% accuracy on ImageNet when training ResNet-50 with 16 workers, similar to the non-distributed baseline.

##### Infinitely Deep Infinite-Width Networks

tl;dr We propose a method for the construction of arbitrarily deep infinite-width networks, based on which we derive a novel weight initialisation scheme for finite-width networks and demonstrate its competitive performance.

Infinite-width neural networks have been extensively used to study the theoretical properties underlying the extraordinary empirical success of standard, finite-width neural networks. Nevertheless, until now, infinite-width networks have been limited to at most two hidden layers. To address this shortcoming, we study the initialisation requirements of these networks and show that the main challenge for constructing them is defining the appropriate sampling distributions for the weights. Based on these observations, we propose a principled approach to weight initialisation that correctly accounts for the functional nature of the hidden layer activations and facilitates the construction of arbitrarily many infinite-width layers, thus enabling the construction of arbitrarily deep infinite-width networks. The main idea of our approach is to iteratively reparametrise the hidden-layer activations into appropriately defined reproducing kernel Hilbert spaces and use the canonical way of constructing probability distributions over these spaces for specifying the required weight distributions in a principled way. Furthermore, we examine the practical implications of this construction for standard, finite-width networks. In particular, we derive a novel weight initialisation scheme for standard, finite-width networks that takes into account the structure of the data and information about the task at hand. We demonstrate the effectiveness of this weight initialisation approach on the MNIST, CIFAR-10 and Year Prediction MSD datasets.

##### Learning to Drive by Observing the Best and Synthesizing the Worst

tl;dr This work explores how far we can take (supervised) imitation learning for the task of driving a car.

Our goal is to train a policy for autonomous driving via imitation learning that is robust enough to drive a real vehicle. We find that standard behavior cloning is insufficient for handling complex driving scenarios, even when we leverage a perception system for preprocessing the input and a controller for executing the output on the car: 30 million examples are still not enough. We propose exposing the learner to synthesized data in the form of perturbations to the expert's driving, which creates interesting situations such as collisions and/or going off the road. Rather than purely imitating all data, we augment the imitation loss with additional losses that penalize undesirable events and encourage progress -- the perturbations then provide an important signal for these losses and lead to robustness of the learned model. We show that the model can handle complex situations in simulation, and present ablation experiments that emphasize the importance of each of our proposed changes and show that the model is responding to the appropriate causal factors. Finally, we demonstrate the model driving a car in the real world ( https://sites.google.com/view/learn-to-drive ).

##### Feed-forward Propagation in Probabilistic Neural Networks with Categorical and Max Layers

tl;dr Approximating mean and variance of the NN output over noisy input / dropout / uncertain parameters. Analytic approximations for argmax, softmax and max layers.

Probabilistic Neural Networks take into account various sources of stochasticity: input noise, dropout, stochastic neurons, parameter uncertainties modeled as random variables. In this paper we revisit the feed-forward propagation method that allows one to estimate for each neuron its mean and variance w.r.t. mentioned sources of stochasticity. In contrast, standard NNs propagate only point estimates, discarding the uncertainty. Methods propagating also the variance have been proposed by several authors in different context. The presented view attempts to clarify the assumptions and derivation behind such methods, relate it to classical NNs and broaden the scope of its applicability. The main technical innovations are new posterior approximations for argmax and max-related transforms, that allows for applicability in networks with softmax and max-pooling layers as well as leaky ReLU activations. We evaluate the accuracy of the approximation and suggest a simple calibration. Applying the method to networks with dropout allows for faster training and gives improved test likelihoods without the need of sampling.

##### Probabilistic Model-Based Dynamic Architecture Search

tl;dr We present an efficient neural network architecture search method based on stochastic natural gradient method via probabilistic modeling.

The architecture search methods for convolutional neural networks (CNNs) have shown promising results. These methods require significant computational resources, as they repeat the neural network training many times to evaluate and search the architectures. Developing the computationally efficient architecture search method is an important research topic. In this paper, we assume that the structure parameters of CNNs are categorical variables, such as types and connectivities of layers, and they are regarded as the learnable parameters. Introducing the multivariate categorical distribution as the underlying distribution for the structure parameters, we formulate a differentiable loss for the training task, where the training of the weights and the optimization of the parameters of the distribution for the structure parameters are coupled. They are trained using the stochastic gradient descent, leading to the optimization of the structure parameters within a single training. We apply the proposed method to search the architecture for two computer vision tasks: image classification and inpainting. The experimental results show that the proposed architecture search method is fast and can achieve comparable performance to the existing methods.

##### Robustness and Equivariance of Neural Networks

tl;dr Robustness to rotations comes at the cost of robustness of pixel-wise adversarial perturbations.

Neural networks models are known to be vulnerable to geometric transformations as well as small pixel-wise perturbations of input. Convolutional Neural Networks (CNNs) are translation-equivariant but can be easily fooled using rotations and small pixel-wise perturbations. Moreover, CNNs require sufficient translations in their training data to achieve translation-invariance. Recent work by Cohen & Welling (2016), Worrall et al. (2016), Kondor & Trivedi (2018), Cohen & Welling (2017), Marcos et al. (2017), and Esteves et al. (2018) has gone beyond translations, and constructed rotation-equivariant or more general group-equivariant neural network models. In this paper, we do an extensive empirical study of various rotation-equivariant neural network models to understand how effectively they learn rotations. This includes Group-equivariant Convolutional Networks (GCNNs) by Cohen & Welling (2016), Harmonic Networks (H-Nets) by Worrall et al. (2016), Polar Transformer Networks (PTN) by Esteves et al. (2018) and Rotation equivariant vector field networks by Marcos et al. (2017). We empirically compare the ability of these networks to learn rotations efficiently in terms of their number of parameters, sample complexity, rotation augmentation used in training. We compare them against each other as well as Standard CNNs. We observe that as these rotation-equivariant neural networks learn rotations, they instead become more vulnerable to small pixel-wise adversarial attacks, e.g., Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD), in comparison with Standard CNNs. In other words, robustness to geometric transformations in these models comes at the cost of robustness to small pixel-wise perturbations.

##### A Unified View of Deep Metric Learning via Gradient Analysis

No tl;dr =[

Loss functions play a pivotal role in deep metric learning (DML). A large variety of loss functions have been proposed in DML recently. However, it remains difficult to answer this question: what are the intrinsic differences among these loss functions?This paper answers this question by proposing a unified perspective to rethink deep metric loss functions. We show theoretically that most DML methods in deep metric learning, in view of gradient equivalence, are essentially weight assignment strategies of training pairs. Based on this unified view, we revisit several typical DML methods and disclose their hidden drawbacks. Moreover, we point out the key components of an effective DML approach which drives us to propose our weight assignment framework. We evaluate our method on image retrieval tasks, and show that it outperforms the state-of-the-art DML approaches by a significant margin on the CUB-200-2011, Cars-196, Stanford Online Products and In-Shop Clothes Retrieval datasets.

##### Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet

tl;dr Aggregating class evidence from many small image patches suffices to solve ImageNet, yields more interpretable models and can explain aspects of the decision-making of popular DNNs.

Deep Neural Networks (DNNs) excel on many complex perceptual tasks but it has proven notoriously difficult to understand how they reach their decisions. We here introduce a high-performance DNN architecture on ImageNet whose decisions are considerably easier to explain. Our model, a simple variant of the ResNet-50 architecture called BagNet, classifies an image based on the occurrences of small local image features without taking into account their spatial ordering. This strategy is closely related to the bag-of-feature (BoF) models popular before the onset of deep learning and reaches a surprisingly high accuracy on ImageNet (87.6% top-5 for 32 x32 px features and Alexnet performance for 16 x 16 px features). The constraint on local features makes it straight-forward to analyse how exactly each feature of the image influences the classification. Furthermore, the BagNets behave similar to state-of-the art deep neural networks such as VGG-16, ResNet-152 or DenseNet-169 in terms of feature sensitivity, error distribution and interactions between image parts, suggesting that modern DNNs approximately follow a similar bag-of-feature strategy.

##### Volumetric Convolution: Automatic Representation Learning in Unit Ball

tl;dr A novel convolution operator for automatic representation learning inside unit ball

Convolution is an efficient technique to obtain abstract feature representations using hierarchical layers in deep networks. Although performing convolution in Euclidean geometries is fairly straightforward, its extension to other topological spaces---such as a sphere S^2 or a unit ball B^3---entails unique challenges. In this work, we propose a novel "volumetric convolution" operation that can effectively convolve arbitrary functions in B^3. We develop a theoretical framework for "volumetric convolution" based on Zernike polynomials and efficiently implement it as a differentiable and an easily pluggable layer for deep networks. Furthermore, our formulation leads to derivation of a novel formula to measure the symmetry of a function in B^3 around an arbitrary axis, that is useful in 3D shape analysis tasks. We demonstrate the efficacy of proposed volumetric convolution operation on a possible use-case i.e., 3D object recognition task.

tl;dr we propose a novel activation function, ConvReLU, that can better mimic brain neuron activation behaviors and overcome the dying ReLU problem.

Rectified linear units (ReLUs) are currently the most popular activation function used in neural networks. Although ReLUs can solve the gradient vanishing problem and accelerate training convergence, it suffers from the dying ReLU problem in which some neurons are never activated if the weights are not updated properly. In this work, we propose a novel activation function, known as the adaptive convolutional ReLU (ConvReLU), that can better mimic brain neuron activation behaviors and overcome the dying ReLU problem. With our novel parameter sharing scheme, ConvReLUs can be applied to convolution layers that allow each input neuron to be activated by different trainable thresholds without involving a large number of extra parameters. We employ the zero initialization scheme in ConvReLU to encourage trainable thresholds to be close to zero. Finally, we develop a partial replacement strategy that only replaces the ReLUs in the early layers of the network. This resolves the dying ReLU problem and retains sparse representations for linear classifiers. Experimental results demonstrate that our proposed ConvReLU has consistently better performance compared to ReLU, LeakyReLU, and PReLU. In addition, the partial replacement strategy is shown to be effective not only for our ConvReLU but also for LeakyReLU and PReLU.

##### On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length

tl;dr SGD is steered early on in training towards a region in which its step is too large compared to curvature, which impacts the rest of training.

Training of deep neural networks with Stochastic Gradient Descent (SGD) typically ends in regions of the weight space, where both the generalization properties and the flatness of the local loss curvature depend on the learning rate and the batch size. We discover that a related phenomena happens in the early phase of training and study its consequences. Initially, SGD visits increasingly sharp regions of the loss surface, reaching a maximum sharpness determined by both the learning rate and the batch-size of SGD. At this early peak value, an SGD step is on average too large to minimize the loss along the directions corresponding to the largest eigenvalues of the Hessian (i.e. the sharpest directions). To query the importance of this phenomena for training, we study a variant of SGD using a reduced learning rate along the sharpest directions and show that it can improve training speed while finding both sharper and better--generalizing solution, compared to vanilla SGD. Overall, our results show that the SGD dynamics along the sharpest directions influence the regions of the weight space visited, the overall training speed, and generalization ability.

##### Fortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden Representations

tl;dr Better adversarial training by learning to map back to the data manifold with autoencoders in the hidden states.

Deep networks have achieved impressive results across a variety of important tasks. However, a known weakness is a failure to perform well when evaluated on data which differ from the training distribution, even if these differences are very small, as is the case with adversarial examples. We propose \emph{Fortified Networks}, a simple transformation of existing networks, which “fortifies” the hidden layers in a deep network by identifying when the hidden states are off of the data manifold, and maps these hidden states back to parts of the data manifold where the network performs well. Our principal contribution is to show that fortifying these hidden states improves the robustness of deep networks and our experiments (i) demonstrate improved robustness to standard adversarial attacks in both black-box and white-box threat models; (ii) suggest that our improvements are not primarily due to the problem of deceptively good results due to degraded quality in the gradient signal (the gradient masking problem) and (iii) show the advantage of doing this fortification in the hidden layers instead of the input space. We demonstrate improvements in adversarial robustness on three datasets (MNIST, Fashion MNIST, CIFAR10), across several attack parameters, both white-box and black-box settings, and the most widely studied attacks (FGSM, PGD, Carlini-Wagner). We show that these improvements are achieved across a wide variety of hyperparameters.

##### Adversarial Attacks for Optical Flow-Based Action Recognition Classifiers

tl;dr The paper describes adversarial attacks for action recognition classifiers that explicitly attack along the time dimension.

The success of deep learning research has catapulted deep models into production systems that our society is becoming increasingly dependent on, especially in the image and video domains. However, recent work has shown that these largely uninterpretable models exhibit glaring security vulnerabilities in the presence of an adversary. In this work, we develop a powerful untargeted adversarial attack for action recognition systems in both white-box and black-box settings. Action recognition models differ from image-classification models in that their inputs contain a temporal dimension, which we explicitly target in the attack. Drawing inspiration from image classifier attacks, we create new attacks which achieve state-of-the-art success rates on a two-stream classifier trained on the UCF-101 dataset. We find that our attacks can significantly degrade a model’s performance with sparsely and imperceptibly perturbed examples. We also demonstrate the transferability of our attacks to black-box action recognition systems.

##### RANDOM MASK: Towards Robust Convolutional Neural Networks

No tl;dr =[

Robustness of neural networks has recently been highlighted by the adversarial examples, i.e., inputs added with well-designed perturbations which are imperceptible to humans but can cause the network to give incorrect outputs. In this paper, we design a new CNN architecture that by itself has good robustness. We introduce a simple but powerful technique, Random Mask, to modify existing CNN structures. We show that CNN with Random Mask achieves state-of-the-art performance against black-box adversarial attacks without applying any adversarial training. We next investigate the adversarial examples which “fool” a CNN with Random Mask. Surprisingly, we find that these adversarial examples often “fool” humans as well. This raises fundamental questions on how to define adversarial examples and robustness properly.

##### Alignment Based Mathching Networks for One-Shot Classification and Open-Set Recognition

No tl;dr =[

Deep learning for object classification relies heavily on convolutional models. While effective, CNNs are rarely interpretable after the fact. An attention mechanism can be used to highlight the area of the image that the model focuses on thus offering a narrow view into the mechanism of classification. We expand on this idea by forcing the method to explicitly align images to be classified to reference images representing the classes. The mechanism of alignment is learned and therefore does not require that the reference objects are anything like those being classified. Beyond explanation, our exemplar based cross-alignment method enables classification with only a single example per category (one-shot). Our model cuts the 5-way, 1-shot error rate in Omniglot from 2.1\% to 1.4\% and in MiniImageNet from 53.5\% to 46.5\% while simultaneously providing point-wise alignment information providing some understanding on what the network is capturing. This method of alignment also enables the recognition of an unsupported class (open-set) in the one-shot setting while maintaining an F1-score of above 0.5 for Omniglot even with 19 other distracting classes while baselines completely fail to separate the open-set class in the one-shot setting.

##### Faster Training by Selecting Samples Using Embeddings

tl;dr Training is sped up by using a dataset that has been subsampled through embedding analysis.

Long training times have increasingly become a burden for researchers by slowing down the pace of innovation, with some models taking days or weeks to train. In this paper, a new, general technique is presented that aims to speed up the training process by using a thinned-down training dataset. By leveraging autoencoders and the unique properties of embedding spaces, we are able to filter training datasets to include only those samples that matter the most. Through evaluation on a standard CIFAR-10 image classification task, this technique is shown to be effective. With this technique, training times can be reduced with a minimal loss in accuracy. Conversely, given a fixed training time budget, the technique was shown to improve accuracy by over 50%. This technique is a practical tool for achieving better results with large datasets and limited computational budgets.

##### Making Convolutional Networks Shift-Invariant Again

tl;dr Modern networks are not shift-invariant, due to naive downsampling; we apply a signal processing tool -- anti-aliasing low-pass filtering before downsampling -- to improve shift-invariance

Modern convolutional networks are not shift-invariant, despite their convolutional nature: small shifts in the input can cause drastic changes in the internal feature maps and output. In this paper, we isolate the cause -- the downsampling operation in convolutional and pooling layers -- and apply the appropriate signal processing fix -- low-pass filtering before downsampling. This simple architectural modification boosts the shift-equivariance of the internal representations and consequently, shift-invariance of the output. Importantly, this is achieved while maintaining downstream classification performance. In addition, incorporating the inductive bias of shift-invariance largely removes the need for shift-based data augmentation. Lastly, we observe that the modification induces spatially-smoother learned convolutional kernels. Our results suggest that this classical signal processing technique has a place in modern deep networks.

##### Hallucinations in Neural Machine Translation

tl;dr We introduce and analyze the phenomenon of "hallucinations" in NMT, or spurious translations unrelated to source text, and propose methods to reduce its frequency.

Neural machine translation (NMT) systems have reached state of the art performance in translating text and are in wide deployment. Yet little is understood about how these systems function or break. Here we show that NMT systems are susceptible to producing highly pathological translations that are completely untethered from the source material, which we term hallucinations. Such pathological translations are problematic because they are are deeply disturbing of user trust and easy to find with a simple search. We describe a method to generate hallucinations and show that many common variations of the NMT architecture are susceptible to them. We study a variety of approaches to reduce the frequency of hallucinations, including data augmentation, dynamical systems and regularization techniques, showing that data augmentation significantly reduces hallucination frequency. Finally, we analyze networks that produce hallucinations and show that there are signatures in the attention matrix as well as in the stability measures of the decoder.

##### Towards Resisting Large Data Variations via Introspective Learning

tl;dr We propose a principled approach that endows classifiers with the ability to resist larger variations between training and testing data in an intelligent and efficient manner.

Learning deep networks which can resist large variations between training andtesting data is essential to build accurate and robust image classifiers. Towardsthis end, a typical strategy is to apply data augmentation to enlarge the trainingset. However, standard data augmentation is essentially a brute-force strategywhich is inefficient, as it performs all the pre-defined transformations to everytraining sample. In this paper, we propose a principled approach to train networkswith significantly improved resistance to large variations between training andtesting data. This is achieved by embedding a learnable transformation moduleinto the introspective networks (Jin et al., 2017; Lazarow et al., 2017; Lee et al.,2018), which is a convolutional neural network (CNN) classifier empowered withgenerative capabilities. Our approach alternatively synthesizes pseudo-negativesamples with learned transformations and enhances the classifier by retraining itwith synthesized samples. Experimental results verify that our approach signif-icantly improves the ability of deep networks to resist large variations betweentraining and testing data and achieves classification accuracy improvements onseveral benchmark datasets, including MNIST, affNIST, SVHN and CIFAR-10.

##### A Mean Field Theory of Batch Normalization

tl;dr Batch normalization causes exploding gradients in vanilla feedforward networks.

We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. We find that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range.

##### Visual Imitation Learning with Recurrent Siamese Networks

tl;dr Learning a vision-based recurrent distance function to allow agents to imitate behaviours from noisy video data.

People are incredibly skilled at imitating others by simply observing them. They achieve this even in the presence of significant morphological differences and capabilities. Further, people are able to do this from raw perceptions of the actions of others, without direct access to the abstracted demonstration actions and with only partial state information. People therefore solve a difficult problem of understanding the salient features of both observations of others and the relationship to their own state when learning to imitate specific tasks. However, we can attempt to reproduce a similar demonstration via trail and error and through this gain more understanding of the task space. To reproduce this ability an agent would need to both learn how to recognize the differences between itself and some demonstration and at the same time learn to minimize the distance between its own performance and that of the demonstration. In this paper we propose an approach using only visual information to learn a distance metric between agent behaviour and a given video demonstration. We train an RNN-based siamese model to compute distances in space and time between motion clips while training an RL policy to minimize this distance. Furthermore, we examine a particularly challenging form of this problem where the agent must learn an imitation based task given a single demonstration. We demonstrate our approach in the setting of deep learning based control for physical simulation of humanoid walking in both 2D with $10$ degrees of freedom (DoF) and 3D with $38$ DoF.

##### Manifold regularization with GANs for semi-supervised learning

No tl;dr =[

Generative Adversarial Networks are powerful generative models that can model the manifold of natural images. We leverage this property to perform manifold regularization by approximating a variant of the Laplacian norm using a Monte Carlo approximation that is easily computed with the GAN. When incorporated into the semi-supervised feature-matching GAN we achieve state-of-the-art results for GAN-based semi-supervised learning on CIFAR-10 and SVHN benchmarks, with a method that is significantly easier to implement than competing methods. We find that manifold regularization improves the quality of generated images, and is affected by the quality of the GAN used to approximate the regularizer.

tl;dr We suggest a smart batch selection technique called Ada-Boundary.

Neural networks can converge faster with help from a smarter batch selection strategy. In this regard, we propose Ada-Boundary, a novel adaptive-batch selection algorithm that constructs an effective mini-batch according to a learner’s level. Our key idea is to automatically derive the learner’s level using the decision boundary which evolves as the learning progresses. Thus, the samples near the current decision boundary are considered as the most effective to expedite convergence. Taking advantage of our design, Ada-Boundary maintains its dominance in various degrees of training difficulty. We demonstrate the advantage of Ada-Boundary by extensive experiments using two convolutional neural networks for three benchmark data sets. The experiment results show that Ada-Boundary improves the training time by up to 31.7% compared with the state-of-the-art strategy and by up to 33.5% compared with the baseline strategy.

##### Generative Feature Matching Networks

tl;dr A new non-adversarial feature matching-based approach to train generative models that achieves state-of-the-art results.

We propose a non-adversarial feature matching-based approach to train generative models. Our approach, Generative Feature Matching Networks (GFMN), leverages pretrained neural networks such as autoencoders and ConvNet classifiers to perform feature extraction. We perform an extensive number of experiments with different challenging datasets, including Imagenet. Our experimental results demonstrate that, due to the expressiveness of the features from pretrained Imagenet classifiers, even by just matching first order statistics, our approach can achieve state-of-the-art results for challenging benchmarks such as CIFAR10 and STL10.

##### Variational Smoothing in Recurrent Neural Network Language Models

No tl;dr =[

We present a new theoretical perspective of data noising in recurrent neural network language models (Xie et al., 2017). We show that each variant of data noising is an instance of Bayesian recurrent neural networks with a particular variational distribution (i.e., a mixture of Gaussians whose weights depend on statistics derived from the corpus such as the unigram distribution). We use this insight to propose a more principled method to apply at prediction time and propose natural extensions to data noising under the variational framework. In particular, we propose variational smoothing with tied input and output embedding matrices and an element-wise variational smoothing method. We empirically verify our analysis on two benchmark language modeling datasets and demonstrate performance improvements over existing data noising methods.

##### IMAGE DEFORMATION META-NETWORK FOR ONE-SHOT LEARNING

No tl;dr =[

Humans can robustly learn novel visual concepts even when images undergo various deformations and loose certain information. Incorporating this ability to synthesize deformed instances of new concepts might help visual recognition systems perform better one-shot learning, i.e., learning concepts from one or few examples. Our key insight is that, while the deformed images might not be visually realistic, they still maintain critical semantic information and contribute significantly in formulating classifier decision boundaries. Inspired by the recent progress on meta-learning, we combine a meta-learner with an image deformation network that produces additional training examples, and optimize both models in an endto- end manner. The deformation network learns to synthesize images by fusing a pair of images—a probe image that keeps the visual content and a gallery image that diversifies the deformations. We demonstrate results on the widely used oneshot learning benchmarks (miniImageNet and ImageNet 1K challenge datasets), which significantly outperform the previous state-of-the-art approaches.

##### A CASE STUDY ON OPTIMAL DEEP LEARNING MODEL FOR UAVS

tl;dr case study on optimal deep learning model for UAVs

Over the passage of time Unmanned Autonomous Vehicles (UAVs), especially Autonomous flying drones grabbed a lot of attention in Artificial Intelligence. Since electronic technology is getting smaller, cheaper and more efficient, huge advancement in the study of UAVs has been observed recently. From monitoring floods, discerning the spread of algae in water bodies to detecting forest trail, their application is far and wide. Our work is mainly focused on autonomous flying drones where we establish a case study towards efficiency, robustness and accuracy of UAVs where we showed our results well supported through experiments. We provide details of the software and hardware architecture used in the study. We further discuss about our implementation algorithms and present experiments that provide a comparison between three different state-of-the-art algorithms namely TrailNet, InceptionResnet and MobileNet in terms of accuracy, robustness, power consumption and inference time. In our study, we have shown that MobileNet has produced better results with very less computational requirement and power consumption. We have also reported the challenges we have faced during our work as well as a brief discussion on our future work to improve safety features and performance.

##### Robustness May Be at Odds with Accuracy

tl;dr We show that adversarial robustness might come at the cost of standard classification performance, but also yields unexpected benefits.

We show that there exists an inherent tension between the goal of adversarial robustness and that of standard generalization. Specifically, training robust models may not only be more resource-consuming, but also lead to a reduction of standard accuracy. We demonstrate that this trade-off between the standard accuracy of a model and its robustness to adversarial perturbations provably exists even in a fairly simple and natural setting. These findings also corroborate a similar phenomenon observed in practice. Further, we argue that this phenomenon is a consequence of robust classifiers learning fundamentally different feature representations than standard classifiers. These differences, in particular, seem to result in unexpected benefits: the representations learned by robust models tend to align better with salient data characteristics and human perception.

##### How Training Data Affect the Accuracy and Robustness of Neural Networks for Image Classification

No tl;dr =[

Recent work has demonstrated the lack of robustness of well-trained deep neural networks (DNNs) to adversarial examples. For example, visually indistinguishable perturbations, when mixed with an original image, can easily lead deep learning models to misclassifications. In light of a recent study on the mutual influence between robustness and accuracy over 18 different ImageNet models, this paper investigates how training data affect the accuracy and robustness of deep neural networks. We conduct extensive experiments on four different datasets, including CIFAR-10, MNIST, STL-10, and Tiny ImageNet, with several representative neural networks. Our results reveal previously unknown phenomena that exist between the size of training data and characteristics of the resulting models. In particular, besides confirming that the model accuracy improves as the amount of training data increases, we also observe that the model robustness improves initially, but there exists a turning point after which robustness starts to deteriorate. How and when such turning points occur vary for different neural networks and different datasets.

##### Invariance and Inverse Stability under ReLU

tl;dr We analyze the invertibility of deep neural networks by studying preimages of ReLU-layers and the stability of the inverse.

We flip the usual approach to study invariance and robustness of neural networks by considering the non-uniqueness and instability of the inverse mapping. We provide theoretical and numerical results on the inverse of ReLU-layers. First, we derive a necessary and sufficient condition on the existence of invariance that provides a geometric interpretation. Next, we move to robustness via analyzing local effects on the inverse. To conclude, we show how this reverse point of view not only provides insights into key effects, but also enables to view adversarial examples from different perspectives.

##### Feature Intertwiners

tl;dr A feature intertwiner module to leverage features from one accurate set to help the learning of another less reliable set.

A well-trained model should classify objects with unanimous score for every category. This requires the high-level semantic features should be alike among samples, despite a wide span in resolution, texture, deformation, etc. Previous works focus on re-designing the loss function or proposing new regularization constraints on the loss. In this paper, we address this problem via a new perspective. For each category, it is assumed that there are two sets in the feature space: one with more reliable information and the other with less reliable source. We argue that the reliable set could guide the feature learning of the less reliable set during training - in spirit of student mimicking teacher’s behavior and thus pushing towards a more compact class centroid in the high-dimensional space. Such a scheme also benefits the reliable set since samples become more closer within the same category - implying that it is easilier for the classifier to identify. We refer to this mutual learning process as feature intertwiner and embed the spirit into object detection. It is well-known that objects of low resolution are more difficult to detect due to the loss of detailed information during network forward pass. We thus regard objects of high resolution as the reliable set and objects of low resolution as the less reliable set. Specifically, an intertwiner is achieved by minimizing the distribution divergence between two sets. We design a historical buffer to represent all previous samples in the reliable set and utilize them to guide the feature learning of the less reliable set. The design of obtaining an effective feature representation for the reliable set is further investigated, where we introduce the optimal transport (OT) algorithm into the framework. Samples in the less reliable set are better aligned with the reliable set with aid of OT metric. Incorporated with such a plug-and-play intertwiner, we achieve an evident improvement over previous state-of-the-arts on the COCO object detection benchmark.

##### Aggregated Momentum: Stability Through Passive Damping

tl;dr We introduce a simple variant of momentum optimization which is able to outperform classical momentum, Nesterov, and Adam on deep learning tasks with minimal hyperparameter tuning.

Momentum is a simple and widely used trick which allows gradient-based optimizers to pick up speed along low curvature directions. Its performance depends crucially on a damping coefficient. Largecamping coefficients can potentially deliver much larger speedups, but are prone to oscillations and instability; hence one typically resorts to small values such as 0.5 or 0.9. We propose Aggregated Momentum (AggMo), a variant of momentum which combines multiple velocity vectors with different damping coefficients. AggMo is trivial to implement, but significantly dampens oscillations, enabling it to remain stable even for aggressive damping coefficients such as 0.999. We reinterpret Nesterov's accelerated gradient descent as a special case of AggMo and analyze rates of convergence for quadratic objectives. Empirically, we find that AggMo is a suitable drop-in replacement for other momentum methods, and frequently delivers faster convergence with little to no tuning.

##### MANIFOLDNET: A DEEP NEURAL NETWORK FOR MANIFOLD-VALUED DATA

No tl;dr =[

Developing deep neural networks (DNNs) for manifold-valued data sets has gained much interest of late in the deep learning research community. Examples of manifold-valued data include data from omnidirectional cameras on automobiles, drones etc., diffusion magnetic resonance imaging, elastography and others. In this paper, we present a novel theoretical framework for DNNs to cope with manifold-valued data inputs. In doing this generalization, we draw parallels to the widely popular convolutional neural networks (CNNs). We call our network the ManifoldNet. As in vector spaces where convolutions are equivalent to computing the weighted mean of functions, an analogous definition for manifold-valued data can be constructed involving the computation of the weighted Fr\'{e}chet Mean (wFM). To this end, we present a provably convergent recursive computation of the wFM of the given data, where the weights makeup the convolution mask, to be learned. Further, we prove that the proposed wFM layer achieves a contraction mapping and hence the ManifoldNet does not need the additional non-linear ReLU unit used in standard CNNs. Operations such as pooling in traditional CNN are no longer necessary in this setting since wFM is already a pooling type operation. Analogous to the equivariance of convolution in Euclidean space to translations, we prove that the wFM is equivariant to the action of the group of isometries admitted by the Riemannian manifold on which the data reside. This equivariance property facilitates weight sharing within the network. We present experiments, using the ManifoldNet framework, to achieve video classification and image reconstruction using an auto-encoder+decoder setting. Experimental results demonstrate the efficacy of ManifoldNet in the context of classification and reconstruction accuracy.

##### Learning Neural Random Fields with Inclusive Auxiliary Generators

tl;dr We develop a new approach to learning neural random fields and show that the new approach obtains state-of-the-art sample generation quality and achieves strong semi-supervised learning results on par with state-of-the-art deep generative models.

Neural random fields (NRFs), which are defined by using neural networks to implement potential functions in undirected models, provide an interesting family of model spaces for machine learning. In this paper we develop a new approach to learning NRFs with inclusive-divergence minimized auxiliary generator - the inclusive-NRF approach. The new approach enables us to flexibly use NRFs in unsupervised, supervised and semi-supervised settings and successfully train them in a black-box manner. Empirically, inclusive-NRFs achieve state-of-the-art sample generation quality on CIFAR-10 in both unsupervised and supervised settings. Semi-supervised inclusive-NRFs show strong classification results on par with state-of-the-art generative model based semi-supervised learning methods, and simultaneously achieve superior generation, on the widely benchmarked datasets - MNIST, SVHN and CIFAR-10.

##### Selective Self-Training for semi-supervised Learning

tl;dr Our proposed algorithm does not use all of the unlabeled data for the training, and it rather uses them selectively.

Most of the conventional semi-supervised learning (SSL) methods assume that the classes of unlabeled data are contained in the set of classes of labeled data. In addition, these methods do not discriminate unlabeled samples and use all the unlabeled data for learning, which is not suitable for realistic situations. In this paper, we propose an SSL method called selective self-training (SST), which selectively decides whether to include each unlabeled sample in the training process or not. It is also designed to be applied to a more realistic situation where classes of unlabeled data are different from the ones of the labeled data. For the conventional SSL problems of fixed classes, the proposed method not only performs comparable to other conventional SSL algorithms, but also can be combined with other SSL algorithms. For the new SSL problems of increased classes where the conventional methods cannot be applied, the proposed method does not show any performance degradation even if the classes of unlabeled data are different from those of the labeled data.

##### A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

tl;dr We use empirical tools of mode connectivity and SVCCA to investigate neural network training heuristics of learning rate restarts, warmup and knowledge distillation.

The convergence rate and final performance of common deep learning models have significantly benefited from recently proposed heuristics such as learning rate schedules, knowledge distillation, skip connections and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining the efficacy of these strategies can aid our understanding of deep learning landscapes and the training dynamics. Existing approaches for empirical analysis rely on tools of linear interpolation and visualizations with dimensionality reduction, each with their limitations. Instead, we revisit the empirical analysis of heuristics through the lens of recently proposed methods for loss surface and representation analysis, viz. mode connectivity and canonical correlation analysis (CCA), and hypothesize reasons why the heuristics succeed. In particular, we explore knowledge distillation and learning rate heuristics of (cosine) restarts and warmup using mode connectivity and CCA. Our empirical analysis suggests that: (a) the reasons often quoted for the success of cosine annealing are not evidenced in practice; (b) that the effect of learning rate warmup is to prevent the deeper layers from creating training instability; and (c) that the latent knowledge shared by the teacher is primarily disbursed in the deeper layers.

##### Learning From the Experience of Others: Approximate Empirical Bayes in Neural Networks

No tl;dr =[

Learning deep neural networks could be understood as the combination of representation learning and learning halfspaces. While most previous work aims to diversify representation learning by data augmentations and regularizations, we explore the opposite direction through the lens of empirical Bayes method. Specifically, we propose a matrix-variate normal prior whose covariance matrix has a Kronecker product structure to capture the correlations in learning different neurons through backpropagation. The prior encourages neurons to learn from the experience of others, hence it provides an effective regularization when training large networks on small datasets. To optimize the model, we design an efficient block coordinate descent algorithm with analytic solutions. Empirically, we show that the proposed method helps the network converge to better local optima that also generalize better, and we verify the effectiveness of the approach on both multiclass classification and multitask regression problems with various network structures.

##### Self-Aware Visual-Textual Co-Grounded Navigation Agent

The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. This challenging task demands that the agent be aware of which instruction was completed, which instruction is needed next, which way to go, and its navigation progress towards the goal. In this paper, we introduce a self-aware agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress. We test our self- aware agent on a standard benchmark and analyze our proposed approach through a series of ablation studies that elucidate the contributions of the primary components. Using our proposed method, we set the new state-of-art by a significant margin (8% absolute increase in success rate on the unseen test set).

##### Adversarial Sampling for Active Learning

tl;dr ASAL is a pool based active learning method that generates high entropy samples and retrieves matching samples from the pool in sub-linear time.

This paper proposes ASAL, a new pool based active learning method that generates high entropy samples. Instead of directly annotating the synthetic samples, ASAL searches similar samples from the pool and includes them for training. Hence, the quality of new samples is high and annotations are reliable. ASAL is particularly suitable for large data sets because it achieves a better run-time complexity (sub-linear) for sample selection than traditional uncertainty sampling (linear). We present a comprehensive set of experiments on two data sets and show that ASAL outperforms similar methods and clearly exceeds the established baseline (random sampling). In the discussion section we analyze in which situations ASAL performs best and why it is sometimes hard to outperform random sample selection. To the best of our knowledge this is the first adversarial active learning technique that is applied for multiple class problems using deep convolutional classifiers and demonstrates superior performance than random sample selection.

##### Shallow Learning For Deep Networks

tl;dr We build CNNs layer by layer without end to end training and show that this kind of approach can scale to Imagenet, while having multiple favorable properties.

Shallow supervised 1-hidden layer neural networks have a number of favorable properties that make them easier to interpret, analyze, and optimize than their deep counterparts, but lack their representational power. Here we use 1-hiddenlayer learning problems to sequentially build deep networks layer by layer, which can inherit properties from shallow networks. Contrary to previous approaches using shallow networks, we focus on problems where deep learning is reportedas critical for success. We thus study CNNs on two large-scale image recognition tasks: ImageNet and CIFAR-10. Using a simple set of ideas for architecture and training we find that solving sequential 1-hidden-layer auxiliary problemsleads to a CNN that exceeds AlexNet performance on ImageNet. Extending ourtraining methodology to construct individual layers by solving 2-and-3-hiddenlayer auxiliary problems, we obtain an 11-layer network that exceeds VGG-11 on ImageNet obtaining 89.8% top-5 single crop. To our knowledge, this is the first competitive alternative to end-to-end training of CNNs that can scale to ImageNet. We conduct a wide range of experiments to study the properties this induces on the intermediate layers.

##### Unsupervised Learning via Meta-Learning

tl;dr An unsupervised learning method that uses meta-learning to enable efficient learning of downstream image classification tasks, outperforming state-of-the-art methods.

A central goal of unsupervised learning is to acquire representations from unlabeled data or experience that can be used for more effective learning of downstream tasks from modest amounts of labeled data. Many prior unsupervised learning works aim to do so by developing proxy objectives based on reconstruction, disentanglement, prediction, and other metrics. Instead, we develop an unsupervised learning method that explicitly optimizes for the ability to learn a variety of tasks from small amounts of data. To do so, we construct tasks from unlabeled data in an automatic way and run meta-learning over the constructed tasks. Surprisingly, we find that relatively simple mechanisms for task design, such as clustering unsupervised representations, lead to good performance on a variety of downstream tasks. Our experiments across four image datasets indicate that our unsupervised meta-learning approach acquires a learning algorithm without any labeled data that is applicable to a wide range of downstream classification tasks, improving upon the representation learned by four prior unsupervised learning methods.

##### Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

tl;dr We use a hypernetwork to predict optimal weights given hyperparameters, and jointly train everything together.

Hyperparameter optimization is a bi-level optimization problem, where the optimal parameters on the training set depend on the current hyperparameters. The best-response function which maps hyperparameters to these optimal parameters allows gradient-based hyperparameter optimization but is difficult to represent and compute when the parameters are high dimensional, as in neural networks. We develop efficient best-response approximations for neural networks by applying insights from the structure of the optimal response in a Jacobian-regularized two-layer linear network to deep, nonlinear networks. The approximation works by scaling and shifting the hidden units by amounts which depend on the current hyperparameters. We use our approximation for a gradient-based hyperparameter optimization algorithm which alternates between approximating the best-response in a neighborhood around the current hyperparameters and optimizing the hyperparameters using the approximate best-response. We show this method outperforms competing hyperparameter optimization methods on large-scale deep learning problems. We call our networks, which update their own hyperparameters online during training, Self-Tuning Networks.

##### Morpho-MNIST: Quantitative Assessment and Diagnostics for Representation Learning

tl;dr This paper introduces Morpho-MNIST, a collection of shape metrics and perturbations, in a step towards quantitative evaluation of representation learning in computer vision.

Revealing latent structure in data is an active field of research, having brought exciting new models such as variational autoencoders and generative adversarial networks, and is essential to push machine learning towards unsupervised knowledge discovery. However, a major challenge is the lack of suitable benchmarks for an objective and quantitative evaluation of learned representations. To address this issue we introduce Morpho-MNIST. We extend the popular MNIST dataset by adding a morphometric analysis enabling quantitative comparison of different models, identification of the roles of latent variables, and characterisation of sample diversity. We further propose a set of quantifiable perturbations to assess the performance of unsupervised and supervised methods on challenging tasks such as outlier detection and domain adaptation.

##### DEEP-TRIM: REVISITING L1 REGULARIZATION FOR CONNECTION PRUNING OF DEEP NETWORK

tl;dr We revisit the simple idea of pruning connections of DNNs through $\ell_1$ regularization achieving state-of-the-art results on multiple datasets with theoretic guarantees.

State-of-the-art deep neural networks (DNNs) typically have tens of millions of parameters, which might not fit into the upper levels of the memory hierarchy, thus increasing the inference time and energy consumption significantly, and prohibiting their use on edge devices such as mobile phones. The compression of DNN models has therefore become an active area of research recently, with \emph{connection pruning} emerging as one of the most successful strategies. A very natural approach is to prune connections of DNNs via $\ell_1$ regularization, but recent empirical investigations have suggested that this does not work as well in the context of DNN compression. In this work, we revisit this simple strategy and analyze it rigorously, to show that: (a) any \emph{stationary point} of an $\ell_1$-regularized layerwise-pruning objective has its number of non-zero elements bounded by the number of penalized prediction logits, regardless of the strength of the regularization; (b) successful pruning highly relies on an accurate optimization solver, and there is a trade-off between compression speed and distortion of prediction accuracy, controlled by the strength of regularization. Our theoretical results thus suggest that $\ell_1$ pruning could be successful provided we use an accurate optimization solver. We corroborate this in our experiments, where we show that simple $\ell_1$ regularization with an Adamax-L1(cumulative) solver gives pruning ratio competitive to the state-of-the-art.

##### Neural Collobrative Networks

No tl;dr =[

This paper presents a conceptually general and modularized neural collaborative network (NCN), which overcomes the limitations of the traditional convolutional neural networks (CNNs) in several aspects. Firstly, our NCN can directly handle non-Euclidean data without any pre-processing (e.g., graph normalizations) by defining a simple yet basic unit named neuron array for feature representation. Secondly, our NCN is capable of achieving both rotational equivariance and invariance properties via a simple yet powerful neuron collaboration mechanism, which imposes a glocal'' operation to capture both global and local information among neuron arrays within each layer. Thirdly, compared to the state-of-the-art networks that using large CNN kernels, our NCN with considerably fewer parameters can also achieve their strengths in feature learning by only exploiting highly efficient 1x1 convolution operations. Extensive experimental analyses on learning feature representation, handling novel viewpoints, and handling non-euclidean data demonstrate that our NCN can not only achieve state-of-the-art performance but also overcome the limitation of the conventional CNNs. The source codes will be released to facilite future researches after the review period for ensuring the anonymity.

##### Characterizing Audio Adversarial Examples Using Temporal Dependency

tl;dr Adversarial audio discrimination using temporal dependency

Recent studies have highlighted adversarial examples as a ubiquitous threat to different neural network models and many downstream applications. Nonetheless, as unique data properties have inspired distinct and powerful learning principles, this paper aims to explore their potentials towards mitigating adversarial inputs. In particular, our results reveal the importance of using the temporal dependency in audio data to gain discriminate power against adversarial examples. Tested on the automatic speech recognition (ASR) tasks and three recent audio adversarial attacks, we find that (i) input transformation developed from image adversarial defense provides limited robustness improvement and is subtle to advanced attacks; (ii) temporal dependency can be exploited to gain discriminative power against audio adversarial examples and is resistant to adaptive attacks considered in our experiments. Our results not only show promising means of improving the robustness of ASR systems, but also offer novel insights in exploiting domain-specific data properties to mitigate negative effects of adversarial examples.

##### Equi-normalization of Neural Networks

tl;dr Fast iterative algorithm to balance the energy of a network while staying in the same functional equivalence class

Modern neural networks are over-parametrized. In particular, each rectified linear hidden unit can be modified by a multiplicative factor by adjusting input and out- put weights, without changing the rest of the network. Inspired by the Sinkhorn-Knopp algorithm, we introduce a fast iterative method for minimizing the l2 norm of the weights, equivalently the weight decay regularizer. It provably converges to a unique solution. Interleaving our algorithm with SGD during training improves the test accuracy. For small batches, our approach offers an alternative to batch- and group- normalization on CIFAR-10 and ImageNet with a ResNet-18.

##### Reconciling Feature-Reuse and Overfitting in DenseNet with Specialized Dropout

tl;dr Realizing the drawbacks when applying original dropout on DenseNet, we craft the design of dropout method from three aspects, the idea of which could also be applied on other CNN models.

Recently convolutional neural networks (CNNs) achieve great accuracy in visual recognition tasks. DenseNet becomes one of the most popular CNN models due to its effectiveness in feature-reuse. However, like other CNN models, DenseNets also face overfitting problem if not severer. Existing dropout method can be applied but not as effective due to the introduced nonlinear connections. In particular, the property of feature-reuse in DenseNet will be impeded, and the dropout effect will be weakened by the spatial correlation inside feature maps. To address these problems, we craft the design of a specialized dropout method from three aspects, dropout location, dropout granularity, and dropout probability. The insights attained here could potentially be applied as a general approach for boosting the accuracy of other CNN models with similar nonlinear connections. Experimental results show that DenseNets with our specialized dropout method yield better accuracy compared to vanilla DenseNet and state-of-the-art CNN models, and such accuracy boost increases with the model depth.

##### Jumpout: Improved Dropout for Deep Neural Networks with Rectified Linear Units

tl;dr Jumpout applies three simple yet effective modifications to dropout, based on novel understandings about the generalization performance of DNN with ReLU in local regions.

Dropout is a simple yet effective technique to improve the generalization performance and prevent overfitting in deep neural networks (DNNs). In this paper, we discuss three novel observations about dropout to better understand the generalization of DNNs with rectified linear unit (ReLU) activations: 1) dropout is a smoothing technique that encourages each local linear model of a DNN to be trained on data points from nearby regions; 2) a constant dropout rate can result in effective neural-deactivation rates that are significantly different for layers with different fractions of activated neurons; and 3) the rescaling factor of dropout causes an inconsistency to occur between the normalization during training and testing conditions when batch normalization is also used. The above leads to three simple but nontrivial improvements to dropout resulting in our proposed method "Jumpout." Jumpout samples the dropout rate using a monotone decreasing distribution (such as the right part of a truncated Gaussian), so the local linear model at each data point is trained, with high probability, to work better for data points from nearby than from more distant regions. Instead of tuning a dropout rate for each layer and applying it to all samples, jumpout moreover adaptively normalizes the dropout rate at each layer and every training sample/batch, so the effective dropout rate applied to the activated neurons are kept the same. Moreover, we rescale the outputs of jumpout for a better trade-off that keeps both the variance and mean of neurons more consistent between training and test phases, which mitigates the incompatibility between dropout and batch normalization. Compared to the original dropout, jumpout shows significantly improved performance on CIFAR10, CIFAR100, Fashion- MNIST, STL10, SVHN, ImageNet-1k, etc., while introducing negligible additional memory and computation costs.

tl;dr Auxiliary tasks need to match the main task to improve learning; we propose to use cosine distance between gradients of an unknown auxiliary task to protect from negative interference with learning the main task.

##### Pearl: Prototype lEArning via Rule Lists

tl;dr a method combining rule list learning and prototype learning

Deep neural networks have demonstrated promising classification performance on many healthcare applications. However, the interpretability of those models are often lacking. On the other hand, classical interpretable models such as rule lists or decision trees do not lead to the same level of accuracy as deep neural networks. Despite their interpretable structures, the resulting rules are often too complex to be interpretable (due to the potentially large depth of rule lists). In this work, we present PEARL, short for Prototype lEArning via Rule Lists, which iteratively use rule lists to guide a neural network to learn representative data prototypes. The resulting prototype neural network provides accurate prediction, and the prediction can be easily explained by prototype and its guiding rule lists. Thanks to the prediction power of neural networks, the rule lists defining prototypes are more concise and hence provide better interpretability. On two real-world electronic healthcare records (EHR) datasets, PEARL consistently outperforms all baselines, achieving performance improvement over conventional rule learning by up to 28% and over prototype learning by up to 3%. Experimental results also show the resulting interpretation of PEARL is simpler than the standard rule learning.

##### Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

tl;dr We train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure, enabled by the new MAESTRO dataset.

Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s). This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.

##### Learn From Neighbour: A Curriculum That Train Low Weighted Samples By Imitating

No tl;dr =[

Deep neural networks, which gain great success in a wide spectrum of applications, are often time, compute and storage hungry. Curriculum learning proposed to boost training of network by a syllabus from easy to hard. However, the relationship between data complexity and network training is unclear: why hard example harm the performance at beginning but helps at end. In this paper, we aim to investigate on this problem. Similar to internal covariate shift in network forward pass, the distribution changes in weight of top layers also affects training of preceding layers during the backward pass. We call this phenomenon inverse "internal covariate shift". Training hard examples aggravates the distribution shifting and damages the training. To address this problem, we introduce a curriculum loss that consists of two parts: a) an adaptive weight that mitigates large early punishment; b) an additional representation loss for low weighted samples. The intuition of the loss is very simple. We train top layers on "good" samples to reduce large shifting, and encourage "bad" samples to learn from "good" sample. In detail, the adaptive weight assigns small values to hard examples, reducing the influence of noisy gradients. On the other hand, the less-weighted hard sample receives the proposed representation loss. Low-weighted data gets nearly no training signal and can stuck in embedding space for a long time. The proposed representation loss aims to encourage their training. This is done by letting them learn a better representation from its superior neighbours but not participate in learning of top layers. In this way, the fluctuation of top layers is reduced and hard samples also received signals for training. We found in this paper that curriculum learning needs random sampling between tasks for better training. Our curriculum loss is easy to combine with existing stochastic algorithms like SGD. Experimental result shows an consistent improvement over several benchmark datasets.

##### Learning Deep Embeddings in Krein Spaces

tl;dr We propose a solution that realizes deep embeddings in Krein spaces.

The non-linear embedding achieved by a Siamese network is indeed a realization of a Hilbert space, \ie, a metric space with a positive definite inner product. Krein spaces generalize the notion of Hilbert spaces to geometrical structures with indefinite inner products. As a result, distances and norms in a Krein space can become negative. The negative spectral of an inner product is usually attributed to observation noise, though such a claim has never been fully studied, nor proved. Seeking how Krein spaces can be constructed from data, we propose a simple and innocent-looking modification to Siamese networks, equipping them with the power to realize indefinite inner-products. This provides a data-driven technique to decide whether the negative spectrum of an inner-product is helpful or not. We empirically show that our Krein embeddings outperform Hilbert space embeddings on recognition tasks.

##### Adversarial Defense Via Data Dependent Activation Function and Total Variation Minimization

tl;dr We proposal strategies for adversarial defense based on data dependent activation function, total variation minimization, and training data augmentation

We improve the robustness of deep neural nets to adversarial attacks by using an interpolating function as the output activation. This data-dependent activation function remarkably improves both classification accuracy and stability to adversarial perturbations. Together with the total variation minimization of adversarial images and augmented training, under the strongest attack, we achieve up to 20.6%, 50.7%, and 68.7% accuracy improvement w.r.t. the fast gradient sign method, iterative fast gradient sign method, and Carlini-WagnerL2attacks, respectively. Our defense strategy is additive to many of the existing methods. We give an intuitive explanation of our defense strategy via analyzing the geometry of the feature space. For reproducibility, the code will be available on GitHub.

##### Probabilistic Knowledge Graph Embeddings

tl;dr Scalable hyperparameter learning for knowledge graph embedding models using variational EM

We develop a probabilistic extension of state-of-the-art embedding models for link prediction in relational knowledge graphs. Knowledge graphs are collections of relational facts, where each fact states that a certain relation holds between two entities, such as people, places, or objects. We argue that knowledge graphs should be treated within a Bayesian framework because even large knowledge graphs typically contain only few facts per entity, leading effectively to a small data problem where parameter uncertainty matters. We introduce a probabilistic reinterpretation of the DistMult (Yang et al., 2015) and ComplEx (Trouillon et al., 2016) models and employ variational inference to estimate a lower bound on the marginal likelihood of the data. We find that the main benefit of the Bayesian approach is that it allows for efficient, gradient based optimization over hyperparameters, which would lead to divergences in a non-Bayesian treatment. Models with such learned hyperparameters improve over the state-of-the-art by a significant margin, as we demonstrate on several benchmarks.

##### Exponentially Decaying Flows for Optimization in Deep Learning

tl;dr Introduction of a new optimization method and its application to deep learning.

The field of deep learning has been craving for an optimization method that shows outstanding property for both optimization and generalization. We propose a method for mathematical optimization based on flows along geodesics, that is, the shortest paths between two points, with respect to the Riemannian metric induced by a non-linear function. In our method, the flows refer to Exponentially Decaying Flows (EDF), as they can be designed to converge on the local solutions exponentially. In this paper, we conduct experiments to show its high performance on optimization benchmarks (i.e., convergence properties), as well as its potential for producing good machine learning benchmarks (i.e., generalization properties).

##### Music Transformer

tl;dr We show the first successful use of Transformer in generating music that exhibits long-term structure.

Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity is quadratic in the sequence length. We propose an algorithm that reduces the intermediate memory requirements to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long (thousands of steps) compositions with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-competition, and obtain state-of-the-art results on the latter.

##### Like What You Like: Knowledge Distill via Neuron Selectivity Transfer

tl;dr We treat knowledge distill as a distribution matching problem and adopt Maximum Mean Discrepancy to minimize the distances between student features and teacher features.

Despite deep neural networks have demonstrated extraordinary power in various applications, their superior performances are at expense of high storage and computational costs. Consequently, the acceleration and compression of neural networks have attracted much attention recently. Knowledge Transfer (KT), which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the popular solutions. In this paper, we propose a novel knowledge transfer method by treating it as a distribution matching problem. Particularly, we match the distributions of neuron selectivity patterns between teacher and student networks. To achieve this goal, we devise a new KT loss function by minimizing the Maximum Mean Discrepancy (MMD) metric between these distributions. Combined with the original loss function, our method can significantly improve the performance of student networks. We validate the effectiveness of our method across several datasets, and further combine it with other KT methods to explore the best possible results. Last but not least, we fine-tune the model to other tasks such as object detection. The results are also encouraging, which confirm the transferability of the learned features.

##### DeepOBS: A Deep Learning Optimizer Benchmark Suite

tl;dr We provide a software package that drastically simplifies, automates, and improves the evaluation of deep learning optimizers.

Because the choice and tuning of the optimizer affects the speed, and ultimately the performance of deep learning, there is significant past and recent research in this area. Yet, perhaps surprisingly, there is no generally agreed-upon protocol for the quantitative and reproducible evaluation of optimization strategies for deep learning. We suggest routines and benchmarks for stochastic optimization, with special focus on the unique aspects of deep learning, such as stochasticity, tunability and generalization. As the primary contribution, we present DeepOBS, a Python package of deep learning optimization benchmarks. The package addresses key challenges in the quantitative assessment of stochastic optimizers, and automates most steps of benchmarking. The library includes a wide and extensible set of ready-to-use realistic optimization problems, such as training Residual Networks for image classification on ImageNet or character-level language prediction models, as well as popular classics like MNIST and CIFAR-10. The package also provides realistic baseline results for the most popular optimizers on these test problems, ensuring a fair comparison to the competition when benchmarking new optimizers, and without having to run costly experiments. It comes with output back-ends that directly produce LaTeX code for inclusion in academic publications. It is written in TensorFlow and available open source.

##### Learning Implicitly Recurrent CNNs Through Parameter Sharing

tl;dr We propose a method that enables CNN folding to create recurrent connections

We introduce a parameter sharing scheme, in which different layers of a convolutional neural network (CNN) are defined by a learned linear combination of parameter tensors from a global bank of templates. Restricting the number of templates yields a flexible hybridization of traditional CNNs and recurrent networks. Compared to traditional CNNs, we demonstrate substantial parameter savings on standard image classification tasks, while maintaining accuracy. Our simple parameter sharing scheme, though defined via soft weights, in practice yields trained networks with near strict recurrent structure; with negligible side effects, they convert into networks with actual recurrent loops. Training these networks thus implicitly involves discovery of suitable recurrent architectures. As a consequence, our hybrid networks are not only more parameter efficient, but also learn some tasks faster. Specifically, on synthetic tasks which are algorithmic in nature, our hybrid networks both train faster and extrapolate better on test examples outside the span of the training set.

##### Online Hyperparameter Adaptation via Amortized Proximal Optimization

tl;dr We introduce amortized proximal optimization (APO), a method to adapt a variety of optimization hyperparameters online during training, including learning rates, damping coefficients, and gradient variance exponents.

Effective performance of neural networks depends criticially on effective tuning of optimization hyperparameters, especially learning rates (and schedules thereof). We present Amortized Proximal Optimization, which takes the perspective that each optimization step should approximately minimize a proximal objective (similar to the ones used to motivate natural gradient and trust region policy optimization). Optimization hyperparameters are adapted to best minimize the proximal objective after one weight update. We show that an idealized version of APO (where an oracle minimizes the proximal objective exactly) achieves second-order convergence rates for neural networks. APO incurs minimal computational overhead. We experiment with using APO to adapt a variety of optimization hyperparameters online during training, including (possibly layer-specific) learning rates, damping coefficients, and gradient variance exponents. For a variety of network architectures and optimization algorithms (including SGD, RMSprop, and K-FAC), we show that with minimal tuning, APO performs competitively with carefully tuned optimizers.

##### Improving machine classification using human uncertainty measurements

tl;dr improving classifiers using human uncertainty measurements

As deep CNN classifier performance using ground-truth labels has begun to asymptote at near-perfect levels, a key aim for the field is to extend training paradigms to capture further useful structure in natural image data and improve model robustness and generalization. In this paper, we present a novel natural image benchmark for making this extension, which we call CIFAR10H. This new dataset comprises a human-derived, full distribution over labels for each image of the CIFAR10 test set, offering the ability to assess the generalization of state-of-the-art CIFAR10 models, as well as investigate the effects of including this information in model training. We show that classification models trained on CIFAR10 do not generalize as well to our dataset as it does to traditional extensions, and that models fine-tuned using our label information are able to generalize better to related datasets, complement popular data augmentation schemes, and provide robustness to adversarial attacks. We explain these improvements in terms of better empirical approximations to the expected loss function over natural images and their categories in the visual world.

##### Manifold Mixup: Learning Better Representations by Interpolating Hidden States

tl;dr A method for learning better representations, that acts as a regularizer and despite its no significant additional computation cost , achieves improvements over strong baselines on Supervised and Semi-supervised Learning tasks.

Deep networks often perform well on the data distribution on which they are trained, yet give incorrect (and often very confident) answers when evaluated on points from off of the training distribution. This is exemplified by the adversarial examples phenomenon but can also be seen in terms of model generalization and domain shift. Ideally, a model would assign lower confidence to points unlike those from the training distribution. We propose a regularizer which addresses this issue by training with interpolated hidden states and encouraging the classifier to be less confident at these points. Because the hidden states are learned, this has an important effect of encouraging the hidden states for a class to be concentrated in such a way so that interpolations within the same class or between two different classes do not intersect with the real data points from other classes. This has a major advantage in that it avoids the underfitting which can result from interpolating in the input space. We prove that the exact condition for this problem of underfitting to be avoided by Manifold Mixup is that the dimensionality of the hidden states exceeds the number of classes, which is often the case in practice. Additionally, this concentration can be seen as making the features in earlier layers more discriminative. We show that despite requiring no significant additional computation, Manifold Mixup achieves large improvements over strong baselines in supervised learning, robustness to single-step adversarial attacks, semi-supervised learning, and Negative Log-Likelihood on held out samples.

##### MahiNet: A Neural Network for Many-Class Few-Shot Learning with Class Hierarchy

tl;dr A memory-augmented neural network that addresses many-class few-shot problem by leveraging class hierarchy in both supervised learning and meta-learning.

We study many-class few-shot (MCFS) problem in both supervised learning and meta-learning scenarios. Compared to the well-studied many-class many-shot and few-class few-shot problems, MCFS problem commonly occurs in practical applications but is rarely studied. MCFS brings new challenges because it needs to distinguish between many classes, but only a few samples per class are available for training. In this paper, we propose memory-augmented hierarchical-classification network (MahiNet)'' for MCFS learning. It addresses the many-class'' problem by exploring the class hierarchy, e.g., the coarse-class label that covers a subset of fine classes, which helps to narrow down the candidates for the fine class and is cheaper to obtain. MahiNet uses a convolutional neural network (CNN) to extract features, and integrates a memory-augmented attention module with a multi-layer perceptron (MLP) to produce the probabilities over coarse and fine classes. While the MLP extends the linear classifier, the attention module extends a KNN classifier, both together targeting the few-shot'' problem. We design different training strategies of MahiNet for supervised learning and meta-learning. Moreover, we propose two novel benchmark datasets ''mcfsImageNet'' (as a subset of ImageNet) and ''mcfsOmniglot'' (re-splitted Omniglot) specifically for MCFS problem. In experiments, we show that MahiNet outperforms several state-of-the-art models on MCFS classification tasks in both supervised learning and meta-learning scenarios.

##### Penetrating the Fog: the Path to Efficient CNN Models

tl;dr We are the first in the field to show how to craft an effective sparse kernel design from three aspects: composition, performance and efficiency.

With the increasing demand to deploy convolutional neural networks (CNNs) on mobile platforms, the sparse kernel approach was proposed, which could save more parameters than the standard convolution while maintaining accuracy. However, despite the great potential, no prior research has pointed out how to craft an sparse kernel design with such potential (i.e., effective design), and all prior works just adopt simple combinations of existing sparse kernels such as group convolution. Meanwhile due to the large design space it is also impossible to try all combinations of existing sparse kernels. In this paper, we are the first in the field to consider how to craft an effective sparse kernel design by eliminating the large design space. Specifically, we present a sparse kernel scheme to illustrate how to reduce the space from three aspects. First, in terms of composition we remove designs composed of repeated layers. Second, to remove designs with large accuracy degradation, we find an unified property named~\emph{information field} behind various sparse kernel designs, which could directly indicate the final accuracy. Last, we remove designs in two cases where a better parameter efficiency could be achieved. Additionally, we provide detailed efficiency analysis on the final 4 designs in our scheme. Experimental results validate the idea of our scheme by showing that our scheme is able to find designs which are more efficient in using parameters and computation with similar or higher accuracy.

##### Rethinking the Value of Network Pruning

tl;dr In network pruning, fine-tuning a pruned model only gives comparable or worse performance than training it from scratch.

Network pruning is widely used for reducing the heavy computational cost of deep networks. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning, and each of the three stages is considered as indispensable. In this work, we make several surprising observations which contradict common beliefs. For all the six state-of-the-art pruning algorithms we examined, fine-tuning a pruned model only gives comparable or even worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined architecture of the target pruned network, one can completely get rid of the pipeline and directly train the target network from scratch. Our observations are consistent for a wide variety of pruning algorithms with multiple network architectures, datasets and tasks. Our results have several implications: 1) training an over-parameterized model is not necessary to obtain an efficient final model, 2) learned important'' weights of the large model are not necessarily helpful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited important'' weights, is what leads to the efficiency benefit in the final model, which suggests that some pruning algorithms could be seen as performing network architecture search.

##### AIM: Adversarial Inference by Matching Priors and Conditionals

No tl;dr =[

Effective inference for a generative adversarial model remains an important and challenging problem. We propose a novel approach, Adversarial Inference by Matching priors and conditionals (AIM), which explicitly matches prior and conditional distributions in both data and code spaces, and puts a direct constraint on the dependency structure of the generative model. We derive an equivalent form of the prior and conditional matching objective that can be optimized efficiently without any parametric assumption on the data. We validate the effectiveness of AIM on the MNIST, CIFAR-10, and CelebA datasets by conducting quantitative and qualitative evaluations. Results demonstrate that AIM significantly improves both reconstruction and generation as compared to other adversarial inference models.

##### Local Binary Pattern Networks for Character Recognition

No tl;dr =[

Memory and computation efficient deep learning architectures are crucial to the continued proliferation of machine learning capabilities to new platforms and systems. Binarization of operations in convolutional neural networks has shown promising results in reducing the model size and computing efficiency. In this paper, we tackle the character recognition problem using a strategy different from the existing literature by proposing local binary pattern networks or LBPNet that can learn and perform bit-wise operations in an end-to-end fashion. LBPNet uses local binary comparisons and random projection in place of conventional convolution (or approximation of convolution) operations, providing important means to improve memory and speed efficiency that is particularly suited for small footprint devices and hardware accelerators. These operations can be implemented efficiently on different platforms including direct hardware implementation. LBPNet demonstrates its particular advantage on the character classification task where the content is composed of strokes. We applied LBPNet to benchmark datasets like MNIST, SVHN, DHCD, ICDAR, and Chars74K and observed encouraging results.

tl;dr We design an adversarial training method to Bayesian neural networks, showing a much stronger defense to white-box adversarial attacks

We present a new algorithm to train a robust neural network against adversarial attacks. Our algorithm is motivated by the following two ideas. First, although recent work has demonstrated that fusing randomness can improve the robustness of neural networks (Liu 2017), we noticed that adding noise blindly to all the layers is not the optimal way to incorporate randomness. Instead, we model randomness under the framework of Bayesian Neural Network (BNN) to formally learn the posterior distribution of models in a scalable way. Second, we formulate the mini-max problem in BNN to learn the best model distribution under adversarial attacks, leading to an adversarial-trained Bayesian neural net. Experiment results demonstrate that the proposed algorithm achieves state-of-the-art performance under strong attacks. On CIFAR-10 with VGG network, our model leads to 14% accuracy improvement compared with adversarial training (Madry 2017) and random self-ensemble (Liu, 2017) under PGD attack with 0.035 distortion, and the gap becomes even larger on a subset of ImageNet.

##### Optimal Completion Distillation for Sequence Learning

tl;dr Optimal Completion Distillation (OCD) is a training procedure for optimizing sequence to sequence models based on edit distance which achieves state-of-the-art on end-to-end Speech Recognition tasks.

We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance. OCD is efficient, has no hyper-parameters of its own, and does not require pre-training or joint optimization with conditional log-likelihood. Given a partial sequence generated by the model, we first identify the set of optimal suffixes that minimize the total edit distance, using an efficient dynamic programming algorithm. Then, for each position of the generated sequence, we use a target distribution which puts equal probability on the first token of all the optimal suffixes. OCD achieves the state-of-the-art performance on end-to-end speech recognition, on both Wall Street Journal and Librispeech datasets, achieving $9.3\%$ WER and $4.8\%$ WER, respectively.

##### Explainable Adversarial Learning: Implicit Generative Modeling of Random Noise during Training for Adversarial Robustness

tl;dr Noise modeling at the input during discriminative training improves adversarial robustness. Propose PCA based evaluation metric for adversarial robustness

We introduce Explainable Adversarial Learning, ExL, an approach for training neural networks that are intrinsically robust to adversarial attacks. We find that the implicit generative modeling of random noise with the same loss function used during posterior maximization, improves a model's understanding of the data manifold furthering adversarial robustness. We prove our approach's efficacy and provide a simplistic visualization tool for understanding adversarial data, using Principal Component Analysis. Our analysis reveals that adversarial robustness, in general, manifests in models with higher variance along the high-ranked principal components. We show that models learnt with our approach perform remarkably well against a wide-range of attacks. Furthermore, combining ExL with state-of-the-art adversarial training extends the robustness of a model, even beyond what it is adversarially trained for, in both white-box and black-box attack scenarios.

##### Deep Learning 3D Shapes Using Alt-az Anisotropic 2-Sphere Convolution

tl;dr A method for applying deep learning to 3D surfaces using their spherical descriptors and alt-az anisotropic convolution on 2-sphere.

The ground-breaking performance obtained by deep convolutional neural networks (CNNs) for image processing tasks is inspiring research efforts attempting to extend it for 3D geometric tasks. One of the main challenge in applying CNNs to 3D shape analysis is how to define a natural convolution operator on non-euclidean surfaces. In this paper, we present a method for applying deep learning to 3D surfaces using their spherical descriptors and alt-az anisotropic convolution on 2-sphere. A cascade set of geodesic disk filters rotate on the 2-sphere and collect multi-level spherical patterns to extract non-trivial features for various 3D shape analysis tasks. We demonstrate theoretically and experimentally that our proposed method has the possibility to bridge the gap between 2D images and 3D shapes with the desired rotation equivariance/invariance, and its effectiveness is evaluated in applications of non-rigid/ rigid shape classification and shape retrieval.

##### Directional Analysis of Stochastic Gradient Descent via von Mises-Fisher Distributions in Deep Learning

tl;dr One of theoretical issues in deep learning

Although stochastic gradient descent (SGD) is a driving force behind the recent success of deep learning, our understanding of its dynamics in a high-dimensional parameter space is limited. In recent years, some researchers have used the stochasticity of minibatch gradients, or the signal-to-noise ratio, to better characterize the learning dynamics of SGD. Inspired from these work, we here analyze SGD from a geometrical perspective by inspecting the stochasticity of the norms and directions of minibatch gradients. We propose a model of the directional concentration for minibatch gradients through von Mises-Fisher (VMF) distribution, and show that the directional uniformity of minibatch gradients increases over the course of SGD. We empirically verify our result using deep convolutional networks and observe a higher correlation between the gradient stochasticity and the proposed directional uniformity than that against the gradient norm stochasticity, suggesting that the directional statistics of minibatch gradients is a major factor behind SGD.

##### There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average

tl;dr Consistency-based models for semi-supervised learning do not converge to a single point but continue to explore a diverse set of plausible solutions on the perimeter of a flat region. Weight averaging helps improve generalization performance.

Presently the most successful approaches to semi-supervised learning are based on consistency regularization, whereby a model is trained to be robust to small perturbations of its inputs and parameters. The consistency loss dramatically improves generalization performance over supervised-only training; however, we show that SGD struggles to converge on the consistency loss and continues to make large steps that lead to changes in predictions on the test data. We show that averaging weights can significantly improve their generalization performance. Motivated by these observations, we propose to train consistency-based methods with Stochastic Weight Averaging (SWA), a recent approach which averages weights along the trajectory of SGD with a modified learning rate schedule. We also propose fast-SWA, which further accelerates convergence by averaging multiple points within each cycle of a cyclical learning rate schedule. With weight averaging, we achieve the best known semi-supervised results on CIFAR-10 and CIFAR-100 over many different settings of training labels. For example, we achieve 5.0% error on CIFAR-10 with only 4000 labels, compared to the previous best result in the literature of 6.3%.

##### DELTA: DEEP LEARNING TRANSFER USING FEATURE MAP WITH ATTENTION FOR CONVOLUTIONAL NETWORKS

tl;dr improving deep transfer learning with regularization using attention based feature maps

Transfer learning through fine-tuning a pre-trained neural network with an extremely large dataset, such as ImageNet, can significantly accel- erate training while the accuracy is frequently bottlenecked by the lim- ited dataset size of the new target task. To solve the problem, some regularization methods, constraining the outer layer weights of the tar- get network using the starting point as references (SPAR), have been studied. In this paper, we propose a novel regularized transfer learn- ing framework DELTA, namely DEep Learning Transfer using Fea- ture Map with Attention. Instead of constraining the weights of neu- ral network, DELTA aims to preserve the outer layer outputs of the target network. Specifically, in addition to minimizing the empirical loss, DELTA intends to align the outer layer outputs of two networks, through constraining a subset of feature maps that are precisely selected by attention that has been learned in an supervised learning manner. We evaluate DELTA with the state-of-the-art algorithms, including L2 and L2-SP. The experiment results show that our proposed method outper- forms these baselines with higher accuracy for new tasks.

##### SHE2: Stochastic Hamiltonian Exploration and Exploitation for Derivative-Free Optimization

tl;dr a new derivative-free optimization algorithms derived from Nesterov's accelerated gradient methods and Hamiltonian dynamics

Derivative-free optimization (DFO) using trust region methods is frequently used for machine learning applications, such as (hyper-)parameter optimization without the derivatives of objective functions known. Inspired by the recent work in continuous-time minimizers, our work models the common trust region methods with the exploration-exploitation using a dynamical system coupling a pair of dynamical processes. While the first exploration process searches the minimum of the blackbox function through minimizing a time-evolving surrogation function, another exploitation process updates the surrogation function time-to-time using the points traversed by the exploration process. The efficiency of derivative-free optimization thus depends on ways the two processes couple. In this paper, we propose a novel dynamical system, namely \ThePrev---\underline{S}tochastic \underline{H}amiltonian \underline{E}xploration and \underline{E}xploitation, that surrogates the subregions of blackbox function using a time-evolving quadratic function, then explores and tracks the minimum of the quadratic functions using a fast-converging Hamiltonian system. The \ThePrev\ algorithm is later provided as a discrete-time numerical approximation to the system. To further accelerate optimization, we present \TheName\ that parallelizes multiple \ThePrev\ threads for concurrent exploration and exploitation. Experiment results based on a wide range of machine learning applications show that \TheName\ outperform a boarder range of derivative-free optimization algorithms with faster convergence speed under the same settings.

##### The Case for Full-Matrix Adaptive Regularization

Adaptive regularization methods pre-multiply a descent direction by a preconditioning matrix. Due to the large number of parameters of machine learning problems, full-matrix preconditioning methods are prohibitively expensive. We show how to modify full-matrix adaptive regularization in order to make it practical and effective. We also provide novel theoretical analysis for adaptive regularization in non-convex optimization settings. The core of our algorithm, termed GGT, consists of efficient inverse computation of square roots of low-rank matrices. Our preliminary experiments underscore improved convergence rate of GGT across a variety of synthetic tasks and standard deep learning benchmarks.

##### Incremental Few-Shot Learning with Attention Attractor Networks

No tl;dr =[

Machine learning classifiers are often trained to recognize a set of pre-defined classes. However, in many real applications, it is often desirable to have the flexibility of learning additional concepts, without re-training on the full training set. This paper addresses this problem, incremental few-shot learning, where a regular classification network has already been trained to recognize a set of base classes; and several extra novel classes are being considered, each with only a few labeled examples. After learning the novel classes, the model is then evaluated on the overall performance of both base and novel classes. To this end, we propose a meta-learning model, the Attention Attractor Network, which regularizes the learning of novel classes. In each episode, we train a set of new weights to recognize novel classes until they converge, and we show that the technique of recurrent back-propagation can back-propagate through the optimization process and facilitate the learning of the attractor network regularizer. We demonstrate that the learned attractor network can recognize novel classes while remembering old classes without the need to review the original training set, outperforming baselines that do not rely on an iterative optimization process.

##### Few-Shot Learning by Exploiting Object Relation

tl;dr Few-shot learning by exploiting the object-level relation to learn the image-level relation (similarity)

Few-shot learning trains image classifiers over datasets with few examples per category. It poses challenges for the optimization algorithms, which typically require many examples to fine-tune the model parameters for new categories. Distance-learning-based approaches avoid the optimization issue by embedding the images into a metric space and applying the nearest neighbor classifier for new categories. In this paper, we propose to exploit the object-level relation to learn the image relation feature, which is converted into a distance directly. For a new category, even though its images are not seen by the model, some objects may appear in the training images. Hence, object-level relation is useful for inferring the relation of images from unseen categories. Consequently, our model generalizes well for new categories without fine-tuning. Experimental results on benchmark datasets show that our approach outperforms state-of-the-art methods.

##### FEATURE PRIORITIZATION AND REGULARIZATION IMPROVE STANDARD ACCURACY AND ADVERSARIAL ROBUSTNESS

tl;dr We propose a model that employs feature prioritization and regularization to improve the adversarial robustness and the standard accuracy.

Adversarial training has been successfully applied to build robust models at a certain cost. While the robustness of a model increases, the standard classification accuracy declines. This phenomenon is suggested to be an inherent trade-off between standard accuracy and robustness. We propose a model that employs feature prioritization by a nonlinear attention module and L2 regularization as implicit denoising to improve the adversarial robustness and the standard accuracy relative to adversarial training. Focusing sharply on the regions of interest, the attention maps encourage the model to rely heavily on features extracted from the most relevant areas while suppressing the unrelated background. Penalized by a regularizer, the model extracts similar features for the natural and adversarial images, effectively ignoring the added perturbation. In addition to qualitative evaluation, we also propose a novel experimental strategy that quantitatively demonstrates that our model is almost ideally aligned with salient data characteristics. Additional experimental results illustrate the power of our model relative to the state of the art methods

##### CONTROLLING COVARIATE SHIFT USING EQUILIBRIUM NORMALIZATION OF WEIGHTS

tl;dr An alternative normalization technique to batch normalization

We introduce a new normalization technique that exhibits the fast convergence properties of batch normalization using a transformation of layer weights instead of layer outputs. The proposed technique keeps the contribution of positive and negative weights to the layer output in equilibrium. We validate our method on a set of standard benchmarks including CIFAR-10/100, SVHN and ILSVRC 2012 ImageNet.

##### Compositional GAN: Learning Conditional Image Composition

tl;dr We develop a novel approach to model object compositionality in images in a GAN framework.

Generative Adversarial Networks (GANs) can produce images of surprising complexity and realism, but are generally structured to sample from a single latent source ignoring the explicit spatial interaction between multiple entities that could be present in a scene. Capturing such complex interactions between different objects in the world, including their relative scaling, spatial layout, occlusion, or viewpoint transformation is a challenging problem. In this work, we propose to model object composition in a GAN framework as a self-consistent composition-decomposition network. Our model is conditioned on the object images from their marginal distributions and can generate a realistic image from their joint distribution. We evaluate our model through qualitative experiments and user evaluations in scenarios when either paired or unpaired examples for the individual object images and the joint scenes are given during training. Our results reveal that the learned model captures potential interactions between the two object domains given as input to output new instances of composed scene at test time in a reasonable fashion.

##### Differentiable Learning-to-Normalize via Switchable Normalization

No tl;dr =[

We address a learning-to-normalize problem by proposing Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network. SN employs three distinct scopes to compute statistics (means and variances) including a channel, a layer, and a minibatch. SN switches between them by learning their importance weights in an end-to-end manner. It has several good properties. First, it adapts to various network architectures and tasks (see Fig.1). Second, it is robust to a wide range of batch sizes, maintaining high performance even when small minibatch is presented (e.g. 2 images/GPU). Third, SN does not have sensitive hyper-parameter, unlike group normalization that searches the number of groups as a hyper-parameter. Without bells and whistles, SN outperforms its counterparts on various challenging benchmarks, such as ImageNet, COCO, CityScapes, ADE20K, and Kinetics. Analyses of SN are also presented. We hope SN will help ease the usage and understand the normalization techniques in deep learning. The code of SN will be released.

##### Hierarchical Generative Modeling for Controllable Speech Synthesis

tl;dr Building a TTS model with Gaussian Mixture VAEs enables fine-grained control of speaking style, noise condition, and more.

This paper proposes a neural end-to-end text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, it is capable of consistently synthesizing high-quality clean speech regardless of the quality of the training data for the target speaker.

##### Learning Factorized Multimodal Representations

tl;dr We propose a model to learn factorized multimodal representations that are discriminative, generative, and interpretable.

Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.

##### Show, Attend and Translate: Unsupervised Image Translation with Self-Regularization and Attention

tl;dr We propose a simple generative model for unsupervised image translation and saliency detection.

Image translation between two domains is a class of problems aiming to learn mapping from an input image in the source domain to an output image in the target domain. It has been applied to numerous applications, such as data augmentation, domain adaptation, and unsupervised training. When paired training data is not accessible, image translation becomes an ill-posed problem. We constrain the problem with the assumption that the translated image needs to be perceptually similar to the original image and also appears to be drawn from the new domain, and propose a simple yet effective image translation model consisting of a single generator trained with a self-regularization term and an adversarial term. We further notice that existing image translation techniques are agnostic to the subjects of interest and often introduce unwanted changes or artifacts to the input. Thus we propose to add an attention module to predict an attention map to guide the image translation process. The module learns to attend to key parts of the image while keeping everything else unaltered, essentially avoiding undesired artifacts or changes. The predicted attention map also opens door to applications such as unsupervised segmentation and saliency detection. Extensive experiments and evaluations show that our model while being simpler, achieves significantly better performance than existing image translation methods.

##### Interpreting Adversarial Robustness: A View from Decision Surface in Input Space

No tl;dr =[

One popular hypothesis of neural network generalization is that the flat local minima of loss surface in parameter space leads to good generalization. However, we demonstrate that loss surface in parameter space has no obvious relationship with generalization, especially under adversarial settings. Through visualizing decision surfaces in both parameter space and input space, we instead show that the geometry property of decision surface in input space correlates well with the adversarial robustness. We then propose an adversarial robustness indicator, which can evaluate a neural network's intrinsic robustness property without testing its accuracy under adversarial attacks. Guided by it, we further propose our robust training method. Without involving adversarial training, our method could enhance network's intrinsic adversarial robustness against various adversarial attacks.

##### Context Dependent Modulation of Activation Function

tl;dr We propose a modification to traditional Artificial Neural Networks motivated by the biology of neurons to enable the shape of the activation function to be context dependent.

We propose a modification to traditional Artificial Neural Networks (ANNs), which provides the ANNs with new aptitudes motivated by biological neurons. Biological neurons work far beyond linearly summing up synaptic inputs and then transforming the integrated information. A biological neuron change firing modes accordingly to peripheral factors (e.g., neuromodulators) as well as intrinsic ones. Our modification connects a new type of ANN nodes, which mimic the function of biological neuromodulators and are termed modulators, to enable other traditional ANN nodes to adjust their activation sensitivities in run-time based on their input patterns. In this manner, we enable the slope of the activation function to be context dependent. This modification produces statistically significant improvements in comparison with traditional ANN nodes in the context of Convolutional Neural Networks and Long Short-Term Memory networks.

##### Neuron Hierarchical Networks

tl;dr By breaking the layer hierarchy, we propose a 3-step approach to the construction of neuron-hierarchy networks that outperform NAS, SMASH and hierarchical representation with fewer parameters and shorter searching time.

In this paper, we propose a neural network framework called neuron hierarchical network (NHN), that evolves beyond the hierarchy in layers, and concentrates on the hierarchy of neurons. We observe mass redundancy in the weights of both handcrafted and randomly searched architectures. Inspired by the development of human brains, we prune low-sensitivity neurons in the model and add new neurons to the graph, and the relation between individual neurons are emphasized and the existence of layers weakened. We propose a process to discover the best base model by random architecture search, and discover the best locations and connections of the added neurons by evolutionary search. Experiment results show that the NHN achieves higher test accuracy on Cifar-10 than state-of-the-art handcrafted and randomly searched architectures, while requiring much fewer parameters and less searching time.

tl;dr We proposed "Difference-Seeking Generative Adversarial Network" (DSGAN) model to learn the target distribution which is hard to collect training data.

We propose a novel algorithm, Difference-Seeking Generative Adversarial Network (DSGAN), developed from traditional GAN. DSGAN considers the scenario that the training samples of target distribution, $p_{t}$, are difficult to collect. Suppose there are two distributions $p_{\bar{d}}$ and $p_{d}$ such that the density of the target distribution can be the differences between the densities of $p_{\bar{d}}$ and $p_{d}$. We show how to learn the target distribution $p_{t}$ only via samples from $p_{d}$ and $p_{\bar{d}}$ (relatively easy to obtain). DSGAN has the flexibility to produce samples from various target distributions (e.g. the out-of-distribution). Two key applications, semi-supervised learning and adversarial training, are taken as examples to validate the effectiveness of DSGAN. We also provide theoretical analyses about the convergence of DSGAN.

##### In Your Pace: Learning the Right Example at the Right Time

tl;dr We provide a formal definition of curriculum learning for deep neural networks, empirically showing how it can improve learning performance without additional human supervision and in a problem-free manner.

Training neural networks is traditionally done by sequentially providing random mini-batches sampled uniformly from the entire dataset. In our work, we show that sampling mini-batches non-uniformly can both enhance the speed of learning and improve the final accuracy of the trained network. Specifically, we decompose the problem using the principles of curriculum learning: first, we sort the data by some difficulty measure; second, we sample mini-batches with a gradually increasing level of difficulty. We focus on CNNs trained on image recognition. Initially, we define the difficulty of a training image using transfer learning from some competitive "teacher" network trained on the Imagenet database, showing improvement in learning speed and final performance for both small and competitive networks, using the CIFAR-10 and the CIFAR-100 datasets. We then suggest a bootstrap alternative to evaluate the difficulty of points using the same network without relying on a "teacher" network, thus increasing the applicability of our suggested method. We compare this approach to a related version of Self-Paced Learning, showing that our method benefits learning while SPL impairs it.

##### Single Shot Neural Architecture Search Via Direct Sparse Optimization

tl;dr single shot neural architecture search via direct sparse optimization

Recently Neural Architecture Search (NAS) has aroused great interest in both academia and industry, however it remains challenging because of its huge and non-continuous search space. Instead of applying evolutionary algorithm or reinforcement learning as previous works, this paper proposes a Direct Sparse Optimization NAS (DSO-NAS) method. In DSO-NAS, we provide a novel model pruning view to NAS problem. In specific, we start from a completely connected block, and then introduce scaling factors to scale the information flow between operations. Next, we impose sparse regularizations to prune useless connections in the architecture. Lastly, we derive an efficient and theoretically sound optimization method to solve it. Our method enjoys both advantages of differentiability and efficiency, therefore can be directly applied to large datasets like ImageNet. Particularly, On CIFAR-10 dataset, DSO-NAS achieves an average test error 2.84%, while on the ImageNet dataset DSO-NAS achieves 25.4% test error under 600M FLOPs with 8 GPUs in 18 hours.

In this paper, we are interested in two seemingly different concepts: \textit{adversarial training} and \textit{generative adversarial networks (GANs)}. Particularly, how these techniques work to improve each other. To this end, we analyze the limitation of adversarial training as a defense method, starting from questioning how well the robustness of a model can generalize. Then, we successfully improve the generalizability via data augmentation by the fake'' images sampled from generative adversarial network. After that, we are surprised to see that the resulting robust classifier leads to a better generator, for free. We intuitively explain this interesting phenomenon and leave the theoretical analysis for future work. Motivated by these observations, we propose a system that combines generator, discriminator, and adversarial attacker together in a single network. After end-to-end training and fine tuning, our method can simultaneously improve the robustness of classifiers, measured by accuracy under strong adversarial attacks, and the quality of generators, evaluated both aesthetically and quantitatively. In terms of the classifier, we achieve better robustness than the state-of-the-art adversarial training algorithm proposed in (Madry \textit{et al.}, 2017), while our generator achieves competitive performance compared with SN-GAN (Miyato and Koyama, 2018).