Search ICLR 2019

Searching papers submitted to ICLR 2019 can be painful. You might want to know which paper uses technique X, dataset D, or cites author ME. Unfortunately, search is limited to titles, abstracts, and keywords, missing the actual contents of the paper. This Frankensteinian search has returned from 2018 to help scour the papers of ICLR by ripping out their souls using pdftotext.

Good luck! Warranty's not included :)

Need random search inspiration..? Grab something from the list of all tags! ^_^
How about: channel pruning, wake-sleep, topic model, model-based meta-learning, permutation phase defense ..?

Sanity Disclaimer: As you stare at the continuous stream of ICLR and arXiv papers, don't lose confidence or feel overwhelmed. This isn't a competition, it's a search for knowledge. You and your work are valuable and help carve out the path for progress in our field :)

"Random selection" has 100 results

DADAM: A consensus-based distributed adaptive gradient method for online optimization    

No tl;dr =[

Online and stochastic optimization methods such as SGD, ADAGRAD and ADAM are key algorithms in solving large-scale machine learning problems including deep learning. A number of schemes that are based on communications of nodes with a central server have been recently proposed in the literature to parallelize them. A bottleneck of such centralized algorithms lies on the high communication cost incurred by the central node. In this paper, we present a new consensus-based distributed adaptive moment estimation method (DADAM) for online optimization over a decentralized network that enables data parallelization, as well as decentralized computation. Such a framework note only can be extremely useful for learning agents with access to only local data in a communication constrained environment, but as shown in this work also outperform centralized adaptive algorithms such as ADAM for certain realistic classes of loss functions. We analyze the convergence properties of the proposed algorithm and provide a \textit{dynamic regret} bound on the convergence rate of adaptive moment estimation methods in both stochastic and deterministic settings. Empirical results demonstrate that DADAM works well in practice and compares favorably to competing online optimization methods.

Probabilistic Planning with Sequential Monte Carlo    

tl;dr Leveraging control as inference and Sequential Monte Carlo methods, we proposed a probabilistic planning algorithm.

Planning is a fundamental ability of intelligent agents, which allows them to reason about their future behavior and efficiently learn new tasks. Despite its importance, planning remains an open problem in complex environments. Current challenges include model compounding errors in roll-outs and exponential search space over trajectories. In this work, we propose a novel formulation of planning, which views it as a probabilistic inference problem over future optimal trajectories. This enabled us to use sampling methods, and thus, tackle planning in continuous and high-dimensional spaces using a fixed computational budget. We designed a new algorithm, Sequential Monte Carlo Planning (SMCP), by leveraging modern methods in Sequential Monte Carlo (SMC), Bayesian smoothing, and control as inference. Following, we draw parallels between our algorithm and popular planning methods such as Monte Carlo Tree Search (MCTS) (Kearns et al., 2002) and Random Shooting methods (Rubinstein & Kroese, 2004). Furthermore, we present experimental results competitive to state-of-the-art methods on standard continuous control tasks. To conclude, we provide direction for future research in hopes of encouraging new lines of work in this domain.

Universal discriminative quantum neural networks    

No tl;dr =[

Quantum mechanics fundamentally forbids deterministic discrimination of quantum states and 9b8d to discriminate among various pure and mixed quantum states exhibiting a trade-off between minimizing erroneous and inconclusive outcomes with comparable performance to theoretically optimal POVMs. We train the circuit on different classes of quantum data and evaluate the generalization error on unseen mixed quantum states. This generalization power hence distinguishes our work from standard circuit optimization and provides an example of quantum machine learning for a task that has inherently no classical analog.

Feature Intertwiners    

tl;dr A feature intertwiner module to leverage features from one accurate set to help the learning of another less reliable set.

A well-trained model should classify objects with unanimous score for every category. This requires the high-level semantic features should be alike among samples, despite a wide span in resolution, texture, deformation, etc. Previous works focus on re-designing the loss function or proposing new regularization constraints on the loss. In this paper, we address this problem via a new perspective. For each category, it is assumed that there are two sets in the feature space: one with more reliable information and the other with less reliable source. We argue that the reliable set could guide the feature learning of the less reliable set during training - in spirit of student mimicking teacher’s behavior and thus pushing towards a more compact class centroid in the high-dimensional space. Such a scheme also benefits the reliable set since samples become more closer within the same category - implying that it is easilier for the classifier to identify. We refer to this mutual learning process as feature intertwiner and embed the spirit into object detection. It is well-known that objects of low resolution are more difficult to detect due to the loss of detailed information during network forward pass. We thus regard objects of high resolution as the reliable set and objects of low resolution as the less reliable set. Specifically, an intertwiner is achieved by minimizing the distribution divergence between two sets. We design a historical buffer to represent all previous samples in the reliable set and utilize them to guide the feature learning of the less reliable set. The design of obtaining an effective feature representation for the reliable set is further investigated, where we introduce the optimal transport (OT) algorithm into the framework. Samples in the less reliable set are better aligned with the reliable set with aid of OT metric. Incorporated with such a plug-and-play intertwiner, we achieve an evident improvement over previous state-of-the-arts on the COCO object detection benchmark.

Domain Adaptation for Structured Output via Disentangled Patch Representations    

tl;dr A domain adaptation method for structured output via learning patch-level discriminative feature representations

Predicting structured outputs such as semantic segmentation relies on expensive per-pixel annotations to learn strong supervised models like convolutional neural networks. However, these models trained on one data domain may not generalize well to other domains unequipped with annotations for model finetuning. To avoid the labor-intensive process of annotation, we develop a domain adaptation method to adapt the source data to the unlabeled target domain. To this end, we propose to learn discriminative feature representations of patches based on label histograms in the source domain, through the construction of a disentangled space. With such representations as guidance, we then use an adversarial learning scheme to push the feature representations in target patches to the closer distributions in source ones. In addition, we show that our framework can integrate a global alignment process with the proposed patch-level alignment and achieve state-of-the-art performance on semantic segmentation. Extensive ablation studies and experiments are conducted on numerous benchmark datasets with various settings, such as synthetic-to-real and cross-city scenarios.

A Collaborative Low-Rank Reconstructive Layer for Domain Shift Problems    

No tl;dr =[

In many situations, the data one has access to at test time follows a different distribution from the training data. Over the years, this problem has been tackled by domain adaptation and domain generalization techniques. In this context, the most popular approach consists of learning features whose distributions are the same in all observed domains. However, with high-dimensional features, such as those of deep networks, we argue that matching distributions is an under-constrained problem; a network can learn to add noise in the representation to artificially achieve this goal. To address this, we propose to learn a collaborative low-rank representation of the data encoding information from all observed domains and well-suited for classification. We formulate this as a general layer that can either be used on its own within any standard base network, or incorporated into existing domain adaptation and generalization frameworks. We empirically show that it consistently boosts the performance of the baseline models that do not exploit our new layer, thus allowing us to achieve state-of-the-art results on several visual domain adaptation and generalization benchmark datasets.

Unicorn: Continual learning with a universal, off-policy agent    

tl;dr Agents learning jointly and off-policy about many tasks make progress on challenging continual learning domains.

Some real-world domains are best characterized as a single task, but for others this perspective is limiting. Instead, some tasks continually grow in complexity, in tandem with the agent's competence. In continual learning there are no explicit task boundaries or curricula. As learning agents have become more powerful, continual learning remains one of the frontiers that has resisted quick progress. To test continual learning capabilities we consider a challenging 3D domain with an implicit sequence of tasks and sparse rewards. We propose a novel agent architecture called Unicorn, which demonstrates strong continual learning and outperforms several baseline agents on the proposed domain. The agent achieves this by jointly representing and efficiently learning multiple policies for multiple goals, using a parallel off-policy learning setup.

Learning deep representations by mutual information estimation and maximization    

tl;dr We learn deep representation by maximizing mutual information, leveraging structure in the objective, and are able to compute with fully supervised classifiers with comparable architectures

In this work, we perform unsupervised learning of representations by maximizing mutual information between an input and the output of a deep neural network encoder. Importantly, we show that structure matters: incorporating knowledge about locality of the input to the objective can greatly influence a representation's suitability for downstream tasks. We further control characteristics of the representation by matching to a prior distribution adversarially. Our method, which we call Deep InfoMax (DIM), outperforms a number of popular unsupervised learning methods and competes with fully-supervised learning on several classification tasks. DIM opens new avenues for unsupervised learning of representations and is an important step towards the flexible formulation of representation-learning objectives for specific end-goals.


tl;dr This is a work aiming for boosting all the existing pruning and mimic method.

In this paper, we propose an efficient framework to accelerate convolutional neural networks. We utilize two types of acceleration methods: pruning and hints. Pruning can reduce model size by removing channels of layers. Hints can improve the performance of student model by transferring knowledge from teacher model. We demonstrate that pruning and hints are complementary to each other. On one hand, hints can benefit pruning by maintaining similar feature representations. On the other hand, the model pruned from teacher networks is a good initialization for student model, which increases the transferability between two networks. Our approach performs pruning stage and hints stage iteratively to further improve the performance. Furthermore, we propose an algorithm to reconstruct the parameters of hints layer and make the pruned model more suitable for hints. Experiments were conducted on various tasks including classification and pose estimation. Results on CIFAR-10, ImageNet and COCO demonstrate the generalization and superiority of our framework.

Active Learning with Partial Feedback    

No tl;dr =[

While many active learning papers assume that the learner can simply ask for a label and receive it, real annotation often presents a mismatch between the form of a label (say, one among many classes), and the form of an annotation (typically yes/no binary feedback). To annotate examples corpora for multiclass classification, we might need to ask multiple yes/no questions, exploiting a label hierarchy if one is available. To address this more realistic setting, we propose active learning with partial feedback (ALPF), where the learner must actively choose both which example to label and which binary question to ask. At each step, the learner selects an example, asking if it belongs to a chosen (possibly composite) class. Each answer eliminates some classes, leaving the learner with a partial label. The learner may then either ask more questions about the same example (until an exact label is uncovered) or move on immediately, leaving the first example partially labeled. Active learning with partial labels requires (i) a sampling strategy to choose (example, class) pairs, and (ii) learning from partial labels between rounds. Experiments on Tiny ImageNet demonstrate that our most effective method improves 26% (relative) in top-1 classification accuracy compared to i.i.d. baselines and standard active learners given 30% of the annotation budget that would be required (naively) to annotate the dataset. Moreover, ALPF-learners fully annotate TinyImageNet at 42% lower cost. Surprisingly, we observe that accounting for per-example annotation costs can alter the conventional wisdom that active learners should solicit labels for hard examples.

The Unusual Effectiveness of Averaging in GAN Training    

No tl;dr =[

We examine two different techniques for parameter averaging in GAN training. Moving Average (MA) computes the time-average of parameters, whereas Exponential Moving Average (EMA) computes an exponentially discounted sum. Whilst MA is known to lead to convergence in bilinear settings, we provide the first to our knowledge theoretical arguments in support of EMA. We show that EMA converges to limit cycles around the equilibrium with vanishing amplitude as the discount parameter approaches one. We establish experimentally that both techniques are strikingly effective in the non convex-concave GAN setting as well. Both improve inception and FID scores on different architectures and for different GAN objectives. We provide comprehensive experimental results across a range of datasets -- mixture of Gaussians, CIFAR-10, STL-10, CelebA and ImageNet -- to demonstrate its effectiveness. We achieve state-of-the-art results on CIFAR-10 and produce clean CelebA face images.


No tl;dr =[

This paper revisits the Random Walk model for sentence embedding in the context of non-extensive statistics. We propose a non-extensive algebra to compute the discourse vector. We argue that by doing so we are taking into account high non-linearity in the semantic space. Furthermore, we show that by considering a non-extensive algebra, the compounding effect of the vector length is mitigated. Overall, we show that the proposed model leads to good sentence embedding. We evaluate the embedding method on textual similarity tasks.

SSoC: Learning Spontaneous and Self-Organizing Communication for Multi-Agent Collaboration    

tl;dr This paper proposes a spontaneous and self-organizing communication (SSoC) learning scheme for multi-agent RL tasks.

Multi-agent collaboration is required by numerous real-world problems. Although distributed setting is usually adopted by practical systems, local range communication and information aggregation still matter in fulfilling complex tasks. For multi-agent reinforcement learning, many previous studies have been dedicated to design an effective communication architecture. However, existing models usually suffer from an ossified communication structure, e.g., most of them predefine a particular communication mode by specifying a fixed time frequency and spatial scope for agents to communicate regardless of necessity. Such design is incapable of dealing with multi-agent scenarios that are capricious and complicated, especially when only partial information is available. Motivated by this, we argue that the solution is to build a spontaneous and self-organizing communication (SSoC) learning scheme. By treating the communication behaviour as an explicit action, SSoC learns to organize communication in an effective and efficient way. Particularly, it enables each agent to spontaneously decide when and who to send messages based on its observed states. In this way, a dynamic inter-agent communication channel is established in an online and self-organizing manner. The agents also learn how to adaptively aggregate the received messages and its own hidden states to execute actions. Various experiments have been conducted to demonstrate that SSoC really learns intelligent message passing among agents located far apart. With such agile communications, we observe that effective collaboration tactics emerge which have not been mastered by the compared baselines.

Generative replay with feedback connections as a general strategy for continual learning    

tl;dr A structured comparison of recent methods for continual learning that turns into an argument for and extension of generative replay.

Standard artificial neural networks suffer from the well-known issue of catastrophic forgetting, making continual or lifelong learning problematic. Recently, numerous methods have been proposed for continual learning, but due to differences in evaluation protocols it is difficult to directly compare their performance. To enable more meaningful comparisons, we identified three distinct continual learning scenarios based on whether task identity is known and, if it is not, whether it needs to be inferred. Performing the split and permuted MNIST task protocols according to each of these scenarios, we found that regularization-based approaches (e.g., elastic weight consolidation) failed when task identity needed to be inferred. In contrast, generative replay combined with distillation (i.e., using class probabilities as ''soft targets'') achieved superior performance in all three scenarios. In addition, we reduced the computational cost of generative replay by integrating the generative model into the main model by equipping it with generative feedback connections. This Replay-through-Feedback approach substantially shortened training time with no or negligible loss in performance. We believe this to be an important first step towards making the powerful technique of generative replay scalable to real-world continual learning applications.

S-System, Geometry, Learning, and Optimization: A Theory of Neural Networks    

tl;dr We present a formal measure-theoretical theory of neural networks (NN) that quantitatively shows NNs renormalize on semantic difference, and under practical conditions large size deep nonlinear NNs can optimize objective functions to zero losses.

We present a formal measure-theoretical theory of neural networks (NN) built on {\it probability coupling theory}. Particularly, we present an algorithm framework, Hierarchical Measure Group and Approximate System (HMGAS), nicknamed S-System, of which NNs are special cases. In addition to many other results, the framework enables us to prove that 1) NNs implement {\it renormalization group (RG)} using information geometry, which points out that the large scale property to renormalize is dual Bregman divergence and completes the analog between NNs and RG; 2) and under a set of {\it realistic} boundedness and diversity conditions, for {\it large size nonlinear deep} NNs with a class of losses, including the hinge loss, all local minima are global minima with zero loss errors, using random matrix theory.

Robustness Certification with Refinement    

tl;dr We refine the over-approximation results from incomplete verifiers using MILP solvers to prove more robustness properties than state-of-the-art.

We present a novel approach for verification of neural networks which combines scalable over-approximation methods with precise (mixed integer) linear programming. This results in significantly better precision than state of the art verifiers on feed forward neural networks with piecewise linear activation functions.

Information asymmetry in KL-regularized RL    

tl;dr Limiting state information for the default policy can improvement performance, in a KL-regularized RL framework where both agent and default policy are optimized together

Many real world tasks exhibit rich structure that is repeated across different parts of the state space or in time. In this work we study the possibility of leveraging such repeated structure to speed up and regularize learning. We start from the KL regularized expected reward objective which introduces an additional component, a default policy. Instead of relying on a fixed default policy, we learn it from data. But crucially, we restrict the amount of information the default policy receives, forcing it to learn reusable behaviors that help the policy learn faster. We formalize this strategy and discuss connections to information bottleneck approaches and to the variational EM algorithm. We present empirical results in both discrete and continuous action domains and demonstrate that, for certain tasks, learning a default policy alongside the policy can significantly speed up and improve learning. Please watch the video demonstrating learned experts and default policies on several continuous control tasks ( ).

Auxiliary Variational MCMC    

No tl;dr =[

We introduce Auxiliary Variational MCMC, a novel framework for learning MCMC kernels that combines recent advances in variational inference with insights drawn from traditional auxiliary variable MCMC methods such as Hamiltonian Monte Carlo. Our framework exploits low dimensional structure in the target distribution in order to learn a more efficient MCMC sampler. The resulting sampler is able to suppress random walk behaviour and mix between modes efficiently, without the need to compute gradients of the target distribution. We test our sampler on a number of challenging distributions, where the underlying structure is known, and on the task of posterior sampling in Bayesian logistic regression. Code to reproduce all experiments is available at

Capsules Graph Neural Network    

tl;dr Inspired by CapsNet, we propose a novel architecture for graph embeddings on the basis of node features extracted from GNN.

The high-quality node embeddings learned by GNN have been applied to a wide range of node-based graph structured applications and some of them have achieved state-of-the-art (SOTA) performance. However, when applying node embeddings learned from GNN to generate graph embeddings, the scalar node representation typically used in GNN may not suffice to preserve the node/graph relationships, resulting in sub-optimal graph embeddings. Inspired by the Capsule Neural Network (CapsNet), we propose the Capsules Graph Neural Network (CapsGNN), which adopts the concept of capsules to address the weakness in existing GNN-based graph embeddings algorithms. By extracting node features in the form of capsules, routing mechanism can be utilized to capture important statistic information at the graph level. As a result, our model generates multiple embeddings for each graph to capture significant graph properties from different aspects. The attention module incorporated in CapsGNN is used to tackle graphs with various sizes which also enables the model to focus on critical parts of the graphs. Our extensive evaluations with 9 graph-structured datasets demonstrate that CapsGNN has a high potential for large graph data analysis and powerful capability in capturing macroscopic properties of the whole graph. It outperforms other SOTA techniques on several graph classification tasks.

Learning powerful policies and better generative models by interaction    

tl;dr In this paper, we formulate a way to ensure consistency between the predictions of dynamics model and the real observations from the environment. Thus allowing us to learn powerful policies, as well as better dynamics models.

Model-based reinforcement learning approaches have the promise of being sample efficient. Much of the progress in learning dynamics models in RL has been made by learning models via supervised learning. There is enough evidence that humans build a model of the environment, not only by observing the environment but also by interacting with the environment. Interaction with the environment allows humans to carry out \textit{experiments}: taking actions that help uncover true casual relationships which in turn can be used for building better dynamics models. Analogously, we would expect such interaction to be helpful for a learning agent while it learns to model the dynamics of the environment. In this paper, we build upon this intuition, by using an auxiliary cost function to ensure consistency between what the agent observes (by actually performing actions in the real world) and what it hallucinates (by imagining to have taken actions in the environment). Our empirical analysis shows that the proposed approach helps to train powerful policies as well as better dynamics models.

PolyCNN: Learning Seed Convolutional Filters    

tl;dr PolyCNN only needs to learn one seed convolutional filter at each layer. This is an efficient variant of traditional CNN, with on-par performance.

In this work, we propose the polynomial convolutional neural network (PolyCNN), as a new design of a weight-learning efficient variant of the traditional CNN. The biggest advantage of the PolyCNN is that at each convolutional layer, only one convolutional filter is needed for learning the weights, which we call the seed filter, and all the other convolutional filters are the polynomial transformations of the seed filter, which is termed as an early fan-out. Alternatively, we can also perform late fan-out on the seed filter response to create the number of response maps needed to be input into the next layer. Both early and late fan-out allow the PolyCNN to learn only one convolutional filter at each layer, which can dramatically reduce the model complexity by saving 10x to 50x parameters during learning. While being efficient during both training and testing, the PolyCNN does not suffer performance due to the non-linear polynomial expansion which translates to richer representational power within the convolutional layers. By allowing direct control over model complexity, PolyCNN provides a flexible trade-off between performance and efficiency. We have verified the on-par performance between the proposed PolyCNN and the standard CNN on several visual datasets, such as MNIST, CIFAR-10, SVHN, and ImageNet.

Improving Sample-based Evaluation for Generative Adversarial Networks    

tl;dr This paper improves existing sample-based evaluation for GANs and contains some insightful experiments.

In this paper, we propose an improved quantitative evaluation framework for Generative Adversarial Networks (GANs) on generating domain-specific images, where we improve conventional evaluation methods on two levels: the feature representation and the evaluation metric. Unlike most existing evaluation frameworks which transfer the representation of ImageNet inception model to map images onto the feature space, our framework uses a specialized encoder to acquire fine-grained domain-specific representation. Moreover, for datasets with multiple classes, we propose Class-Aware Frechet Distance (CAFD), which employs a Gaussian mixture model on the feature space to better fit the multi-manifold feature distribution. Experiments and analysis on both the feature level and the image level were conducted to demonstrate improvements of our proposed framework over the recently proposed state-of-the-art FID method. To our best knowledge, we are the first to provide counter examples where FID gives inconsistent results with human judgments. It is shown in the experiments that our framework is able to overcome the shortness of FID and improves robustness. Code will be made available.

Efficiently testing local optimality and escaping saddles for ReLU networks    

tl;dr A theoretical algorithm for testing local optimality and extracting descent directions at nondifferentiable points of empirical risks of one-hidden-layer ReLU networks.

We provide a theoretical algorithm for checking local optimality and escaping saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of M data points on the nondifferentiability of the ReLU divides the parameter space into at most 2^M regions, which makes analysis difficult. By exploiting polyhedral geometry, we reduce the total computation down to one convex quadratic program (QP) for each hidden node, O(M) (in)equality tests, and one (or a few) nonconvex QP. For the last QP, we show that our specific problem can be solved efficiently, in spite of nonconvexity. In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast. In the bad case, we have to solve a few more inequality constrained QPs, but we prove that the time complexity is exponential only in the number of inequality constraints. Our experiments show that either benign case or bad case with very few inequality constraints occurs, implying that our algorithm is efficient in most cases.


tl;dr We propose an easy to implement, yet effective method for neural network compression. PFA exploits the intrinsic correlation between filter responses within network layers to recommend a smaller network footprints.

Principal Filter Analysis (PFA) is an easy to implement, yet effective method for neural network compression. PFA exploits the intrinsic correlation between filter responses within network layers to recommend a smaller network footprint. We propose two compression algorithms: the first allows a user to specify the proportion of the original spectral energy that should be preserved in each layer after compression, while the second is a heuristic that leads to a parameter-free approach that automatically selects the compression used at each layer. Both algorithms are evaluated against several architectures and datasets, and we show considerable compression rates without compromising accuracy, e.g., for VGG-16 on CIFAR-10 and CIFAR-100 PFA achieves a compression rate of 8x and 3x with an accuracy gain of 0.4% points and 1.4% points, respectively. In our tests we also demonstrate that networks compressed with PFA achieve an accuracy that is very close to the empirical upper bound for a given compression ratio. Finally, we show how PFA is an effective tool for simultaneous compression and domain adaptation.

Towards Robust, Locally Linear Deep Networks    

No tl;dr =[

Deep networks realize complex mappings that are often understood by their locally linear behavior around or at points of interest. For example, we use the derivative of the mapping with respect to its inputs for sensitivity analysis, or to explain (obtain coordinate relevance for) a prediction. One key challenge is that such derivates are themselves inherently unstable. In this paper, we propose a new learning problem to encourage deep networks to have stable derivatives over larger regions. While the problem is challenging in general, we focus on networks with piecewise linear activation functions. Our algorithm consists of an inference step that identifies a region around a point where linear approximation is provably stable, and an optimization step to expand such regions. We propose a novel relaxation to scale the algorithm to realistic models. We demonstrate algorithm and the resulting solutions with residual and recurrent networks on image and sequence datasets.

Learning agents with prioritization and parameter noise in continuous state and action space    

tl;dr Improving the performance of an RL agent in the continuous action and state space domain by using prioritised experience replay and parameter noise.

Reinforcement Learning (RL) problem can be solved in two different ways - the Value function-based approach and the policy optimization-based approach - to eventually arrive at an optimal policy for the given environment. One of the recent breakthroughs in reinforcement learning is the use of deep neural networks as function approximators to approximate the value function or q-function in a reinforcement learning scheme. This has led to results with agents automatically learning how to play games like alpha-go showing better-than-human performance. Deep Q-learning networks (DQN) and Deep Deterministic Policy Gradient (DDPG) are two such methods that have shown state-of-the-art results in recent times. Among the many variants of RL, an important class of problems is where the state and action spaces are continuous --- autonomous robots, autonomous vehicles, optimal control are all examples of such problems that can lend themselves naturally to reinforcement based algorithms, and have continuous state and action spaces. In this paper, we adapt and combine approaches such as DQN and DDPG in novel ways to outperform the earlier results for continuous state and action space problems. We believe these results are a valuable addition to the fast-growing body of results on Reinforcement Learning, more so for continuous state and action space problems.

Déjà Vu: An Empirical Evaluation of the Memorization Properties of Convnets    

tl;dr We analyze the memorization properties by a convnet of the training set and propose several use-cases where we can extract some information about the training set.

Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting. This paper considers the related question of ``membership inference'', where the goal is to determine if an image was used during training. We consider it under three complementary angles. We first analyze explicit memorization and extend classical random label experiments to the problem of learning a model that predicts if an image belongs to an arbitrary set. We then show how to detect if a dataset was used to train a model, and in particular whether some validation images were used at train time. Finally, we propose a new approach to infer membership when a few of the top layers are not available or have been fine-tuned, and show that lower layers still carry information about the training samples. To support our findings, we conduct large-scale experiments on Imagenet and subsets of YFCC-100M with modern architectures such as VGG and Resnet.

Holographic and other Point Set Distances for Machine Learning    

tl;dr Permutation-invariant loss function for point set prediction.

We introduce an analytic distance function for moderately sized point sets of known cardinality that is shown to have very desirable properties, both as a loss function as well as a regularizer for machine learning applications. We compare our novel construction to other point set distance functions and show proof of concept experiments for training neural networks end-to-end on point set prediction tasks such as object detection.

Principled Deep Neural Network Training through Linear Programming    

tl;dr Using linear programming we show that the computational complexity of approximate Deep Neural Network training depends polynomially on the data size for several architectures

Deep Learning has received significant attention due to its impressive performance in many state-of-the-art learning tasks. Unfortunately, while very powerful, Deep Learning is not well understood theoretically and in particular only recently results for the complexity of training deep neural networks have been obtained. In this work we show that large classes of deep neural networks with various architectures (e.g., DNNs, CNNs, Binary Neural Networks, and ResNets), activation functions (e.g., ReLUs and leaky ReLUs), and loss functions (e.g., Hinge loss, Euclidean loss, etc) can be trained to near optimality with desired target accuracy using linear programming in time that is exponential in the size of the architecture and polynomial in the size of the data set; this is the best one can hope for due to the NP-Hardness of the problem and in line with previous work. In particular, we obtain polynomial time algorithms for training for a given fixed network architecture. Our work applies more broadly to empirical risk minimization problems which allows us to generalize various previous results and obtain new complexity results for previously unstudied architectures in the proper learning setting.

Constraining Action Sequences with Formal Languages for Deep Reinforcement Learning    

tl;dr We constrain an agent's actions during reinforcement learning, for safety or to enhance exploration.

We study the problem of deep reinforcement learning where the agent's action sequences are constrained, e.g., prohibition of dithering or overactuating action sequences that might damage a robot, drone, or other physical device. Our model focuses on constraints that can be described by automata such as DFAs or PDAs. We then propose multiple approaches to augment the state descriptions of the Markov decision process (MDP) with summaries of recent action histories. We empirically evaluate these methods applying DQN to three Atari games, training with reward shaping. We found that our approaches are effective in significantly reducing, and even eliminating, constraint violations while maintaining high reward. We also observed that the total reward achieved by an agent can be highly sensitive to how much the constraints encourage or discourage exploration of potentially effective actions during training, and, in addition to helping ensure safe policies, the use of constraints can enhance exploration during training.

MahiNet: A Neural Network for Many-Class Few-Shot Learning with Class Hierarchy    

tl;dr A memory-augmented neural network that addresses many-class few-shot problem by leveraging class hierarchy in both supervised learning and meta-learning.

We study many-class few-shot (MCFS) problem in both supervised learning and meta-learning scenarios. Compared to the well-studied many-class many-shot and few-class few-shot problems, MCFS problem commonly occurs in practical applications but is rarely studied. MCFS brings new challenges because it needs to distinguish between many classes, but only a few samples per class are available for training. In this paper, we propose ``memory-augmented hierarchical-classification network (MahiNet)'' for MCFS learning. It addresses the ``many-class'' problem by exploring the class hierarchy, e.g., the coarse-class label that covers a subset of fine classes, which helps to narrow down the candidates for the fine class and is cheaper to obtain. MahiNet uses a convolutional neural network (CNN) to extract features, and integrates a memory-augmented attention module with a multi-layer perceptron (MLP) to produce the probabilities over coarse and fine classes. While the MLP extends the linear classifier, the attention module extends a KNN classifier, both together targeting the ``few-shot'' problem. We design different training strategies of MahiNet for supervised learning and meta-learning. Moreover, we propose two novel benchmark datasets ''mcfsImageNet'' (as a subset of ImageNet) and ''mcfsOmniglot'' (re-splitted Omniglot) specifically for MCFS problem. In experiments, we show that MahiNet outperforms several state-of-the-art models on MCFS classification tasks in both supervised learning and meta-learning scenarios.

What a difference a pixel makes: An empirical examination of features used by CNNs for categorisation    

tl;dr This study highlights a key difference between human vision and CNNs: while object recognition in humans relies on analysing shape, CNNs do not have such a shape-bias.

Convolutional neural networks (CNNs) were inspired by human vision and, in some settings, achieve a performance comparable to human object recognition. This has lead to the speculation that both systems use similar mechanisms to perform recognition. In this study, we conducted a series of simulations that indicate that there is a fundamental difference between human vision and CNNs: while object recognition in humans relies on analysing shape, CNNs do not have such a shape-bias. We teased apart the type of features selected by the model by modifying the CIFAR-10 dataset so that, in addition to containing objects with shape, the images concurrently contained non-shape features, such as a noise-like mask. When trained on these modified set of images, the model did not show any bias towards selecting shapes as features. Instead it relied on whichever feature allowed it to perform the best prediction -- even when this feature was a noise-like mask or a single predictive pixel amongst 50176 pixels. We also found that regularisation methods, such as batch normalisation or Dropout, did not change this behaviour and neither did past or concurrent experience with images from other datasets.

Quasi-hyperbolic momentum and Adam for deep learning    

tl;dr Mix plain SGD and momentum (or do something similar with Adam) for great profit.

Momentum-based acceleration of stochastic gradient descent (SGD) is widely used in deep learning. We propose the quasi-hyperbolic momentum algorithm (QHM) as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step. We describe numerous connections to and identities with other algorithms, and we characterize the set of two-state optimization algorithms that QHM can recover. Finally, we propose a QH variant of Adam called QHAdam, and we empirically demonstrate that our algorithms lead to significantly improved training in a variety of settings, including a new state-of-the-art result on WMT16 EN-DE. We hope that these empirical results, combined with the conceptual and practical simplicity of QHM and QHAdam, will spur interest from both practitioners and researchers. PyTorch code is immediately available.

Model Comparison for Semantic Grouping    

tl;dr Competitive alternative to sentence embeddings in the task of semantic similarity using model comparison

We introduce a probabilistic framework for quantifying the semantic similarity between two groups of embeddings. We formulate this as a model comparison task in which we contrast a generative model that encodes similarity between the two groups versus one that does not. We illustrate how this framework can be used for the Semantic Text Similarity (STS) task using clear assumptions about how the embeddings of words are generated. We apply information criteria based model comparison to overcome the shortcomings of Bayesian model comparison, whilst still penalising model complexity. We achieve competitive results by applying the proposed framework with an appropriate choice of likelihood on the STS datasets.

Information-Directed Exploration for Deep Reinforcement Learning    

tl;dr We develop a practical extension of Information-Directed Sampling for Reinforcement Learning, which accounts for parametric uncertainty and heteroscedasticity in the return distribution for exploration.

Efficient exploration remains a major challenge for reinforcement learning. One reason is that the variability of the returns often depends on the current state and action, and is therefore heteroscedastic. Classical exploration strategies such as upper confidence bound algorithms and Thompson sampling fail to appropriately account for heteroscedasticity, even in the bandit setting. Motivated by recent findings that address this issue in bandits, we propose to use Information-Directed Sampling (IDS) for exploration in reinforcement learning. As our main contribution, we build on recent advances in distributional reinforcement learning and propose a novel, tractable approximation of IDS for deep Q-learning. The resulting exploration strategy explicitly accounts for both parametric uncertainty and heteroscedastic observation noise. We evaluate our method on Atari games and demonstrate a significant improvement over alternative approaches.

DARTS: Differentiable Architecture Search    

tl;dr We propose a differentiable architecture search algorithm for both convolutional and recurrent networks, achieving competitive performance with the state of the art using orders of magnitude less computation resources.

This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Unlike conventional approaches of applying evolution or reinforcement learning over a discrete and non-differentiable search space, our method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent. Extensive experiments on CIFAR-10, ImageNet, Penn Treebank and WikiText-2 show that our algorithm excels in discovering high-performance convolutional architectures for image classification and recurrent architectures for language modeling, while being orders of magnitude faster than state-of-the-art non-differentiable techniques.

Siamese Capsule Networks    

tl;dr A variant of capsule networks that can be used for pairwise learning tasks. Results shows that Siamese Capsule Networks work well in the few shot learning setting.

Capsule Networks have shown encouraging results on defacto benchmark computer vision datasets such as MNIST, CIFAR and smallNORB. Although, they are yet to be tested on tasks where (1) the entities detected inherently have more complex internal representations and (2) there are very few instances per class to learn from and (3) where point-wise classification is not suitable. Hence, this paper carries out experiments on face verification in both controlled and uncontrolled settings that together address these points. In doing so we introduce Siamese Capsule Networks, a new variant that can be used for pairwise learning tasks. The model is trained using contrastive loss with l2-normalized capsule encoded pose features. We find that Siamese Capsule Networks perform well against strong baselines on both pairwise learning datasets, yielding best results in the few-shot learning setting where image pairs in the test set contain unseen subjects.

Human-Guided Column Networks: Augmenting Deep Learning with Advice    

tl;dr Guiding relation-aware deep models towards better learning with human knowledge.

While extremely successful in several applications, especially with low-level representations; sparse, noisy samples and structured domains (with multiple objects and interactions) are some of the open challenges in most deep models. Column Networks, a deep architecture, can succinctly capture such domain structure and interactions, but may still be prone to sub-optimal learning from sparse and noisy samples. Inspired by the success of human-advice guided learning in AI, especially in data-scarce domains, we propose Knowledge-augmented Column Networks that leverage human advice/knowledge for better learning with noisy/sparse samples. Our experiments demonstrate how our approach leads to either superior overall performance or faster convergence.


tl;dr We present a general technique toward 8-bit low precision inference of convolutional neural networks.

High throughput and low latency inference of deep neural networks are critical for the deployment of deep learning applications. This paper presents a general technique toward 8-bit low precision inference of convolutional neural networks, including 1) channel-wise scale factors of weights, especially for depthwise convolution, 2) Winograd convolution, and 3) topology-wise 8-bit support. We experiment the techniques on top of a widely-used deep learning framework. The 8-bit optimized model is automatically generated with a calibration process from FP32 model without the need of fine-tuning or retraining. We perform a systematical and comprehensive study on 18 widely-used convolutional neural networks and demonstrate the effectiveness of 8-bit low precision inference across a wide range of applications and use cases, including image classification, object detection, image segmentation, and super resolution. We show that the inference throughput and latency are improved by 1.6X and 1.5X respectively with minimal within 0.6%1to no loss in accuracy from FP32 baseline. We believe the methodology can provide the guidance and reference design of 8-bit low precision inference for other frameworks. All the code and models will be publicly available soon.

Wizard of Wikipedia: Knowledge-Powered Conversational Agents    

tl;dr We build knowledgeable conversational agents by conditioning on Wikipedia + a new supervised task.

In open-domain dialogue intelligent agents should exhibit the use of knowledge, however there are few convincing demonstrations of this to date. The most popular sequence to sequence models typically “generate and hope” generic utterances that can be memorized in the weights of the model when mapping from input utterance(s) to output, rather than employing recalled knowledge as context. Use of knowledge has so far proved difficult, in part because of the lack of a supervised learning benchmark task which exhibits knowledgeable open dialogue with clear grounding. To that end we collect and release a large dataset with conversations directly grounded with knowledge retrieved from Wikipedia. We then design architectures capable of retrieving knowledge, reading and conditioning on it, and finally generating natural responses. Our best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while our new benchmark allows for measuring further improvements in this important research direction.

Feature Transformers: A Unified Representation Learning Framework for Lifelong Learning    

tl;dr Single generic mathematical framework for lifelong learning paradigms with data privacy

Despite the recent advances in representation learning, lifelong learning continues to be one of the most challenging and unconquered problems. Catastrophic forgetting and data privacy constitute two of the important challenges for a successful lifelong learner. Further, existing techniques are designed to handle only specific manifestations of lifelong learning, whereas a practical lifelong learner is expected to switch and adapt seamlessly to different scenarios. In this paper, we present a single, unified mathematical framework for handling the myriad variants of lifelong learning, while alleviating these two challenges. We utilize an external memory to store only the features representing past data and learn richer and newer representations incrementally through transformation neural networks - feature transformers. We define, simulate and demonstrate exemplary performance on a realistic lifelong experimental setting using the MNIST rotations dataset, paving the way for practical lifelong learners. To illustrate the applicability of our method in data sensitive domains like healthcare, we study the pneumothorax classification problem from X-ray images, achieving near gold standard performance. We also benchmark our approach with a number of state-of-the art methods on MNIST rotations and iCIFAR100 datasets demonstrating superior performance.

The role of over-parametrization in generalization of neural networks    

tl;dr We suggest a generalization bound that could potentially explain the improvement in generalization with over-parametrization.

Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes, and could potentially explain the improvement in generalization with over-parametrization. We further present a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks.

Efficient Multi-Objective Neural Architecture Search via Lamarckian Evolution    

tl;dr We propose a method for efficient Multi-Objective Neural Architecture Search based on Lamarckian inheritance and evolutionary algorithms.

Architecture search aims at automatically finding neural architectures that are competitive with architectures designed by human experts. While recent approaches have achieved state-of-the-art predictive performance for image recognition, they are problematic under resource constraints for two reasons: (1) the neural architectures found are solely optimized for high predictive performance, without penalizing excessive resource consumption; (2)most architecture search methods require vast computational resources. We address the first shortcoming by proposing LEMONADE, an evolutionary algorithm for multi-objective architecture search that allows approximating the Pareto-front of architectures under multiple objectives, such as predictive performance and number of parameters, in a single run of the method. We address the second shortcoming by proposing a Lamarckian inheritance mechanism for LEMONADE which generates children networks that are warmstarted with the predictive performance of their trained parents. This is accomplished by using (approximate) network morphism operators for generating children. The combination of these two contributions allows finding models that are on par or even outperform different-sized NASNets, MobileNets, MobileNets V2 and Wide Residual Networks on CIFAR-10 and ImageNet64x64 within only one week on eight GPUs, which is about 20-40x less compute power than previous architecture search methods that yield state-of-the-art performance.

Experience replay for continual learning    

tl;dr We show that, in continual learning settings, catastrophic forgetting can be avoided by applying off-policy RL to a mixture of new and replay experience, with a behavioral cloning loss.

Continual learning is the problem of learning new tasks or knowledge while protecting old knowledge and ideally generalizing from old experience to learn new tasks faster. Neural networks trained by stochastic gradient descent often degrade on old tasks when trained successively on new tasks with different data distributions. This phenomenon, referred to as catastrophic forgetting, is considered a major hurdle to learning with non-stationary data or sequences of new tasks, and prevents networks from continually accumulating knowledge and skills. We examine this issue in the context of reinforcement learning, in a setting where an agent is exposed to tasks in a sequence. Unlike most other work, we do not provide an explicit indication to the model of task boundaries, which is the most general circumstance for a learning agent exposed to continuous experience. While various methods to counteract catastrophic forgetting have recently been proposed, we explore a straightforward, general, and seemingly overlooked solution - that of using experience replay buffers for all past events - with a mixture of on- and off-policy learning, leveraging behavioral cloning. We show that this strategy can still learn new tasks quickly yet can substantially reduce catastrophic forgetting in both Atari and DMLab domains, even matching the performance of methods that require task identities. When buffer storage is constrained, we confirm that a simple mechanism for randomly discarding data allows a limited size buffer to perform almost as well as an unbounded one.

Differentiable Greedy Networks    

tl;dr We propose a subset selection algorithm that is trainable with gradient based methods yet achieves near optimal performance via submodular optimization.

Optimal selection of a subset of items from a given set is a hard problem that requires combinatorial optimization. In this paper, we propose a subset selection algorithm that is trainable with gradient based methods yet achieves near optimal performance via submodular optimization. We focus on the task of identifying a relevant set of sentences for claim verification in the context of the FEVER task. Conventional methods for this task look at sentences on their individual merit and thus do not optimize the informativeness of sentences as a set. We show that our proposed method which builds on the idea of unfolding a greedy algorithm into a computational graph allows both interpretability and gradient based training. The proposed differentiable greedy network (DGN) outperforms discrete optimization algorithms as well as other baseline methods in terms of precision and recall.

The meaning of "most" for visual question answering models    

tl;dr Psychology-inspired evaluation of quantifier understanding for visual question answering models

The correct interpretation of quantifier statements in the context of a visual scene requires non-trivial inference mechanisms. For the example of "most", we discuss two strategies which rely on fundamentally different cognitive concepts. Our aim is to identify what strategy deep learning models for visual question answering learn when trained on such questions. To this end, we carefully design data to replicate experiments from psycholinguistics where the same question was investigated for humans. Our experiments indicate that a form of approximate number system emerges whose performance declines with more difficult scenes as predicted by Weber's law. Moreover, we identify confounding factors, like spatial arrangement of the scene, which impede the effectiveness of this system.

Learning Global Additive Explanations for Neural Nets Using Model Distillation    

tl;dr We propose to leverage model distillation to learn global additive explanations in the form of feature shapes (that are more expressive than feature attributions) for models such as neural nets trained on tabular data.

Interpretability has largely focused on local explanations, i.e. explaining why a model made a particular prediction for a sample. These explanations are appealing due to their simplicity and local fidelity. However, they do not provide information about the general behavior of the model. We propose to leverage model distillation to learn global additive explanations that describe the relationship between input features and model predictions. These global explanations take the form of feature shapes, which are more expressive than feature attributions. Through careful experimentation, we show qualitatively and quantitatively that global additive explanations are able to describe model behavior and yield insights about models such as neural nets. A visualization of our approach applied to a neural net as it is trained is available at

Approximation and non-parametric estimation of ResNet-type convolutional neural networks via block-sparse fully-connected neural networks    

tl;dr It is shown that ResNet-type CNNs are a universal approximator and its expression ability is not worse than fully connected neural networks (FNNs) with a \textit{block-sparse} structure even if the size of each layer in the CNN is fixed.

We develop new approximation and statistical learning theories of convolutional neural networks (CNNs) via the ResNet-type structure where the channel size, width, and filter size are fixed. It is shown that a ResNet-type CNN is a universal approximator and its expression ability is no worse than fully connected neural networks (FNNs) with a \textit{block-sparse} structure even if the size of each layer in the CNN is fixed. Our result is general in the sense that we can automatically translate any approximation rate achieved by block-sparse FNNs into that by CNNs. Thanks to the general theory, it is shown that learning on CNNs satisfies optimality in approximation and estimation of several important function classes. As applications, we consider two types of function classes to be estimated: the Barron class and the H\"older class. We prove the regularized empirical risk minimization (ERM) estimator can achieve the same rate as FNNs even the channel size, filter size, and width of CNNs are constant with respect to the sample size. This is minimax optimal (up to logarithmic factors) for the H\"older class. Our proof is based on sophisticated evaluations of the covering number of CNNs and the non-trivial parameter rescaling technique to control the Lipschitz constant of CNNs to be constructed.


tl;dr Develop new techniques that rely on patch reordering to enable detailed analysis of data-set relationship to training and generalization performances.

We present a framework for automatically ordering image patches that enables in-depth analysis of dataset relationship to learnability of a classification task using convolutional neural network. An image patch is a group of pixels residing in a continuous area contained in the sample. Our preliminary experimental results show that an informed smart shuffling of patches at a sample level can expedite training by exposing important features at early stages of training. In addition, we conduct systematic experiments and provide evidence that CNN’s generalization capabilities do not correlate with human recognizable features present in training samples. We utilized the framework not only to show that spatial locality of features within samples do not correlate with generalization, but also to expedite convergence while achieving similar generalization performance. Using multiple network architectures and datasets, we show that ordering image regions using mutual information measure between adjacent patches, enables CNNs to converge in a third of the total steps required to train the same network without patch ordering.

Classification in the dark using tactile exploration    

tl;dr In this work, we study the problem of learning representations to identify novel objects by exploring objects using tactile sensing. Key point here is that the query is provided in image domain.

Combining information from different sensory modalities to execute goal directed actions is a key aspect of human intelligence. Specifically, human agents are very easily able to translate the task communicated in one sensory domain (say vision) into a representation that enables them to complete this task when they can only sense their environment using a separate sensory modality (say touch). In order to build agents with similar capabilities, in this work we consider the problem of a retrieving a target object from a drawer. The agent is provided with an image of a previously unseen object and it explores objects in the drawer using only tactile sensing to retrieve the object that was shown in the image without receiving any visual feedback. Success at this task requires close integration of visual and tactile sensing. We present a method for performing this task in a simulated environment using an anthropomorphic hand. We hope that future research in the direction of combining sensory signals for acting will find the object retrieval from a drawer to be a useful benchmark problem

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding    

tl;dr We present a multi-task benchmark and analysis platform for evaluating generalization in natural language understanding systems.

For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.

Inferring Reward Functions from Demonstrators with Unknown Biases    

tl;dr When we infer preferences from behavior, we can try to improve accuracy by jointly learning a bias model and preferences, though this requires new assumptions to make progress.

Our goal is to infer reward functions from demonstrations. In order to infer the correct reward function, we must account for the systematic ways in which the demonstrator is suboptimal. Prior work in inverse reinforcement learning can account for specific, known biases, but cannot handle demonstrators with unknown biases. In this work, we explore the idea of learning the demonstrator's planning algorithm (including their unknown biases), along with their reward function. What makes this challenging is that any demonstration could be explained either by positing a term in the reward function, or by positing a particular systematic bias. We explore what assumptions are sufficient for avoiding this impossibility result: either access to tasks with known rewards which enable estimating the planner separately, or that the demonstrator is sufficiently close to optimal that this can serve as a regularizer. In our exploration with synthetic models of human biases, we find that it is possible to adapt to different biases and perform better than assuming a fixed model of the demonstrator, such as Boltzmann rationality.

On Meaning-Preserving Adversarial Perturbations for Sequence-to-Sequence Models    

tl;dr How you should evaluate adversarial attacks on seq2seq

Adversarial examples have been shown to be an effective way of assessing the robustness of neural sequence-to-sequence (seq2seq) models, by applying perturbations to the input of a model leading to large degradation in performance. However, these perturbations are only indicative of a weakness in the model if they do not change the semantics of the input in a way that would change the expected output. Using the example of machine translation (MT), we propose a new evaluation framework for adversarial attacks on seq2seq models taking meaning preservation into account and demonstrate that existing methods may not preserve meaning in general. Based on these findings, we propose new constraints for attacks on word-based MT systems and show, via human and automatic evaluation, that they produce more semantically similar adversarial inputs. Furthermore, we show that performing adversarial training with meaning-preserving attacks is beneficial to the model in terms of adversarial robustness without hurting test performance.

Variance Reduction for Reinforcement Learning in Input-Driven Environments    

tl;dr For environments dictated partially by external input processes, we derive an input-dependent baseline that provably reduces the variance for policy gradient methods and improves the policy performance in a wide range of RL tasks.

We consider reinforcement learning in input-driven environments, where an exogenous, stochastic input process affects the dynamics of the system. Input processes arise in many applications, including queuing systems, robotics control with disturbances, and object tracking. Since the state dynamics and rewards depend on the input process, the state alone provides limited information for the expected future returns. Therefore, policy gradient methods with standard state-dependent baselines suffer high variance during training. We derive a bias-free, input-dependent baseline to reduce this variance, and analytically show its benefits over state-dependent baselines. We then propose a meta-learning approach to overcome the complexity of learning a baseline that depends on a long sequence of inputs. Our experimental results show that across environments from queuing systems, computer networks, and MuJoCo robotic locomotion, input-dependent baselines consistently improve training stability and result in better eventual policies.

Deep Frank-Wolfe For Neural Network Optimization    

No tl;dr =[

Learning a deep neural network requires solving a challenging optimization problem: it is a high-dimensional, non-convex and non-smooth minimization problem with a large number of terms. The current practice in neural network optimization is to rely on the stochastic gradient descent (SGD) algorithm or its adaptive variants. However, SGD requires a hand-designed schedule for the learning rate. In addition, its adaptive variants tend to produce solutions that generalize less well on unseen data than SGD with a hand-designed schedule. We present an optimization method that offers the best of both worlds: our algorithm yields good generalization performance while requiring only one hyper-parameter. Our approach is based on a composite proximal framework, which exploits the compositional nature of deep neural networks and can leverage powerful convex optimization algorithms by design. Specifically, we employ the Frank-Wolfe (FW) algorithm for SVM, which computes an optimal step-size in closed-form at each time-step. We further show that the descent direction is given by a simple backward pass in the network, yielding the same computational cost per iteration as SGD. We customize the algorithm in two ways to further improve its performance. First, we use a descent direction that smoothes the loss function to better condition the problem. Second, we combine our proximal algorithm with Nesterov momentum to benefit from acceleration. We present experiments on the CIFAR and SNLI data sets, where we demonstrate the significant superiority of our method over Adam, Adagrad, as well as the recently proposed BPGrad and AMSGrad. Furthermore, we compare our algorithm to SGD with a hand-designed learning rate schedule, and show that it provides similar generalization while converging faster.

Domain Adaptive Transfer Learning    

No tl;dr =[

Transfer learning is a widely used method to build high performing computer vision models. In this paper, we study the efficacy of transfer learning by examining how the choice of data impacts performance. We find that more pre-training data does not always help, and transfer performance depends on a judicious choice of pre-training data. These findings are important given the continued increase in dataset sizes. We further propose domain adaptive transfer learning, a simple and effective pre-training method using importance weights computed based on the target dataset. Our methods achieve state-of-the-art results on multiple fine-grained classification datasets and are well-suited for use in practice.

CGNF: Conditional Graph Neural Fields    

No tl;dr =[

Graph convolutional networks have achieved tremendous success in the tasks of graph node classification. These models could learn a better node representation through encoding the graph structure and node features. However, the correlation between the node labels are not considered. In this paper, we propose a novel architecture for graph node classification, named conditional graph neural fields (CGNF). By integrating the conditional random fields (CRF) in the graph convolutional networks, we explicitly model a joint probability of the entire set of node labels, thus taking advantage of neighborhood label information in the node label prediction task. Our model could have both the representation capacity of graph neural networks and the prediction power of CRFs. Experiments on several graph datasets demonstrate effectiveness of CGNF.

AntisymmetricRNN: A dynamical System View on Recurrent Neural Networks    

No tl;dr =[

Recurrent neural networks have gained widespread use in modeling sequential data. Learning long-term dependencies using these models remains difficult though, due to exploding or vanishing gradients. In this paper, we draw connections between recurrent networks and ordinary differential equations. A special form of recurrent network called the AntisymmetricRNN is proposed under this theoretical framework, which is able to capture long-term dependencies thanks to the stability property of its underlying differential equation. In comparison to existing approaches for improving RNN trainability, which often incur significant computation overhead, AntisymmetricRNN achieves the same goal by design. We showcase the advantage of this new architecture through extensive simulations and experiments. AntisymmetricRNN exhibits much more predictable dynamics. It outperforms regular LSTM models on tasks requiring long-term memory, and matches the performance on tasks where short-term dependencies dominate despite being much simpler.

Super-Resolution via Conditional Implicit Maximum Likelihood Estimation    

tl;dr We propose a new method for image super-resolution based on IMLE.

Single-image super-resolution (SISR) is a canonical problem with diverse applications. Leading methods like SRGAN produce images that contain various artifacts, such as high-frequency noise, hallucinated colours and shape distortions, which adversely affect the realism of the result. In this paper, we propose an alternative approach based on an extension of the method of Implicit Maximum Likelihood Estimation (IMLE). We demonstrate greater effectiveness at noise reduction and preservation of the original colours and shapes, yielding more realistic super-resolved images.

The Expressive Power of Gated Recurrent Units as a Continuous Dynamical System    

tl;dr We classify the the dynamical features one and two GRU cells can and cannot capture in continuous time, and verify our findings experimentally with k-step time series prediction.

Gated recurrent units (GRUs) were inspired by the common gated recurrent unit, long short-term memory (LSTM), as a means of capturing temporal structure with less complex memory unit architecture. Despite their incredible success in tasks such as natural and artificial language processing, speech, video, and polyphonic music, very little is understood about the specific dynamic features representable in a GRU network. As a result, it is difficult to know a priori how successful a GRU-RNN will perform on a given data set. In this paper, we develop a new theoretical framework to analyze one and two dimensional GRUs as a continuous dynamical system, and classify the dynamic features obtainable with such system. In addition, we show that a two dimensional GRU cannot mimic the dynamics of a ring attractor, or more generally, any line attractor without near zero constant curvature in phase space. These results were then experimentally verified by means of time series prediction.

Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks    

tl;dr We present an analytical framework to determine accumulation bit-width requirements in all three deep learning training GEMMs and verify the validity and tightness of our method via benchmarking experiments.

Efforts to reduce the numerical precision of computations in deep learning training have yielded systems that aggressively quantize weights and activations, yet employ wide high-precision accumulators for partial sums in inner-product operations to preserve the quality of convergence. The absence of any framework to analyze the precision requirements of partial sum accumulations results in conservative design choices. This imposes an upper-bound on the reduction of complexity of multiply-accumulate units. We present a statistical approach to analyze the impact of reduced accumulation precision on deep learning training. Observing that a bad choice for accumulation precision results in loss of information that manifests itself as a reduction in variance in an ensemble of partial sums, we derive a set of equations that relate this variance to the length of accumulation and the minimum number of bits needed for accumulation. We apply our analysis to three benchmark networks: CIFAR-10 ResNet 32, ImageNet ResNet 18 and ImageNet AlexNet. In each case, with accumulation precision set in accordance with our proposed equations, the networks successfully converge to the single precision floating-point baseline. We also show that reducing accumulation precision further degrades the quality of the trained network, proving that our equations produce tight bounds. Overall this analysis enables precise tailoring of computation hardware to the application, yielding area- and power-optimal systems.

Characterizing Audio Adversarial Examples Using Temporal Dependency    

tl;dr Adversarial audio discrimination using temporal dependency

Recent studies have highlighted adversarial examples as a ubiquitous threat to different neural network models and many downstream applications. Nonetheless, as unique data properties have inspired distinct and powerful learning principles, this paper aims to explore their potentials towards mitigating adversarial inputs. In particular, our results reveal the importance of using the temporal dependency in audio data to gain discriminate power against adversarial examples. Tested on the automatic speech recognition (ASR) tasks and three recent audio adversarial attacks, we find that (i) input transformation developed from image adversarial defense provides limited robustness improvement and is subtle to advanced attacks; (ii) temporal dependency can be exploited to gain discriminative power against audio adversarial examples and is resistant to adaptive attacks considered in our experiments. Our results not only show promising means of improving the robustness of ASR systems, but also offer novel insights in exploiting domain-specific data properties to mitigate negative effects of adversarial examples.

Empirical observations on the instability of aligning word vector spaces with GANs    

tl;dr An empirical investigation of GAN-based alignment of word vector spaces, focusing on cases, where linear transformations provably exist, but training is unstable.

Unsupervised bilingual dictionary induction (UBDI) is useful for unsupervised machine translation and for cross-lingual transfer of models into low-resource languages. One approach to UBDI is to align word vector spaces in different languages using Generative adversarial networks (GANs) with linear generators, achieving state-of-the-art performance for several language pairs. For some pairs, however, GAN-based induction is unstable or completely fails to align the vector spaces. We focus on cases where linear transformations provably exist, but the performance of GAN-based UBDI depends heavily on the model initialization. We show that the instability depends on the shape and density of the vector sets, but not on noise; it is the result of local optima, but neither over-parameterization nor changing the batch size or the learning rate consistently reduces instability. Nevertheless, we can stabilize GAN-based UBDI through best-of-N model selection, based on an unsupervised stopping criterion.


No tl;dr =[

We propose Distributional Concavity (DC) regularization for GANs, a functional gradient-based method that promotes the entropy of the generator distribution and works against mode collapse. Our DC regularization is an easy-to-implement method that can be used in combination with the current state of the art methods like Spectral Normalization and WGAN-GP to further improve the performance. We will not only show that our DC regularization can achieve highly competi- tive results on ImageNet and CIFAR datasets in terms of Inception score and FID score, but also provide a mathematical guarantee that our method can always in- crease the entropy of the generator distribution. We will also show an intimate theoretical connection between our method and the theory of optimal transport.

L2-Nonexpansive Neural Networks    

No tl;dr =[

This paper proposes a class of well-conditioned neural networks in which a unit amount of change in the inputs causes at most a unit amount of change in the outputs or any of the internal layers. We develop the known methodology of controlling Lipschitz constants to realize its full potential in maximizing robustness, with a new regularization scheme for linear layers, new ways to adapt nonlinearities and a new loss function. With MNIST and CIFAR-10 classifiers, we demonstrate a number of advantages. Without needing any adversarial training, the proposed classifiers exceed the state of the art in robustness against white-box L2-bounded adversarial attacks. They generalize better than ordinary networks from noisy data with partially random labels. Their outputs are quantitatively meaningful and indicate levels of confidence and generalization, among other desirable properties.

Accelerated Sparse Recovery Under Structured Measurements    

No tl;dr =[

Extensive work on compressed sensing has yielded a rich collection of sparse recovery algorithms, each making different tradeoffs between recovery condition and computational efficiency. In this paper, we propose a unified framework for accelerating various existing sparse recovery algorithms without sacrificing recovery guarantees by exploiting structure in the measurement matrix. Unlike fast algorithms that are specific to particular choices of measurement matrices where the columns are Fourier or wavelet filters for example, the proposed approach works on a broad range of measurement matrices that satisfy a particular property. We precisely characterize this property, which quantifies how easy it is to accelerate sparse recovery for the measurement matrix in question. We also derive the time complexity of the accelerated algorithm, which is sublinear in the signal length in each iteration. Moreover, we present experimental results on real world data that demonstrate the effectiveness of the proposed approach in practice.

Bounce and Learn: Modeling Scene Dynamics with Real-World Bounces    

No tl;dr =[

We introduce an approach to model surface properties governing bounces in everyday scenes. Our model learns end-to-end, starting from sensor inputs, to predict post-bounce trajectories and infer two underlying physical properties that govern bouncing - restitution and effective collision normals. Our model, Bounce and Learn, comprises two modules -- a Physics Inference Module (PIM) and a Visual Inference Module (VIM). VIM learns to infer physical parameters for locations in a scene given a single still image, while PIM learns to model physical interactions for the prediction task given physical parameters and observed pre-collision 3D trajectories. To achieve our results, we introduce the Bounce Dataset comprising 5K RGB-D videos of bouncing trajectories of a foam ball to probe surfaces of varying shapes and materials in everyday scenes including homes and offices. Our proposed model learns from our collected dataset of real-world bounces and is bootstrapped with additional information from simple physics simulations. We show qualitative and quantitative results on our newly collected dataset and outline open challenges for learning to model real-world bounces.

Expressiveness in Deep Reinforcement Learning    

No tl;dr =[

Representation learning in reinforcement learning (RL) algorithms focuses on extracting useful features for choosing good actions. Expressive representations are essential for learning well-performed policies. In this paper, we study the relationship between the state representation assigned by the state extractor and the performance of the RL agent. We observe that representations assigned by the better state extractor are more scattered than which assigned by the worse one. Moreover, RL agents achieving high performances always have high rank matrices which are composed by their representations. Based on our observations, we formally define expressiveness of the state extractor as the rank of the matrix composed by representations. Therefore, we propose to promote expressiveness so as to improve algorithm performances, and we call it Expressiveness Promoted DRL. We apply our method on both policy gradient and value-based algorithms, and experimental results on 55 Atari games show the superiority of our proposed method.

Graph Matching Networks for Learning the Similarity of Graph Structured Objects    

tl;dr We tackle the problem of similarity learning for structured objects with applications in particular in computer security, and propose a new model graph matching networks that excels on this task.

This paper addresses the challenging problem of retrieval and matching of graph structured objects, and makes two key contributions. First, we demonstrate how Graph Neural Networks (GNN), which have emerged as an effective model for various supervised prediction problems defined on structured data, can be trained to produce embedding of graphs in vector spaces that enables efficient similarity reasoning. Second, we propose a novel Graph Matching Network model that, given a pair of graphs as input, computes a similarity score between them by jointly reasoning on the pair through a new cross-graph attention-based matching mechanism. We demonstrate the effectiveness of our models on different domains including the challenging problem of control-flow-graph based function similarity search that plays an important role in the detection of vulnerabilities in software systems. The experimental analysis demonstrates that our models are not only able to exploit structure in the context of similarity learning but they can also outperform domain-specific baseline systems that have been carefully hand-engineered for these problems.

Label Smoothing and Logit Squeezing: A Replacement for Adversarial Training?    

tl;dr Achieving strong adversarial robustness and exceeding adversarial training without training on adversarial examples

Adversarial training is one of the strongest defenses against adversarial attacks, but it requires adversarial examples to be generated for every mini-batch during optimization. The expense of producing these examples during training often precludes adversarial training from use on large and high-resolution image datasets. In this study, we explore the mechanisms by which adversarial training improves classifier robustness, and show that these mechanisms can be effectively mimicked using simple regularization methods, including label smoothing and logit squeezing. Remarkably, using these simple regularization methods in combination with Gaussian noise injection, we are able to achieve strong adversarial robustness -- often exceeding that of adversarial training -- using no adversarial examples.

Learning space time dynamics with PDE guided neural networks    

No tl;dr =[

Spatio-Temporal processes bear a central importance in many applied scientific fields. Generally, differential equations are used to describe these processes. In this work, we address the problem of learning spatio-temporal dynamics with neural networks when only partial information on the system's state is available. Taking inspiration from the dynamical system approach, we outline a general framework in which complex dynamics generated by families of differential equations can be learned in a principled way. Two models are derived from this framework. We demonstrate how they can be applied in practice by considering the problem of forecasting fluid flows. We show how the underlying equations fit into our formalism and evaluate our method by comparing with standard baselines.

Unifying Bilateral Filtering and Adversarial Training for Robust Neural Networks    

tl;dr We adapt bilateral filtering as a layer in a neural network which improves robustness to adversarial examples using nonlocal filtering.

Recent analysis of deep neural networks has revealed their vulnerability to carefully structured adversarial examples. Many effective algorithms exist to craft these adversarial examples, but performant defenses seem to be far away. In this work, we explore the use of edge-aware bilateral filtering as a projection back to the space of natural images. We show that bilateral filtering is an effective defense in multiple attack settings, where the strength of the adversary gradually increases. In the case of adversary who has no knowledge of the defense, bilateral filtering can remove more than 90% of adversarial examples from a variety of different attacks. To evaluate against an adversary with complete knowledge of our defense, we adapt the bilateral filter as a trainable layer in a neural network and show that adding this layer makes ImageNet images significantly more robust to attacks. When trained under a framework of adversarial training, we show that the resulting model is hard to fool with even the best attack methods.

Pseudosaccades: A simple ensemble scheme for improving classification performance of deep nets    

tl;dr Inspired by saccades we describe a simple, cheap, effective way to improve deep net performance on an image labelling task.

We describe a simple ensemble approach that, unlike conventional ensembles, uses multiple random data sketches (‘pseudosaccades’) rather than multiple classifiers to improve classification performance. Using this simple, but novel, approach we obtain statistically significant improvements in classification performance on AlexNet, GoogLeNet, ResNet-50 and ResNet-152 baselines on Imagenet data – e.g. of the order of 0.3% to 0.6% in Top-1 accuracy and similar improvements in Top-k accuracy – essentially nearly for free.

Uncertainty-guided Lifelong Learning in Bayesian Networks    

tl;dr We formulate lifelong learning in the Bayesian-by-Backprop framework, exploiting the parameter uncertainty in two settings: for pruning network parameters and in importance weight based continual learning.

The ability to learn in a setting where tasks arrive in a sequence without access to previous task data is difficult for learning algorithms when restricted in capacity. In this lifelong learning setting a single model is challenged to learning a new task, while at the same time not forgetting about previous tasks and freeing up capacity for future tasks. We argue that the ability to identify network parameters which are most critical for a learned task plays a critical role to decide which ones to remember. In this work we propose to rely on Bayesian Networks, which inherently model the distribution of a parameter rather than a single value of a parameter. More specifically, we formulate lifelong learning in the Bayesian-by-Backprop framework, exploiting the parameter uncertainty for two lifelong learning directions. First, weight pruning, where a hard selection is made on which parameters to select per task and, second, weight regularization which can be seen as a softer version to keep important parameters, respectively. We show the benefit of our approach using diverse object classification datasets in both cases.


tl;dr A deep learning based graph classification method plus a new adaptive method for graph pooling.

Most of the existing Graph Neural Networks (GNNs) are the mere extension of the Convolutional Neural Networks (CNNs) to graphs. Generally, they consist of several steps of message passing between the nodes followed by a global indiscriminate feature pooling function. However, most of the times the nodes are unlabeled or their labels (or the given feature vectors of the nodes) provide no information about the similarity between the nodes and the locations of the nodes in the graph. Accordingly, message passing may not propagate helpful information throughout the graph. We show that this conventional approach fails to learn to solve even simple graph classification tasks. We alleviate this serious shortcoming of the GNNs by making them a two step method where in the second step, the message passing block is given the continuous features obtained by the embedding algorithm in the first step. The GNN learns to solve the given task by inferring the topological structure of the graph encoded in the spatial distribution of the embedded vectors. The second challenge we address in this paper is designing a pooling algorithm applicable to graphs. We turn the problem of graph down-sampling into a column sampling problem, i.e., the sampling algorithm samples a subset of the nodes whose feature vectors preserve the spatial distribution of all the feature vectors. We apply the proposed approach to several established benchmark data sets and it is shown that the proposed geometrical approach strongly improves the state-of-the-art for several data-sets.

Improved resistance of neural networks to adversarial images through generative pre-training    

tl;dr Generative pre-training with mean field Boltzmann machines increases robustness against adversarial images in neural networks.

We train a feed forward neural network with increased robustness against adversarial attacks compared to conventional training approaches. This is achieved using a novel pre-trained building block based on a mean field description of a Boltzmann machine. On the MNIST dataset the method achieves strong adversarial resistance without data augmentation or adversarial training. We show that the increased adversarial resistance is correlated with the generative performance of the underlying Boltzmann machine.

Direct Optimization through $\arg \max$ for Discrete Variational Auto-Encoder    

No tl;dr =[

Reparameterization of variational auto-encoders is an effective method for reducing the variance of their gradient estimates. However, when the latent variables are discrete, a reparameterization is problematic due to discontinuities in the discrete space. In this work, we extend the direct loss minimization technique to discrete variational auto-encoders. We first reparameterize a discrete random variable using the $\arg \max$ function of the Gumbel-Max perturbation model. We then use direct optimization to propagate gradients through the non-differentiable $\arg \max$ using two perturbed $\arg \max$ operations.

On the Trajectory of Stochastic Gradient Descent in the Information Plane    

tl;dr We look at SGD as a trajectory in the space of probability measures, show its connection to Markov processes, propose a simple Markov model of SGD learning, and experimentally compare it with SGD using information theoretic quantities.

Studying the evolution of information theoretic quantities during Stochastic Gradient Descent (SGD) learning of Artificial Neural Networks (ANNs) has gained popularity in recent years. Nevertheless, this type of experiments require estimating mutual information and entropy which becomes intractable for moderately large problems. In this work we propose an experimental framework for understanding SGD based only on the output labels of ANNs. We look at SGD learning as a trajectory in the space of probability measures and define a notion of shortest learning path using a total variation metric. Using this formulation we provide a connection between learning and Markov processes that allows us characterize the trajectory of information theoretic quantities during learning. In addition, a simple Markov chain model for SGD learning, that moves along the shortest learning path is constructed and compared with SGD through empirical simulations. Experiments show that SGD moves in a similar trajectory as a Markov chain along the shortest learning path.


No tl;dr =[

To be effective in sequential data processing, Recurrent Neural Networks (RNNs) are required to perform data processing as well as to keep track of past events by creating memories. Consequently RNNs are harder to train than their not recurrent counterparts. In this paper, we investigate the representation of memories formed in trained RNN internal state under various training protocols. It is not clear whether there is a trade-off between learning discriminative task, learning actions and learning to memorize. Having observed the RNN’s apparently consistent performance regardless of training protocol, we expected the internal dynamics to be similar as well. Instead we were surprised to discover substantial differences, leading to differences in the ability to generalize for unforeseen tasks or conditions. In an attempt to understand these differences we proposed a method for tracking the formation of memories along the course of training, and indeed results on memory shaping process were obtained.

Dirichlet Variational Autoencoder    

No tl;dr =[

This paper proposes Dirichlet Variational Autoencoder (DirVAE) using a Dirichlet prior for a continuous latent variable that exhibits the characteristic of the categorical probabilities. To infer the parameters of DirVAE, we utilize the stochastic gradient method by approximating the Gamma distribution, which is a component of the Dirichlet distribution, with the inverse Gamma CDF approximation. Additionally, we reshape the component collapsing issue by investigating two problem sources, which are decoder weight collapsing and latent value collapsing, and we show that DirVAE has no component collapsing; while Gaussian VAE exhibits the decoder weight collapsing and Stick-Breaking VAE shows the latent value collapsing. The experimental results show that 1) DirVAE models the latent representation result with the best log-likelihood compared to the baselines; and 2) DirVAE produces more interpretable latent values with no collapsing issues which the baseline models suffer from. Also, we show that the learned latent representation from the DirVAE achieves the best classification accuracy in the semi-supervised and the supervised classification tasks on MNIST, OMNIGLOT, and SVHN compared to the baseline VAEs. Finally, we demonstrated that the DirVAE augmented topic models show better performances in most cases.

LayoutGAN: Generating Graphic Layouts with Wireframe Discriminator    

No tl;dr =[

Layouts are important for graphic design and scene generation. We propose a novel generative adversarial network, named as LayoutGAN, that synthesizes graphic layouts by modeling semantic and geometric relations of 2D elements. The generator of LayoutGAN takes as input a set of randomly placed 2D graphic elements and uses self-attention modules to refine their semantic and geometric parameters jointly to produce a meaningful layout. Accurate alignment is critical for good layouts. We thus propose a novel differentiable wireframe rendering layer that maps the generated layout to a wireframe image, upon which a CNN-based discriminator is used to optimize the layouts in visual domain. We validate the effectiveness of LayoutGAN in various experiments including MNIST digit generation, document layout generation, clipart abstract scene generation and tangram graphic design.

RelGAN: Relational Generative Adversarial Networks for Text Generation    

No tl;dr =[

Generative adversarial networks (GANs) have achieved great success at generating realistic images. However, the text generation still remains a challenging task for modern GAN architectures. In this work, we propose RelGAN, a new GAN architecture for text generation, consisting of three main components: a relational memory based generator for the long-distance dependency modeling, the Gumbel-Softmax relaxation for training GANs on discrete data, and multiple embedded representations in the discriminator to provide a more informative signal for the generator updates. Our experiments show that RelGAN outperforms current state-of-the-art models in terms of sample quality and diversity, and we also reveal via ablation studies that each component of RelGAN contributes critically to its performance improvements. Moreover, a key advantage of our method, that distinguishes it from other GANs, is the ability to control the trade-off between sample quality and diversity via the use of a single adjustable parameter. Finally, RelGAN is the first architecture that makes GANs with Gumbel-Softmax relaxation succeed in generating realistic text.

Convolutional Neural Networks combined with Runge-Kutta Methods    

No tl;dr =[

A convolutional neural network for image classification can be constructed mathematically since it is inspired by the ventral stream in visual cortex which can be regarded as a multi-period dynamical system. In this paper, a novel approach is proposed to construct network models from the dynamical systems view. Since a pre-activation residual network can be deemed an approximation of a time-dependent dynamical system using the Euler method, higher order Runge-Kutta methods (RK methods) can be utilized to build network models in order to achieve higher accuracy. The model constructed in such a way is referred to as the Runge-Kutta Convolutional Neural Network (RKNet). RK methods also provide an interpretation of Dense Convolutional Networks (DenseNets) and Convolutional Neural Networks with Alternately Updated Clique (CliqueNets) from the dynamical systems view. The proposed methods are evaluated on the benchmark datasets: CIFAR-10/100 and ImageNet. The experimental results are consistent with the theoretical properties of RK methods and support the dynamical systems interpretation. Moreover, the experimental results show that the RKNets are superior to the state-of-the-art network models on CIFAR-10 and be comparable with them on CIFAR-100 and ImageNet.

Understanding GANs via Generalization Analysis for Disconnected Support    

tl;dr We investigate the generalization performance of GANs and show how GANs outperform others with a specific property of data.

This paper provides theoretical analysis of generative adversarial networks (GANs) to explain its advantages over other standard methods of learning probability measures. GANs learn a probability through observations, using the objective function with a generator and a discriminator. While many empirical results indicate that GANs can generate realistic samples, the reason for such successful performance remains unelucidated. This paper focuses the situation where the target probability measure satisfies the disconnected support property, which means a separate support of a probability, and relates it with the advantage of GANs. It is theoretically shown that, unlike other popular models, GANs do not suffer from the decrease of generalization performance caused by the disconnected support property. We rigorously quantify the generalization performance of GANs of a given architecture, and compare it with the performance of the other models. Based on the theory, we also provide a guideline for selecting deep network architecture for GANs. We demonstrate some numerical examples which support our results.

The Forward-Backward Embedding of Directed Graphs    

No tl;dr =[

We introduce a novel embedding of directed graphs derived from the singular value decomposition (SVD) of the normalized adjacency matrix. Specifically, we show that, after proper normalization of the singular vectors, the distances between vectors in the embedding space are proportional to the mean commute times between the corresponding nodes by a forward-backward random walk in the graph, which follows the edges alternately in forward and backward directions. In particular, two nodes having many common successors in the graph tend to be represented by close vectors in the embedding space. More formally, we prove that our representation of the graph is equivalent to the spectral embedding of some co-citation graph, where nodes are linked with respect to their common set of successors in the original graph. The interest of our approach is that it does not require to build this co-citation graph, which is typically much denser than the original graph. Experiments on real datasets show the efficiency of the approach.

Deep Generative Models for learning Coherent Latent Representations from Multi-Modal Data    

tl;dr Deriving a general formulation of a multi-modal VAE from the joint marginal log-likelihood.

The application of multi-modal generative models by means of a Variational Auto Encoder (VAE) is an upcoming research topic for sensor fusion and bi-directional modality exchange. This contribution gives insights into the learned joint latent representation and shows that expressiveness and coherence are decisive properties for multi-modal data sets. Furthermore, we propose a multi-modal VAE derived from the full joint marginal log-likelihood that is able to learn the most meaningful representation for ambiguous observations. Since the properties of multi-modal sensor setups are essential for our approach but hardly available, we also propose a technique to generate correlated data sets from uni-modal ones.

Stable Opponent Shaping in Differentiable Games    

tl;dr Opponent shaping is a powerful approach to multi-agent learning but can prevent convergence; our SOS algorithm fixes this with strong guarantees in all differentiable games.

A growing number of learning methods are actually games which optimise multiple, interdependent objectives in parallel -- from GANs and intrinsic curiosity to multi-agent RL. Opponent shaping is a powerful approach to improve learning dynamics in such games, accounting for the fact that the 'environment' includes agents adapting to one another's updates. Learning with Opponent-Learning Awareness (LOLA) is a recent algorithm which exploits this dynamic response and encourages cooperation in settings like the Iterated Prisoner's Dilemma. Although experimentally successful, we show that LOLA can exhibit 'arrogant' behaviour directly at odds with convergence. In fact, remarkably few algorithms have theoretical guarantees applying across all differentiable games. In this paper we present Stable Opponent Shaping (SOS), a new method that interpolates between LOLA and a stable variant named LookAhead. We prove that LookAhead locally converges and avoids strict saddles in all differentiable games, the strongest results in the field so far. SOS inherits these desirable guarantees, while also shaping the learning of opponents and consistently either matching or outperforming LOLA experimentally.

Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks    

tl;dr We introduce a new inductive bias that integrates tree structures in recurrent neural networks.

Recurrent neural network (RNN) models are widely used for processing sequential data governed by a latent tree structure. Previous work shows that RNN models (especially Long Short-Term Memory (LSTM) based models) could learn to exploit the underlying tree structure. However, its performance consistently lags behind that of tree-based models. This work proposes a new inductive bias Ordered Neurons, which enforces an order of updating frequencies between hidden state neurons. We show that the ordered neurons could explicitly integrate the latent tree structure into recurrent models. To this end, we propose a new RNN unit: ON-LSTM, which achieve good performances on four different tasks: language modeling, unsupervised parsing, targeted syntactic evaluation, and logical inference.

Adaptive Posterior Learning    

tl;dr We introduce a model which generalizes quickly from few observations by storing surprising information and attending over the most relevant data at each time point.

The ability to generalize quickly from few observations is crucial for intelligent systems. In this paper we introduce APL, an algorithm that approximates probability distributions by remembering the most surprising observations it has encountered. These past observations are recalled from an external memory module and processed by a decoder network that can combine information from different memory slots to generalize beyond direct recall. We show this algorithm can perform as well as state of the art baselines on few-shot classification benchmarks with a smaller memory footprint. In addition, its memory compression allows it to scale to thousands of unknown labels. Finally, we introduce a meta-learning reasoning task which is more challenging than direct classification. In this setting, APL is able to generalize with fewer than one example per class via deductive reasoning.

Adversarially Learned Mixture Model    

tl;dr The AMM is the first fully adversarially optimized method to model the conditional dependence between categorical and continuous latent variables.

The Adversarially Learned Mixture Model (AMM) is a generative model for unsupervised or semi-supervised data clustering. The AMM is the first adversarially optimized method to model the conditional dependence between inferred continuous and categorical latent variables. Experiments on the MNIST and SVHN datasets show that the AMM allows for semantic separation of complex data when little or no labeled data is available. The AMM achieves unsupervised clustering error rates of 3.32% and 20.4% on the MNIST and SVHN datasets, respectively. A semi-supervised extension of the AMM achieves a classification error rate of 5.60% on the SVHN dataset.

Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures    

tl;dr We show that rare but catastrophic failures may be missed entirely by random testing, which poses issues for safe deployment. Our proposed approach for adversarial testing fixes this.

This paper addresses the problem of evaluating learning systems in safety critical domains such as autonomous driving, where failures can have catastrophic consequences. To this end, we focus on two problems: searching for scenarios when learned agents fail and the related problem of assessing their probability of failure. The standard method for agent evaluation in reinforcement learning, Vanilla Monte Carlo, can severely underestimate agent failure probabilities, leading to the deployment of unsafe agents. In our experiments, we observe this even after allocating equal compute to training and evaluation. To address this shortcoming, we draw upon the rare event probability estimation literature and propose an adversarial evaluation approach. Our approach focuses evaluation on difficult scenarios that are selected adversarially, while still providing unbiased estimates of failure probabilities. To do this, we propose a continuation approach to learning a failure probability predictor. This leverages data from related agents to overcome issues of data sparsity and allows the adversary to reuse data gathered for training the agent. We demonstrate the efficacy of adversarial evaluation on two complex reinforcement learning domains (humanoid control and simulated driving). Experimental results show that our methods can find catastrophic failures and estimate failures rates of agents multiple orders of magnitude faster (hours instead of days) than standard evaluation schemes.

Visual Imitation with a Minimal Adversary    

tl;dr Imitation from pixels, with sparse or no reward, using off-policy RL and a tiny adversarially-learned reward function.

High-dimensional sparse reward tasks present major challenges for reinforcement learning agents. In this work we use imitation learning to address two of these challenges: how to learn a useful representation of the world e.g. from pixels, and how to explore efficiently given the rarity of a reward signal? We show that adversarial imitation can work well even in this high dimensional observation space. Surprisingly the adversary itself, acting as the learned reward function, can be tiny, comprising as few as 128 parameters, and can be easily trained using the most basic GAN formulation. Our approach removes limitations present in most contemporary imitation approaches: requiring no demonstrator actions (only video), no special initial conditions or warm starts, and no explicit tracking of any single demo. The proposed agent can solve a challenging robot manipulation task of block stacking from only video demonstrations and sparse reward, in which the non-imitating agents fail to learn completely. Furthermore, our agent learns much faster than competing approaches that depend on hand-crafted, staged dense reward functions, and also better compared to standard GAIL baselines. Finally, we develop a new adversarial goal recognizer that in some cases allows the agent to learn stacking without any task reward, purely from imitation.

Second-Order Adversarial Attack and Certifiable Robustness    

No tl;dr =[

Adversarial training has been recognized as a strong defense against adversarial attacks. In this paper, we propose a powerful second-order attack method that reduces the accuracy of the defense model by Madry et al. (2017). We demonstrate that adversarial training overfits to the choice of the norm in the sense that it is only robust to the attack used for adversarial training, thus suggesting it has not achieved universal robustness. The effectiveness of our attack method motivates an investigation of provable robustness of a defense model. To this end, we introduce a framework that allows one to obtain a certifiable lower bound on the prediction accuracy against adversarial examples. We conduct experiments to show the effectiveness of our attack method. At the same time, our defense model achieves significant improvements compared to previous works under our proposed attack.

A Forensic Representation to Detect Non-Trivial Image Duplicates, and How it Applies to Semantic Segmentation    

tl;dr A forensic metric to determine if a given image is a copy (with possible manipulation) of another image from a given dataset.

Manipulation and re-use of images in scientific publications is a recurring problem, at present lacking a scalable solution. Existing tools for detecting image duplication are mostly manual or semi-automated, despite the fact that generating data for a learning-based approach is straightforward, as we here illustrate. This paper addresses the problem of determining if, given two images, one is a manipulated version of the other by means of certain geometric and statistical manipulations, e.g. copy, rotation, translation, scale, perspective transform, histogram adjustment, partial erasing, and compression artifacts. We propose a solution based on a 3-branch Siamese Convolutional Neural Network. The ConvNet model is trained to map images into a 128-dimensional space, where the Euclidean distance between duplicate (respectively, unique) images is no greater (respectively, greater) than 1. Our results suggest that such an approach can serve as tool to improve surveillance of the published and in-peer-review literature for image manipulation. We also show that as a byproduct the network learns useful representations for semantic segmentation, with performance comparable to that of domain-specific models.

Multi-Modal Generative Adversarial Networks for Diverse Datasets    

tl;dr Multi modal Guassian distribution of latent space in GAN models improves performance and allows to trade-off quality vs. diversity

Generative Adversarial Networks (GANs) have been shown to produce realistically looking synthetic images with remarkable success, yet their performance seems less impressive when the training set is highly diverse. In order to provide a better fit to the target data distribution when the dataset includes many different classes, we propose a variant of the basic GAN model, a Multi-Modal Gaussian-Mixture GAN (GM-GAN), where the probability distribution over the latent space is a mixture of Gaussians. We also propose a supervised variant which is capable of conditional sample synthesis. In order to evaluate the model's performance, we propose a new scoring method which separately takes into account two (typically conflicting) measures - diversity vs. quality of the generated data. Through a series of experiments, using both synthetic and real-world datasets, we quantitatively show that GM-GANs outperform baselines, both when evaluated using the commonly used Inception Score, and when evaluated using our own alternative scoring method. In addition, we qualitatively demonstrate how the unsupervised variant of GM-GAN tends to map latent vectors sampled from different Gaussians in the latent space to samples of different classes in the data space. We show how this phenomenon can be exploited for the task of unsupervised clustering, and provide quantitative evaluation showing the superiority of our method for the unsupervised clustering of image datasets. Finally, we demonstrate a feature which further sets our model apart from other GAN models: the option to control the quality-diversity trade-off by altering, post-training, the probability distribution of the latent space. This allows one to sample higher quality and lower diversity samples, or vice versa, according to one's needs.

Rethinking the Value of Network Pruning    

tl;dr In network pruning, fine-tuning a pruned model only gives comparable or worse performance than training it from scratch.

Network pruning is widely used for reducing the heavy computational cost of deep networks. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning, and each of the three stages is considered as indispensable. In this work, we make several surprising observations which contradict common beliefs. For all the six state-of-the-art pruning algorithms we examined, fine-tuning a pruned model only gives comparable or even worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined architecture of the target pruned network, one can completely get rid of the pipeline and directly train the target network from scratch. Our observations are consistent for a wide variety of pruning algorithms with multiple network architectures, datasets and tasks. Our results have several implications: 1) training an over-parameterized model is not necessary to obtain an efficient final model, 2) learned ``important'' weights of the large model are not necessarily helpful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited ``important'' weights, is what leads to the efficiency benefit in the final model, which suggests that some pruning algorithms could be seen as performing network architecture search.

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware    

tl;dr We accelerate secure DNN inference in trusted execution environments (by a factor 4x-20x) by selectively outsourcing the computation of linear layers to a faster yet untrusted co-processor.

As Machine Learning (ML) gets applied to security-critical or sensitive domains, there is a growing need for integrity and privacy for outsourced ML computations. A pragmatic solution comes from Trusted Execution Environments (TEEs), which use hardware and software protections to isolate sensitive computations from the untrusted software stack. However, these isolation guarantees come at a price in performance, compared to untrusted alternatives. This paper initiates the study of high performance execution of Deep Neural Networks (DNNs) in TEEs by efficiently partitioning DNN computations between trusted and untrusted devices. Building upon an efficient outsourcing scheme for matrix multiplication, we propose Slalom, a framework that securely delegates execution of all linear layers in a DNN from a TEE (e.g., Intel SGX or Sanctum) to a faster, yet untrusted, co-located processor. We evaluate Slalom by executing DNNs in an Intel SGX enclave, which selectively delegates work to an untrusted GPU. For two canonical DNNs, VGG16 and MobileNet, we obtain 20x and 6x increases in throughput for verifiable inference, and 11x and 4x for verifiable and private inference.

Self-Supervised Generalisation with Meta Auxiliary Learning    

tl;dr We propose Meta AuXiliary Learning (MAXL), a learning framework which can automatically generate auxiliary tasks to improve generalisation of the principal task in a self-supervised manner.

Auxiliary learning has been shown to improve the generalisation performance of a principal task. But typically, this requires manually-defined auxiliary tasks based on domain knowledge. In this paper, we consider that it may be possible to automatically learn these auxiliary tasks to best suit the principal task, towards optimum auxiliary tasks without any human knowledge. We propose a novel method, Meta Auxiliary Learning (MAXL), which we design for the task of image classification, where the auxiliary task is hierarchical sub-class image classification. The role of the meta learner is to determine sub-class target labels to train a multi-task evaluator, such that these labels improve the generalisation performance on the principal task. Experiments on three different CIFAR datasets show that MAXL outperforms baseline auxiliary learning methods, and is competitive even with a method which uses human-defined sub-class hierarchies. MAXL is self-supervised and general, and therefore offers a promising new direction towards automated generalisation.

Guaranteed Recovery of One-Hidden-Layer Neural Networks via Cross Entropy    

tl;dr We provide the first theoretical analysis of guaranteed recovery of one-hidden-layer neural networks under cross entropy loss for classification problems.

We study model recovery for data classification, where the training labels are generated from a one-hidden-layer fully -connected neural network with sigmoid activations, and the goal is to recover the weight vectors of the neural network. We prove that under Gaussian inputs, the empirical risk function using cross entropy exhibits strong convexity and smoothness uniformly in a local neighborhood of the ground truth, as soon as the sample complexity is sufficiently large. This implies that if initialized in this neighborhood, which can be achieved via the tensor method, gradient descent converges linearly to a critical point that is provably close to the ground truth without requiring a fresh set of samples at each iteration. To the best of our knowledge, this is the first global convergence guarantee established for the empirical risk minimization using cross entropy via gradient descent for learning one-hidden-layer neural networks, at the near-optimal sample and computational complexity with respect to the network input dimension.

MuMoMAML: Model-Agnostic Meta-Learning for Multimodal Task Distributions    

tl;dr We proposed a meta-learner that generalizes across a multimodal task distribution by identifying the modes of the distribution and modulating its meta-learned prior accordingly, allowing further efficient adaptation through gradient updates.

Gradient-based meta-learners such as MAML (Finn et al., 2017) are able to learn a meta-prior from similar tasks to adapt to novel tasks from the same distribution with few gradient updates. However, such frameworks seek a common initialization shared across the entire task distribution, substantially limiting the diversity of the task distributions that they are able to learn from. In this paper, we aim to augment existing gradient-based meta-learners with the capability to identify the modes of a task distribution and adapt quickly through gradient updates given tasks sampled from a multimodal task distribution. Specifically, we propose a multimodal MAML algorithm (MuMoMAML), which is able to modulate its meta-learned prior according to the identified task modes, allowing further fast adaptation. We evaluate the proposed algorithm on a diverse set of problems including regression, few-shot image classification, and reinforcement learning. The results demonstrate the effectiveness of our model in efficiently acquiring a meta-learned prior under a multimodal task distribution.