Search ICLR 2019

Searching papers submitted to ICLR 2019 can be painful. You might want to know which paper uses technique X, dataset D, or cites author ME. Unfortunately, search is limited to titles, abstracts, and keywords, missing the actual contents of the paper. This Frankensteinian search has returned from 2018 to help scour the papers of ICLR by ripping out their souls using pdftotext.

Good luck! Warranty's not included :)


Need random search inspiration..? Grab something from the list of all tags! ^_^
How about: social dilemma, point set, on-policy learning, lstm, mixture model ..?


Sanity Disclaimer: As you stare at the continuous stream of ICLR and arXiv papers, don't lose confidence or feel overwhelmed. This isn't a competition, it's a search for knowledge. You and your work are valuable and help carve out the path for progress in our field :)

"Random selection" has 100 results


L2-Nonexpansive Neural Networks    

No tl;dr =[

This paper proposes a class of well-conditioned neural networks in which a unit amount of change in the inputs causes at most a unit amount of change in the outputs or any of the internal layers. We develop the known methodology of controlling Lipschitz constants to realize its full potential in maximizing robustness, with a new regularization scheme for linear layers, new ways to adapt nonlinearities and a new loss function. With MNIST and CIFAR-10 classifiers, we demonstrate a number of advantages. Without needing any adversarial training, the proposed classifiers exceed the state of the art in robustness against white-box L2-bounded adversarial attacks. They generalize better than ordinary networks from noisy data with partially random labels. Their outputs are quantitatively meaningful and indicate levels of confidence and generalization, among other desirable properties.


AD-VAT: An Asymmetric Dueling mechanism for learning Visual Active Tracking    

tl;dr We propose AD-VAT, where the tracker and the target object, viewed as two learnable agents, are opponents and can mutually enhance during training.

Visual active tracking (VAT) aims at following a target object by autonomously controlling the motion system of a tracker given visual observations. In this paper, we propose a novel method which adopts an asymmetric dueling mechanism for learning visual active tracking, namely AD-VAT. In AD-VAT, the target and the tracker are mutual opponents, i.e, the tracker manages to lockup the target, and the target tries to escape from the tracker. In the implementation, both the tracker and the target are approximated by deep networks, and their policies that map environment observations to control actions can be learned via reinforcement learning in an end-to-end manner. The tracker and the target are asymmetric in observations, network structures and reward functions. Different from the tracker, the target is modeled with a tracker-aware network, i.e, besides its own observation, the tracker's observations and actions are also fed as input to the network. In addition, it learns to predict the tracker's reward as an auxiliary task. We argue that such an asymmetric adversarial mechanism is able to learn a stronger target, which vice versa induces a more robust tracker. The experimental results, in both 2D and 3D environments, demonstrate that the proposed method leads to a faster convergence in training the tracker and more robust tracking behaviors in different testing scenarios.


DANA: Scalable Out-of-the-box Distributed ASGD Without Retuning    

tl;dr A new distributed asynchronous SGD algorithm that achieves state-of-the-art accuracy on existing architectures without any additional tuning or overhead.

Distributed computing can significantly reduce the training time of neural networks. Despite its potential, however, distributed training has not been widely adopted: scaling the training process is difficult, and existing SGD methods require substantial tuning of hyperparameters and learning schedules to achieve sufficient accuracy when increasing the number of workers. In practice, such tuning can be prohibitively expensive given the huge number of potential hyperparameter configurations and the effort required to test each one. We propose DANA, a novel approach that scales out-of-the-box to large clusters using the same hyperparameters and learning schedule optimized for training on a single worker, while maintaining similar final accuracy without additional overhead. DANA estimates the future value of model parameters by adapting Nesterov Accelerated Gradient to a distributed setting, and so mitigates the effect of gradient staleness, one of the main difficulties in scaling SGD to more workers. Evaluation on three state-of-the-art network architectures and three datasets shows that DANA scales as well as or better than existing work without having to tune any hyperparameters or tweak the learning schedule. For example, DANA achieves 75.73% accuracy on ImageNet when training ResNet-50 with 16 workers, similar to the non-distributed baseline.


Generative Adversarial Models for Learning Private and Fair Representations    

tl;dr We present Generative Adversarial Privacy and Fairness (GAPF), a data-driven framework for learning private and fair representations with certified privacy/fairness guarantees

We present Generative Adversarial Privacy and Fairness (GAPF), a data-driven framework for learning private and fair representations. GAPF leverages recent advancements in adversarial learning to allow a data holder to learn "universal" representations that decouple a set of sensitive attributes from the rest of the dataset. Under GAPF, finding the optimal privacy mechanism is formulated as a constrained minimax game between a private/fair encoder and an adversary. We show that for appropriately chosen adversarial loss functions, GAPF provides privacy guarantees against strong information-theoretic adversaries and enforces demographic parity. We also evaluate the performance of GAPF on multi-dimensional Gaussian mixture models and real datasets, and show how a designer can certify that representations learned under an adversary with a fixed architecture perform well against more complex adversaries.


Better Accuracy with Quantified Privacy: Representations Learned via Reconstructive Adversarial Network    

No tl;dr =[

The remarkable success of machine learning, especially deep learning, has produced a variety of cloud-based services for mobile users. Such services require an end user to send data to the service provider, which presents a serious challenge to end-user privacy. To address this concern, prior works either add noise to the data or send features extracted from the raw data. They struggle to balance between the utility and privacy because added noise reduces utility and raw data can be reconstructed from extracted features. This work represents a methodical departure from prior works: we balance between a measure of privacy and another of utility by leveraging adversarial learning to find a sweeter tradeoff. We design an encoder that optimizes against the reconstruction error (a measure of privacy), adversarially by a Decoder, and the inference accuracy (a measure of utility) by a Classifier. The result is RAN, a novel deep model with a new training algorithm that automatically extracts features for classification that are both private and useful. It turns out that adversarially forcing the extracted features to only conveys the intended information required by classification leads to an implicit regularization leading to better classification accuracy than the original model which completely ignores privacy. Thus, we achieve better privacy with better utility, a surprising possibility in machine learning! We conducted extensive experiments on five popular datasets over four training schemes, and demonstrate the superiority of RAN compared with existing alternatives.


Dissecting an Adversarial framework for Information Retrieval    

tl;dr Points out problems in loss function used in IRGAN, a recently proposed GAN framework for Information Retrieval. Further, a model motivated by co-training is proposed, which achieves better performance.

Recent advances in Generative Adversarial Networks facilitated by improvements to the framework and successful application to various problems has resulted in extensions to multiple domains. IRGAN attempts to leverage the framework for Information-Retrieval (IR), a task that can be described as modeling the correct conditional probability distribution p(d|q) over the documents (d), given the query (q). The work that proposes IRGAN claims that optimizing their minimax loss function will result in a generator which can learn the distribution, but their setup and baseline term steer the model away from an exact adversarial formulation, and this work attempts to point out certain inaccuracies in their formulation. Analyzing their loss curves gives insight into possible mistakes in the loss functions and better performance can be obtained by using the co-training like setup we propose, where two models are trained in a co-operative rather than an adversarial fashion.


Solving the Rubik's Cube with Approximate Policy Iteration    

tl;dr We solve the Rubik's Cube with pure reinforcement learning

Recently, Approximate Policy Iteration (API) algorithms have achieved super-human proficiency in two-player zero-sum games such as Go, Chess, and Shogi without human data. These API algorithms iterate between two policies: a slow policy (tree search), and a fast policy (a neural network). In these two-player games, a reward is always received at the end of the game. However, the Rubik’s Cube has only a single solved state, and episodes are not guaranteed to terminate. This poses a major problem for these API algorithms since they rely on the reward received at the end of the game. We introduce Autodidactic Iteration: an API algorithm that overcomes the problem of sparse rewards by training on a distribution of states that allows the reward to propagate from the goal state to states farther away. Autodidactic Iteration is able to learn how to solve the Rubik’s Cube and the 15-puzzle without relying on human data. Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves — less than or equal to solvers that employ human domain knowledge


Large-Scale Study of Curiosity-Driven Learning    

tl;dr An agent trained only with curiosity, and no extrinsic reward, does surprisingly well on 54 popular environments, including the suite of Atari games, Mario etc.

Reinforcement learning algorithms rely on carefully engineered rewards from the environment that are extrinsic to the agent. However, annotating each environment with hand-designed, dense rewards is difficult and not scalable, motivating the need for developing reward functions that are intrinsic to the agent. Curiosity is such intrinsic reward function which uses prediction error as a reward signal. In this paper: (a) We perform the first large-scale study of purely curiosity-driven learning, i.e. {\em without any extrinsic rewards}, across $54$ standard benchmark environments, including the Atari game suite. Our results show surprisingly good performance as well as a high degree of alignment between the intrinsic curiosity objective and the hand-designed extrinsic rewards of many games. (b) We investigate the effect of using different feature spaces for computing prediction error and show that random features are sufficient for many popular RL game benchmarks, but learned features appear to generalize better (e.g. to novel game levels in Super Mario Bros.). (c) We demonstrate limitations of the prediction-based rewards in stochastic setups. Game-play videos and code are at https://doubleblindsupplementary.github.io/large-curiosity/.


Unsupervised Image to Sequence Translation with Canvas-Drawer Networks    

tl;dr Recreate images as interpretable high-level sequences without the need for paired data.

Encoding images as a series of high-level constructs, such as brush strokes or discrete shapes, can often be key to both human and machine understanding. In many cases, however, data is only available in pixel form. We present a method for generating images directly in a high-level domain (e.g. brush strokes), without the need for real pairwise data. Specifically, we train a ”canvas” network to imitate the mapping of high-level constructs to pixels, followed by a high-level ”drawing” network which is optimized through this mapping towards solving a desired image recreation or translation task. We successfully discover sequential vector representations of symbols, large sketches, and 3D objects, utilizing only pixel data. We display applications of our method in image segmentation, and present several ablation studies comparing various configurations.


Conditional Inference in Pre-trained Variational Autoencoders via Cross-coding    

No tl;dr =[

Variational Autoencoders (VAEs) are a popular generative model, but one in which conditional inference can be challenging. If the decomposition into query and evidence variables is fixed, conditional VAEs provide an attractive solution. To support arbitrary queries, one is generally reduced to Markov Chain Monte Carlo sampling methods that can suffer from long mixing times. In this paper, we propose an idea we term cross-coding to approximate the distribution over the latent variables after conditioning on an evidence assignment to some subset of the variables. This allows generating query samples without retraining the full VAE. We experimentally evaluate three variations of cross-coding showing that (i) can be quickly optimized for different decompositions of evidence and query and (ii) they quantitatively and qualitatively outperform Hamiltonian Monte Carlo.


S3TA: A Soft, Spatial, Sequential, Top-Down Attention Model    

tl;dr http://sites.google.com/view/s3ta

We present a soft, spatial, sequential, top-down attention model (S3TA). This model uses a soft attention mechanism to bottleneck its view of the input. A recurrent core is used to generate query vectors, which actively select information from the input by correlating the query with input- and space-dependent key maps at different spatial locations. We demonstrate the power and interpretabilty of this model under two settings. First, we build an agent which uses this attention model in RL environments and show that we can achieve performance competitive with state-of-the-art models while producing attention maps that elucidate some of the strategies used to solve the task. Second, we use this model in supervised learning tasks and show that it also achieves competitive performance and provides interpretable attention maps that show some of the underlying logic in the model's decision making.


ON RANDOM DEEP AUTOENCODERS: EXACT ASYMPTOTIC ANALYSIS, PHASE TRANSITIONS, AND IMPLICATIONS TO TRAINING    

tl;dr We study the behavior of weight-tied multilayer vanilla autoencoders under the assumption of random weights. Via an exact characterization in the limit of large dimensions, our analysis reveals interesting phase transition phenomena.

We study the behavior of weight-tied multilayer vanilla autoencoders under the assumption of random weights. Via an exact characterization in the limit of large dimensions, our analysis reveals interesting phase transition phenomena when the depth becomes large. This, in particular, provides quantitative answers and insights to three questions that were yet fully understood in the literature. Firstly, we provide a precise answer on how the random deep weight-tied autoencoder model performs “approximate inference” as posed by Scellier et al. (2018), and its connection to reversibility considered by several theoretical studies. Secondly, we show that deep autoencoders display a higher degree of sensitivity to perturbations in the parameters, distinct from the shallow counterparts. Thirdly, we obtain insights on pitfalls in training initialization practice, and demonstrate experimentally that it is possible to train a deep autoencoder, even with the tanh activation and a depth as large as 200 layers, without resorting to techniques such as layer-wise pre-training or batch normalization. Our analysis is not specific to any depths or any Lipschitz activations, and our analytical techniques may have broader applicability.


Meta-Learning For Stochastic Gradient MCMC    

tl;dr This paper proposes a method to automate the design of stochastic gradient MCMC proposal using meta learning approach.

Stochastic gradient Markov chain Monte Carlo (SG-MCMC) has become increasingly popular for simulating posterior samples in large-scale Bayesian modeling. However, existing SG-MCMC schemes are not tailored to any specific probabilistic model, even a simple modification of the underlying dynamical system requires significant physical intuition. This paper presents the first meta-learning algorithm that allows automated design for the underlying continuous dynamics of an SG-MCMC sampler. The learned sampler generalizes Hamiltonian dynamics with state-dependent drift and diffusion, enabling fast traversal and efficient exploration of energy landscapes. Experiments validate the proposed approach on Bayesian fully connected neural network, Bayesian convolutional neural network and Bayesian recurrent neural network tasks, showing that the learned sampler outperforms generic, hand-designed SG-MCMC algorithms, and generalizes to different datasets and larger architectures.


State-Denoised Recurrent Neural Networks    

tl;dr We propose a mechanism for denoising the internal state of an RNN to improve generalization performance.

Recurrent neural networks (RNNs) are difficult to train on sequence processing tasks, not only because input noise may be amplified through feedback, but also because any inaccuracy in the weights has similar consequences as input noise. We describe a method for denoising the hidden state during training to achieve more robust representations thereby improving generalization performance. Attractor dynamics are incorporated into the hidden state to `clean up' representations at each step of a sequence. The attractor dynamics are trained through an auxillary denoising loss to recover previously experienced hidden states from noisy versions of those states. This state-denoised recurrent neural network (SDRNN) performs multiple steps of internal processing for each external sequence step. On a range of tasks, we show that the SDRNN outperforms a generic RNN as well as a variant of the SDRNN with attractor dynamics on the hidden state but without the auxillary loss. We argue that attractor dynamics---and corresponding connectivity constraints---are an essential component of the deep learning arsenal and should be invoked not only for recurrent networks but also for improving deep feedforward nets and intertask transfer.


CEM-RL: Combining evolutionary and gradient-based methods for policy search    

tl;dr We propose a new combination of evolution strategy and deep reinforcement learning which takes the best of both worlds

Deep neuroevolution and deep reinforcement learning (deep RL) algorithms are two popular approaches to policy search. The former is widely applicable and rather stable, but suffers from low sample efficiency. By contrast, the latter is more sample efficient, but the most sample efficient variants are also rather unstable and highly sensitive to hyper-parameter setting. So far, these families of methods have mostly been compared as competing tools. However, an emerging approach consists in combining them so as to get the best of both worlds. Two previously existing combinations use either a standard evolutionary algorithm or a goal exploration process together with the DDPG algorithm, a sample efficient off-policy deep RL algorithm. In this paper, we propose a different combination scheme using the simple cross-entropy method (CEM) and TD3, another off-policy deep RL algorithm which improves over DDPG. We evaluate the resulting algorithm, CEMRL, on a set of benchmarks classically used in deep RL. We show that \cemrl benefits from several advantages over its competitors and offers a satisfactory trade-off between performance and sample efficiency.


IMAGE DEFORMATION META-NETWORK FOR ONE-SHOT LEARNING    

No tl;dr =[

Humans can robustly learn novel visual concepts even when images undergo various deformations and loose certain information. Incorporating this ability to synthesize deformed instances of new concepts might help visual recognition systems perform better one-shot learning, i.e., learning concepts from one or few examples. Our key insight is that, while the deformed images might not be visually realistic, they still maintain critical semantic information and contribute significantly in formulating classifier decision boundaries. Inspired by the recent progress on meta-learning, we combine a meta-learner with an image deformation network that produces additional training examples, and optimize both models in an endto- end manner. The deformation network learns to synthesize images by fusing a pair of images—a probe image that keeps the visual content and a gallery image that diversifies the deformations. We demonstrate results on the widely used oneshot learning benchmarks (miniImageNet and ImageNet 1K challenge datasets), which significantly outperform the previous state-of-the-art approaches.


Overfitting Detection of Deep Neural Networks without a Hold Out Set    

tl;dr We introduce and analyze several criteria for detecting overfitting.

Overfitting is an ubiquitous problem in neural network training and usually mitigated using a holdout data set. Here we challenge this rationale and investigate criteria for overfitting without using a holdout data set. Specifically, we train a model for a fixed number of epochs multiple times with varying fractions of randomized labels and for a range of regularization strengths. A properly trained model should not be able to attain an accuracy greater than the fraction of properly labeled data points. Otherwise the model overfits. We introduce two criteria for detecting overfitting and one to detect underfitting. We analyze early stopping, the regularization factor, and network depth. In safety critical applications we are interested in models and parameter settings which perform well and are not likely to overfit. The methods of this paper allow characterizing and identifying such models.


Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware    

tl;dr We accelerate secure DNN inference in trusted execution environments (by a factor 4x-20x) by selectively outsourcing the computation of linear layers to a faster yet untrusted co-processor.

As Machine Learning (ML) gets applied to security-critical or sensitive domains, there is a growing need for integrity and privacy for outsourced ML computations. A pragmatic solution comes from Trusted Execution Environments (TEEs), which use hardware and software protections to isolate sensitive computations from the untrusted software stack. However, these isolation guarantees come at a price in performance, compared to untrusted alternatives. This paper initiates the study of high performance execution of Deep Neural Networks (DNNs) in TEEs by efficiently partitioning DNN computations between trusted and untrusted devices. Building upon an efficient outsourcing scheme for matrix multiplication, we propose Slalom, a framework that securely delegates execution of all linear layers in a DNN from a TEE (e.g., Intel SGX or Sanctum) to a faster, yet untrusted, co-located processor. We evaluate Slalom by executing DNNs in an Intel SGX enclave, which selectively delegates work to an untrusted GPU. For two canonical DNNs, VGG16 and MobileNet, we obtain 20x and 6x increases in throughput for verifiable inference, and 11x and 4x for verifiable and private inference.


Improving Generalization and Stability of Generative Adversarial Networks    

tl;dr We propose a zero-centered gradient penalty for improving generalization and stability of GANs

Generative Adversarial Networks (GANs) are one of the most popular tools for learning complex high dimensional distributions. However, generalization properties of GANs have not been well understood. In this paper, we analyze the generalization of GANs in practical settings. We show that discriminators trained on discrete datasets with the original GAN loss have poor generalization capability and do not approximate the theoretically optimal discriminator. We propose a zero-centered gradient penalty for improving the generalization of the discriminator by pushing it toward the optimal discriminator. The penalty guarantees the generalization and convergence of GANs. Experiments on synthetic and large scale datasets verify our theoretical analysis.


RANDOM MASK: Towards Robust Convolutional Neural Networks    

No tl;dr =[

Robustness of neural networks has recently been highlighted by the adversarial examples, i.e., inputs added with well-designed perturbations which are imperceptible to humans but can cause the network to give incorrect outputs. In this paper, we design a new CNN architecture that by itself has good robustness. We introduce a simple but powerful technique, Random Mask, to modify existing CNN structures. We show that CNN with Random Mask achieves state-of-the-art performance against black-box adversarial attacks without applying any adversarial training. We next investigate the adversarial examples which “fool” a CNN with Random Mask. Surprisingly, we find that these adversarial examples often “fool” humans as well. This raises fundamental questions on how to define adversarial examples and robustness properly.


Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning    

tl;dr A model-based meta-RL algorithm that enables a real robot to adapt online in dynamic environments

Although reinforcement learning methods can achieve impressive results in simulation, the real world presents two major challenges: generating samples is exceedingly expensive, and unexpected perturbations or unseen situations cause proficient but specialized policies to fail at test time. Given that it is impractical to train separate policies to accommodate all situations the agent may see in the real world, this work proposes to learn how to quickly and effectively adapt online to new tasks. To enable sample-efficient learning, we consider learning online adaptation in the context of model-based reinforcement learning. Our approach uses meta-learning to train a dynamics model prior such that, when combined with recent data, this prior can be rapidly adapted to the local context. Our experiments demonstrate online adaptation for continuous control tasks on both simulated and real-world agents. We first show simulated agents adapting their behavior online to novel terrains, crippled body parts, and highly-dynamic environments. We also illustrate the importance of incorporating online adaptation into autonomous agents that operate in the real world by applying our method to a real dynamic legged millirobot: We demonstrate the agent's learned ability to quickly adapt online to a missing leg, adjust to novel terrains and slopes, account for miscalibration or errors in pose estimation, and compensate for pulling payloads.


Learning Self-Imitating Diverse Policies    

tl;dr Policy optimization by using past good rollouts from the agent; learning shaped rewards via divergence minimization; SVPG with JS-kernel for population-based exploration.

The success of popular algorithms for deep reinforcement learning, such as policy-gradients and Q-learning, relies heavily on the availability of an informative reward signal at each timestep of the sequential decision-making process. When rewards are only sparsely available during an episode, or a rewarding feedback is provided only after episode termination, these algorithms perform sub-optimally due to the difficultly in credit assignment. Alternatively, trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not require per-timestep rewards, but have been found to suffer from high sample complexity by completing forgoing the temporal nature of the problem. Improving the efficiency of RL algorithms in real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. We view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. We show that with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and significantly better in environments with sparse and episodic rewards. We then discuss limitations of self-imitation learning, and propose to solve them by using Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate its effectiveness on a number of challenging tasks.


Towards Resisting Large Data Variations via Introspective Learning    

tl;dr We propose a principled approach that endows classifiers with the ability to resist larger variations between training and testing data in an intelligent and efficient manner.

Learning deep networks which can resist large variations between training andtesting data is essential to build accurate and robust image classifiers. Towardsthis end, a typical strategy is to apply data augmentation to enlarge the trainingset. However, standard data augmentation is essentially a brute-force strategywhich is inefficient, as it performs all the pre-defined transformations to everytraining sample. In this paper, we propose a principled approach to train networkswith significantly improved resistance to large variations between training andtesting data. This is achieved by embedding a learnable transformation moduleinto the introspective networks (Jin et al., 2017; Lazarow et al., 2017; Lee et al.,2018), which is a convolutional neural network (CNN) classifier empowered withgenerative capabilities. Our approach alternatively synthesizes pseudo-negativesamples with learned transformations and enhances the classifier by retraining itwith synthesized samples. Experimental results verify that our approach signif-icantly improves the ability of deep networks to resist large variations betweentraining and testing data and achieves classification accuracy improvements onseveral benchmark datasets, including MNIST, affNIST, SVHN and CIFAR-10.


Neural Logic Machines    

tl;dr A fully differentiable neural-symbolic architecture to conduct first-order logic reasoning

We propose Neural Logic Machines (NLMs), a neural-symbolic architecture for both inductive learning and logic reasoning. NLMs exploit the power of both neural networks—as function approximators for probabilistic distributions, and logic programming—as symbolic processor for objects with properties, relations, logic connectives, and quantifiers. After being trained on small-scale tasks (such as sorting short arrays), NLMs can learn the underlying logic rules, and generalize to arbitrarily large-scale tasks (such as sorting arbitrarily long arrays). In our experiments, NLMs achieve perfect generalization in a number of tasks, from relational reasoning tasks on family tree and general graphs, to decision making tasks including sorting, finding shortest paths, and the blocks world. Most of these tasks are hard to accomplish for neural networks or logical programming alone.


A Teacher Student Network For Faster Video Classification    

tl;dr Teacher-Student framework for efficient video classification using fewer frames

Over the past few years, various tasks involving videos such as classification, description, summarization and question answering have received a lot of attention. Current models for these tasks compute an encoding of the video by treating it as a sequence of images and going over every image in the sequence, which becomes computationally expensive for longer videos. In this paper, we focus on the task of video classification and aim to reduce the computational cost by using the idea of distillation. Specifically, we propose a Teacher-Student network wherein the teacher looks at all the frames in the video but the student looks at only a small fraction of the frames in the video. The idea is to then train the student to minimize (i) the difference between the final representation computed by the student and the teacher and/or (ii) the difference between the distributions predicted by the teacher and the student. This smaller student network which involves fewer computations but still learns to mimic the teacher can then be employed at inference time for video classification. We experiment with the YouTube-8M dataset and show that the proposed student network can reduce the inference time by upto 30% with a negligent drop in the performance.


Dual Importance Weight GAN    

No tl;dr =[

Generative Adversarial Networks (GAN) are trained to generate a sample image of interest. To this end, generative network of GAN learns implicit distribution of true dataset from the classification samples with candidate generated samples. However, in real implementation of GAN, training the generative network with limited number of candidate samples guarantees to properly represent neither true distribution nor the distribution of generator outputs. In this paper, we propose dual importance weights for the candidate samples represented in the latent space of auto-encoder. The auto-encoder is pre-trained with real target dataset. Therefore, the latent space representation allows us to compare real distribution and the distribution of generated samples explicitly. Dual importance weights iteratively maximize the representation of generated samples for both distributions: current generator outputs and real dataset. Proposed generative model not only resolves mode collapse problem of GAN but also improves the convergence on target distribution. Experimental evaluation shows that the proposed network learns complete modes of target distribution more stable and faster than state of the art methods.


Stacking for Transfer Learning    

tl;dr How to use stacked generalization to improve the performance of existing transfer learning algorithms when limited labeled data is available.

In machine learning tasks, overtting frequently crops up when the number of samples of target domain is insufficient, for the generalization ability of the classifier is poor in this circumstance. To solve this problem, transfer learning utilizes the knowledge of similar domains to improve the robustness of the learner. The main idea of existing transfer learning algorithms is to reduce the dierence between domains by sample selection or domain adaptation. However, no matter what transfer learning algorithm we use, the difference always exists and the hybrid training of source and target data leads to reducing fitting capability of the learner on target domain. Moreover, when the relatedness between domains is too low, negative transfer is more likely to occur. To tackle the problem, we proposed a two-phase transfer learning architecture based on ensemble learning, which uses the existing transfer learning algorithms to train the weak learners in the first stage, and uses the predictions of target data to train the final learner in the second stage. Under this architecture, the fitting capability and generalization capability can be guaranteed at the same time. We evaluated the proposed method on public datasets, which demonstrates the effectiveness and robustness of our proposed method.


Low-Rank Matrix Factorization of LSTM as Effective Model Compression    

tl;dr We propose simple, but effective, low-rank matrix factorization (MF) algorithms to speed up in running time, save memory, and improve the performance of LSTMs.

Large-scale Long Short-Term Memory (LSTM) cells are often the building blocks of many state-of-the-art algorithms for tasks in Natural Language Processing (NLP). However, LSTMs are known to be computationally inefficient because the memory capacity of the models depends on the number of parameters, and the inherent recurrence that models the temporal dependency is not parallelizable. In this paper, we propose simple, but effective, low-rank matrix factorization (MF) algorithms to compress network parameters and significantly speed up LSTMs with almost no loss of performance (and sometimes even gain). To show the effectiveness of our method across different tasks, we examine two settings: 1) compressing core LSTM layers in Language Models, 2) compressing biLSTM layers of ELMo~\citep{ELMo} and evaluate in three downstream NLP tasks (Sentiment Analysis, Textual Entailment, and Question Answering). The latter is particularly interesting as embeddings from large pre-trained biLSTM Language Models are often used as contextual word representations. Finally, we discover that matrix factorization performs better in general, additive recurrence is often more important than multiplicative recurrence, and we identify an interesting correlation between matrix norms and compression performance.


Feature quantization for parsimonious and meaningful predictive models    

tl;dr We tackle discretization of continuous features and grouping of factor levels as a representation learning problem and provide a rigorous way of estimating the best quantization to yield good performance and interpretability.

For regulatory and interpretability reasons, the logistic regression is still widely used by financial institutions to learn the refunding probability of a loan given the applicant's characteristics from historical data. Although logistic regression handles naturally both continuous and categorical data, a preprocessing step to quantize them is usually performed for improving simultaneously prediction accuracy and user interpretability: continuous features are discretized by assigning factor levels to intervals; some levels of categorical features (with numerous levels) are grouped. However, a better predictive accuracy can be reached by embedding this quantization estimation step directly into the predictive estimation step itself. A related information criterion has then to be optimized on a huge and untractable discontinuous quantization set, requiring to introduce a specific two-step optimization strategy: first, the optimization problem is relaxed in order to deal with smooth functions; second, a particular neural network is involved through a stochastic gradient algorithm to optimize the resulting criterion, giving access to good candidates for the initial optimization problem. The good performances of this approach are illustrated on simulated and real data from Crédit Agricole Consumer Finance (a major European historic player in the consumer credit market).


Reinforcement Learning: From temporal to spatial value decomposition    

No tl;dr =[

Value function lies in the heart of reinforcement learning, as it can capture the long-term dependency between current state/action and delayed reward. Value function is defined as expected total reward. Following this definition, it's natural to derive a temporal decomposition for value function, as a sum of discounted reward on different time steps. Most existing RL algorithms estimate value function based on this temporal decomposition. In our formulation, value function $Q(s,a)$ is a weight sum of reward $R(s')$, and the weight is the discounted visiting frequency $d(s, a, s')$. Here $d(s, a, s')$ is the key part that captures long-term dependency between current state/action with future state. This formulation inspires us to study $d(s, a, s')$ in depth, with function approximation as usual. Under the function approximation setting, $d(s, a, s')$ enjoys the benefit of generalization between similar state $s'$. Due to this generalization property, we can get better estimation for $d(s, a, s')$, thus resulting in a better estimation for $Q(s,a)$. We propose spatial monte-carlo (SMC) method, and apply the idea of SMC to actor-critic algorithms with $m$-step advantage estimation and GAE. Experiments demonstrate that our algorithms work empirically well.


Discriminator Rejection Sampling    

tl;dr We use a GAN discriminator to perform an approximate rejection sampling scheme on the output of the GAN generator.

We propose a rejection sampling scheme using the discriminator of a GAN to approximately correct errors in the GAN generator distribution. We show that under quite strict assumptions, this will allow us to recover the data distribution exactly. We then examine where those strict assumptions break down and design a practical algorithm—called Discriminator Rejection Sampling (DRS)—that can be used on real data-sets. Finally, we demonstrate the efficacy of DRS on a mixture of Gaussians and on the state of the art SAGAN model. On ImageNet, we train an improved baseline that increases the best published Inception Score from 52.52 to 62.36 and reduces the Frechet Inception Distance from 18.65 to 14.79. We then use DRS to further improve on this baseline, improving the Inception Score to 76.08 and the FID to 13.75.


Universal Successor Features for Transfer Reinforcement Learning    

No tl;dr =[

Transfer in Reinforcement Learning (RL) refers to the idea of applying knowledge gained from previous tasks to solving related tasks. Learning a universal value function (Schaul et al., 2015), which generalizes over goals and states, has previously been shown to be useful for transfer. However, successor features are believed to be more suitable than values for transfer (Dayan, 1993; Barreto et al.,2017), even though they cannot directly generalize to new goals. In this paper, we propose (1) Universal Successor Features (USFs) to capture the underlying dynamics of the environment while allowing generalization to unseen goals and (2) a flexible end-to-end model of USFs that can be trained by interacting with the environment. We show that learning USFs is compatible with any RL algorithm that learns state values using a temporal difference method. Our experiments in a simple gridworld and with two MuJoCo environments show that USFs can greatly accelerate training when learning multiple tasks and can effectively transfer knowledge to new tasks.


Pearl: Prototype lEArning via Rule Lists    

tl;dr a method combining rule list learning and prototype learning

Deep neural networks have demonstrated promising classification performance on many healthcare applications. However, the interpretability of those models are often lacking. On the other hand, classical interpretable models such as rule lists or decision trees do not lead to the same level of accuracy as deep neural networks. Despite their interpretable structures, the resulting rules are often too complex to be interpretable (due to the potentially large depth of rule lists). In this work, we present PEARL, short for Prototype lEArning via Rule Lists, which iteratively use rule lists to guide a neural network to learn representative data prototypes. The resulting prototype neural network provides accurate prediction, and the prediction can be easily explained by prototype and its guiding rule lists. Thanks to the prediction power of neural networks, the rule lists defining prototypes are more concise and hence provide better interpretability. On two real-world electronic healthcare records (EHR) datasets, PEARL consistently outperforms all baselines, achieving performance improvement over conventional rule learning by up to 28% and over prototype learning by up to 3%. Experimental results also show the resulting interpretation of PEARL is simpler than the standard rule learning.


Recurrent Kalman Networks: Factorized Inference in High-Dimensional Deep Feature Spaces    

tl;dr Kalman Filter based recurrent model for efficient state estimation, principled uncertainty handling and end to end learning of dynamic models in high dimensional spaces.

State estimation together with state prediction is a crucial task for many applications. Typically, sensory observations give only partial and noisy information about the state of the environment. A well-known tool for performing state estimation under these conditions is the Kalman filter (Kalman et al., 1960). However, the Kalman filter is limited to problems with known, linear models and good estimates about the system noise. Recent deep learning approaches integrate a non-linear encoder into the KF equations that maps the high-dimensional observation to the typically low-dimensional state of the system (Haarnoja et al., 2016). However, these approaches are still limited to systems with known dynamics that are either linear or it requires approximations such as an extended Kalman filter. In contrast, our approach does not use a pre-defined state representation but learns a high-dimensional factorized representation that is used for inference us- ing locally linear models. While our locally linear modelling and factorization assumptions are in general not true for the original low-dimensional state space of the system, the network finds a high-dimensional latent space where these as- sumptions hold to perform efficient inference. This state representation is learned jointly with the transition and noise models by backpropagation. The resulting network architecture, which we call Recurrent Kalman Network (RKN), can be used for any time-series data, similar to a LSTM (Hochreiter and Schmidhuber, 1997) but uses an explicit representation of uncertainty. As shown by our experiments, the RKN obtains much more accurate uncertainty estimates than an LSTM or Gated Recurrent Units (GRUs) (Cho et al., 2014) while also showing a slightly improved prediction performance.


Modular Deep Probabilistic Programming    

No tl;dr =[

Modularity is a key feature of deep learning libraries but has not been fully exploited for probabilistic programming. We propose to improve modularity of probabilistic programming language by offering not only plain probabilistic distributions but also sophisticated probabilistic model such as Bayesian non-parametric models as fundamental building blocks. We demonstrate this idea by presenting a modular probabilistic programming language MXFusion, which includes a new type of re-usable building blocks, called probabilistic modules. A probabilistic module consists of a set of random variables with associated probabilistic distributions and dedicated inference methods. Under the framework of variational inference, the pre-specified inference methods of individual probabilistic modules can be transparently used for inference of the whole probabilistic model. We demonstrate the power and convenience of probabilistic modules in MXFusion with various examples of Gaussian process models, which are evaluated with experiments on real data.


Transfer Learning via Unsupervised Task Discovery for Visual Question Answering    

No tl;dr =[

We study how to leverage off-the-shelf visual and linguistic data to cope with out-of-vocabulary answers in visual question answering. Existing large-scale visual data with annotations such as image class labels, bounding boxes and region descriptions are good sources for learning rich and diverse visual concepts. However, it is not straightforward how the visual concepts should be captured and transferred to visual question answering models due to missing link between question dependent answering models and visual data without question or task specification. We tackle this problem in two steps: 1) learning a task conditional visual classifier based on unsupervised task discovery and 2) transferring and adapting the task conditional visual classifier to visual question answering models. Specifically, we employ linguistic knowledge sources such as structured lexical database (e.g. Wordnet) and visual descriptions for unsupervised task discovery, and adapt a learned task conditional visual classifier to answering unit in a visual question answering model. We empirically show that the proposed algorithm generalizes to unseen answers successfully using the knowledge transferred from the visual data.


Spread Divergences    

tl;dr Using noise to define the divergence between distributions with different support.

For distributions $p$ and $q$ with different support, the divergence $\div{p}{q}$ generally will not exist. We define a spread divergence $\sdiv{p}{q}$ on modified $p$ and $q$ and describe sufficient conditions for the existence of such a divergence. We give examples of using a spread divergence to train implicit generative models, including linear models (Principal Components Analysis and Independent Components Analysis) and non-linear models (Deep Generative Networks).


Dynamically Unfolding Recurrent Restorer: A Moving Endpoint Control Method for Image Restoration    

No tl;dr =[

In this paper, we propose a new control framework called the moving endpoint control to restore images corrupted by different degradation levels in one model. The proposed control problem contains a restoration dynamics which is modeled by an RNN. The moving endpoint, which is essentially the terminal time of the associated dynamics, is determined by a policy network. We call the proposed model the dynamically unfolding recurrent restorer (DURR). Numerical experiments show that DURR is able to achieve state-of-the-art performances on blind image denoising and JPEG image deblocking. Furthermore, DURR can well generalize to images with higher degradation levels that are not included in the training stage.


Variational Sparse Coding    

tl;dr We explore the intersection of VAEs and sparse coding.

Variationalauto-encoders(VAEs)offeratractableapproachwhenperformingapproximate inference in otherwise intractable generative models. However, standard VAEs often produce latent codes that are disperse and lack interpretability, thus making the resulting representations unsuitable for auxiliary tasks (e.g. classification) and human interpretation. We address these issues by merging ideas fromvariationalauto-encodersandsparsecoding,andproposetoexplicitlymodel sparsity in the latent space of a VAE with a Spike and Slab prior distribution. We derive the variational lower bound using a discrete mixture recognition function thereby making approximate posterior inference as computational efficient as in the standard VAE case. With the new approach, we are able to infer truly sparse representations with generally intractable non-linear probabilistic models. We show that these sparse representations are advantageous over standard VAE representationsontwobenchmarkclassificationtasks(MNISTandFashion-MNIST) by demonstratingimproved classification accuracy and significantly increased robustness to the number of latent dimensions. Furthermore, we demonstrate qualitatively that the sparse elements capture subjectively understandable sources of variation.


Graph Convolutional Network with Sequential Attention For Goal-Oriented Dialogue Systems    

tl;dr We propose a Graph Convolutional Network based encoder-decoder model with sequential attention for goal-oriented dialogue systems.

Domain specific goal-oriented dialogue systems typically require modeling three types of inputs, viz., (i) the knowledge-base associated with the domain, (ii) the history of the conversation, which is a sequence of utterances and (iii) the current utterance for which the response needs to be generated. While modeling these inputs, current state-of-the-art models such as Mem2Seq typically ignore the rich structure inherent in the knowledge graph and the sentences in the conversation context. Inspired by the recent success of structure-aware Graph Convolutional Networks (GCNs) for various NLP tasks such as machine translation, semantic role labeling and document dating, we propose a memory augmented GCN for goal-oriented dialogues. Our model exploits (i) the entity relation graph in a knowledge-base and (ii) the dependency graph associated with an utterance to compute richer representations for words and entities. Further, we take cognizance of the fact that in certain situations, such as, when the conversation is in a code-mixed language, dependency parsers may not be available. We show that in such situations we could use the global word co-occurrence graph and use it to enrich the representations of utterances. We experiment with the modified DSTC2 dataset and its recently released code-mixed versions in four languages and show that our method outperforms existing state-of-the-art methods, using a wide range of evaluation metrics.


Overcoming Neural Brainwashing    

tl;dr We identify a phenomenon, neural brainwashing, and introduce a statistically-justified weight plasticity loss to overcome this.

We identify a phenomenon, which we dub neural brainwashing, that occurs when sequentially training multiple deep networks with partially-shared parameters; the performance of previously-trained models degrades as one optimizes a subsequent one, due to the overwriting of shared parameters. To overcome this, we introduce a statistically-justified weight plasticity loss that regularizes the learning of a model's shared parameters according to their importance for the previous models, and demonstrate its effectiveness when training two models sequentially and for neural architecture search.


Graph Wavelet Neural Network    

tl;dr We present graph wavelet neural network (GWNN), a novel graph convolutional neural network (CNN), leveraging graph wavelet transform to address the shortcoming of previous spectral graph CNN methods that depend on graph Fourier transform.

We present graph wavelet neural network (GWNN), a novel graph convolutional neural network (CNN), leveraging graph wavelet transform to address the shortcoming of previous spectral graph CNN methods that depend on graph Fourier transform. Different from graph Fourier transform, graph wavelet transform does not require matrix eigendecomposition with high computational cost. Moreover, wavelets are sparse and localized in vertex domain, offering high efficiency and good interpretability for convolution. The proposed GWNN significantly outperforms previous spectral graph CNNs in the task of graph semi-supervised classification, on three benchmark datasets: Cora, Citeseer and Pubmed.


Predicted Variables in Programming    

tl;dr We present Predicted Variables, an approach to making machine learning a first class citizen in programming languages.

We present Predicted Variables, an approach to making machine learning a first class citizen in programming languages. There is a growing divide in approaches to building systems: using human experts (e.g. programming) on the one hand, and using behavior learned from data (e.g. ML) on the other hand. PVars aim to make ML in programming as easy as `if' statements and with that hybridize ML with programming. We leverage the existing concept of variables and create a new type, a predicted variable. PVars are akin to native variables with one important distinction: PVars determine their value using ML when evaluated. We describe PVars and their interface, how they can be used in programming, and demonstrate the feasibility of our approach on three algorithmic problems: binary search, Quicksort, and caches. We show experimentally that PVars are able to improve over the commonly used heuristics and lead to a better performance than the original algorithms. As opposed to previous work applying ML to algorithmic problems, PVars have the advantage that they can be used within the existing frameworks and do not require the existing domain knowledge to be replaced. PVars allow for a seamless integration of ML into existing systems and algorithms. Our PVars implementation currently relies on standard Reinforcement Learning (RL) methods. To learn faster, PVars use the heuristic function, which they are replacing, as an initial function. We show that PVars quickly pick up the behavior of the initial function and then improve performance beyond that without ever performing substantially worse -- allowing for a safe deployment in critical applications.


Logically-Constrained Neural Fitted Q-iteration    

tl;dr As safety is becoming a critical notion in machine learning we believe that this work can act as a foundation for a number of research directions such as safety-aware learning algorithms.

This paper proposes a method for efficient training of the Q-function for continuous-state Markov Decision Processes (MDP), such that the traces of the resulting policies satisfy a Linear Temporal Logic (LTL) property. The logical property is converted into a limit deterministic Buchi automaton with which a synchronized product MDP is constructed. The control policy is then synthesized by a reinforcement learning algorithm assuming that no prior knowledge is available from the MDP. The proposed method is evaluated in a numerical study to test the quality of the generated control policy and is compared against conventional methods for policy synthesis such as MDP abstraction (Voronoi quantizer) and approximate dynamic programming (fitted value iteration).


Data Interpretation and Reasoning Over Scientific Plots    

tl;dr We created a new dataset for data interpretation over plots and also propose a baseline for the same.

Data Interpretation is an important part of Quantitative Aptitude exams and requires an individual to answer questions grounded in plots such as bar charts, line graphs, scatter plots, \textit{etc}. Recently, there has been an increasing interest in building models which can perform this task by learning from datasets containing triplets of the form \{plot, question, answer\}. Two such datasets have been proposed in the recent past which contain plots generated from synthetic data with limited (i) $x-y$ axes variables (ii) question templates and (iii) answer vocabulary and hence do not adequately capture the challenges posed by this task. To overcome these limitations of existing datasets, we introduce a new dataset containing $9.7$ million question-answer pairs grounded over $270,000$ plots with three main differentiators. First, the plots in our dataset contain a wide variety of realistic $x$-$y$ variables such as CO2 emission, fertility rate, \textit{etc.} extracted from real word data sources such as World Bank, government sites, \textit{etc}. Second, the questions in our dataset are more complex as they are based on templates extracted from interesting questions asked by a crowd of workers using a fraction of these plots. Lastly, the answers in our dataset are not restricted to a small vocabulary and a large fraction of the answers seen at test time are not present in the training vocabulary. As a result, existing models for Visual Question Answering which largely use end-to-end models in a multi-class classification framework cannot be used for this task. We establish initial results on this dataset and emphasize the complexity of the task using a multi-staged modular pipeline with various sub-components to (i) extract relevant data from the plot and convert it to a semi-structured table (ii) combine the question with this table and use compositional semantic parsing to arrive at a logical form from which the answer can be derived. We believe that such a modular framework is the best way to go forward as it would enable the research community to independently make progress on all the sub-tasks involved in plot question answering.


Fatty and Skinny: A Joint Training Method of Watermark Encoder and Decoder    

tl;dr We propose a novel watermark encoder-decoder neural networks. They perform a cooperative game to define their own watermarking scheme. People do not need to design watermarking methods any more.

Watermarks have been used for various purposes. Recently, researchers started to look into using them for deep neural networks. Some works try to hide attack triggers on their adversarial samples when attacking neural networks and others want to watermark neural networks to prove their ownership against plagiarism. Implanting a backdoor watermark module into a neural network is getting more attention from the community. In this paper, we present a general purpose encoder-decoder joint training method, inspired by generative adversarial networks (GANs). Unlike GANs, however, our encoder and decoder neural networks cooperate to find the best watermarking scheme given data samples. In other words, we do not design any new watermarking strategy but our proposed two neural networks will find the best suited method on their own. After being trained, the decoder can be implanted into other neural networks to attack or protect them (see Appendix for their use cases and real implementations). To this end, the decoder should be very tiny in order not to incur any overhead when attached to other neural networks but at the same time provide very high decoding success rates, which is very challenging. Our joint training method successfully solves the problem and in our experiments maintain almost 100\% encoding-decoding success rates for multiple datasets with very little modifications on data samples to hide watermarks. We also present several real-world use cases in Appendix.


Contextualized Role Interaction for Neural Machine Translation    

tl;dr We propose a role interaction layer that explicitly models the modulation of token representations by contextualized roles.

Word inputs tend to be represented as single continuous vectors in deep neural networks. It is left to the subsequent layers of the network to extract relevant aspects of a word's meaning based on the context in which it appears. In this paper, we investigate whether word representations can be improved by explicitly incorporating the idea of latent roles. That is, we propose a role interaction layer (RIL) that consists of context-dependent (latent) role assignments and role-specific transformations. We evaluate the RIL on machine translation using two language pairs (En-De and En-Fi) and three datasets of varying size. We find that the proposed mechanism improves translation quality over strong baselines with limited amounts of data, but that the improvement diminishes as the size of data grows, indicating that powerful neural MT systems are capable of implicitly modeling role-word interaction by themselves. Our qualitative analysis reveals that the RIL extracts meaningful context-dependent roles and that it allows us to inspect more deeply the internal mechanisms of state-of-the-art neural machine translation systems.


Multi-Objective Value Iteration with Parameterized Threshold-Based Safety Constraints    

No tl;dr =[

We consider an environment with multiple reward functions. One of them represents goal achievement and the others represent instantaneous safety conditions. We consider a scenario where the safety rewards should always be above some thresholds. The thresholds are parameters with values that differ between users. %The thresholds are not known at the time the policy is being designed. We efficiently compute a family of policies that cover all threshold-based constraints and maximize the goal achievement reward. We introduce a new parameterized threshold-based scalarization method of the reward vector that encodes our objective. We present novel data structures to store the value functions of the Bellman equation that allow their efficient computation using the value iteration algorithm. We present results for both discrete and continuous state spaces.


TopicGAN: Unsupervised Text Generation from Explainable Latent Topics    

No tl;dr =[

Learning discrete representations of data and then generating data from the discovered representations have been increasingly studied, because the obtained discrete representations can benefit unsupervised learning. However, the performance of learning discrete representations of textual data with deep generative models has not been widely explored. In this work, we propose TopicGAN, a two-step generative model on text generation, which is able to discover discrete latent topics of texts and generate natural language from the discovered latent topics in an unsupervised fashion. Promising results are shown on unsupervised text classification and text generation for both subjective and objective evaluation.


Deep Convolutional Networks as shallow Gaussian Processes    

tl;dr We show that CNNs and ResNets with appropriate priors on the parameters are Gaussian processes in the limit of infinitely many convolutional filters.

We show that the output of a (residual) CNN with an appropriate prior over the weights and biases is a GP in the limit of infinitely many convolutional filters, extending similar results for dense networks. For a CNN, the equivalent kernel can be computed exactly and, unlike "deep kernels", has very few parameters: only the hyperparameters of the original CNN. Further, we show that this kernel has two properties that allow it to be computed efficiently; the cost of evaluating the kernel for a pair of images is similar to a single forward pass through the original CNN with only one filter per layer. The kernel equivalent to a 32-layer ResNet obtains 0.84% classification error on MNIST, a new record for GP with a comparable number of parameters.


Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis    

tl;dr We throughly compare several pretraining tasks on their ability to induce syntactic information and find that representations from language models consistently perform best, even when trained on relatively small amounts of data.

Recent work using auxiliary prediction task classifiers to investigate the properties of LSTM representations has begun to shed light on why pretrained representations, like ELMo (Peters et al., 2018) and CoVe (McCann et al., 2017), are so beneficial for neural language understanding models. We still, though, do not yet have a clear understanding of how the choice of pretraining objective affects the type of linguistic information that models learn. With this in mind, we compare four objectives - language modeling, translation, skip-thought, and autoencoding - on their ability to induce syntactic and part-of-speech information. We make a fair comparison between the tasks by holding constant the quantity and genre of the training data, as well as the LSTM architecture. We find that representations from language models consistently perform best on our syntactic auxiliary prediction tasks, even when trained on relatively small amounts of data. These results suggest that language modeling may be the best data-rich pretraining task for transfer learning applications requiring syntactic information. We also find that the representations from randomly-initialized, frozen LSTMs perform strikingly well on our syntactic auxiliary tasks, but this effect disappears when the amount of training data for the auxiliary tasks is reduced.


Adversarial Information Factorization    

tl;dr Learn representations for images that factor out a single attribute.

We propose a novel generative model architecture designed to learn representations for images that factor out a single attribute from the rest of the representation. A single object may have many attributes which when altered do not change the identity of the object itself. Consider the human face; the identity of a particular person is independent of whether or not they happen to be wearing glasses. The attribute of wearing glasses can be changed without changing the identity of the person. However, the ability to manipulate and alter image attributes without altering the object identity is not a trivial task. Here, we are interested in learning a representation of the image that separates the identity of an object (such as a human face) from an attribute (such as 'wearing glasses'). We demonstrate the success of our factorization approach by using the learned representation to synthesize the same face with and without a chosen attribute. We refer to this specific synthesis process as image attribute manipulation. We further demonstrate that our model achieves competitive scores, with state of the art, on a facial attribute classification task.


Bayesian Modelling and Monte Carlo Inference for GAN    

tl;dr A novel Bayesian treatment for GAN with theoretical guarantee.

Bayesian modelling is a principal framework to perform model aggregation, which has been a primary mechanism to combat mode collapsing in the context of Generative Adversarial Networks (GANs). In this paper, we propose a novel Bayesian modelling framework for GANs, which iteratively learns a distribution over generators with a carefully crafted prior. Learning is efficiently triggered by a tailored stochastic gradient Hamiltonian Monte Carlo with novel gradient approximation to perform Bayesian inference. Our theoretical analysis further reveals that our treatment is the first Bayesian modelling framework that yields an equilibrium where generator distributions are faithful to the data distribution. Empirical evidence on synthetic high-dimensional multi-modal data and the natural image database CIFAR-10 demonstrates the superiority of our method over both start-of-the-art multi-generator GANs and other Bayesian treatment for GANs.


Transfer and Exploration via the Information Bottleneck    

tl;dr Training agents with goal-policy information bottlenecks promotes transfer and yields a powerful exploration bonus

A central challenge in reinforcement learning is discovering effective policies for tasks where rewards are sparsely distributed. We postulate that in the absence of useful reward signals, an effective exploration strategy should seek out {\it decision states}. These states lie at critical junctions in the state space from where the agent can transition to new, potentially unexplored regions. We propose to learn about decision states from prior experience. By training a goal-conditioned model with an information bottleneck, we can identify decision states by examining where the model accesses the goal state through the bottleneck. We find that this simple mechanism effectively identifies decision states, even in partially observed settings. In effect, the model learns the sensory cues that correlate with potential subgoals. In new environments, this model can then identify novel subgoals for further exploration, guiding the agent through a sequence of potential decision states and through new regions of the state space.


Pixel Redrawn For A Robust Adversarial Defense    

No tl;dr =[

Recently, an adversarial example becomes a serious problem to be aware of because it can fool trained neural networks easily. To prevent the issue, many researchers have proposed several defense techniques such as adversarial training, input transformation, stochastic activation pruning, etc. In this paper, we propose a novel defense technique, Pixel Redrawn (PR) method, which redraws every pixel of training images to convert them into distorted images. The motivation for our PR method is from the observation that the adversarial attacks have redrawn some pixels of the original image with the known parameters of the trained neural network. Mimicking these attacks, our PR method redraws the image without any knowledge of the trained neural network. This method can be similar to the adversarial training method but our PR method can be used to prevent future attacks. Experimental results on several benchmark datasets indicate our PR method not only relieves the over-fitting issue when we train neural networks with a large number of epochs, but it also boosts the robustness of the neural network.


Making Convolutional Networks Shift-Invariant Again    

tl;dr Modern networks are not shift-invariant, due to naive downsampling; we apply a signal processing tool -- anti-aliasing low-pass filtering before downsampling -- to improve shift-invariance

Modern convolutional networks are not shift-invariant, despite their convolutional nature: small shifts in the input can cause drastic changes in the internal feature maps and output. In this paper, we isolate the cause -- the downsampling operation in convolutional and pooling layers -- and apply the appropriate signal processing fix -- low-pass filtering before downsampling. This simple architectural modification boosts the shift-equivariance of the internal representations and consequently, shift-invariance of the output. Importantly, this is achieved while maintaining downstream classification performance. In addition, incorporating the inductive bias of shift-invariance largely removes the need for shift-based data augmentation. Lastly, we observe that the modification induces spatially-smoother learned convolutional kernels. Our results suggest that this classical signal processing technique has a place in modern deep networks.


Learning models for visual 3D localization with implicit mapping    

tl;dr We propose a generative approach based on Generative Query Networks + attention for localization with implicit mapping, and compare to a discriminative baseline with a similar architecture.

We consider learning based methods for visual localization that do not require the construction of explicit maps in the form of point clouds or voxels. The goal is to learn an implicit representation of the environment at a higher, more abstract level, for instance that of objects. We propose to use a generative approach based on Generative Query Networks (GQNs, Eslami et al. 2018), asking the following questions: 1) Can GQN capture more complex scenes than those it was originally demonstrated on? 2) Can GQN be used for localization in those scenes? To study this approach we consider procedurally generated Minecraft worlds, for which we can generate images of complex 3D scenes along with camera pose coordinates. We first show that GQNs, enhanced with a novel attention mechanism can capture the structure of 3D scenes in Minecraft, as evidenced by their samples. We then apply the models to the localization problem, comparing the results to a discriminative baseline, and comparing the ways each approach captures the task uncertainty.


Verification of Non-Linear Specifications for Neural Networks    

No tl;dr =[

Prior work on neural network verification has focused on specifications that are linear functions of the output of the network, e.g., invariance of the classifier output under adversarial perturbations of the input. In this paper, we extend verification algorithms to be able to certify richer properties of neural networks. To do this we introduce the class of convex-relaxable specifications, which constitute nonlinear specifications that can be verified using a convex relaxation. We show that a number of important properties of interest can be modeled within this class, including conservation of energy in a learned dynamics model of a physical system; semantic consistency of a classifier's output labels under adversarial perturbations and bounding errors in a system that predicts the summation of handwritten digits. Our experimental evaluation shows that our method is able to effectively verify these specifications. Moreover, our evaluation exposes the failure modes in models which cannot be verified to satisfy these specifications. Thus, emphasizing the importance of training models not just to fit training data but also to be consistent with specifications.


Variational Autoencoders for Text Modeling without Weakening the Decoder    

tl;dr We propose a model of variational autoencoders for text modeling without weakening the decoder, which improves the quality of text generation and interpretability of acquired representations.

Previous work (Bowman et al., 2015; Yang et al., 2017) has found difficulty developing generative models based on variational autoencoders (VAEs) for text. To address the problem of the decoder ignoring information from the encoder (posterior collapse), these previous models weaken the capacity of the decoder to force the model to use information from latent variables. However, this strategy is not ideal as it degrades the quality of generated text and increases hyper-parameters. In this paper, we propose a new VAE for text utilizing a multimodal prior distribution, a modified encoder, and multi-task learning. We show our model can generate well-conditioned sentences without weakening the capacity of the decoder. Also, the multimodal prior distribution improves the interpretability of acquired representations.


AntMan: Sparse Low-Rank Compression To Accelerate RNN Inference    

tl;dr Reducing computation and memory complexity of RNN models by up to 100x using sparse low-rank compression modules, trained via knowledge distillation.

Wide adoption of complex RNN based models is hindered by their inference performance, cost and memory requirements. To address this issue, we develop AntMan, combining structured sparsity with low-rank decomposition synergistically, to reduce model computation, size and execution time of RNNs while attaining desired accuracy. AntMan extends knowledge distillation based training to learn the compressed models efficiently. Our evaluation shows that AntMan offers up to 100x computation reduction with less than 1pt accuracy drop for language and machine reading comprehension models. Our evaluation also shows that for a given accuracy target, AntMan produces 5x smaller models than the state-of-art. Lastly, we show that AntMan offers super-linear speed gains compared to theoretical speedup, demonstrating its practical value on commodity hardware.


Universal Lipschitz Functions    

tl;dr We identify pathologies in existing activation functions when learning neural networks with Lipschitz constraints and use these insights to design neural networks which are universal Lipschitz function approximators.

Training neural networks with a Lipschitz constraint provides improved generalization, robustness, and interpretability. However, existing techniques either fail to guarantee a Lipschitz constraint or are unable to universally approximate Lipschitz functions. Often, a small Lipschitz constant is enforced by considering constraints on the network weights, but little attention is payed to the choice of activation function. We identify Jacobian norm of network layers as a scarce resource in representing Lipschitz functions and show that common activation functions are unable to effectively utilize this. We show that with common activation functions networks are unable to learn even the simplest Lipschitz functions, such as the absolute value function. With this insight, we introduce a novel activation function, the GroupSort activation, which partitions the hidden layer and sorts the units within each partition. Empirically, we identify pathologies of common activation functions and confirm that these theoretical observations are relevant in practice.


Optimistic Acceleration for Optimization    

tl;dr We consider new variants of optimization algorithms for training deep nets.

We consider new variants of optimization algorithms. Our algorithms are based on the observation that mini-batch of stochastic gradients in consecutive iterations do not change drastically and consequently may be predictable. Inspired by the similar setting in online learning literature called Optimistic Online learning, we propose two new algorithms, Optimistic-AMSGrad and Optimistic-Adam that exploit the predictability of gradients. Optimistic-AMSGrad and Optimistic-Adam combine the idea of momentum method, adaptive gradient method, and algorithms in Optimistic Online learning, which leads to speed up in training deep neural nets in practice.


Adversarial Examples Are a Natural Consequence of Test Error in Noise    

tl;dr Small adversarial perturbations should be expected given observed error rates of models outside the natural data distribution.

Maliciously constructed inputs, or adversarial examples, can fool trained machine learning models. Over the last few years, adversarial examples have captured the attention of the research community, especially in the case where the adversary is restricted to making only small modifications of a correctly handled input. When it was first discovered that neural networks are sensitive to small perturbations, many researchers found this surprising and proposed several hypotheses to explain it. In this work, we show that this sensitivity and the poor performance of classification models (relative to humans) on noisy images are two manifestations of the same underlying phenomenon. Nearby errors simply lie on the boundary of a large set of errors whose volume can be measured using test error in additive noise. We present compelling new evidence in favor of this interpretation before discussing some preexisting results which also support our perspective. The relationship between nearby errors and failure to generalize in noise has implications for the adversarial defense literature, as it suggests that defenses which fail to reduce test error in noise will also fail to defend against small adversarial perturbations. This yields a computationally tractable evaluation metric for defenses to consider: test error in noisy image distributions.


Second-Order Adversarial Attack and Certifiable Robustness    

No tl;dr =[

Adversarial training has been recognized as a strong defense against adversarial attacks. In this paper, we propose a powerful second-order attack method that reduces the accuracy of the defense model by Madry et al. (2017). We demonstrate that adversarial training overfits to the choice of the norm in the sense that it is only robust to the attack used for adversarial training, thus suggesting it has not achieved universal robustness. The effectiveness of our attack method motivates an investigation of provable robustness of a defense model. To this end, we introduce a framework that allows one to obtain a certifiable lower bound on the prediction accuracy against adversarial examples. We conduct experiments to show the effectiveness of our attack method. At the same time, our defense model achieves significant improvements compared to previous works under our proposed attack.


Reduced-Gate Convolutional LSTM Design Using Predictive Coding for Next-Frame Video Prediction    

tl;dr A novel reduced-gate convolutional LSTM design using predictive coding for next-frame video prediction

Spatiotemporal sequence prediction is an important problem in deep learning. We study next-frame video prediction using a deep-learning-based predictive coding framework that uses convolutional, long short-term memory (convLSTM) modules. We introduce a novel reduced-gate convolutional LSTM architecture. Our reduced-gate model achieves better next-frame prediction accuracy than the original convolutional LSTM while using a smaller parameter budget, thereby reducing training time. We tested our reduced gate modules within a predictive coding architecture on the moving MNIST and KITTI datasets. We found that our reduced-gate model has a significant reduction of approximately 40 percent of the total number of training parameters and training time in comparison with the standard LSTM model which makes it attractive for hardware implementation especially on small devices.


GENERALIZED ADAPTIVE MOMENT ESTIMATION    

tl;dr A new adaptive gradient method is proposed for effectively training deep neural networks

Adaptive gradient methods have experienced great success in training deep neural networks (DNNs). The basic idea of the methods is to track and properly make use of the first and/or second moments of the gradient for model-parameter updates over iterations for the purpose of removing the need for manual interference. In this work, we propose a new adaptive gradient method, referred to as generalized adaptive moment estimation (Game). From a high level perspective, the new method introduces two more parameters w.r.t. AMSGrad (S. J. Reddi & Kumar (2018)) and one more parameter w.r.t. PAdam (Chen & Gu (2018)) to enlarge the parameter- selection space for performance enhancement while reducing the memory cost per iteration compared to AMSGrad and PAdam. The saved memory space amounts to the number of model parameters, which is significant for large-scale DNNs. Our motivation for introducing additional parameters in Game is to provide algorithmic flexibility to facilitate a reduction of the performance gap between training and validation datasets when training a DNN. Convergence analysis is provided for applying Game to solve both convex optimization and smooth nonconvex optmization. Empirical studies for training four convolutional neural networks over MNIST and CIFAR10 show that under proper parameter selection, Game produces promising validation performance as compared to AMSGrad and PAdam.


Understanding & Generalizing AlphaGo Zero    

No tl;dr =[

AlphaGo Zero (AGZ) introduced a new {\em tabula rasa} reinforcement learning algorithm that has achieved superhuman performance in the games of Go, Chess, and Shogi with no prior knowledge other than the rules of the game. This success naturally begs the question whether it is possible to develop similar high-performance reinforcement learning algorithms for generic sequential decision-making problems (beyond two-player games), using only the constraints of the environment as the ``rules.'' To address this challenge, we start by taking steps towards developing a formal understanding of AGZ. AGZ includes two key innovations: (1) it learns a policy (represented as a neural network) using {\em supervised learning} with cross-entropy loss from samples generated via Monte-Carlo Tree Search (MCTS); (2) it uses {\em self-play} to learn without training data. We argue that the self-play in AGZ corresponds to learning a Nash equilibrium for the two-player game; and the supervised learning with MCTS is attempting to learn the policy corresponding to the Nash equilibrium, by establishing a novel bound on the difference between the expected return achieved by two policies in terms of the expected KL divergence (cross-entropy) of their induced distributions. To extend AGZ to generic sequential decision-making problems, we introduce a {\em robust MDP} framework, in which the agent and nature effectively play a zero-sum game: the agent aims to take actions to maximize reward while nature seeks state transitions, subject to the constraints of that environment, that minimize the agent's reward. For a challenging network scheduling domain, we find that AGZ within the robust MDP framework provides near-optimal performance, matching one of the best known scheduling policies that has taken the networking community three decades of intensive research to develop.


How Important is a Neuron    

No tl;dr =[

The problem of attributing a deep network’s prediction to its input/base features is well-studied (cf. Simonyan et al. (2013)). We introduce the notion of conductance to extend the notion of attribution to understanding the importance of hidden units. Informally, the conductance of a hidden unit of a deep network is the flow of attribution via this hidden unit. We can use conductance to understand the importance of a hidden unit to the prediction for a specific input, or over a set of inputs. We justify conductance in multiple ways via a qualitative comparison with other methods, via some axiomatic results, and via an empirical evaluation based on a feature selection task. The empirical evaluations are done using the Inception network over ImageNet data, and a convolutinal network over text data. In both cases, we demonstrate the effectiveness of conductance in identifying interesting insights about the internal workings of these networks.


Large-scale classification of structured objects using a CRF with deep class embedding    

tl;dr We present a technique for ultrafine-grained, large-scale structured classification, based on CRF modeling with factorized pairwise potentials, learned as neighboring class embedding in a whitened space.

This paper presents a novel deep learning architecture for classifying structured objects in ultrafine-grained datasets, where classes may not be clearly distinguishable by their appearance but rather by their context. We model sequences of images as linear-chain CRFs, and jointly learn the parameters from both local-visual features and neighboring class information. The visual features are learned by convolutional layers, whereas class-structure information is reparametrized by factorizing the CRF pairwise potential matrix. This forms a context-based semantic similarity space, learned alongside the visual similarities, and dramatically increases the learning capacity of contextual information. This new parametrization, however, forms a highly nonlinear objective function which is challenging to optimize. To overcome this, we develop a novel surrogate likelihood which allows for a local likelihood approximation of the original CRF with integrated batch-normalization. This model overcomes the difficulties of existing CRF methods to learn the contextual relationships thoroughly when there is a large number of classes and the data is sparse. The performance of the proposed method is illustrated on a huge dataset that contains images of retail-store product displays, and shows significantly improved results compared to linear CRF parametrization, unnormalized likelihood optimization, and RNN modeling.


POLICY GENERALIZATION IN CAPACITY-LIMITED REINFORCEMENT LEARNING    

tl;dr This paper describes the application of rate-distortion theory to the learning of efficient (capacity limited) policy representations in the reinforcement learning setting.

Motivated by the study of generalization in biological intelligence, we examine reinforcement learning (RL) in settings where there are information-theoretic constraints placed on the learner’s ability to represent a behavioral policy. We first show that the problem of optimizing expected utility within capacity-limited learning agents maps naturally to the mathematical field of rate-distortion (RD) theory. Applying the RD framework to the RL setting, we develop a new online RL algorithm, Capacity-Limited Actor-Critic, that learns a policy that optimizes a tradeoff between utility maximization and information processing costs. Using this algorithm in a 2D gridworld environment, we demonstrate two novel empirical results. First, at high information rates (high channel capacity), the algorithm achieves faster learning and discovers better policies compared to the standard tabular actor-critic algorithm. Second, we demonstrate that agents with capacity-limited policy representations exhibit superior transfer to novel environments compared to policies learned by agents with unlimited information processing resources. Our work provides a principled framework for the development of computationally rational RL agents.


The Singular Values of Convolutional Layers    

tl;dr We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation.

We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation. This characterization also leads to an algorithm for projecting a convolutional layer onto an operator-norm ball. We show that this is an effective regularizer; for example, it improves the test error of a deep residual network using batch normalization on CIFAR-10 from 6.2\% to 5.3\%.


NUTS: Network for Unsupervised Telegraphic Summarization    

tl;dr In this paper, we propose an unsupervised deep learning network (NUTS) to generate telegraphic summaries.

Extractive summarization methods operate by ranking and selecting the sentences which best encapsulate the theme of a given document. They do not fare well in domains like fictional narratives where there is no central theme and core information is not encapsulated by a small set of sentences. For the purpose of reducing the size of the document while conveying the idea expressed by each sentence, we need more sentence specific methods. Telegraphic summarization, which selects short segments across several sentences, is better suited for such domains. Telegraphic summarization captures the plot better by retaining shorter versions of each sentence while not really concerning itself with grammatically linking these segments. In this paper, we propose an unsupervised deep learning network (NUTS) to generate telegraphic summaries. We use multiple encoder-decoder networks and learn to drop portions of the text that are inferable from the chosen segments. The model is agnostic to both sentence length and style. We demonstrate that the summaries produced by our model show significant quantitative and qualitative improvement over those produced by existing methods and baselines.


Robust Determinantal Generative Classifier for Noisy Labels and Adversarial Attacks    

No tl;dr =[

Large-scale datasets may contain significant proportions of noisy (incorrect) class labels, and it is well-known that modern deep neural networks poorly generalize from such noisy training datasets. In this paper, we propose a novel inference method, Deep Determinantal Generative Classifier (DDGC), which can obtain a more robust decision boundary under any softmax neural classifier pre-trained on noisy datasets. Our main idea is inducing a generative classifier on top of hidden feature spaces of the discriminative deep model. By estimating the parameters of generative classifier using the minimum covariance determinant estimator, we significantly improve the classification accuracy, with neither re-training of the deep model nor changing its architectures. In particular, we show that DDGC not only generalizes well from noisy labels, but also is robust against adversarial perturbations due to its large margin property. Finally, we propose the ensemble version of DDGC to improve its performance, by investigating the layer-wise characteristics of generative classifier. Our extensive experimental results demonstrate the superiority of DDGC given different learning models optimized by various training techniques to handle noisy labels or adversarial examples. For instance, we improve the test accuracy of DenseNet on CIFAR-10 datasets with 60% noisy labels from 53.34% to 74.72%.


CNNSAT: Fast, Accurate Boolean Satisfiability using Convolutional Neural Networks    

tl;dr We introduce CNNSAT, a fast and accurate statistical decision procedure for SAT based on convolutional neural networks.

Boolean satisfiability (SAT) is one of the most well-known NP-complete problems and has been extensively studied. State-of-the-art solvers exist and have found a wide range of applications. However, they still do not scale well to formulas with hundreds of variables. To tackle this fundamental scalability challenge, we introduce CNNSAT, a fast and accurate statistical decision procedure for SAT based on convolutional neural networks. CNNSAT's effectiveness is due to a precise and compact representation of Boolean formulas. On both real and synthetic formulas, CNNSAT is highly accurate and orders of magnitude faster than the state-of-the-art solver Z3. We also describe how to extend CNNSAT to predict satisfying assignments when it predicts a formula to be satisfiable.


INTERPRETING DEEP NEURAL NETWORK: FAST OBJECT LOCALIZATION VIA SENSITIVITY ANALYSIS    

tl;dr Proposing a novel object localization(detection) approach based on interpreting the deep CNN using internal representation and network's thoughts

Deep Convolutional Neural Networks (CNNs) have been repeatedly shown to perform well on image classification tasks, successfully recognizing a broad array of objects when given sufficient training data. Methods for object localization, however, are still in need of substantial improvement. Common approaches to this problem involve the use of a sliding window, sometimes at multiple scales, providing input to a deep CNN trained to classify the contents of the window. In general, these approaches are time consuming, requiring many classification calculations. In this paper, we offer a fundamentally different approach to the localization of recognized objects in images. Our method is predicated on the idea that a deep CNN capable of recognizing an object must implicitly contain knowledge about object location in its connection weights. We provide a simple method to interpret classifier weights in the context of individual classified images. This method involves the calculation of the derivative of network generated activation patterns, such as the activation of output class label units, with regard to each in- put pixel, performing a sensitivity analysis that identifies the pixels that, in a local sense, have the greatest influence on internal representations and object recognition. These derivatives can be efficiently computed using a single backward pass through the deep CNN classifier, producing a sensitivity map of the image. We demonstrate that a simple linear mapping can be learned from sensitivity maps to bounding box coordinates, localizing the recognized object. Our experimental results, using real-world data sets for which ground truth localization information is known, reveal competitive accuracy from our fast technique.


Efficient Exploration through Bayesian Deep Q-Networks    

tl;dr Using Bayesian regression for the last layer of DQN, and do Thompson Sampling for exploration. With Bayesian Regret bound

We propose Bayesian Deep Q-Networks (BDQN), a principled and a practical Deep Reinforcement Learning (DRL) algorithm for Markov decision processes (MDP). It combines Thompson sampling with deep-Q networks (DQN). Thompson sampling ensures more efficient exploration-exploitation tradeoff in high dimensions. It is typically carried out through posterior sampling over the model parameters, which makes it computationally expensive. To overcome this limitation, we directly incorporate uncertainty over the value (Q) function. Further, we only introduce randomness in the last layer (i.e. the output layer) of the DQN and use independent Gaussian priors on the weights. This allows us to efficiently carry out Thompson sampling through Gaussian sampling and Bayesian Linear Regression (BLR), which has fast closed-form updates. The rest of the layers of the Q network are trained through back propagation, as in a standard DQN. We apply our method to a wide range of Atari games in Arcade Learning Environments and compare BDQN to a powerful baseline: the double deep Q-network (DDQN). Since BDQN carries out more efficient exploration, it is able to reach higher rewards substantially faster: in less than 5M±1M samples for almost half of the games to reach DDQN scores while a typical run of DDQN is 50-200M. We also establish theoretical guarantees for the special case when the feature representation is fixed and not learnt. We show that the Bayesian regret is bounded by O􏰒(d \sqrt(N)) after N time steps for a d-dimensional feature map, and this bound is shown to be tight up-to logarithmic factors. To the best of our knowledge, this is the first Bayesian theoretical guarantee for Markov Decision Processes (MDP) beyond the tabula rasa setting.


Confidence-based Graph Convolutional Networks for Semi-Supervised Learning    

tl;dr We propose a confidence based Graph Convolutional Network for Semi-Supervised Learning.

Predicting properties of nodes in a graph is an important problem with applications in a variety of domains. Graph-based Semi Supervised Learning (SSL) methods aim to address this problem by labeling a small subset of the nodes as seeds, and then utilizing the graph structure to predict label scores for the rest of the nodes in the graph. Recently, Graph Convolutional Networks (GCNs) have achieved impressive performance on the graph-based SSL task. In addition to label scores, it is also desirable to have a confidence score associated with them. Unfortunately, confidence estimation in the context of GCN has not been previously explored. We fill this important gap in this paper and propose ConfGCN, which estimates labels scores along with their confidences jointly in GCN-based setting. ConfGCN uses these estimated confidences to determine the influence of one node on another during neighborhood aggregation, thereby acquiring anisotropic capabilities. Through extensive analysis and experiments on standard benchmarks, we find that ConfGCN is able to significantly outperform state-of-the-art baselines. We have made ConfGCN’s source code available to encourage reproducible research.


A Priori Estimates of the Generalization Error for Two-layer Neural Networks    

No tl;dr =[

New estimates for the generalization error are established for a nonlinear regression problem using a two-layer neural network model. These new estimates are a priori in nature in the sense that the bounds depend only on some norms of the underlying functions to be fitted, not the parameters in the model. In contrast, most existing results for neural networks are a posteriori in nature in the sense that the bounds depend on some norms of the model parameters. The error rates are comparable to that of the Monte Carlo method in terms of the size of the dataset. Moreover, these bounds are equally effective in the over-parametrized regime when the network size is much larger than the size of the dataset.


Model Comparison for Semantic Grouping    

tl;dr Competitive alternative to sentence embeddings in the task of semantic similarity using model comparison

We introduce a probabilistic framework for quantifying the semantic similarity between two groups of embeddings. We formulate this as a model comparison task in which we contrast a generative model that encodes similarity between the two groups versus one that does not. We illustrate how this framework can be used for the Semantic Text Similarity (STS) task using clear assumptions about how the embeddings of words are generated. We apply information criteria based model comparison to overcome the shortcomings of Bayesian model comparison, whilst still penalising model complexity. We achieve competitive results by applying the proposed framework with an appropriate choice of likelihood on the STS datasets.


Formal Limitations on the Measurement of Mutual Information    

tl;dr We give a theoretical analysis of the measurement and optimization of mutual information.

Maximum mutual information (MMI) predictive coding has recently been proposed as a compelling approach to self-supervised learning. Maximizing mutual information also arises in INFOMAX and the information bottleneck. Recent papers have used the Donsker-Varadhan (DV) lower bound on KL divergence as a tractable training surrogate when maximizing mutual information. We show here that any distribution-free high-confidence lower bound on mutual information requires a sample size exponential in the value of the bound. We also directly analyze the DV lower bound and show that the DV bound cannot be measured accurately without an exponential sample size. We propose instead to representing mutual information as a difference of entropies and of estimating entropies with cross-entropy upper bounds. A sample estimate of a cross entropy converges to the true cross-entropy at the rate of $1/\sqrt{N}$.


Successor Uncertainties: exploration and uncertainty in temporal difference learning    

No tl;dr =[

We consider the problem of balancing exploration and exploitation in sequential decision making problems. To explore efficiently, it is vital to consider the uncertainty over all consequences of a decision, and not just those that follow immediately; the uncertainties need to be propagated according to the dynamics of the problem. To this end, we develop Successor Uncertainties, a probabilistic model for the state-action function of a Markov Decision Process that propagates uncertainties in a coherent and scalable way. Our model achieves this by combining successor features and online Bayesian uncertainty estimation. We relate our approach to other classical and contemporary methods for exploration and present an empirical analysis of successor uncertainties.


Neural Graph Evolution: Automatic Robot Design    

tl;dr Automatic robotic design search with graph neural networks and evolutionary algorithms

Despite the recent success in robotic locomotion control, the design of robot relies heavily on human engineering. Automatic robot design has been a long studied subject but with limited success due to the large combinatorial search space and the difficulty in evaluating the found solution. In this paper, We propose Neural Graph Evolution (NGE) to address these two challenges. We formulate automatic robot design as a graph search problem and perform evolution search in graph space. NGE uses graph neural networks to parameterize the control polices that allows skill transfer from previously evaluated control policy to a new robot design. The policy sharing and transfer greatly reducing the cost of re-training new candidates. Similar to ES, NGE performs selection on current candidates and evolves new ones iteratively. In addition, NGE applies Graph Mutation by incorporating model uncertainty, which reduces the search space by balancing exploration and exploitation. We show that NGE significantly outperforms both random graph search (RGS) and ES by an order of magnitude. As shown in experiments, NGE is the first algorithm that can automatically discover complex robotic graph structures, such as a fish with two symmetrical flat side-fins and a tail, or a cheetah with athletic front and back legs. Instead of using thousands of cores for weeks, NGE efficiently solves searching problem within a day on a single 64 CPU-core Amazon EC2 machine.


A NON-LINEAR THEORY FOR SENTENCE EMBEDDING    

No tl;dr =[

This paper revisits the Random Walk model for sentence embedding in the context of non-extensive statistics. We propose a non-extensive algebra to compute the discourse vector. We argue that by doing so we are taking into account high non-linearity in the semantic space. Furthermore, we show that by considering a non-extensive algebra, the compounding effect of the vector length is mitigated. Overall, we show that the proposed model leads to good sentence embedding. We evaluate the embedding method on textual similarity tasks.


Coupled Recurrent Models for Polyphonic Music Composition    

tl;dr New recurrent generative models for composition of rhythmically complex, polyphonic music.

This work describes a novel recurrent model for music composition, which accounts for the rich statistical structure of polyphonic music. There are many ways to factor the probability distribution over musical scores; we consider the merits of various approaches and propose a new factorization that decomposes a score into a collection of concurrent, coupled time series: "parts." The model we propose borrows ideas from both convolutional neural models and recurrent neural models; we argue that these ideas are natural for capturing music's pitch invariances, temporal structure, and polyphony. We train generative models for homophonic and polyphonic composition on the KernScores dataset (Sapp, 2005), a collection of 2,300 musical scores comprised of around 2.8 million notes spanning time from the Renaissance to the early 20th century. While evaluation of generative models is know to be hard (Theis et al., 2016), we present careful quantitative results using a unit-adjusted cross entropy metric that is independent of how we factor the distribution over scores. We also present qualitative results using a blind discrimination test.


Reconciling Feature-Reuse and Overfitting in DenseNet with Specialized Dropout    

tl;dr Realizing the drawbacks when applying original dropout on DenseNet, we craft the design of dropout method from three aspects, the idea of which could also be applied on other CNN models.

Recently convolutional neural networks (CNNs) achieve great accuracy in visual recognition tasks. DenseNet becomes one of the most popular CNN models due to its effectiveness in feature-reuse. However, like other CNN models, DenseNets also face overfitting problem if not severer. Existing dropout method can be applied but not as effective due to the introduced nonlinear connections. In particular, the property of feature-reuse in DenseNet will be impeded, and the dropout effect will be weakened by the spatial correlation inside feature maps. To address these problems, we craft the design of a specialized dropout method from three aspects, dropout location, dropout granularity, and dropout probability. The insights attained here could potentially be applied as a general approach for boosting the accuracy of other CNN models with similar nonlinear connections. Experimental results show that DenseNets with our specialized dropout method yield better accuracy compared to vanilla DenseNet and state-of-the-art CNN models, and such accuracy boost increases with the model depth.


Explainable Adversarial Learning: Implicit Generative Modeling of Random Noise during Training for Adversarial Robustness    

tl;dr Noise modeling at the input during discriminative training improves adversarial robustness. Propose PCA based evaluation metric for adversarial robustness

We introduce Explainable Adversarial Learning, ExL, an approach for training neural networks that are intrinsically robust to adversarial attacks. We find that the implicit generative modeling of random noise with the same loss function used during posterior maximization, improves a model's understanding of the data manifold furthering adversarial robustness. We prove our approach's efficacy and provide a simplistic visualization tool for understanding adversarial data, using Principal Component Analysis. Our analysis reveals that adversarial robustness, in general, manifests in models with higher variance along the high-ranked principal components. We show that models learnt with our approach perform remarkably well against a wide-range of attacks. Furthermore, combining ExL with state-of-the-art adversarial training extends the robustness of a model, even beyond what it is adversarially trained for, in both white-box and black-box attack scenarios.


PeerNets: Exploiting Peer Wisdom Against Adversarial Attacks    

No tl;dr =[

Deep learning systems have become ubiquitous in many aspects of our lives. Unfortunately, it has been shown that such systems are vulnerable to adversarial attacks, making them prone to potential unlawful uses. Designing deep neural networks that are robust to adversarial attacks is a fundamental step in making such systems safer and deployable in a broader variety of applications (e.g. autonomous driving), but more importantly is a necessary step to design novel and more advanced architectures built on new computational paradigms rather than marginally building on the existing ones. In this paper we introduce PeerNets, a novel family of convolutional networks alternating classical Euclidean convolutions with graph convolutions to harness information from a graph of peer samples. This results in a form of non-local forward propagation in the model, where latent features are conditioned on the global structure induced by the graph, that is up to 3 times more robust to a variety of white- and black-box adversarial attacks compared to conventional architectures with almost no drop in accuracy.


Wasserstein proximal of GANs    

tl;dr We propose the Wasserstein proximal method for training GANs.

We introduce a new method for training GANs by applying the Wasserstein-2 metric proximal on the generators. The approach is based on the gradient operator induced by optimal transport, which connects the geometry of sample space and parameter space in implicit deep generative models. From this theory, we obtain an easy-to-implement regularizer for the parameter updates. Our experiments demonstrate that this method improves the speed and stability in training GANs in terms of wall-clock time and Fr\'echet Inception Distance (FID) learning curves.


SpaMHMM: Sparse Mixture of Hidden Markov Models for Graph Connected Entities    

tl;dr A method to model the generative distribution of sequences coming from graph connected entities.

We propose a framework to model the distribution of sequential data coming from a set of entities connected in a graph with a known topology. The method is based on a mixture of shared hidden Markov models (HMMs), which are trained in order to exploit the knowledge of the graph structure and in such a way that the obtained mixtures tend to be sparse. Experiments in different application domains demonstrate the effectiveness and versatility of the method.


Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search    

No tl;dr =[

Learning policies on data synthesized by models can in principle quench the thirst of reinforcement learning algorithms for large amounts of real experience, which is often costly to acquire. However, simulating plausible experience de novo is a hard problem for many complex environments, often resulting in biases for model-based policy evaluation and search. Instead of de novo synthesis of data, here we assume logged, real experience and model alternative outcomes of this experience under counterfactual actions, i.e. actions that were not actually taken. Based on this, we propose the Counterfactually-Guided Policy Search (CF-GPS) algorithm for learning policies in POMDPs from off-policy experience. It leverages structural causal models for counterfactual evaluation of arbitrary policies on individual off-policy episodes. CF-GPS can improve on vanilla model-based RL algorithms by making use of available logged data to de-bias model predictions. In contrast to off-policy algorithms based on Importance Sampling which re-weight data, CF-GPS leverages a model to explicitly consider alternative outcomes, allowing the algorithm to make better use of experience data. We find empirically that these advantages translate into improved policy evaluation and search results on a non-trivial grid-world task. Finally, we show that CF-GPS generalizes the previously proposed Guided Policy Search and that reparameterization-based algorithms such Stochastic Value Gradient can be interpreted as counterfactual methods.


Analyzing Inverse Problems with Invertible Neural Networks    

tl;dr To analyze inverse problems with Invertible Neural Networks

For many applications, in particular in natural science, the task is to determine hidden system parameters from a set of measurements. Often, the forward process from parameter- to measurement-space is well-defined, whereas the inverse problem is ambiguous: multiple parameter sets can result in the same measurement. To fully characterize this ambiguity, the full posterior parameter distribution, conditioned on an observed measurement, has to be determined. We argue that a particular class of neural networks is well suited for this task – so-called Invertible Neural Networks (INNs). Unlike classical neural networks, which attempt to solve the ambiguous inverse problem directly, INNs focus on learning the forward process, using additional latent output variables to capture the information otherwise lost. Due to invertibility, a model of the corresponding inverse process is learned implicitly. Given a specific measurement and the distribution of the latent variables, the inverse pass of the INN provides the full posterior over parameter space. We prove theoretically and verify experimentally, on artificial data and real-world problems from medicine and astrophysics, that INNs are a powerful analysis tool to find multi-modalities in parameter space, uncover parameter correlations, and identify unrecoverable parameters.


Optimal Completion Distillation for Sequence Learning    

tl;dr Optimal Completion Distillation (OCD) is a training procedure for optimizing sequence to sequence models based on edit distance which achieves state-of-the-art on end-to-end Speech Recognition tasks.

We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance. OCD is efficient, has no hyper-parameters of its own, and does not require pre-training or joint optimization with conditional log-likelihood. Given a partial sequence generated by the model, we first identify the set of optimal suffixes that minimize the total edit distance, using an efficient dynamic programming algorithm. Then, for each position of the generated sequence, we use a target distribution which puts equal probability on the first token of all the optimal suffixes. OCD achieves the state-of-the-art performance on end-to-end speech recognition, on both Wall Street Journal and Librispeech datasets, achieving $9.3\%$ WER and $4.8\%$ WER, respectively.


Featurized Bidirectional GAN: Adversarial Defense via Adversarially Learned Semantic Inference    

No tl;dr =[

Deep neural networks have been demonstrated to be vulnerable to adversarial attacks, where small perturbations intentionally added to the original inputs can fool the classifier. In this paper, we propose a defense method, Featurized Bidirectional Generative Adversarial Networks (FBGAN), to extract the semantic features of the input and filter the non-semantic perturbation. FBGAN is pre-trained on the clean dataset in an unsupervised manner, adversarially learning a bidirectional mapping between a high-dimensional data space and a low-dimensional semantic space; also mutual information is applied to disentangle the semantically meaningful features. After the bidirectional mapping, the adversarial data can be reconstructed to denoised data, which could be fed into any pre-trained classifier. We empirically show the quality of reconstruction images and the effectiveness of defense.


Compound Density Networks    

No tl;dr =[

Despite the huge success of deep neural networks (NNs), finding good mechanisms for quantifying their prediction uncertainty is still an open problem. It was recently shown, that using an ensemble of NNs trained with a proper scoring rule leads to results competitive to those of Bayesian NNs. This ensemble method can be understood as finite mixture model with uniform mixing weights. We build on this mixture model approach and increase its flexibility by replacing the fixed mixing weights by an adaptive, input-dependent distribution (specifying the probability of each component) represented by an NN, and by considering uncountably many mixture components. The resulting model can be seen as the continuous counterpart to mixture density networks and is therefore referred to as compound density network. We empirically show that the proposed model results in better uncertainty estimates and is more robust to adversarial examples than previous approaches.


Generative Code Modeling with Graphs    

tl;dr Representing programs as graphs including semantics helps when generating programs

Generative models forsource code are an interesting structured prediction problem, requiring to reason about both hard syntactic and semantic constraints as well as about natural, likely programs. We present a novel model for this problem that uses a graph to represent the intermediate state of the generated output. Our model generates code by interleaving grammar-driven expansion steps with graph augmentation and neural message passing steps. An experimental evaluation shows that our new model can generate semantically meaningful expressions, outperforming a range of strong baselines.


SENSE: SEMANTICALLY ENHANCED NODE SEQUENCE EMBEDDING    

tl;dr Node sequence embedding mechanism that captures both graph and text properties.

Effectively capturing graph node sequences in the form of vector embeddings is critical to many applications. We achieve this by (i) first learning vector embeddings of single graph nodes and (ii) then composing them to compactly represent node sequences. Specifically, we propose SENSE-S (Semantically Enhanced Node Sequence Embedding - for Single nodes), a skip-gram based novel embedding mechanism, for single graph nodes that co-learns graph structure as well as their textual descriptions. We demonstrate that SENSE-S vectors increase the accuracy of multi-label classification tasks by up to 50% and link-prediction tasks by up to 78% under a variety of scenarios using real datasets. Based on SENSE-S, we next propose generic SENSE to compute composite vectors that represent a sequence of nodes, where preserving the node order is important. We prove that this approach is efficient in embedding node sequences, and our experiments on real data confirm its high accuracy in node order decoding.


Automatically Composing Representation Transformations as a Means for Generalization    

tl;dr We explore the problem of compositional generalization and propose a means for endowing neural network architectures with the ability to compose themselves to solve these problems.

How can we build a learner that can capture the essence of what makes a hard problem more complex than a simple one, break the hard problem along characteristic lines into smaller problems it knows how to solve, and sequentially solve the smaller problems until the larger one is solved? To work towards this goal, we focus on learning to generalize in a particular family of problems that exhibit compositional and recursive structure: their solutions can be found by composing in sequence a set of reusable partial solutions. Our key idea is to recast the problem of generalization as a problem of learning algorithmic procedures: we can formulate a solution to this family as a sequential decision-making process over transformations between representations. Our formulation enables the learner to learn the structure and parameters of its own computation graph with sparse supervision, make analogies between problems by transforming one problem representation to another, and exploit modularity and reuse to scale to problems of varying complexity. Experiments on solving a variety of multilingual arithmetic problems demonstrate that our method discovers the hierarchical decomposition of a problem into its subproblems, generalizes out of distribution to unseen problem classes, and extrapolates to harder versions of the same problem, yielding a 10-fold reduction in sample complexity compared to a monolithic recurrent neural network. We further show that our method can compose learned spatial transformations to recover canonical MNIST digits from transformed ones.


DEEP GRAPH TRANSLATION    

No tl;dr =[

The tremendous success of deep generative models on generating continuous data like image and audio has been achieved; however, few deep graph generative models have been proposed to generate discrete data such as graphs. The recently proposed approaches are typically unconditioned generative models which have no control over modes of the graphs being generated. Differently, in this paper, we are interested in a new problem named Deep Graph Translation: given an input graph, the goal is to infer a target graph by learning their underlying translation mapping. Graph translation could be highly desirable in many applications such as disaster management and rare event forecasting, where the rare and abnormal graph patterns (e.g., traffic congestions and terrorism events) will be inferred prior to their occurrence even without historical data on the abnormal patterns for this specific graph (e.g., a road network or human contact network). To this end, we propose a novel Graph-Translation-Generative Adversarial Networks (GT-GAN) which translates one mode of the input graphs to its target mode. GT-GAN consists of a graph translator where we propose new graph convolution and deconvolution layers to learn the global and local translation mapping. A new conditional graph discriminator has also been proposed to classify target graphs by conditioning on input graphs. Extensive experiments on multiple synthetic and real-world datasets demonstrate the effectiveness and scalability of the proposed GT-GAN.


Deep models calibration with bayesian neural networks    

tl;dr We apply bayesian neural networks to improve calibration

We apply Bayesian Neural Networks to improve calibration of state-of-the-art deep neural networks. We show that, even with the most basic amortized approximate posterior distribution, and fast fully connected neural network for the likelihood, the Bayesian framework clearly outperforms other simple maximum likelihood based solutions that have recently shown very good performance, as temperature scaling. As an example, we reduce the Expected Calibration Error (ECE) from 0.52 to 0.24 on CIFAR-10 and from 4.28 to 2.456 on CIFAR-100 on two Wide ResNet with 96.13% and 80.39% accuracy respectively, which are among the best results published for this task. We demonstrate our robustness and performance with experiments on a wide set of state-of-the-art computer vision models. Moreover, our approach acts off-line, and thus can be applied to any probabilistic model regardless of the limitations that the model may present during training. This make it suitable to calibrate systems that make use of pre-trained deep neural networks that are expensive to train for a specific task, or to directly train a calibrated deep convolutional model with Monte Carlo Dropout approximations, among others. However, our method is still complementary with any Bayesian Neural Network for further improvement.


Zero-training Sentence Embedding via Orthogonal Basis    

tl;dr A simple and training-free approach for sentence embeddings with competitive performance compared with sophisticated models requiring either large amount of training data or prolonged training time.

We propose a simple and robust training-free approach for building sentence representations. Inspired by the Gram-Schmidt Process in geometric theory, we build an orthogonal basis of the subspace spanned by a word and its surrounding context in a sentence. We model the semantic meaning of a word in a sentence based on two aspects. One is its relatedness to the word vector subspace already spanned by its contextual words. The other is its novel semantic meaning which shall be introduced as a new basis vector perpendicular to this existing subspace. Following this motivation, we develop an innovative method based on orthogonal basis to combine pre-trained word embeddings into sentence representation. This approach requires zero training and zero parameters, along with efficient inference performance. We evaluate our approach on 11 downstream NLP tasks. Experimental results show that our model outperforms all existing zero-training alternatives in all the tasks and it is competitive to other approaches relying on either large amounts of labelled data or prolonged training time.