# Search ICLR 2019

Searching papers submitted to ICLR 2019 can be painful. You might want to know which paper uses technique X, dataset D, or cites author ME. Unfortunately, search is limited to titles, abstracts, and keywords, missing the actual contents of the paper. This Frankensteinian search has returned from 2018 to help scour the papers of ICLR by ripping out their souls using pdftotext.

Good luck! Warranty's not included :)

Need random search inspiration..? Grab something from the list of all tags! ^_^
How about: progress inference, disentangling, max-marginal ranking regularization, smoothed gradient, dynamic model optimization ..?

Sanity Disclaimer: As you stare at the continuous stream of ICLR and arXiv papers, don't lose confidence or feel overwhelmed. This isn't a competition, it's a search for knowledge. You and your work are valuable and help carve out the path for progress in our field :)

### "Random selection" has 100 results

##### Empirically Characterizing Overparameterization Impact on Convergence

tl;dr Empirically shows that larger models train in fewer training steps, because all factors in weight space traversal improve.

A long-held conventional wisdom states that larger models train more slowly when using gradient descent. This work challenges this widely-held belief, showing that larger models can potentially train faster despite the increasing computational requirements of each training step. In particular, we study the effect of network structure (depth and width) on halting time and show that larger models---wider models in particular---take fewer training steps to converge. We design simple experiments to quantitatively characterize the effect of overparametrization on weight space traversal. Results show that halting time improves when growing model's width for three different applications, and the improvement comes from each factor: The distance from initialized weights to converged weights shrinks with a power-law-like relationship, the average step size grows with a power-law-like relationship, and gradient vectors become more aligned with each other during traversal.

##### Real-time Neural-based Input Method

No tl;dr =[

The input method is an essential service on every mobile and desktop devices that provides text suggestions. It converts sequential keyboard inputs to the characters in its target language, which is indispensable for Japanese and Chinese users. Due to critical resource constraints and limited network bandwidth of the target devices, applying neural models to input method is not well explored. In this work, we apply a LSTM-based language model to input method and evaluate its performance for both prediction and conversion tasks with Japanese BCCWJ corpus. We articulate the bottleneck to be the slow softmax computation during conversion. To solve the issue, we propose incremental softmax approximation approach, which computes softmax with a selected subset vocabulary and fix the stale probabilities when the vocabulary is updated in future steps. We refer to this method as incremental selective softmax. The results show a two order speedup for the softmax computation when converting Japanese input sequences with a large vocabulary, reaching real-time speed on commodity CPU. We also exploit the model compressing potential to achieve a 92% model size reduction without losing accuracy.

##### RETHINKING SELF-DRIVING : MULTI -TASK KNOWLEDGE FOR BETTER GENERALIZATION AND ACCIDENT EXPLANATION ABILITY

tl;dr we proposed a new self-driving model which is composed of perception module for see and think and driving module for behave to acquire better generalization and accident explanation ability.

Current end-to-end deep learning driving models have two problems: (1) Poor generalization ability of unobserved driving environment when diversity of train- ing driving dataset is limited (2) Lack of accident explanation ability when driving models don’t work as expected. To tackle these two problems, rooted on the be- lieve that knowledge of associated easy task is benificial for addressing difficult task, we proposed a new driving model which is composed of perception module for see and think and driving module for behave, and trained it with multi-task perception-related basic knowledge and driving knowledge stepwisely. Specifi- cally segmentation map and depth map (pixel level understanding of images) were considered as what & where and how far knowledge for tackling easier driving- related perception problems before generating final control commands for difficult driving task. The results of experiments demonstrated the effectiveness of multi- task perception knowledge for better generalization and accident explanation abil- ity. With our method the average sucess rate of finishing most difficult navigation tasks in untrained city of CoRL test surpassed current benchmark method for 15 percent in trained weather and 20 percent in untrained weathers.

##### Sample Efficient Deep Neuroevolution in Low Dimensional Latent Space

No tl;dr =[

Current deep neuroevolution models are usually trained in a large parameter search space for complex learning tasks, e.g. playing video games, which needs billions of samples and thousands of search steps to obtain significant performance. This raises a question of whether we can make use of sequential data generated during evolution, encode input samples, and evolve in low dimensional parameter space with latent state input in a fast and efficient manner. Here we give an affirmative answer: we train a VAE to encode input samples, then an RNN to model environment dynamics and handle temporal information, and last evolve our low dimensional policy network in latent space. We demonstrate that this approach is surprisingly efficient: our experiments on Atari games show that within 10M frames and 30 evolution steps of training, our algorithm could achieve competitive result compared with ES, A3C, and DQN which need billions of frames.

##### ACIQ: Analytical Clipping for Integer Quantization of neural networks

tl;dr We analyze the trade-off between quantization noise and clipping distortion in low precision networks, and show marked improvements over standard quantization schemes that normally avoid clipping

We analyze the trade-off between quantization noise and clipping distortion in low precision networks. We identify the statistics of various tensors, and derive exact expressions for the mean-square-error degradation due to clipping. By optimizing these expressions, we show marked improvements over standard quantization schemes that normally avoid clipping. For example, just by choosing the accurate clipping values, more than 40\% accuracy improvement is obtained for the quantization of VGG-16 to 4-bits of precision. Our results have many applications for the quantization of neural networks at both training and inference time.

##### KNOWLEDGE DISTILL VIA LEARNING NEURON MANIFOLD

tl;dr A new knowledge distill method for transfer learning

Although deep neural networks show their extraordinary power in various tasks, they are not feasible for deploying such large models on embedded systems due to high computational cost and storage space limitation. The recent work knowledge distillation (KD) aims at transferring model knowledge from a well-trained teacher model to a small and fast student model which can significantly help extending the usage of large deep neural networks on portable platform. In this paper, we show that, by properly defining the neuron manifold of deep neuron network (DNN), we can significantly improve the performance of student DNN networks through approximating neuron manifold of powerful teacher network. To make this, we propose several novel methods for learning neuron manifold from DNN model. Empowered with neuron manifold knowledge, our experiments show the great improvement across a variety of DNN architectures and training data. Compared with other KD methods, our Neuron Manifold Transfer (NMT) has best transfer ability of the learned features.

##### Step-wise Sensitivity Analysis: Identifying Partially Distributed Representations for Interpretable Deep Learning

tl;dr We find dependency graphs between learned representations as a first step towards building decision trees to interpret the representation manifold.

In this paper, we introduce a novel method, called step-wise sensitivity analysis, which makes three contributions towards increasing the interpretability of Deep Neural Networks (DNNs). First, we are the first to suggest a methodology that aggregates results across input stimuli to gain model-centric results. Second, we linearly approximate the neuron activation and propose to use the outlier weights to identify distributed code. Third, our method constructs a dependency graph of the relevant neurons across the network to gain fine-grained understanding of the nature and interactions of DNN's internal features. The dependency graph illustrates shared subgraphs that generalise across 10 classes and can be clustered into semantically related groups. This is the first step towards building decision trees as an interpretation of learned representations.

tl;dr Using noise to define the divergence between distributions with different support.

For distributions $p$ and $q$ with different support, the divergence $\div{p}{q}$ generally will not exist. We define a spread divergence $\sdiv{p}{q}$ on modified $p$ and $q$ and describe sufficient conditions for the existence of such a divergence. We give examples of using a spread divergence to train implicit generative models, including linear models (Principal Components Analysis and Independent Components Analysis) and non-linear models (Deep Generative Networks).

##### Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs

tl;dr Language generation using seq2seq models which produce word embeddings instead of a softmax based distribution over the vocabulary at each step enabling much faster training while maintaining generation quality

The Softmax function is used in the final layer of nearly all existing sequence-to-sequence models for language generation. However, it is usually the slowest layer to compute which limits the vocabulary size to a subset of most frequent types; and it has a large memory footprint. We propose a general technique for replacing the softmax layer with a continuous embedding layer. Our primary innovations are a novel probabilistic loss, and a training and inference procedure in which we generate a probability distribution over pre-trained word embeddings, instead of a multinomial distribution over the vocabulary obtained via softmax. We evaluate this new class of sequence-to-sequence models with continuous outputs on the task of neural machine translation. We show that our models obtain upto 2.5x speed-up in training time while performing on par with the state-of-the-art models in terms of translation quality. These models are capable of handling very large vocabularies without compromising on translation quality. They also produce more meaningful errors than in the softmax-based models, as these errors typically lie in a subspace of the vector space of the reference translations.

##### Locally Linear Unsupervised Feature Selection

tl;dr Unsupervised feature selection through capturing the local linear structure of the data

The paper, interested in unsupervised feature selection, aims to retain the features best accounting for the local patterns in the data. The proposed approach, called Locally Linear Unsupervised Feature Selection, relies on a dimensionality reduction method to characterize such patterns; each feature is thereafter assessed according to its compliance w.r.t. the local patterns, taking inspiration from Locally Linear Embedding (Roweis and Saul, 2000). The experimental validation of the approach on the scikit-feature benchmark suite demonstrates its effectiveness compared to the state of the art.

##### Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control

tl;dr We propose a framework that incorporates planning for efficient exploration and learning in complex environments.

We propose a plan online and learn offline framework for the setting where an agent with an internal model needs to continually act and learn in the world. Our work builds on the synergistic relationship between local trajectory optimization, global value function learning, and exploration. We study how trajectory optimization can cope with approximation errors in the value function, and can stabilize and accelerate value function learning. Conversely, we also study how approximate value functions can help reduce the planning horizon and allow for better policies beyond local solutions. Finally, we also demonstrate how trajectory optimization can be used to perform temporally coordinated exploration in conjunction with estimating uncertainty in value function approximation. Combining these components enable solutions to complex complex control tasks like humanoid locomotion and dexterous in-hand manipulation in the equivalent of a few minutes of experience in the real world.

##### Overlapping Community Detection with Graph Neural Networks

tl;dr Detecting overlapping communities in graphs using graph neural networks

Community detection in graphs is of central importance in graph mining, machine learning and network science. Detecting overlapping communities is especially challenging, and remains an open problem. Motivated by the success of graph-based deep learning in other graph-related tasks, we study the applicability of this framework for overlapping community detection. We propose a probabilistic model for overlapping community detection based on the graph neural network architecture. Despite its simplicity, our model outperforms the existing approaches in the community recovery task by a large margin. Moreover, due to the inductive formulation, the proposed model is able to perform out-of-sample community detection for nodes that were not present at training time

##### Classifier-agnostic saliency map extraction

tl;dr We propose a new saliency map extraction method which results in extracting higher quality maps.

Extracting saliency maps, which indicate parts of the image important to classification, requires many tricks to achieve satisfactory performance when using classifier-dependent methods. Instead, we propose classifier-agnostic saliency map extraction, which finds all parts of the image that any classifier could use, not just one given in advance. We observe that the proposed approach extracts higher quality saliency maps and outperforms existing weakly-supervised localization techniques, setting the new state of the art result on the ImageNet dataset.

##### Policy Transfer with Strategy Optimization

tl;dr We propose a policy transfer algorithm that can overcome large and challenging discrepancies in the system dynamics such as latency, actuator modeling error, etc.

Computer simulation provides an automatic and safe way for training robotic control policies to achieve complex tasks such as locomotion. However, a policy trained in simulation usually does not transfer directly to the real hardware due to the differences between the two environments. Transfer learning using domain randomization is a promising approach, but it usually assumes that the target environment is close to the distribution of the training environments, thus relying heavily on accurate system identification. In this paper, we present a different approach that leverages domain randomization for transferring control policies to unknown environments. The key idea that, instead of learning a single policy in the simulation, we simultaneously learn a family of policies that exhibit different behaviors. When tested in the target environment, we directly search for the best policy in the family based on the task performance, without the need to identify the dynamic parameters. We evaluate our method on five simulated robotic control problems with different discrepancies in the training and testing environment and demonstrate that our method can overcome larger modeling errors compared to training a robust policy or an adaptive policy.

No tl;dr =[

We address the problem of recovering an underlying signal from lossy and inaccurate measurements in an unsupervised fashion. Typically, we consider situations where there is no background knowledge on the structure of the unknown signal and where we do not have access to signal-measurement pairs, nor even unpaired signal data. We introduce a general framework, where a neural network is trained to recover plausible signals from the measurements in the data, by introducing an adversarial and a reconstruction loss. We evaluate our framework on different noise instances, and show that our approach yields comparable results to model variants trained with stronger supervision.

##### Episodic Curiosity through Reachability

tl;dr We propose a novel model of curiosity based on episodic memory and the ideas of reachability which allows us to overcome the known "couch-potato" issues of prior work.

Rewards are sparse in the real world and most today's reinforcement learning algorithms struggle with such sparsity. One solution to this problem is to allow the agent to create rewards for itself --- thus making rewards dense and more suitable for learning. In particular, inspired by curious behaviour in animals, observing something novel could be rewarded with a bonus. Such bonus is summed up with the real task reward --- making it possible for RL algorithms to learn from the combined reward. We propose a new curiosity method which uses episodic memory to form the novelty bonus. To determine the bonus, the current observation is compared with the observations in memory. Crucially, the comparison is done based on how many environment steps it takes to reach the current observation from those in memory --- which incorporates rich information about environment dynamics. This allows us to overcome the known "couch-potato" issues of prior work --- when the agent finds a way to instantly gratify itself by exploiting actions which lead to unpredictable consequences. We test our approach in visually rich 3D environments in ViZDoom and DMLab. In ViZDoom, our agent learns to successfully navigate to a distant goal at least 2 times faster than the state-of-the-art curiosity method ICM. In DMLab, our agent generalizes well to new procedurally generated levels of the game --- reaching the goal at least 2 times more frequently than ICM on test mazes with very sparse reward.

##### Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning

tl;dr We address sample inefficiency and reward bias in adversarial imitation learning algorithms such as GAIL and AIRL.

We identify two issues with the family of algorithms based on the Adversarial Imitation Learning framework. The first problem is implicit bias present in the reward functions used in these algorithms. While these biases might work well for some environments, they can also lead to sub-optimal behavior in others. Secondly, even though these algorithms can learn from few expert demonstrations, they require a prohibitively large number of interactions with the environment in order to imitate the expert for many real-world applications. In order to address these issues, we propose a new algorithm called Discriminator-Actor-Critic that uses off-policy Reinforcement Learning to reduce policy-environment interaction sample complexity by an average factor of 10. Furthermore, since our reward function is designed to be unbiased, we can apply our algorithm to many problems without making any task-specific adjustments.

tl;dr Learning to synthesize raw waveform audio with GANs

While Generative Adversarial Networks (GANs) have seen wide success at the problem of synthesizing realistic images, they have seen little application to audio generation. Unlike for images, a barrier to success is that the best discriminative representations for audio tend to be non-invertible, and thus cannot be used to synthesize listenable outputs. In this paper we introduce WaveGAN, a first attempt at applying GANs to unsupervised synthesis of raw-waveform audio. Our experiments demonstrate that WaveGAN can produce intelligible words from a small vocabulary of speech, and can also synthesize audio from other domains such as drums, bird vocalizations, and piano. Qualitatively, we find that human judges prefer the sound quality of generated examples from WaveGAN over those from a method which naïvely apply GANs on image-like audio feature representations.

##### Structured Content Preservation for Unsupervised Text Style Transfer

No tl;dr =[

Text style transfer aims to modify the style of a sentence while keeping its content unchanged. Recent style transfer systems often fail to faithfully preserve the content after changing the style. This paper proposes a structured content preserving model that leverages linguistic information in the structured fine-grained supervisions to better preserve the style-independent content \footnote{Henceforth, we refer to style-independent content as content, for simplicity.} during style transfer. In particular, we achieve the goal by devising rich model objectives based on both the sentence's lexical information and a language model that conditions on content. The resulting model therefore is encouraged to retain the semantic meaning of the target sentences. We perform extensive experiments that compare our model to other existing approaches in the tasks of sentiment and political slant transfer. Our model achieves significant improvement in terms of both content preservation and style transfer in automatic and human evaluation.

##### Set Transformer

tl;dr Attention-based neural network to process set-structured data

Many machine learning tasks such as multiple instance learning, 3D shape recognition and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the permutation of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces computation time of self-attention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating increased performance compared to recent methods for set-structured data.

##### Successor Uncertainties: exploration and uncertainty in temporal difference learning

No tl;dr =[

We consider the problem of balancing exploration and exploitation in sequential decision making problems. To explore efficiently, it is vital to consider the uncertainty over all consequences of a decision, and not just those that follow immediately; the uncertainties need to be propagated according to the dynamics of the problem. To this end, we develop Successor Uncertainties, a probabilistic model for the state-action function of a Markov Decision Process that propagates uncertainties in a coherent and scalable way. Our model achieves this by combining successor features and online Bayesian uncertainty estimation. We relate our approach to other classical and contemporary methods for exploration and present an empirical analysis of successor uncertainties.

##### Combinatorial Attacks on Binarized Neural Networks

tl;dr Gradient-based attacks on binarized neural networks are not effective due to the non-differentiability of such networks; Our IPROP algorithm solves this problem using integer optimization

Binarized Neural Networks (BNNs) have recently attracted significant interest due to their computational efficiency. Concurrently, it has been shown that neural networks may be overly sensitive to "attacks" -- tiny adversarial changes in the input -- which may be detrimental to their use in safety-critical domains. Designing attack algorithms that effectively fool trained models is a key step towards learning robust neural networks. The discrete, non-differentiable nature of BNNs, which distinguishes them from their full-precision counterparts, poses a challenge to gradient-based attacks. In this work, we study the problem of attacking a BNN through the lens of combinatorial and integer optimization. We propose a Mixed Integer Linear Programming (MILP) formulation of the problem. While exact and flexible, the MILP quickly becomes intractable as the network and perturbation space grow. To address this issue, we propose IProp, a decomposition-based algorithm that solves a sequence of much smaller MILP problems. Experimentally, we evaluate both proposed methods against the standard gradient-based attack (FGSM) on MNIST and Fashion-MNIST, and show that IProp performs favorably compared to FGSM, while scaling beyond the limits of the MILP.

##### Towards Decomposed Linguistic Representation with Holographic Reduced Representation

tl;dr Holographic Reduced Representation enables language model to discover linguistic roles.

The vast majority of neural models in Natural Language Processing adopt a form of structureless distributed representations. While these models are powerful at making predictions, the representational form is rather crude and does not provide insights into linguistic structures. In this paper we introduce novel language models with representations informed by the framework of Holographic Reduced Representation (HRR). This allows us to inject structures directly into our word-level and chunk-level representations. Our analyses show that by using HRR as a structured compositional representation, our models are able to discover crude linguistic roles, which roughly resembles a classic division between syntax and semantics.

##### A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

tl;dr We use empirical tools of mode connectivity and SVCCA to investigate neural network training heuristics of learning rate restarts, warmup and knowledge distillation.

The convergence rate and final performance of common deep learning models have significantly benefited from recently proposed heuristics such as learning rate schedules, knowledge distillation, skip connections and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining the efficacy of these strategies can aid our understanding of deep learning landscapes and the training dynamics. Existing approaches for empirical analysis rely on tools of linear interpolation and visualizations with dimensionality reduction, each with their limitations. Instead, we revisit the empirical analysis of heuristics through the lens of recently proposed methods for loss surface and representation analysis, viz. mode connectivity and canonical correlation analysis (CCA), and hypothesize reasons why the heuristics succeed. In particular, we explore knowledge distillation and learning rate heuristics of (cosine) restarts and warmup using mode connectivity and CCA. Our empirical analysis suggests that: (a) the reasons often quoted for the success of cosine annealing are not evidenced in practice; (b) that the effect of learning rate warmup is to prevent the deeper layers from creating training instability; and (c) that the latent knowledge shared by the teacher is primarily disbursed in the deeper layers.

##### Integer Networks for Data Compression with Latent-Variable Models

tl;dr We train variational models with quantized networks for computational determinism. This enables using them for cross-platform data compression.

We consider the problem of using variational latent-variable models for data compression. For such models to produce a compressed binary sequence, which is the universal data representation in a digital world, the latent representation needs to be subjected to entropy coding. Range coding as an entropy coding technique is optimal, but it can fail catastrophically if the computation of the prior differs even slightly between the sending and the receiving side. Unfortunately, this is a common scenario when floating point math is used and the sender and receiver operate on different hardware or software platforms, as numerical round-off is often platform dependent. We propose using integer networks as a universal solution to this problem, and demonstrate that they enable reliable cross-platform encoding and decoding of images using variational models.

##### The effectiveness of layer-by-layer training using the information bottleneck principle

No tl;dr =[

The recently proposed information bottleneck (IB) theory of deep nets suggests that during training, each layer attempts to maximize its mutual information (MI) with the target labels (so as to allow good prediction accuracy), while minimizing its MI with the input (leading to effective compression and thus good generalization). To date, evidence of this phenomenon has been indirect and aroused controversy due to theoretical and practical complications. In particular, it has been pointed out that the MI with the input is theoretically inﬁnite in many cases of interest, and that the MI with the target is fundamentally difﬁcult to estimate in high dimensions. As a consequence, the validity of this theory has been questioned. In this paper, we overcome these obstacles by two means. First, as previously suggested, we replace the MI with the input by a noise-regularized version, which ensures it is ﬁnite. As we show, this modiﬁed penalty in fact acts as a form of weight decay regularization. Second, to obtain accurate (noise regularized) MI estimates between an intermediate representation and the input, we incorporate the strong prior-knowledge we have about their relation, into the recently proposed MI estimator of Belghazi et al. (2018). With this scheme, we are able to stably train each layer independently to explicitly optimize the IB functional. Surprisingly, this leads to enhanced prediction accuracy, thus directly validating the IB theory of deep nets for the ﬁrst time.

##### Instance-aware Image-to-Image Translation

tl;dr We propose a novel method to incorporate the set of instance attributes for image-to-image translation.

Unsupervised image-to-image translation has gained considerable attention due to the recent impressive progress based on generative adversarial networks (GANs). However, previous methods often fail in challenging cases, in particular, when an image has multiple target instances and a translation task involves significant changes in shape, e.g., translating pants to skirts in fashion images. To tackle the issues, we propose a novel method, coined instance-aware GAN (InstaGAN), that incorporates the instance information (e.g., object segmentation masks) and improves multi-instance transfiguration. The proposed method translates both an image and the corresponding set of instance attributes while maintaining the permutation invariance property of the instances. To this end, we introduce a context preserving loss that encourages the network to learn the identity function outside of target instances. We also propose a sequential mini-batch inference/training technique that handles multiple instances with a limited GPU memory and enhances the network to generalize better for multiple instances. Our comparative evaluation demonstrates the effectiveness of the proposed method on different image datasets, in particular, in the aforementioned challenging cases.

##### ALISTA: Analytic Weights Are As Good As Learned Weights in LISTA

No tl;dr =[

Deep neural networks based on unfolding an iterative algorithm, for example, LISTA (learned iterative shrinkage thresholding algorithm), have been an empirical success for sparse signal recovery. The weights of these neural networks are currently determined by data-driven “black-box” training. In this work, we propose Analytic LISTA (ALISTA), where the weight matrix in LISTA is computed as the solution to a data-free optimization problem, leaving only the stepsize and threshold parameters to data-driven learning. This signiﬁcantly simpliﬁes the training. Speciﬁcally, the data-free optimization problem is based on coherence minimization. We show our ALISTA retains the optimal linear convergence proved in (Chen et al., 2018) and has a performance comparable to LISTA. Furthermore, we extend ALISTA to convolutional linear operators, again determined in a data-free manner. We also propose a feed-forward framework that combines the data-free optimization and ALISTA networks from end to end, one that can be jointly trained to gain robustness to small perturbations in the encoding model.

##### Hierarchical Bayesian Modeling for Clustering Sparse Sequences in the Context of Group Profiling

tl;dr Hierarchical Bayesian Modeling for Clustering Sparse Sequences ; user group modeling using behavioral data

This paper proposes a hierarchical Bayesian model for clustering sparse sequences.This is a mixture model and does not need the data to be represented by a Gaussian mixture and that gives significant modelling freedom.It also generates a very interpretable profile for the discovered latent groups.The data that was used for the work have been contributed by a restaurant loyalty program company. The data is a collection of sparse sequences where each entry of each sequence is the number of user visits of one week to some restaurant. This algorithm successfully clustered the data and calculated the expected user affiliation in each cluster.

##### EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE

No tl;dr =[

Making decisions requires information relevant to the task at hand. Many real-life decision-making situations allow acquiring further relevant information at a specific cost. For example, in assessing the health status of a patient we may decide to take additional measurements such as diagnostic tests or imaging scans before making a final assessment. More information that is relevant allows for better decisions but it may be costly to acquire all of this information. How can we trade off the desire to make good decisions with the option to acquire further information at a cost? To this end, we propose a principled framework, named EDDI (Efficient Dynamic Discovery of high-value Information), based on the theory of Bayesian experimental design. In EDDI we propose a novel partial variational autoencoder (Partial VAE), to efficiently handle missing data over varying subsets of known information. EDDI combines this Partial VAE with an acquisition function that maximizes expected information gain on a set of target variables. EDDI is efficient and demonstrates that dynamic discovery of high-value information is possible; we show cost reduction at the same decision quality and improved decision quality at the same cost in benchmarks and in two health-care applications.. We believe there is great potential for realizing these gains in real-world decision support systems.

##### An experimental study of layer-level training speed and its impact on generalization

tl;dr This paper provides empirical evidence that 1) the speed at which each layer trains influences generalization and 2) this phenomenon is at the root of weight decay's and adaptive gradient methods' impact on generalization.

How optimization influences the generalization ability of a DNN is still an active area of research. This work aims to unveil and study a factor of influence: we show that the speed at which each layer trains, measured by the rotation rate of each layer's weight vector (or layer rotation rate), has a consistent and substantial impact on generalization. We develop a visualization technique and an optimization algorithm to monitor and control the layer rotation rates during training, and show across multiple tasks and training settings that rotating all the layers' weights synchronously and at high rate repeatedly induces the best generalization performance. Going further, our experiments suggest that weight decay is an essential ingredient for inducing such beneficial layer rotation rates with SGD, and that the impact of adaptive gradient methods on training speed and generalization is solely due to the modifications they induce to each layer's training speed compared to SGD. Besides these fundamental findings, we also expect that the tools we introduce will reduce the meta-parameter tuning required to get the best generalization out of a deep network.

##### A Study of Robustness of Neural Nets Using Approximate Feature Collisions

No tl;dr =[

In recent years, various studies have focused on the robustness of neural nets. While it is known that neural nets are not robust to examples with adversarially chosen perturbations as a result of linear operations on the input data, we show in this paper there could be a convex polytope within which all examples are misclassified by neural nets due to the properties of ReLU activation functions. We propose a way to finding such polytopes empirically and demonstrate that such polytopes exist in practice. Furthermore, we show that such polytopes exist even after constraining the examples to be a composition of image patches, resulting in perceptibly different examples at different locations in the polytope that are all misclassified.

##### Pathologies in information bottleneck for deterministic supervised learning

tl;dr Information bottleneck has pathologies in supervised learning scenarios where the output is a deterministic function of the input.

Information bottleneck (IB) is a method for extracting information from one random variable X that is relevant for predicting another random variable Y. To do so, IB identifies an intermediate "bottleneck variable" T that has low mutual information I(X;T) and high mutual information I(Y;T). The "IB curve" characterizes the set of bottleneck variables that achieve maximal I(Y;T) for a given I(X;T), and is typically explored by optimizing the "IB Lagrangian", I(Y;T) - βI(X;T). Recently, there has been interest in applying IB to supervised learning, particularly for classification problems that use neural networks. In most classification problems, the output class Y is a deterministic function of the input X, which we refer to as "deterministic supervised learning". We demonstrate three pathologies that arise when IB is used in any scenario where Y is a deterministic function of X: (1) the IB curve cannot be recovered by optimizing the IB Lagrangian for different values of β; (2) there are "uninteresting" solutions at all points of the IB curve; and (3) for classifiers that achieve low error rates, the activity of different hidden layers will not exhibit a strict trade-off between compression and prediction, contrary to a recent proposal. To address problem (1), we propose a functional that, unlike the IB Lagrangian, can recover the IB curve in all cases. We finish by demonstrating these issues on the MNIST dataset.

##### Learning Heuristics for Automated Reasoning through Reinforcement Learning

tl;dr RL finds better heuristics for automated reasoning algorithms.

We demonstrate how to learn efficient heuristics for automated reasoning algorithms through deep reinforcement learning. We focus on backtracking search algorithms for quantified Boolean logics, which already can solve formulas of impressive size - up to 100s of thousands of variables. The main challenge is to find a representation of these formulas that lends itself to making predictions in a scalable way. For challenging problems, the heuristic learned through our approach reduces execution time by >=90% compared to the existing handwritten heuristics.

##### Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

tl;dr We propose a cross-lingual document classification framework for related language pairs.

Text classification must sometimes be applied in situations with no training data in a target language. However, training data may be available in a related language. We introduce a cross-lingual document classification framework CACO between related language pairs. To best use limited training data, our transfer learning scheme exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. CACO models trained under low-resource settings rival cross-lingual word embedding models trained under high-resource settings on related language pairs.

##### DL2: Training and Querying Neural Networks with Logic

tl;dr A differentiable loss for logic constraints for training and querying neural networks.

We present DL2, a system for training and querying neural networks with logical constraints. The key idea is to translate these constraints into a differentiable loss with desirable mathematical properties and to then either train with this loss in an iterative manner or to use the loss for querying the network for inputs subject to the constraints. We empirically demonstrate that DL2 is effective in both training and querying scenarios, across a range of constraints and data sets.

##### Intrinsic Social Motivation via Causal Influence in Multi-Agent RL

tl;dr We reward agents for having a causal influence on the actions of other agents, and show that this gives rise to better cooperation and more meaningful emergent communication protocols.

We derive a new intrinsic social motivation for multi-agent reinforcement learning (MARL), in which agents are rewarded for having causal influence over another agent's actions, where causal influence is assessed using counterfactual reasoning. The reward does not depend on observing another agent's reward function, and is thus a more realistic approach to MARL than taken in previous work. We show that the causal influence reward is related to maximizing the mutual information between agents' actions. We test the approach in challenging social dilemma environments, where it consistently leads to enhanced cooperation between agents and higher collective reward. Moreover, we find that rewarding influence can lead agents to develop emergent communication protocols. Therefore, we also employ influence to train agents to use an explicit communication channel, and find that it leads to more effective communication and higher collective reward. Finally, we show that influence can be computed by equipping each agent with an internal model that predicts the actions of other agents. This allows the social influence reward to be computed without the use of a centralised controller, and as such represents a significantly more general and scalable inductive bias for MARL with independent agents.

##### Improving machine classification using human uncertainty measurements

tl;dr improving classifiers using human uncertainty measurements

As deep CNN classifier performance using ground-truth labels has begun to asymptote at near-perfect levels, a key aim for the field is to extend training paradigms to capture further useful structure in natural image data and improve model robustness and generalization. In this paper, we present a novel natural image benchmark for making this extension, which we call CIFAR10H. This new dataset comprises a human-derived, full distribution over labels for each image of the CIFAR10 test set, offering the ability to assess the generalization of state-of-the-art CIFAR10 models, as well as investigate the effects of including this information in model training. We show that classification models trained on CIFAR10 do not generalize as well to our dataset as it does to traditional extensions, and that models fine-tuned using our label information are able to generalize better to related datasets, complement popular data augmentation schemes, and provide robustness to adversarial attacks. We explain these improvements in terms of better empirical approximations to the expected loss function over natural images and their categories in the visual world.

##### Countering Language Drift via Grounding

tl;dr Grounding helps avoid language drift during fine-tuning natural language agents with policy gradients.

While reinforcement learning (RL) shows a lot of promise for natural language processing—e.g. when fine-tuning natural language systems for optimizing a certain objective—there has been little investigation into potential language drift: when an external reward is used to train a system, the agents’ communication protocol may easily and radically diverge from natural language. By re-casting translation as a communication game, we show that language drift indeed happens when pre-trained agents are fine-tuned with policy gradient methods. We contend that simply adding a "naturalness" constraint to the reward, e.g. by using language model log likelihood, does not fully address the issue, and argue that (perceptual) grounding is required. That is, while language model constraints impose syntactic conformity, they do not lead to semantic correspondence. Our experiments show that grounded models give the best communication performance, while retaining English syntax along with the ability to convey the intended semantics.

##### Predicted Variables in Programming

tl;dr We present Predicted Variables, an approach to making machine learning a first class citizen in programming languages.

We present Predicted Variables, an approach to making machine learning a first class citizen in programming languages. There is a growing divide in approaches to building systems: using human experts (e.g. programming) on the one hand, and using behavior learned from data (e.g. ML) on the other hand. PVars aim to make ML in programming as easy as if' statements and with that hybridize ML with programming. We leverage the existing concept of variables and create a new type, a predicted variable. PVars are akin to native variables with one important distinction: PVars determine their value using ML when evaluated. We describe PVars and their interface, how they can be used in programming, and demonstrate the feasibility of our approach on three algorithmic problems: binary search, Quicksort, and caches. We show experimentally that PVars are able to improve over the commonly used heuristics and lead to a better performance than the original algorithms. As opposed to previous work applying ML to algorithmic problems, PVars have the advantage that they can be used within the existing frameworks and do not require the existing domain knowledge to be replaced. PVars allow for a seamless integration of ML into existing systems and algorithms. Our PVars implementation currently relies on standard Reinforcement Learning (RL) methods. To learn faster, PVars use the heuristic function, which they are replacing, as an initial function. We show that PVars quickly pick up the behavior of the initial function and then improve performance beyond that without ever performing substantially worse -- allowing for a safe deployment in critical applications.

##### D-GAN: Divergent generative adversarial network for positive unlabeled learning and counter-examples generation

tl;dr A new two-stage positive unlabeled learning approach with GAN

Positive Unlabeled learning task remains an interesting challenge in the context of image analysis. Recent approaches suggest to exploit the GANs abilities to answer this problem. In this paper, we propose a new approach named Divergent-GAN (D-GAN). It keeps the light adversarial architecture of the PGAN method, with a better robustness counter the varying images complexity, while simultaneously allowing the same functionalities as the GenPU method, like the generation of relevant counter-examples. However, this is achieved without the need of prior knowledge, nor an onerous architecture and framework. Its functionning is based on the combination between the behaviour principles of Positive Unlabeled learning classification and the adversarial GAN training. Experimental results show that this divergent adversarial framework outperforms the state of the art PU learning in terms of prediction accuracy, training robustness, and its ability to work on both simple and complex real images. Combined with an additional generator, the proposed approach even allows to accomplish noisy labeled learning, and thus opening new application perspectives for GANs architectures.

##### Probabilistic Binary Neural Networks

tl;dr We introduce a stochastic training method for training Binary Neural Network with both binary weights and activations.

Low bit-width weights and activations are an effective way of combating the increasing need for both memory and compute power of Deep Neural Networks. In this work, we present a probabilistic training method for Neural Network with both binary weights and activations, called PBNet. By embracing stochasticity during training, we circumvent the need to approximate the gradient of functions for which the derivative is zero almost always, such as $\textrm{sign}(\cdot)$, while still obtaining a fully Binary Neural Network at test time. Moreover, it allows for anytime ensemble predictions for improved performance and uncertainty estimates by sampling from the weight distribution. Since all operations in a layer of the PBNet operate on random variables, we introduce stochastic versions of Batch Normalization and max pooling, which transfer well to a deterministic network at test time. We evaluate two related training methods for the PBNet: one in which activation distributions are propagated throughout the network, and one in which binary activations are sampled in each layer. Our experiments indicate that sampling the binary activations is an important element for stochastic training of binary Neural Networks.

##### On the Statistical and Information Theoretical Characteristics of DNN Representations

No tl;dr =[

It has been common to argue or imply that a regularizer can be used to alter a statistical property of a hidden layer's representation and thus improve generalization or performance of deep networks. For instance, dropout has been known to improve performance by reducing co-adaptation, and representational sparsity has been argued as a good characteristic because many data-generation processes have only a small number of factors that are independent. In this work, we analytically and empirically investigate the popular characteristics of learned representations, including correlation, sparsity, dead unit, rank, and mutual information, and disprove many of the \textit{conventional wisdom}. We first show that infinitely many Identical Output Networks (IONs) can be constructed for any deep network with a linear layer, where any invertible affine transformation can be applied to alter the layer's representation characteristics. The existence of ION proves that the correlation characteristics of representation can be either low or high for a well-performing network. Extensions to ReLU layers are provided, too. Then, we consider sparsity, dead unit, and rank to show that only loose relationships exist among the three characteristics. It is shown that a higher sparsity or additional dead units do not imply a better or worse performance when the rank of representation is fixed. We also develop a rank regularizer and show that neither representation sparsity nor lower rank is helpful for improving performance even when the data-generation process has only a small number of independent factors. Mutual information $I(\z_l;\x)$ and $I(\z_l;\y)$ are investigated as well, and we show that regularizers can affect $I(\z_l;\x)$ and thus indirectly influence the performance. Finally, we explain how a rich set of regularizers can be used as a powerful tool for performance tuning.

##### Detecting Out-Of-Distribution Samples Using Low-Order Deep Features Statistics

tl;dr Detecting out-of-distribution samples by using low-order feature statistics without requiring any change in underlying DNN.

The ability to detect when an input sample was not drawn from the training distribution is an important desirable property of deep neural networks. In this paper, we show that a simple ensembling of first and second order deep feature statistics can be exploited to effectively differentiate in-distribution and out-of-distribution samples. Specifically, we observe that the mean and standard deviation within feature maps differs greatly between in-distribution and out-of-distribution samples. Based on this observation, we propose a simple and efficient plug-and-play detection procedure that does not require re-training, pre-processing or changes to the model. The proposed method outperforms the state-of-the-art by a large margin in all standard benchmarking tasks, while being much simpler to implement and execute. Notably, our method improves the true negative rate from 86.6% to 96.8% when 95% of in-distribution (CIFAR-100) are correctly detected using a DenseNet and the out-of-distribution dataset is TinyImageNet resize. The source code of our method will be made publicly available.

##### ANYTIME MINIBATCH: EXPLOITING STRAGGLERS IN ONLINE DISTRIBUTED OPTIMIZATION

tl;dr Accelerate distributed optimization by exploiting stragglers.

Distributed optimization is vital in solving large-scale machine learning problems. A widely-shared feature of distributed optimization techniques is the requirement that all nodes complete their assigned tasks in each computational epoch before the system can proceed to the next epoch. In such settings, slow nodes, called stragglers, can greatly slow progress. To mitigate the impact of stragglers, we propose an online distributed optimization method called Anytime Minibatch. In this approach, all nodes are given a fixed time to compute the gradients of as many data samples as possible. The result is a variable per-node minibatch size. Workers then get a fixed communication time to average their minibatch gradients via several rounds of consensus, which are then used to update primal variables via dual averaging. Anytime Minibatch prevents stragglers from holding up the system without wasting the work that stragglers can complete. We present a convergence analysis and analyze the wall time performance. We evaluate the method empirically using the Amazon Elastic Compute Cloud (EC2) and observe a 30-50% improvement in convergence speed.

##### N-Ary Quantization for CNN Model Compression and Inference Acceleration

tl;dr We propose a quantization scheme for weights and activations of deep neural networks. This reduces the memory footprint substantially and accelerates inference.

The tremendous memory and computational complexity of Convolutional Neural Networks (CNNs) prevents the inference deployment on resource-constrained systems. As a result, recent research focused on CNN optimization techniques, in particular quantization, which allows weights and activations of layers to be represented with just a few bits while achieving impressive prediction performance. However, aggressive quantization techniques still fail to achieve full-precision prediction performance on state-of-the-art CNN architectures on large-scale classification tasks. In this work we propose a method for weight and activation quantization that is scalable in terms of quantization levels (n-ary representations) and easy to compute while maintaining the performance close to full-precision CNNs. Our weight quantization scheme is based on trainable scaling factors and a nested-means clustering strategy which is robust to weight updates and therefore exhibits good convergence properties. The flexibility of nested-means clustering enables exploration of various n-ary weight representations with the potential of high parameter compression. For activations, we propose a linear quantization strategy that takes the statistical properties of batch normalization into account. We demonstrate the effectiveness of our approach using state-of-the-art models on ImageNet.

##### The loss landscape of overparameterized neural networks

No tl;dr =[

We explore some mathematical features of the loss landscape of overparameterized neural networks. A priori one might imagine that the loss function looks like a typical function from $\mathbb{R}^n$ to $\mathbb{R}$ - in particular, nonconvex, with discrete global minima. In this paper, we prove that in at least one important way, the loss function of an overparameterized neural network does not look like a typical function. If a neural net has $n$ parameters and is trained on $d$ data points, with $n>d$, we show that the locus $M$ of global minima of $L$ is usually not discrete, but rather an $n-d$ dimensional submanifold of $\mathbb{R}^n$. In practice, neural nets commonly have orders of magnitude more parameters than data points, so this observation implies that $M$ is typically a very high-dimensional subset of $\mathbb{R}^n$.

##### Variational Sparse Coding

tl;dr We explore the intersection of VAEs and sparse coding.

Variationalauto-encoders(VAEs)offeratractableapproachwhenperformingapproximate inference in otherwise intractable generative models. However, standard VAEs often produce latent codes that are disperse and lack interpretability, thus making the resulting representations unsuitable for auxiliary tasks (e.g. classiﬁcation) and human interpretation. We address these issues by merging ideas fromvariationalauto-encodersandsparsecoding,andproposetoexplicitlymodel sparsity in the latent space of a VAE with a Spike and Slab prior distribution. We derive the variational lower bound using a discrete mixture recognition function thereby making approximate posterior inference as computational efﬁcient as in the standard VAE case. With the new approach, we are able to infer truly sparse representations with generally intractable non-linear probabilistic models. We show that these sparse representations are advantageous over standard VAE representationsontwobenchmarkclassiﬁcationtasks(MNISTandFashion-MNIST) by demonstratingimproved classiﬁcation accuracy and signiﬁcantly increased robustness to the number of latent dimensions. Furthermore, we demonstrate qualitatively that the sparse elements capture subjectively understandable sources of variation.

##### Computation-Efficient Quantization Method for Deep Neural Networks

tl;dr A simple computation-efficient quantization training method for CNNs and RNNs.

Deep Neural Networks, being memory and computation intensive, are a challenge to deploy in smaller devices. Numerous quantization techniques have been proposed to reduce the inference latency/memory consumption. However, these techniques impose a large overhead on the training procedure or need to change the training process. We present a non-intrusive quantization technique based on re-training the full precision model, followed by directly optimizing the corresponding binary model. The quantization training process takes no longer than the original training process. We also propose a new loss function to regularize the weights, resulting in reduced quantization error. Combining both help us achieve full precision accuracy on CIFAR dataset using binary quantization. We also achieve full precision accuracy on WikiText-2 using 2 bit quantization. Comparable results are also shown for ImageNet. We also present a 1.5 bits hybrid model exceeding the performance of TWN LSTM model for WikiText-2.

##### Improved Language Modeling by Decoding the Past

tl;dr Decoding the last token in the context using the predicted next token distribution acts as a regularizer and improves language modeling.

Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token in the context using the predicted distribution of the next token. This biases the model towards retaining more contextual information, in turn improving its ability to predict the next token. With negligible overhead in the number of parameters and training time, our past decode regularization (PDR) method achieves state-of-the-art word level perplexity on the Penn Treebank (55.6) and WikiText-2 (63.5) datasets and bits-per-character on the Penn Treebank Character (1.169) dataset for character level language modeling. Using dynamic evaluation, we also achieve the first sub 50 perplexity of 49.3 on the Penn Treebank test set.

##### Better Generalization with On-the-fly Dataset Denoising

tl;dr We introduce a fast and easy-to-implement algorithm that is robust to dataset noise.

Memorization in over-parameterized neural networks can severely hurt generalization in the presence of mislabeled examples. However, mislabeled examples are to hard avoid in extremely large datasets. We address this problem using the implicit regularization effect of stochastic gradient descent with large learning rates, which we find to be able to separate clean and mislabeled examples with remarkable success using loss statistics. We leverage this to identify and on-the-fly discard mislabeled examples using a threshold on their losses. This leads to On-the-fly Data Denoising (ODD), a simple yet effective algorithm that is robust to mislabeled examples, while introducing almost zero computational overhead. Empirical results demonstrate the effectiveness of ODD on several datasets containing artificial and real-world mislabeled examples.

##### Feature quantization for parsimonious and meaningful predictive models

tl;dr We tackle discretization of continuous features and grouping of factor levels as a representation learning problem and provide a rigorous way of estimating the best quantization to yield good performance and interpretability.

For regulatory and interpretability reasons, the logistic regression is still widely used by financial institutions to learn the refunding probability of a loan given the applicant's characteristics from historical data. Although logistic regression handles naturally both continuous and categorical data, a preprocessing step to quantize them is usually performed for improving simultaneously prediction accuracy and user interpretability: continuous features are discretized by assigning factor levels to intervals; some levels of categorical features (with numerous levels) are grouped. However, a better predictive accuracy can be reached by embedding this quantization estimation step directly into the predictive estimation step itself. A related information criterion has then to be optimized on a huge and untractable discontinuous quantization set, requiring to introduce a specific two-step optimization strategy: first, the optimization problem is relaxed in order to deal with smooth functions; second, a particular neural network is involved through a stochastic gradient algorithm to optimize the resulting criterion, giving access to good candidates for the initial optimization problem. The good performances of this approach are illustrated on simulated and real data from Crédit Agricole Consumer Finance (a major European historic player in the consumer credit market).

##### A Rate-Distortion Theory of Adversarial Examples

tl;dr We suggest that rate-distortion theory precisely characterizes the accuracy versus robustness to adversarial examples trade-off

The generalization ability of deep neural networks (DNNs) is interwined with model complexity, robustness and capacity. We employ information theory to establish an equivalence between a DNN and a noisy communication channel, and obtain a notion of capacity that allows us characterize generalization behavior of DNNs for adversarial inputs.

##### Meta-Learning Neural Bloom Filters

tl;dr We investigate the space efficiency of memory-augmented neural nets when learning set membership.

There has been a recent trend in training neural networks to replace data structures that have been crafted by hand, with an aim for faster execution, better accuracy, or greater compression. In this setting, a neural data structure is instantiated by training a network over many epochs of its inputs until convergence. In many applications this expensive initialization is not practical, for example streaming algorithms --- where inputs are ephemeral and can only be inspected a small number of times. In this paper we explore the learning of approximate set membership over a stream of data in one-shot via meta-learning. We propose a novel memory architecture, the Neural Bloom Filter, which we show to be more compressive than Bloom Filters and several existing memory-augmented neural networks in scenarios of skewed data or structured sets.

##### Variational Autoencoders for Text Modeling without Weakening the Decoder

tl;dr We propose a model of variational autoencoders for text modeling without weakening the decoder, which improves the quality of text generation and interpretability of acquired representations.

Previous work (Bowman et al., 2015; Yang et al., 2017) has found difficulty developing generative models based on variational autoencoders (VAEs) for text. To address the problem of the decoder ignoring information from the encoder (posterior collapse), these previous models weaken the capacity of the decoder to force the model to use information from latent variables. However, this strategy is not ideal as it degrades the quality of generated text and increases hyper-parameters. In this paper, we propose a new VAE for text utilizing a multimodal prior distribution, a modified encoder, and multi-task learning. We show our model can generate well-conditioned sentences without weakening the capacity of the decoder. Also, the multimodal prior distribution improves the interpretability of acquired representations.

##### An Adversarial Learning Framework for a Persona-based Multi-turn Dialogue Model

tl;dr This paper develops an adversarial learning framework for neural conversation models with persona

In this paper, we extend the persona-based sequence-to-sequence (Seq2Seq) neural network conversation model to a multi-turn dialogue scenario by modifying the state-of-the-art hredGAN architecture to simultaneously capture utterance attributes such as speaker identity, dialogue topic, speaker sentiments and so on. The proposed system, phredGAN has a persona-based HRED generator (PHRED) and a conditional discriminator. We also explore two approaches to accomplish the conditional discriminator: (1) $phredGAN_a$, a system that passes the attribute representation as an additional input into a traditional adversarial discriminator, and (2) $phredGAN_d$, a dual discriminator system which in addition to the adversarial discriminator, collaboratively predicts the attribute(s) that generated the input utterance. To demonstrate the superior performance of phredGAN over the persona SeqSeq model, we experiment with two conversational datasets, the Ubuntu Dialogue Corpus (UDC) and TV series transcripts from the Big Bang Theory and Friends. Performance comparison is made with respect to a variety of quantitative measures, including BLEU, ROUGE, and perplexity scores. We also explore the trade-offs from using either variant of $phredGAN$ on datasets with many but weak attribute modalities (such as with Big Bang Theory and Friends) and ones with few but strong attribute modalities (customer-agent interactions in Ubuntu dataset).

##### Boltzmann Weighting Done Right in Reinforcement Learning

No tl;dr =[

The Boltzmann softmax operator can trade-off well between exploration and exploitation according to current estimation in an exponential weighting scheme, which is a promising way to address the exploration-exploitation dilemma in reinforcement learning. Unfortunately, the Boltzmann softmax operator is not a non-expansion, which may lead to unstable or even divergent learning behavior when used in estimating the value function. The convergence of value iteration is guaranteed in a restricted set of non-expansive operators and how to characterize the effect of such non-expansive operators in value iteration remains an open problem. In this paper, we propose a new technique to analyze the error bound of value iteration with the the Boltzmann softmax operator. We then propose the dynamic Boltzmann softmax(DBS) operator to enable the convergence to the optimal value function in value iteration. We also present convergence rate analysis of the algorithm. Using Q-learning as an application, we show that the DBS operator can be applied in a model-free reinforcement learning algorithm. Finally, we demonstrate the effectiveness of the DBS operator in a toy problem called GridWorld and a suite of Atari games. Experimental results show that outperforms DQN substantially in benchmark games.

##### Synthnet: Learning synthesizers end-to-end

tl;dr A convolutional autoregressive generative model that generates high fidelity audio, behchmarked on music

Learning synthesizers and generating music in the raw audio domain is a challenging task. We investigate the learned representations of convolutional autoregressive generative models. Consequently, we show that mappings between musical notes and the harmonic style (instrument timbre) can be learned based on the raw audio music recording and the musical score (in binary piano roll format). Our proposed architecture, SynthNet uses minimal training data (9 minutes), is substantially better in quality and converges 6 times faster than the baselines. The quality of the generated waveforms (generation accuracy) is sufficiently high that they are almost identical to the ground truth. Therefore, we are able to directly measure generation error during training, based on the RMSE of the Constant-Q transform. Mean opinion scores are also provided. We validate our work using 7 distinct harmonic styles and also provide visualizations and links to all generated audio.

tl;dr We propose a novel aritificial checkerboard enhancer (ACE) module which guides attacks to a pre-specified pixel space and successfully defends it with a simple padding operation.

The checkerboard phenomenon is one of the well-known visual artifacts in the computer vision field. The origins and solutions of checkerboard artifacts in the pixel space have been studied for a long time, but their effects on the gradient space have rarely been investigated. In this paper, we revisit the checkerboard artifacts in the gradient space which turn out to be the weak point of a network architecture. We explore image-agnostic property of gradient checkerboard artifacts and propose a simple yet effective defense method by utilizing the artifacts. We introduce our defense module, dubbed Artificial Checkerboard Enhancer (ACE), which induces adversarial attacks on designated pixels. This enables the model to deflect attacks by shifting only a single pixel in the image with a remarkable defense rate. We provide extensive experiments to support the effectiveness of our work for various attack scenarios using state-of-the-art attack methods. Furthermore, we show that ACE is even applicable to large-scale datasets including ImageNet dataset and can be easily transferred to various pretrained networks.

##### Combining adaptive algorithms and hypergradient method: a performance and robustness study

tl;dr We provide a study trying to see how the recent online learning rate adaptation extends the conclusion made by Wilson et al. 2018 about adaptive gradient methods, along with comparison and sensitivity analysis.

Wilson et al. (2017) showed that, when the stepsize schedule is properly designed, stochastic gradient generalizes better than ADAM (Kingma & Ba, 2014). In light of recent work on hypergradient methods (Baydin et al., 2018), we revisit these claims to see if such methods close the gap between the most popular optimizers. As a byproduct, we analyze the true benefit of these hypergradient methods compared to more classical schedules, such as the fixed decay of Wilson et al. (2017). In particular, we observe they are of marginal help since their performance varies significantly when tuning their hyperparameters. Finally, as robustness is a critical quality of an optimizer, we provide a sensitivity analysis of these gradient based optimizers to assess how challenging their tuning is.

##### Generating Multi-Agent Trajectories using Programmatic Weak Supervision

tl;dr We blend deep generative models with programmatic weak supervision to generate coordinated multi-agent trajectories of significantly higher quality than previous baselines.

We study the problem of training sequential generative models for capturing coordinated multi-agent trajectory behavior, such as offensive basketball gameplay. When modeling such settings, it is often beneficial to design hierarchical models that can capture long-term coordination using intermediate variables. Furthermore, these intermediate variables should capture interesting high-level behavioral semantics in an interpretable and manipulable way. We present a hierarchical framework that can effectively learn such sequential generative models. Our approach is inspired by recent work on leveraging programmatically produced weak labels, which we extend to the spatiotemporal regime. In addition to synthetic settings, we show how to instantiate our framework to effectively model complex interactions between basketball players and generate realistic multi-agent trajectories of basketball gameplay over long time periods. We validate our approach using both quantitative and qualitative evaluations, including a user study comparison conducted with professional sports analysts.

##### The Conditional Entropy Bottleneck

tl;dr The Conditional Entropy Bottleneck is an information-theoretic objective function for learning optimal representations.

We present a new family of objective functions, which we term the Conditional Entropy Bottleneck (CEB). We demonstrate the application of CEB to classification tasks. In our experiments, CEB gives: well-calibrated predictions; essentially perfect detection of challenging out-of-distribution examples and powerful whitebox adversarial examples; and natural robustness to the same. Finally, we report that CEB fails to learn a dataset with fixed random labels, providing a possible resolution to the problem of generalization observed in Zhang et al. (2016).

##### On Accurate Evaluation of GANs for Language Generation

tl;dr We discuss how to evaluate GANs for language generation, propose a protocol and show that simple Language Models achieve results as good as GANs.

Generative Adversarial Networks (GANs) are a promising approach to language generation. The latest works introducing novel GAN models for language generation use n-gram based metrics for evaluation and only report single scores of the best run. In this paper, we argue that this often misrepresents the true picture and does not tell the full story, as GAN models can be extremely sensitive to the random initialization and small deviations from the best hyperparameter choice. In particular, we demonstrate that the previously used BLEU score is not sensitive to semantic deterioration of generated texts and propose alternative metrics that better capture the quality and diversity of the generated samples. We also conduct a set of experiments comparing a number of GAN models for text with a conventional Language Model (LM) and find that none of the considered models performs convincingly better than the LM.

##### Purchase as Reward : Session-based Recommendation by Imagination Reconstruction

tl;dr We propose the IRN architecture to augment sparse and delayed purchase reward for session-based recommendation.

One of the key challenges of session-based recommender systems is to enhance users’ purchase intentions. In this paper, we formulate the sequential interactions between user sessions and a recommender agent as a Markov Decision Process (MDP). In practice, the purchase reward is delayed and sparse, and may be buried by clicks, making it an impoverished signal for policy learning. Inspired by the prediction error minimization (PEM) and embodied cognition, we propose a simple architecture to augment reward, namely Imagination Reconstruction Network (IRN). Speciﬁcally, IRN enables the agent to explore its environment and learn predictive representations via three key components. The imagination core generates predicted trajectories, i.e., imagined items that users may purchase. The trajectory manager controls the granularity of imagined trajectories using the planning strategies, which balances the long-term rewards and short-term rewards. To optimize the action policy, the imagination-augmented executor minimizes the intrinsic imagination error of simulated trajectories by self-supervised reconstruction, while maximizing the extrinsic reward using model-free algorithms. Empirically, IRN promotes quicker adaptation to user interest, and shows improved robustness to the cold-start scenario and ultimately higher purchase performance compared to several baselines. Somewhat surprisingly, IRN using only the purchase reward achieves excellent next-click prediction performance, demonstrating that the agent can "guess what you like" via internal planning.

##### SELECT VIA PROXY: EFFICIENT DATA SELECTION FOR TRAINING DEEP NETWORKS

tl;dr we develop an efficient method for selecting training data to quickly and efficiently learn large machine learning models.

At internet scale, applications collect a tremendous amount of data by logging user events, analyzing text, and collecting images. This data powers a variety of machine learning models for tasks such as image classification, language modeling, content recommendation, and advertising. However, training large models over all available data can be computationally expensive, creating a bottleneck in the development of new machine learning models. In this work, we develop a novel approach to efficiently select a subset of training data to achieve faster training with no loss in model predictive performance. In our approach, we first train a small proxy model quickly, which we then use to estimate the utility of individual training data points, and then select the most informative ones for training the large target model. Extensive experiments show that our approach leads to a 1.6x and 1.8x speed-up on CIFAR10 and SVHN by selecting 60% and 50% subsets of the data, while maintaining the predictive performance of the model trained on the entire dataset. Further, our method is robust to design choices.

##### Experience replay for continual learning

tl;dr We show that, in continual learning settings, catastrophic forgetting can be avoided by applying off-policy RL to a mixture of new and replay experience, with a behavioral cloning loss.

##### Fast Exploration with Simplified Models and Approximately Optimistic Planning in Model Based Reinforcement Learning

tl;dr We studied exploration with imperfect planning and used object representation to learn simple models and introduced a new sample efficient RL algorithm that achieves state of the art results on Pitfall!

Humans learn to play video games significantly faster than the state-of-the-art reinforcement learning (RL) algorithms. People seem to build simple models that are easy to learn to support planning and strategic exploration. Inspired by this, we investigate two issues in leveraging model-based RL for sample efficiency. First we investigate how to perform strategic exploration when exact planning is not feasible and empirically show that optimistic Monte Carlo Tree Search outperforms posterior sampling methods. Second we show how to learn simple deterministic models to support fast learning using object representation. We illustrate the benefit of these ideas by introducing a novel algorithm, Strategic Object Oriented Reinforcement Learning (SOORL), that outperforms state-of-the-art algorithms in the game of Pitfall! in less than 50 episodes.

##### MARGINALIZED AVERAGE ATTENTIONAL NETWORK FOR WEAKLY-SUPERVISED LEARNING

tl;dr A novel marginalized average attentional network for weakly-supervised temporal action localization

In weakly-supervised temporal action localization, previous works suffer from overestimating the most salient regions and fail to locate dense and integral regions for each entire action. To alleviate this issue, we propose a marginalized average attentional network (MAAN) to suppress the dominant response of the most salient regions in a principled manner. The MAAN employs a novel marginalized average aggregation (MAA) module and learns a set of latent discriminative probabilities in an end-to-end fashion. MAA samples the subsets from the video snippet features based on the latent discriminative probabilities and takes the expectation over all the subset features. Theoretically, we prove that the learned latent discriminative probabilities reduce the difference of responses between the most salient regions and the others, and thus MAAN generates better class activation sequences to identify more dense and integral action regions in the videos. Moreover, we propose a fast algorithm to reduce the complexity of constructing MAA from $O(2^T)$ to $O(T^2)$. Extensive experiments on two large-scale video datasets show that our MAAN achieves superior performance on weakly-supervised temporal action localization task.

##### Graph Transformer

No tl;dr =[

Graph neural networks (GNN) have gained increasing research interests as a mean to the challenging goal of robust and universal graph learning. Previous GNNs have assumed single pre-fixed graph structure and permitted only local context encoding. This paper proposes a novel Graph Transformer (GTR) architecture that captures long-range dependency with global attention, and enables dynamic graph structures. In particular, GTR propagates features within the same graph structure via an intra-graph message passing, and transforms dynamic semantics across multi-domain graph-structured data (e.g. images, sequences, knowledge graphs) for multi-modal learning via an inter-graph message passing. Furthermore, GTR enables effective incorporation of any prior graph structure by weighted averaging of the prior and learned edges, which can be crucially useful for scenarios where prior knowledge is desired. The proposed GTR achieves new state-of-the-arts across three benchmark tasks, including few-shot learning, medical abnormality and disease classification, and graph classification. Experiments show that GTR is superior in learning robust graph representations, transforming high-level semantics across domains, and bridging between prior graph structure with automatic structure learning.

##### Unsupervised Document Representation using Partition Word-Vectors Averaging

tl;dr A simple unsupervised method for multi-sentense-document embedding using partition based word vectors averaging that achieve results comparable to sophisticated models.

Learning effective document-level representation is essential in many important NLP tasks such as document classification, summarization, etc. Recent research has shown that simple weighted averaging of word vectors is an effective way to represent sentences, often outperforming complicated seq2seq neural models in many tasks. While it is desirable to use the same method to represent documents as well, unfortunately, the effectiveness is lost when representing long documents involving multiple sentences. One reason for this degradation is due to the fact that a longer document is likely to contain words from many different themes (or topics), and hence creating a single vector while ignoring all the thematic structure is unlikely to yield an effective representation of the document. This problem is less acute in single sentences and other short text fragments where presence of a single theme/topic is most likely. To overcome this problem, in this paper we present PSIF, a partitioned word averaging model to represent long documents. P-SIF retains the simplicity of simple weighted word averaging while taking a document's thematic structure into account. In particular, P-SIF learns topic-specific vectors from a document and finally concatenates them all to represent the overall document. Through our experiments over multiple real-world datasets and tasks, we demonstrate PSIF's effectiveness compared to simple weighted averaging and many other state-of-the-art baselines. We also show that PSIF is particularly effective in representing long multi-sentence documents. We will release PSIF's embedding source code and data-sets for reproducing results.

##### Security Analysis of Deep Neural Networks Operating in the Presence of Cache Side-Channel Attacks

tl;dr We conduct the first in-depth security analysis of DNN fingerprinting attacks that exploit cache side-channels, which represents a step toward understanding the DNN’s vulnerability to side-channel attacks.

Recent work has introduced attacks that extract the architecture information of deep neural networks (DNN), as this knowledge enhances an adversary’s capability to conduct black-box attacks against the model. This paper presents the first in-depth security analysis of DNN fingerprinting attacks that exploit cache side-channels. First, we define the threat model for these attacks: our adversary does not need the ability to query the victim model; instead, she runs a co-located process on the host machine victim ’s deep learning (DL) system is running and passively monitors the accesses of the target functions in the shared framework. Second, we introduce DeepRecon, an attack that reconstructs the architecture of the victim network by using the internal information extracted via Flush+Reload, a cache side-channel technique. Once the attacker observes function invocations that map directly to architecture attributes of the victim network, the attacker can reconstruct the victim’s entire network architecture. In our evaluation, we demonstrate that an attacker can accurately reconstruct two complex networks (VGG19 and ResNet50) having only observed one forward propagation. Based on the extracted architecture attributes, we also demonstrate that an attacker can build a meta-model that accurately fingerprints the architecture and family of the pre-trained model in a transfer learning setting. From this meta-model, we evaluate the importance of the observed attributes in the fingerprinting process. Third, we propose and evaluate new framework-level defense techniques that obfuscate our attacker’s observations. Our empirical security analysis represents a step toward understanding the DNNs’ vulnerability to cache side-channel attacks.

##### Learning To Solve Circuit-SAT: An Unsupervised Differentiable Approach

tl;dr We propose a neural framework that can learn to solve the Circuit Satisfiability problem from (unlabeled) circuit instances.

Recent efforts to combine Representation Learning with Formal Methods, commonly known as the Neuro-Symbolic Methods, have given rise to a new trend of applying rich neural architectures to solve classical combinatorial optimization problems. In this paper, we propose a neural framework that can learn to solve the Circuit Satisfiability problem. Our framework is built upon two fundamental contributions: a rich embedding architecture that encodes the problem structure and an end-to-end differentiable training procedure that mimics Reinforcement Learning and trains the model directly toward solving the SAT problem. The experimental results show the superior out-of-sample generalization performance of our framework compared to the recently developed NeuroSAT method.

##### Value Propagation Networks

tl;dr We present planners based on convnets that are sample-efficient and that generalize to larger instances of navigation and pathfinding problems.

We present Value Propagation (VProp), a set of parameter-efficient differentiable planning modules built on Value Iteration which can successfully be trained using reinforcement learning to solve unseen tasks, has the capability to generalize to larger map sizes, and can learn to navigate in dynamic environments. We show that the modules enable learning to plan when the environment also includes stochastic elements, providing a cost-efficient learning system to build low-level size-invariant planners for a variety of interactive navigation problems. We evaluate on static and dynamic configurations of MazeBase grid-worlds, with randomly generated environments of several different sizes, and on a StarCraft navigation scenario, with more complex dynamics, and pixels as input.

##### Near-Optimal Representation Learning for Hierarchical Reinforcement Learning

tl;dr We translate a bound on sub-optimality of representations to a practical training objective in the context of hierarchical reinforcement learning.

We study the problem of representation learning in goal-conditioned hierarchical reinforcement learning. In such hierarchical structures, a higher-level controller solves tasks by iteratively communicating goals which a lower-level policy is trained to reach. Accordingly, the choice of representation -- the mapping of observation space to goal space -- is crucial. To study this problem, we develop a notion of sub-optimality of a representation, defined in terms of expected reward of the optimal hierarchical policy using this representation. We derive expressions which bound the sub-optimality and show how these expressions can be translated to representation learning objectives which may be optimized in practice. Results on a number of difficult continuous-control tasks show that our approach to representation learning yields qualitatively better representations as well as quantitatively better hierarchical policies, compared to existing methods.

##### Dataset Distillation

tl;dr We propose to distill a large dataset into a small set of synthetic data , so networks can achieve close to original performance when trained on these data.

Model distillation aims to distill the knowledge of a complex model into a simpler one. In this paper, we consider an alternative formulation called {\em dataset distillation}: we keep the model fixed and instead attempt to distill the knowledge from a large training dataset into a small one. The idea is to {\em synthesize} a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. For example, we show that it is possible to compress $60,000$ MNIST training images into just $10$ synthetic {\em distilled images} (one per class) and achieve close to original performance with only a few steps of gradient descent, given a particular fixed network initialization. Apart from being an interesting new way to think about distillation, this approach could potentially open up several applications, such as fast domain adaptation and effective data poisoning attacks.

##### Adaptive Mixture of Low-Rank Factorizations for Compact Neural Modeling

tl;dr We propose a simple modification to low-rank factorization that improves performances (in both image and language tasks) while still being compact.

Modern deep neural networks have a large amount of weights, which make them difficult to deploy on computation constrained devices such as mobile phones. One common approach to reduce the model size and computational cost is to use low-rank factorization to approximate a weight matrix. However, performing standard low-rank factorization with a small rank can hurt the model expressiveness and significantly decrease the performance. In this work, we propose to use a mixture of multiple low-rank factorizations to model a large weight matrix, and the mixture coefficients are computed dynamically depending on its input. We demonstrate the effectiveness of the proposed approach on both language modeling and image classification tasks. Experiments show that our method not only improves the computation efficiency but also maintains (sometimes outperforms) its accuracy compared with the full-rank counterparts.

##### Minimal Images in Deep Neural Networks: Fragile Object Recognition in Natural Images

No tl;dr =[

The human ability to recognize objects is impaired when the object is not shown in full. "Minimal images" are the smallest regions of an image that remain recognizable for humans. Ullman et al. (2016) show that a slight modification of the location and size of the visible region of the minimal image produces a sharp drop in human recognition accuracy. In this paper, we demonstrate that such drops in accuracy due to changes of the visible region are a common phenomenon between humans and existing state-of-the-art deep neural networks (DNNs), and are much more prominent in DNNs. We found many cases where DNNs classified one region correctly and the other incorrectly, though they only differed by one row or column of pixels, and were often bigger than the average human minimal image size. We show that this phenomenon is independent from previous works that have reported lack of invariance to minor modifications in object location in DNNs. Our results thus reveal a new failure mode of DNNs that also affects humans to a much lesser degree. They expose how fragile DNN recognition ability is in natural images even without adversarial patterns being introduced. Bringing the robustness of DNNs in natural images to the human level remains an open challenge for the community.

##### Discovering General-Purpose Active Learning Strategies

No tl;dr =[

We propose a general-purpose approach to discovering active learning (AL) strategies from data. These strategies are transferable from one domain to another and can be used in conjunction with many machine learning models. To this end, we formalize the annotation process as a Markov decision process, design universal state and action spaces and introduce a new reward function that precisely reflects the AL objective of minimizing the annotation cost We seek to find an optimal (non-myopic) AL strategy using reinforcement learning. We evaluate the learned strategies on multiple unrelated domains and show that they consistently outperform state-of-the-art baselines.

##### ProMP: Proximal Meta-Policy Search

tl;dr A novel and theoretically grounded meta-reinforcement learning algorithm

Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.

##### Learning from Positive and Unlabeled Data with a Selection Bias

No tl;dr =[

We consider the problem of learning a binary classifier only from positive data and unlabeled data (PU learning). Recent methods of PU learning commonly assume that the labeled positive data are identically distributed as the unlabeled positive data. However, this assumption is unrealistic in many instances of PU learning because it fails to capture the existence of a selection bias in the labeling process. When the data has a selection bias, it is difficult to learn the Bayes optimal classifier by conventional methods of PU learning. In this paper, we propose a method to partially identify the classifier. The proposed algorithm learns a scoring function that preserves the order induced by the class posterior under mild assumptions, which can be used as a classifier by setting an appropriate threshold. Through experiments, we show that the method outperforms previous methods for PU learning on various real-world datasets.

##### What Would pi* Do?: Imitation Learning via Off-Policy Reinforcement Learning

tl;dr We propose a simple and effective imitation learning algorithm based on off-policy RL, which works well on image-based tasks and implicitly performs approximate inference of the expert policy.

Learning to imitate expert actions given demonstrations containing image observations is a difficult problem in robotic control. The key challenge is generalizing behavior to out-of-distribution states that differ from those in the demonstrations. State-of-the-art imitation learning algorithms perform well in environments with low-dimensional observations, but typically involve adversarial optimization procedures, which can be difficult to use with high-dimensional image observations. We propose a remarkably simple alternative based on off-policy reinforcement learning, which rewards the agent for matching demonstrated actions in demonstrated states — the key idea is initially filling the agent's experience replay buffer with demonstrations where rewards are set to a positive constant, and setting rewards to zero in all additional experiences. We derive this RL algorithm from first principles as a method for performing approximate inference under the MaxCausalEnt model of expert behavior — the approximate inference objective trades off between a pure behavioral cloning loss and a regularization term that incorporates information about state transitions via the soft Bellman error. Our experiments show that this algorithm matches the state of the art in low-dimensional environments, and significantly outperforms prior work in playing video games from high-dimensional images.

##### Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives

tl;dr Doubly reparameterized gradient estimators provide unbiased variance reduction which leads to improved performance.

Deep latent variable models have become a popular model choice due to the scalable learning algorithms introduced by (Kingma & Welling 2013, Rezende et al. 2014). These approaches maximize a variational lower bound on the intractable log likelihood of the observed data. Burda et al. (2015) introduced a multi-sample variational bound, IWAE, that is at least as tight as the standard variational lower bound and becomes increasingly tight as the number of samples increases. Counterintuitively, the typical inference network gradient estimator for the IWAE bound performs poorly as the number of samples increases (Rainforth et al. 2018, Le et al. 2018). Roeder et a. (2017) propose an improved gradient estimator, however, are unable to show it is unbiased. We show that it is in fact biased and that the bias can be estimated efficiently with a second application of the reparameterization trick. The doubly reparameterized gradient (DReG) estimator does not suffer as the number of samples increases, resolving the previously raised issues. The same idea can be used to improve many recently introduced training techniques for latent variable models. In particular, we show that this estimator reduces the variance of the IWAE gradient, the reweighted wake-sleep update (RWS) (Bornschein & Bengio 2014), and the jackknife variational inference (JVI) gradient (Nowozin 2018). Finally, we show that this computationally efficient, drop-in estimator translates to improved performance for all three objectives on several modeling tasks.

##### Boosting Trust Region Policy Optimization by Normalizing flows Policy

tl;dr Normalizing flows policy to improve TRPO and ACKTR

We propose to improve trust region policy search with normalizing flows policy. We illustrate that when the trust region is constructed by KL divergence constraint, normalizing flows policy can generate samples far from the 'center' of the previous policy iterate, which potentially enables better exploration and helps avoid bad local optima. We show that normalizing flows policy significantly improves upon factorized Gaussian policy baseline, with both TRPO and ACKTR, especially on tasks with complex dynamics such as Humanoid.

##### Learning agents with prioritization and parameter noise in continuous state and action space

tl;dr Improving the performance of an RL agent in the continuous action and state space domain by using prioritised experience replay and parameter noise.

Reinforcement Learning (RL) problem can be solved in two different ways - the Value function-based approach and the policy optimization-based approach - to eventually arrive at an optimal policy for the given environment. One of the recent breakthroughs in reinforcement learning is the use of deep neural networks as function approximators to approximate the value function or q-function in a reinforcement learning scheme. This has led to results with agents automatically learning how to play games like alpha-go showing better-than-human performance. Deep Q-learning networks (DQN) and Deep Deterministic Policy Gradient (DDPG) are two such methods that have shown state-of-the-art results in recent times. Among the many variants of RL, an important class of problems is where the state and action spaces are continuous --- autonomous robots, autonomous vehicles, optimal control are all examples of such problems that can lend themselves naturally to reinforcement based algorithms, and have continuous state and action spaces. In this paper, we adapt and combine approaches such as DQN and DDPG in novel ways to outperform the earlier results for continuous state and action space problems. We believe these results are a valuable addition to the fast-growing body of results on Reinforcement Learning, more so for continuous state and action space problems.

##### On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

tl;dr We analyze convergence of Adam-type algorithms and provide mild sufficient conditions to guarantee their convergence, we also show violating the conditions can makes an algorithm diverge.

This paper studies a class of adaptive gradient based momentum algorithms that update the search directions and learning rates simultaneously using past gradients. This class, which we refer to as the ''Adam-type'', includes the popular algorithms such as Adam, AMSGrad, AdaGrad. Despite their popularity in training deep neural networks (DNNs), the convergence of these algorithms for solving non-convex problems remains an open question. In this paper, we develop an analysis framework and a set of mild sufficient conditions that guarantee the convergence of the Adam-type methods, with a convergence rate of order $O(\log{T}/\sqrt{T})$ for non-convex stochastic optimization. Our convergence analysis applies to a new algorithm called AdaFom (AdaGrad with First Order Momentum). We show that the conditions are essential, by identifying concrete examples in which violating the conditions makes an algorithm diverge. Besides providing one of the first comprehensive analysis for Adam-type methods in the non-convex setting, our results can also help the practitioners to easily monitor the progress of algorithms and determine their convergence behavior.

tl;dr We implement an adversarial domain adaptation network to stabilize a fixed Brain-Machine Interface against gradual changes in the recorded neural signals.

Brain-Machine Interfaces (BMIs) have recently emerged as a clinically viable option to restore voluntary movements after paralysis. These devices are based on the ability to extract information about movement intent from neural signals recorded using multi-electrode arrays chronically implanted in the motor cortices of the brain. However, the inherent loss and turnover of recorded neurons requires repeated recalibrations of the interface, which can potentially alter the day-to-day user experience. The resulting need for continued user adaptation interferes with the natural, subconscious use of the BMI. Here, we introduce a new computational approach that decodes movement intent from a low-dimensional latent representation of the neural data. We implement various domain adaptation methods to stabilize the interface over significantly long times. This includes Canonical Correlation Analysis used to align the latent variables across days; this method requires prior point-to-point correspondence of the time series across domains. Alternatively, we match the empirical probability distributions of the latent variables across days through the minimization of their Kullback-Leibler divergence. These two methods provide a significant and comparable improvement in the performance of the interface. However, implementation of an Adversarial Domain Adaptation Network trained to match the empirical probability distribution of the residuals of the reconstructed neural signals outperforms the two methods based on latent variables, while requiring remarkably few data points to solve the domain adaptation problem.

##### PointGrow: Autoregressively Learned Point Cloud Generation with Self-Attention

tl;dr An autoregressive deep learning model for generating diverse point clouds.

A point cloud is an agile 3D representation, efficiently modeling an object's surface geometry. However, these surface-centric properties also pose challenges on designing tools to recognize and synthesize point clouds. This work presents a novel autoregressive model, PointGrow, which generates realistic point cloud samples from scratch or conditioned from given semantic contexts. Our model operates recurrently, with each point sampled according to a conditional distribution given its previously-generated points. Since point cloud object shapes are typically encoded by long-range interpoint dependencies, we augment our model with dedicated self-attention modules to capture these relations. Extensive evaluation demonstrates that PointGrow achieves satisfying performance on both unconditional and conditional point cloud generation tasks, with respect to fidelity, diversity and semantic preservation. Further, conditional PointGrow learns a smooth manifold of given images where 3D shape interpolation and arithmetic calculation can be performed inside.

##### Differentiable Greedy Networks

tl;dr We propose a subset selection algorithm that is trainable with gradient based methods yet achieves near optimal performance via submodular optimization.

Optimal selection of a subset of items from a given set is a hard problem that requires combinatorial optimization. In this paper, we propose a subset selection algorithm that is trainable with gradient based methods yet achieves near optimal performance via submodular optimization. We focus on the task of identifying a relevant set of sentences for claim verification in the context of the FEVER task. Conventional methods for this task look at sentences on their individual merit and thus do not optimize the informativeness of sentences as a set. We show that our proposed method which builds on the idea of unfolding a greedy algorithm into a computational graph allows both interpretability and gradient based training. The proposed differentiable greedy network (DGN) outperforms discrete optimization algorithms as well as other baseline methods in terms of precision and recall.

##### MahiNet: A Neural Network for Many-Class Few-Shot Learning with Class Hierarchy

tl;dr A memory-augmented neural network that addresses many-class few-shot problem by leveraging class hierarchy in both supervised learning and meta-learning.

We study many-class few-shot (MCFS) problem in both supervised learning and meta-learning scenarios. Compared to the well-studied many-class many-shot and few-class few-shot problems, MCFS problem commonly occurs in practical applications but is rarely studied. MCFS brings new challenges because it needs to distinguish between many classes, but only a few samples per class are available for training. In this paper, we propose memory-augmented hierarchical-classification network (MahiNet)'' for MCFS learning. It addresses the many-class'' problem by exploring the class hierarchy, e.g., the coarse-class label that covers a subset of fine classes, which helps to narrow down the candidates for the fine class and is cheaper to obtain. MahiNet uses a convolutional neural network (CNN) to extract features, and integrates a memory-augmented attention module with a multi-layer perceptron (MLP) to produce the probabilities over coarse and fine classes. While the MLP extends the linear classifier, the attention module extends a KNN classifier, both together targeting the `few-shot'' problem. We design different training strategies of MahiNet for supervised learning and meta-learning. Moreover, we propose two novel benchmark datasets ''mcfsImageNet'' (as a subset of ImageNet) and ''mcfsOmniglot'' (re-splitted Omniglot) specifically for MCFS problem. In experiments, we show that MahiNet outperforms several state-of-the-art models on MCFS classification tasks in both supervised learning and meta-learning scenarios.

##### Text Embeddings for Retrieval from a Large Knowledge Base

tl;dr The new attempt for creating semantically meaningful text embeddings via improved language modeling and utilizing an extra knowledge base

Text embedding representing natural language documents in a semantic vector space can be used for document retrieval using nearest neighbor lookup. In order to study the feasibility of neural models specialized for retrieval in a semantically meaningful way, we suggest the use of the Stanford Question Answering Dataset (SQuAD) in an open-domain question answering context, where the first task is to find paragraphs useful for answering a given question. First, we compare the quality of various text-embedding methods on the performance of retrieval and give an extensive empirical comparison on the performance of various non-augmented base embedding with, and without IDF weighting. Our main results are that by training deep residual neural models specifically for retrieval purposes can yield significant gains when it is used to augment existing embeddings. We also establish that deeper models are superior to this task. The best base baseline embeddings augmented by our learned neural approach improves the top-1 recall of the system by 14% in terms of the question side, and by 8% in terms of the paragraph side.

##### Generative predecessor models for sample-efficient imitation learning

No tl;dr =[

We propose Generative Predecessor Models for Imitation Learning (GPRIL), a novel imitation learning algorithm that matches the state-action distribution to the distribution observed in expert demonstrations, using generative models to reason probabilistically about alternative histories of demonstrated states. We show that this approach allows an agent to learn robust policies using only a small number of expert demonstrations and self-supervised interactions with the environment. We derive this approach from first principles and compare it empirically to a state-of-the-art imitation learning method, showing that it outperforms or matches its performance on two simulated robot manipulation tasks and demonstrate significantly higher sample efficiency by applying the algorithm on a real robot.

tl;dr Learn representations for images that factor out a single attribute.

We propose a novel generative model architecture designed to learn representations for images that factor out a single attribute from the rest of the representation. A single object may have many attributes which when altered do not change the identity of the object itself. Consider the human face; the identity of a particular person is independent of whether or not they happen to be wearing glasses. The attribute of wearing glasses can be changed without changing the identity of the person. However, the ability to manipulate and alter image attributes without altering the object identity is not a trivial task. Here, we are interested in learning a representation of the image that separates the identity of an object (such as a human face) from an attribute (such as 'wearing glasses'). We demonstrate the success of our factorization approach by using the learned representation to synthesize the same face with and without a chosen attribute. We refer to this specific synthesis process as image attribute manipulation. We further demonstrate that our model achieves competitive scores, with state of the art, on a facial attribute classification task.

##### Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis

tl;dr We throughly compare several pretraining tasks on their ability to induce syntactic information and find that representations from language models consistently perform best, even when trained on relatively small amounts of data.

Recent work using auxiliary prediction task classifiers to investigate the properties of LSTM representations has begun to shed light on why pretrained representations, like ELMo (Peters et al., 2018) and CoVe (McCann et al., 2017), are so beneficial for neural language understanding models. We still, though, do not yet have a clear understanding of how the choice of pretraining objective affects the type of linguistic information that models learn. With this in mind, we compare four objectives - language modeling, translation, skip-thought, and autoencoding - on their ability to induce syntactic and part-of-speech information. We make a fair comparison between the tasks by holding constant the quantity and genre of the training data, as well as the LSTM architecture. We find that representations from language models consistently perform best on our syntactic auxiliary prediction tasks, even when trained on relatively small amounts of data. These results suggest that language modeling may be the best data-rich pretraining task for transfer learning applications requiring syntactic information. We also find that the representations from randomly-initialized, frozen LSTMs perform strikingly well on our syntactic auxiliary tasks, but this effect disappears when the amount of training data for the auxiliary tasks is reduced.

##### signSGD with Majority Vote is Communication Efficient and Byzantine Fault Tolerant

tl;dr Workers send gradient signs to the server, and the update is decided by majority vote. We show that this algorithm is convergent, communication efficient and adversarially robust, both in theory and in practice.

Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses 32x less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. We model adversaries as those workers who may compute a stochastic gradient estimate and manipulate it, but may not coordinate with other adversaries. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.

##### ATTENTIVE EXPLAINABILITY FOR PATIENT TEMPO- RAL EMBEDDING

No tl;dr =[

Learning explainable patient temporal embeddings from observational data has mostly ignored the use of RNN architecture that excel in capturing temporal data dependencies but at the expense of explainability. This paper addresses this problem by introducing and applying an information theoretic approach to estimate the degree of explainability of such architectures. Using a communication paradigm, we formalize metrics of explainability by estimating the amount of information that an AI model needs to convey to a human end user to explain and rationalize its outputs. A key aspect of this work is to model human prior knowledge at the receiving end and measure the lack of explainability as a deviation from human prior knowledge. We apply this paradigm to medical concept representation problems by regularizing loss functions of temporal autoencoders according to the derived explainability metrics to guide the learning process towards models producing explainable outputs. We illustrate the approach with convincing experimental results for the generation of explainable temporal embeddings for critical care patient data.

##### Generalized Capsule Networks with Trainable Routing Procedure

tl;dr A scalable capsule network

CapsNet (Capsule Network) was first proposed by Sabour et al. (2017) and lateranother version of CapsNet was proposed by Hinton et al. (2018). CapsNet hasbeen proved effective in modeling spatial features with much fewer parameters.However, the routing procedures (dynamic routing and EM routing) in both pa-pers are not well incorporated into the whole training process, and the optimalnumber for the routing procedure has to be found manually. We propose Gen-eralized GapsNet (G-CapsNet) to overcome this disadvantages by incorporatingthe routing procedure into the optimization. We implement two versions of G-CapsNet (fully-connected and convolutional) on CAFFE (Jia et al. (2014)) andevaluate them by testing the accuracy on MNIST & CIFAR10, the robustness towhite-box & black-box attack, and the generalization ability on GAN-generatedsynthetic images. We also explore the scalability of G-CapsNet by constructinga relatively deep G-CapsNet. The experiment shows that G-CapsNet has goodgeneralization ability and scalability.

##### Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks

tl;dr We introduce a new inductive bias that integrates tree structures in recurrent neural networks.

Recurrent neural network (RNN) models are widely used for processing sequential data governed by a latent tree structure. Previous work shows that RNN models (especially Long Short-Term Memory (LSTM) based models) could learn to exploit the underlying tree structure. However, its performance consistently lags behind that of tree-based models. This work proposes a new inductive bias Ordered Neurons, which enforces an order of updating frequencies between hidden state neurons. We show that the ordered neurons could explicitly integrate the latent tree structure into recurrent models. To this end, we propose a new RNN unit: ON-LSTM, which achieve good performances on four different tasks: language modeling, unsupervised parsing, targeted syntactic evaluation, and logical inference.

##### Using Deep Siamese Neural Networks to Speed up Natural Products Research

tl;dr We learn a direct mapping from NMR spectra of small molecules to a molecular structure based cluster space.

Natural products (NPs, compounds derived from plants and animals) are an important source of novel disease treatments. A bottleneck in the search for new NPs is structure determination. One method is to use 2D Nuclear Magnetic Resonance (NMR) imaging, which indicates bonds between nuclei in the compound, and hence is the "fingerprint" of the compound. Computing a similarity score between 2D NMR spectra for a novel compound and a compound whose structure is known helps determine the structure of the novel compound. Standard approaches to this problem do not appear to scale to larger databases of compounds. Here we use deep convolutional Siamese networks to map NMR spectra to a cluster space, where similarity is given by the distance in the space. This approach results in an AUC score that is more than four times better than an approach using Latent Dirichlet Allocation.

##### Pix2Scene: Learning Implicit 3D Representations from Images

tl;dr pix2scene: a deep generative based approach for implicitly modelling the geometrical properties of a 3D scene from images

Modelling 3D scenes from 2D images is a long-standing problem in computer vision with implications in, e.g., simulation and robotics. We propose pix2scene, a deep generative-based approach that implicitly models the geometric properties of a scene from images. Our method learns the depth and orientation of scene points visible in images. Our model can then predict the structure of a scene from various, previously unseen view points. It relies on a bi-directional adversarial learning mechanism to generate scene representations from a latent code, inferring the 3D representation of the underlying scene geometry. We showcase a novel differentiable renderer to train the 3D model in an end-to-end fashion, using only images. We demonstrate the generative ability of our model qualitatively on both a custom dataset and on ShapeNet. Finally, we evaluate the effectiveness of the learned 3D scene representation in supporting a 3D spatial reasoning.

##### Cramer-Wold AutoEncoder

tl;dr Inspired by prior work on Sliced-Wasserstein Autoencoders (SWAE) and kernel smoothing we construct a new generative model – Cramer-Wold AutoEncoder (CWAE).

Assessing distance betweeen the true and the sample distribution is a key component of many state of the art generative models, such as Wasserstein Autoencoder (WAE). Inspired by prior work on Sliced-Wasserstein Autoencoders (SWAE) and kernel smoothing we construct a new generative model – Cramer-Wold AutoEncoder (CWAE). CWAE cost function, based on introduced Cramer-Wold distance between samples, has a simple closed-form in the case of normal prior. As a consequence, while simplifying the optimization procedure (no need of sampling necessary to evaluate the distance function in the training loop), CWAE performance matches quantitatively and qualitatively that of WAE-MMD (WAE using maximum mean discrepancy based distance function) and often improves upon SWAE.