# Search ICLR 2019

Searching papers submitted to ICLR 2019 can be painful. You might want to know which paper uses technique X, dataset D, or cites author ME. Unfortunately, search is limited to titles, abstracts, and keywords, missing the actual contents of the paper. This Frankensteinian search has returned from 2018 to help scour the papers of ICLR by ripping out their souls using pdftotext.

Good luck! Warranty's not included :)

Need random search inspiration..? Grab something from the list of all tags! ^_^
How about: off-policy, conditional image generation, riemannian manifold, convolutions, computer vision ..?

Sanity Disclaimer: As you stare at the continuous stream of ICLR and arXiv papers, don't lose confidence or feel overwhelmed. This isn't a competition, it's a search for knowledge. You and your work are valuable and help carve out the path for progress in our field :)

### "Random selection" has 100 results

##### Knowledge Flow: Improve Upon Your Teachers

tl;dr ‘Knowledge Flow’ trains a deep net (student) by injecting information from multiple nets (teachers). The student is independent upon training and performs very well on learned tasks irrespective of the setting (reinforcement or supervised learning).

A zoo of deep nets is available these days for almost any given task, and it is increasingly unclear which net to start with when addressing a new task, or which net to use as an initialization for fine-tuning a new model. To address this issue, in this paper, we develop knowledge flow which moves ‘knowledge’ from multiple deep nets, referred to as teachers, to a new deep net model, called the student. The structure of the teachers and the student can differ arbitrarily and they can be trained on entirely different tasks with different output spaces too. Upon training with knowledge flow the student is independent of the teachers. We demonstrate our approach on a variety of supervised and reinforcement learning tasks, outperforming fine-tuning and other ‘knowledge exchange’ methods.

##### Adversarial Sampling for Active Learning

tl;dr ASAL is a pool based active learning method that generates high entropy samples and retrieves matching samples from the pool in sub-linear time.

This paper proposes ASAL, a new pool based active learning method that generates high entropy samples. Instead of directly annotating the synthetic samples, ASAL searches similar samples from the pool and includes them for training. Hence, the quality of new samples is high and annotations are reliable. ASAL is particularly suitable for large data sets because it achieves a better run-time complexity (sub-linear) for sample selection than traditional uncertainty sampling (linear). We present a comprehensive set of experiments on two data sets and show that ASAL outperforms similar methods and clearly exceeds the established baseline (random sampling). In the discussion section we analyze in which situations ASAL performs best and why it is sometimes hard to outperform random sample selection. To the best of our knowledge this is the first adversarial active learning technique that is applied for multiple class problems using deep convolutional classifiers and demonstrates superior performance than random sample selection.

##### LanczosNet: Multi-Scale Deep Graph Convolutional Networks

No tl;dr =[

We propose Lanczos network (LanczosNet) which uses the Lanczos algorithm to construct low rank approximations of the graph Laplacian for graph convolution. Relying on the tridiagonal decomposition of the Lanczos algorithm, we not only efficiently exploit multi-scale information via fast approximated computation of matrix power but also design learnable spectral filters. Being fully differentiable, LanczosNet facilitates both graph kernel learning as well as learning node embeddings. We show the connection between our LanczosNet and graph based manifold learning, especially diffusion maps. We benchmark our model against $8$ recent deep graph networks on citation datasets and QM8 quantum chemistry dataset. Experimental results show that our model achieves the state-of-the-art performance in most tasks.

##### Exploiting Invariant Structures for Compression in Neural Networks

tl;dr Compression of neural networks which improves the state-of-the-art low rank approximation techniques and is complementary to most of other compression techniques.

Modern neural networks often require deep compositions of high-dimensional nonlinear functions (wide architecture) to achieve high test accuracy, and thus can have overwhelming number of parameters. Repeated high cost in prediction at test-time makes neural networks ill-suited for devices with constrained memory or computational power. We introduce an efficient mechanism, reshaped tensor decomposition, to compress neural networks by exploiting three types of invariant structures: periodicity, modulation and low rank. Our reshaped tensor decomposition method exploits such invariance structures using a technique called tensorization (reshaping the layers into higher-order tensors) combined with higher order tensor decompositions on top of the tensorized layers. Our compression method improves low rank approximation methods and can be incorporated to (is complementary to) most of the existing compression methods for neural networks to achieve better compression. Experiments on LeNet-5 (MNIST), ResNet-32 (CI- FAR10) and ResNet-50 (ImageNet) demonstrate that our reshaped tensor decomposition outperforms (5% test accuracy improvement universally on CIFAR10) the state-of-the-art low-rank approximation techniques under same compression rate, besides achieving orders of magnitude faster convergence rates.

##### Where and when to look? Spatial-temporal attention for action recognition in videos

No tl;dr =[

Inspired by the observation that humans are able to process videos efficiently by only paying attention when and where it is needed, we propose a novel spatial-temporal attention mechanism for video-based action recognition. For spatial attention, we learn a saliency mask to allow the model to focus on the most salient parts of the feature maps. For temporal attention, we employ a soft temporal attention mechanism to identify the most relevant frames from an input video. Further, we propose a set of regularizers that ensure that our attention mechanism attends to coherent regions in space and time. Our model is efficient, as it proposes a separable spatio-temporal mechanism for video attention, while being able to identify important parts of the video both spatially and temporally. We demonstrate the efficacy of our approach on three public video action recognition datasets. The proposed approach leads to state-of-the-art performance on all of them, including the new large-scale Moments in Time dataset. Furthermore, we quantitatively and qualitatively evaluate our model's ability to accurately localize discriminative regions spatially and critical frames temporally. This is despite our model only being trained with per video classification labels.

##### Explaining AlphaGo: Interpreting Contextual Effects in Neural Networks

tl;dr This paper presents methods to disentangle and interpret contextual effects that are encoded in a deep neural network.

This paper presents two methods to disentangle and interpret contextual effects that are encoded in a pre-trained deep neural network. Unlike convolutional studies that visualize image appearances corresponding to the network output or a neural activation from a global perspective, our research aims to clarify how a certain input unit (dimension) collaborates with other units (dimensions) to constitute inference patterns of the neural network and thus contribute to the network output. The analysis of local contextual effects w.r.t. certain input units is of special values in real applications. In particular, we used our methods to explain the gaming strategy of the alphaGo Zero model in experiments, and our method successfully disentangled the rationale of each move during the game.

##### Metropolis-Hastings view on variational inference and adversarial training

tl;dr Learning to sample via lower bounding the acceptance rate of the Metropolis-Hastings algorithm

In this paper we propose to view the acceptance rate of the Metropolis-Hastings algorithm as a universal objective for learning to sample from target distribution -- given either as a set of samples or in the form of unnormalized density. This point of view unifies the goals of such approaches as Markov Chain Monte Carlo (MCMC), Generative Adversarial Networks (GANs), variational inference. To reveal the connection we derive the lower bound on the acceptance rate and treat it as the objective for learning explicit and implicit samplers. The form of the lower bound allows for doubly stochastic gradient optimization in case the target distribution factorizes (i.e. over data points). We empirically validate our approach on Bayesian inference for neural networks and generative models for images.

##### Learning to encode spatial relations from natural language

tl;dr We introduce a system capable of capturing the semantics of spatial relations by grounding representation learning in vision.

Natural language processing has made significant inroads into learning the semantics of words through distributional approaches, however representations learnt via these methods fail to capture certain kinds of information implicit in the real world. In particular, spatial relations are encoded in a way that is inconsistent with human spatial reasoning and lacking invariance to viewpoint changes. We present a system capable of capturing the semantics of spatial relations such as behind, left of, etc from natural language. Our key contributions are a novel multi-modal objective based on generating images of scenes from their textual descriptions, and a new dataset on which to train it. We demonstrate that internal representations are robust to meaning preserving transformations of descriptions (paraphrase invariance), while viewpoint invariance is an emergent property of the system.

##### GenEval: A Benchmark Suite for Evaluating Generative Models

tl;dr We introduce battery of synthetic distributions and metrics for measuring the success of generative models

Generative models are important for several practical applications, from low level image processing tasks, to model-based planning in robotics. More generally, the study of generative models is motivated by the long-standing endeavor to model uncertainty and to discover structure by leveraging unlabeled data. Unfortunately, the lack of an ultimate task of interest has hindered progress in the field, as there is no established way to compare models and, often times, evaluation is based on mere visual inspection of samples drawn from such models. In this work, we aim at addressing this problem by introducing a new benchmark evaluation suite, dubbed \textit{GenEval}. GenEval hosts a large array of distributions capturing many important properties of real datasets, yet in a controlled setting, such as lower intrinsic dimensionality, multi-modality, compositionality, independence and causal structure. Any model can be easily plugged for evaluation, provided it can generate samples. Our extensive evaluation suggests that different models have different strenghts, and that GenEval is a great tool to gain insights about how models and metrics work. We offer GenEval to the community~\footnote{Available at: \it{coming soon}.} and believe that this benchmark will facilitate comparison and development of new generative models.

##### Learning Mixed-Curvature Representations in Product Spaces

tl;dr Product manifold embedding spaces with heterogenous curvature yield improved representations compared to traditional embedding spaces for a variety of structures.

The quality of the representations achieved by embeddings is determined by how well the geometry of the embedding space matches the structure of the data. Euclidean space has been the workhorse space for embeddings; recently hyperbolic and spherical spaces are gaining popularity due to their ability to better embed new types of structured data---such as hierarchical data---but most data is not structured so uniformly. We address this problem by proposing embedding into a product manifold combining multiple copies of spherical, hyperbolic, and Euclidean spaces, providing a space of heterogeneous curvature suitable for a wide variety of structures. We introduce a heuristic to estimate the sectional curvature of graph data and directly determine the signature---the number of component spaces and their dimensions---of the product manifold. Empirically, we jointly learn the curvature and the embedding in the product space via Riemannian optimization. We discuss how to define and compute intrinsic quantities such as means---a challenging notion for product manifolds---and provably learnable optimization functions. On a range of datasets and reconstruction tasks, our product space embeddings outperform single Euclidean or hyperbolic spaces used in previous works, reducing distortion by 32.55% on a Facebook social network dataset. We learn word embeddings and find that a product of hyperbolic spaces in 50 dimensions consistently improves on baseline Euclidean and hyperbolic embeddings by 2.6 points in Spearman rank correlation on similarity tasks and 3.4 points on analogy accuracy.

##### A RECURRENT NEURAL CASCADE-BASED MODEL FOR CONTINUOUS-TIME DIFFUSION PROCESS

No tl;dr =[

Many works have been proposed in the literature to capture the dynamics of diffusion in networks. While some of them define graphical markovian models to extract temporal relationships between node infections in networks, others consider diffusion episodes as sequences of infections via recurrent neural models. In this paper we propose a model at the crossroads of these two extremes, which embeds the history of diffusion in infected nodes as hidden continuous states. Depending on the trajectory followed by the content before reaching a given node, the distribution of influence probabilities may vary. However, content trajectories are usually hidden in the data, which induces challenging learning problems. We propose a topological recurrent neural model which exhibits good experimental performances for diffusion modelling and prediction.

##### Bayesian Deep Learning via Stochastic Gradient MCMC with a Stochastic Approximation Adaptation

tl;dr a robust Bayesian deep learning algorithm to infer complex posteriors with latent variables

We propose a robust Bayesian deep learning algorithm to infer complex posteriors with latent variables. Inspired by dropout, a popular tool for regularization and model ensemble, we assign sparse priors to the weights in deep neural networks (DNN) in order to achieve automatic dropout'' and avoid over-fitting. By alternatively sampling from posterior distribution through stochastic gradient Markov Chain Monte Carlo (SG-MCMC) and optimizing latent variables via stochastic approximation (SA), the trajectory of the target weights is proved to converge to the true posterior distribution conditioned on optimal latent variables. This ensures a stronger regularization on the over-fitted parameter space and more accurate uncertainty quantification on the decisive variables. Simulations from large-p-small-n regressions showcase the robustness of this method when applied to models with latent variables. Additionally, its application on the convolutional neural networks (CNN) leads to state-of-the-art performance on MNIST and Fashion MNIST datasets and improved resistance to adversarial attacks.

##### Overfitting Detection of Deep Neural Networks without a Hold Out Set

tl;dr We introduce and analyze several criteria for detecting overfitting.

Overfitting is an ubiquitous problem in neural network training and usually mitigated using a holdout data set. Here we challenge this rationale and investigate criteria for overfitting without using a holdout data set. Specifically, we train a model for a fixed number of epochs multiple times with varying fractions of randomized labels and for a range of regularization strengths. A properly trained model should not be able to attain an accuracy greater than the fraction of properly labeled data points. Otherwise the model overfits. We introduce two criteria for detecting overfitting and one to detect underfitting. We analyze early stopping, the regularization factor, and network depth. In safety critical applications we are interested in models and parameter settings which perform well and are not likely to overfit. The methods of this paper allow characterizing and identifying such models.

##### The Variational Deficiency Bottleneck

tl;dr We develop a new bottleneck method based on channel deficiency.

We introduce a bottleneck method for learning data representations based on channel deficiency, rather than the more traditional information sufficiency. A variational upper bound allows us to implement this method efficiently. The bound itself is bounded above by the variational information bottleneck objective, and the two methods coincide in the regime of single shot Monte Carlo approximations. The notion of deficiency provides a principled way of approximating complicated channels by relatively simpler ones. The deficiency of one channel w.r.t. another has an operational interpretation in terms of the optimal risk gap of general decision problems, capturing classification as a special case. Unsupervised generalizations are possible, such as the deficiency autoencoder, which also can be formulated in a variational form. Experiments demonstrate that the deficiency bottleneck can provide advantages in terms of minimal sufficiency as measured by information bottleneck curves, while retaining a good test performance in classification and reconstruction tasks.

##### Relational Forward Models for Multi-Agent Learning

tl;dr Relational Forward Models for multi-agent learning make accurate predictions of agents' future behavior, they produce intepretable representations and can be used inside agents.

The behavioral dynamics of multi-agent systems have a rich and orderly structure, which can be leveraged to understand these systems, and to improve how artificial agents learn to operate in them. Here we introduce Relational Forward Models (RFM) for multi-agent learning, networks that can learn to make accurate predictions of agents' future behavior in multi-agent environments. Because these models operate on the discrete entities and relations present in the environment, they produce interpretable intermediate representations which offer insights into what drives agents' behavior, and what events mediate the intensity and valence of social interactions. Furthermore, we show that embedding RFM modules inside agents results in faster learning systems compared to non-augmented baselines. As more and more of the autonomous systems we develop and interact with become multi-agent in nature, developing richer analysis tools for characterizing how and why agents make decisions is increasingly necessary. Moreover, developing artificial agents that quickly and safely learn to coordinate with one another, and with humans in shared environments, is crucial.

##### Learning to Progressively Plan

No tl;dr =[

For problem solving, making reactive decisions based on problem description is fast but inaccurate, while search-based planning using heuristics gives better solutions but could be exponentially slow. In this paper, we propose a new approach that improves an existing solution by iteratively picking and rewriting its local components until convergence. The rewriting policy employs a neural network trained with reinforcement learning. We evaluate our approach in two domains: job scheduling and expression simplification. Compared to common effective heuristics, baseline deep models and search algorithms, our approach efficiently gives solutions with higher quality.

##### Confidence-based Graph Convolutional Networks for Semi-Supervised Learning

tl;dr We propose a confidence based Graph Convolutional Network for Semi-Supervised Learning.

Predicting properties of nodes in a graph is an important problem with applications in a variety of domains. Graph-based Semi Supervised Learning (SSL) methods aim to address this problem by labeling a small subset of the nodes as seeds, and then utilizing the graph structure to predict label scores for the rest of the nodes in the graph. Recently, Graph Convolutional Networks (GCNs) have achieved impressive performance on the graph-based SSL task. In addition to label scores, it is also desirable to have a confidence score associated with them. Unfortunately, confidence estimation in the context of GCN has not been previously explored. We fill this important gap in this paper and propose ConfGCN, which estimates labels scores along with their confidences jointly in GCN-based setting. ConfGCN uses these estimated confidences to determine the influence of one node on another during neighborhood aggregation, thereby acquiring anisotropic capabilities. Through extensive analysis and experiments on standard benchmarks, we find that ConfGCN is able to significantly outperform state-of-the-art baselines. We have made ConfGCN’s source code available to encourage reproducible research.

##### Critical Learning Periods in Deep Networks

tl;dr Sensory deficits in early training phases can lead to irreversible performance loss in both artificial and neuronal networks, suggesting information phenomena as the common cause, and point to the importance of the initial transient and forgetting.

Similar to humans and animals, deep artificial neural networks exhibit critical periods during which a temporary stimulus deficit can impair the development of a skill. The extent of the impairment depends on the onset and length of the deficit window, as in animal models, and on the size of the neural network. Deficits that do not affect low-level statistics, such as vertical flipping of the images, have no lasting effect on performance and can be overcome with further training. To better understand this phenomenon, we use the Fisher Information of the weights to measure the effective connectivity between layers of a network during training. Counterintuitively, information raises rapidly in the early phases of training, and then decreases, preventing redistribution of information resources in a phenomenon we refer to as a loss of "Information Plasticity". Our analysis suggests that the first few epochs are critical for the creation of strong connections that are optimal relative to the input data distribution. Once such strong connections are created, they do not appear to change during additional training. These findings suggest that the initial learning transient, under-scrutinized compared to asymptotic behavior, plays a key role in determining the outcome of the training process. Our findings, combined with recent theoretical results in the literature, also suggest that forgetting (decrease of information in the weights) is critical to achieving invariance and disentanglement in representation learning. Finally, critical periods are not restricted to biological systems, but can emerge naturally in learning systems, whether biological or artificial, due to fundamental constrains arising from learning dynamics and information processing.

##### Selective Self-Training for semi-supervised Learning

tl;dr Our proposed algorithm does not use all of the unlabeled data for the training, and it rather uses them selectively.

Most of the conventional semi-supervised learning (SSL) methods assume that the classes of unlabeled data are contained in the set of classes of labeled data. In addition, these methods do not discriminate unlabeled samples and use all the unlabeled data for learning, which is not suitable for realistic situations. In this paper, we propose an SSL method called selective self-training (SST), which selectively decides whether to include each unlabeled sample in the training process or not. It is also designed to be applied to a more realistic situation where classes of unlabeled data are different from the ones of the labeled data. For the conventional SSL problems of fixed classes, the proposed method not only performs comparable to other conventional SSL algorithms, but also can be combined with other SSL algorithms. For the new SSL problems of increased classes where the conventional methods cannot be applied, the proposed method does not show any performance degradation even if the classes of unlabeled data are different from those of the labeled data.

##### A Mean Field Theory of Batch Normalization

tl;dr Batch normalization causes exploding gradients in vanilla feedforward networks.

We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. We find that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range.

##### CAMOU: Learning Physical Vehicle Camouflages to Adversarially Attack Detectors in the Wild

tl;dr We propose a method to learn physical vehicle camouflage to adversarially attack object detectors in the wild. We find our camouflage effective and transferable.

In this paper, we conduct an interesting experimental study about the physical adversarial attack on object detectors in the wild. In particular, we learn a camouflage pattern to hide vehicles from being detected by state-of-the-art convolutional neural network based detectors. Our approach alternates between two threads. In the first, we train a neural approximation function to imitate how a simulator applies camouflages to vehicles and how a vehicle detector performs given an image generated by the simulator. In the second, we minimize the approximated detection score by searching for the optimal camouflage. Experiments show that the learned camouflage can not only hide a vehicle from the image-based detectors under many cases, but also generalizes to different environments, vehicles, and object detectors.

##### Detecting Out-Of-Distribution Samples Using Low-Order Deep Features Statistics

tl;dr Detecting out-of-distribution samples by using low-order feature statistics without requiring any change in underlying DNN.

The ability to detect when an input sample was not drawn from the training distribution is an important desirable property of deep neural networks. In this paper, we show that a simple ensembling of first and second order deep feature statistics can be exploited to effectively differentiate in-distribution and out-of-distribution samples. Specifically, we observe that the mean and standard deviation within feature maps differs greatly between in-distribution and out-of-distribution samples. Based on this observation, we propose a simple and efficient plug-and-play detection procedure that does not require re-training, pre-processing or changes to the model. The proposed method outperforms the state-of-the-art by a large margin in all standard benchmarking tasks, while being much simpler to implement and execute. Notably, our method improves the true negative rate from 86.6% to 96.8% when 95% of in-distribution (CIFAR-100) are correctly detected using a DenseNet and the out-of-distribution dataset is TinyImageNet resize. The source code of our method will be made publicly available.

##### Beyond Games: Bringing Exploration to Robots in Real-world

No tl;dr =[

Exploration has been a long standing problem in both model-based and model-free learning methods for sensorimotor control. While there has been major advances over the years, most of these successes have been demonstrated in either video games or simulation environments. This is primarily because the rewards (even the intrinsic ones) are non-differentiable since they are function of the environment (which is a black-box). In this paper, we focus on the policy optimization aspect of the intrinsic reward function. Specifically, by using a local approximation, we formulate intrinsic reward as a differentiable function so as to perform policy optimization using likelihood maximization -- much like supervised learning instead of reinforcement learning. This leads to a significantly sample efficient exploration policy. Our experiments clearly show that our approach outperforms both on-policy and off-policy optimization approaches like REINFORCE and DQN respectively. But most importantly, we are able to implement an exploration policy on a robot which learns to interact with objects completely from scratch just using data collected via the differentiable exploration module.

##### Linearizing Visual Processes with Deep Generative Models

tl;dr We model non-linear visual processes as autoregressive noise via generative deep learning.

This work studies the problem of modeling non-linear visual processes by leveraging deep generative architectures for learning linear, Gaussian models of observed sequences. We propose a joint learning framework, combining a multivariate autoregressive model and deep convolutional generative networks. After justification of theoretical assumptions of inearization, we propose an architecture that allows Variational Autoencoders and Generative Adversarial Networks to simultaneously learn the non-linear observation as well as the linear state-transition model from a sequence of observed frames. Finally, we demonstrate our approach on conceptual toy examples and dynamic textures.

##### Bias Also Matters: Bias Attribution for Deep Neural Network Explanation

tl;dr Attribute the bias terms of deep neural networks to input features by a backpropagation-type algorithm; Generate complementary and highly interpretable explanations of DNNs in addition to gradient-based attributions.

The gradient of a deep neural network (DNN) w.r.t. the input provides information that can be used to explain the output prediction in terms of the input features and has been widely studied to assist in interpreting DNNs. In a linear model (i.e., $g(x)=wx+b$), the gradient corresponds solely to the weights $w$. Such a model can reasonably locally linearly approximate a smooth nonlinear DNN, and hence the weights of this local model are the gradient. The other part, however, of a local linear model, i.e., the bias $b$, is usually overlooked in attribution methods since it is not part of the gradient. In this paper, we observe that since the bias in a DNN also has a non-negligible contribution to the correctness of predictions, it can also play a significant role in understanding DNN behaviors. In particular, we study how to attribute a DNN's bias to its input features. We propose a backpropagation-type algorithm bias back-propagation (BBp)'' that starts at the output layer and iteratively attributes the bias of each layer to its input nodes as well as combining the resulting bias term of the previous layer. This process stops at the input layer, where summing up the attributions over all the input features exactly recovers $b$. Together with the backpropagation of the gradient generating $w$, we can fully recover the locally linear model $g(x)=wx+b$. Hence, the attribution of the DNN outputs to its inputs is decomposed into two parts, the gradient $w$ and the bias attribution, providing separate and complementary explanations. We study several possible attribution methods applied to the bias of each layer in BBp. In experiments, we show that BBp can generate complementary and highly interpretable explanations of DNNs in addition to gradient-based attributions.

##### Generative adversarial interpolative autoencoding: adversarial training on latent space interpolations encourage convex latent distributions

tl;dr We designed an autoencoder which is trained to learn a convex latent distribution by using an adversarial loss function to discriminate latent space interpolations from real data.

We present a neural network architecture based upon the Autoencoder (AE) and Generative Adversarial Network (GAN) that promotes a convex latent distribution by training adversarially on latent space interpolations. By using an AE as both the generator and discriminator of a GAN, we pass a pixel-wise error function across the discriminator, yielding an AE which produces non-blurry samples that match both high- and low-level features of the original images. Interpolations between images in this space remain within the latent-space distribution of real images as trained by the discriminator, and therfore preserve realistic resemblances to the network inputs.

##### Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference

No tl;dr =[

Lack of performance when it comes to continual learning over non-stationary distributions of data remains a major challenge in scaling neural network learning to more human realistic settings. In this work we propose a new conceptualization of the continual learning problem in terms of a trade-off between transfer and interference. We then propose a new algorithm, Meta-Experience Replay (MER), that directly exploits this view by combining experience replay with optimization based meta-learning. This method learns parameters that make interference based on future gradients less likely and transfer based on future gradients more likely. We conduct experiments across continual lifelong supervised learning benchmarks and non-stationary reinforcement learning environments demonstrating that our approach consistently outperforms recently proposed baselines for continual learning. Our experiments show that the gap between the performance of MER and baseline algorithms grows both as the environment gets more non-stationary and as the fraction of the total experiences stored gets smaller.

##### Do Language Models Have Common Sense?

tl;dr We present evidence that LMs do capture common sense with state-of-the-art results on both Winograd Schema Challenge and Commonsense Knowledge Mining.

It has been argued that current machine learning models do not have commonsense, and therefore must be hard-coded with prior knowledge (Marcus, 2018). Here we show surprising evidence that language models can already learn to capture certain common sense knowledge. Our key observation is that a language model can compute the probability of any statement, and this probability can be used to evaluate the truthfulness of that statement. On the Winograd Schema Challenge (Levesque et al., 2011), language models are 11% higher in accuracy than previous state-of-the-art supervised methods. Language models can also be fine-tuned for the task of Mining Commonsense Knowledge on ConceptNet to achieve an F1 score of 0.912 and 0.824, outperforming previous best results (Jastrzebskiet al., 2018). Further analysis demonstrates that language models can discover unique features of Winograd Schema contexts that decide the correct answers without explicit supervision.

##### REVERSED NEURAL NETWORK - AUTOMATICALLY FINDING NASH EQUILIBRIUM

tl;dr REVERSED NEURAL NETWORK - A PRIMAL

Contrary to most reinforcement learning studies emphasizing on approximating the output layer of a neural network to certain strategies, this paper proposes a reversed way for reinforcement learning. We call this “Reversed Neural Network”. In short, after sufficiently training a canonical deep feed-forward neural network according to a strategy-and-environment-to-payoff table, we randomize part of the neurons in the input layer and propagate the error between the generated output and the desired output back to the part of the neurons in the “input layer” of the trained deep neural network recurrently. And we view the final neurons in the “input layer” as the fittest strategy for a neural network.

##### Prototypical Examples in Deep Learning: Metrics, Characteristics, and Utility

tl;dr We can identify prototypical and outlier examples in machine learning that are quantifiably very different, and make use of them to improve many aspects of neural networks.

Machine learning (ML) research has investigated prototypes: examples that are representative of the behavior to be learned. We systematically evaluate five methods for identifying prototypes, both ones previously introduced as well as new ones we propose, finding all of them to provide meaningful but different interpretations. Through a human study, we confirm that all five metrics are well matched to human intuition. Examining cases where the metrics disagree offers an informative perspective on the properties of data and algorithms used in learning, with implications for data-corpus construction, efficiency, adversarial robustness, interpretability, and other ML aspects. In particular, we confirm that the "train on hard" curriculum approach can improve accuracy on many datasets and tasks, but that it is strictly worse when there are many mislabeled or ambiguous examples.

##### Contextualized Role Interaction for Neural Machine Translation

tl;dr We propose a role interaction layer that explicitly models the modulation of token representations by contextualized roles.

Word inputs tend to be represented as single continuous vectors in deep neural networks. It is left to the subsequent layers of the network to extract relevant aspects of a word's meaning based on the context in which it appears. In this paper, we investigate whether word representations can be improved by explicitly incorporating the idea of latent roles. That is, we propose a role interaction layer (RIL) that consists of context-dependent (latent) role assignments and role-specific transformations. We evaluate the RIL on machine translation using two language pairs (En-De and En-Fi) and three datasets of varying size. We find that the proposed mechanism improves translation quality over strong baselines with limited amounts of data, but that the improvement diminishes as the size of data grows, indicating that powerful neural MT systems are capable of implicitly modeling role-word interaction by themselves. Our qualitative analysis reveals that the RIL extracts meaningful context-dependent roles and that it allows us to inspect more deeply the internal mechanisms of state-of-the-art neural machine translation systems.

##### How Training Data Affect the Accuracy and Robustness of Neural Networks for Image Classification

No tl;dr =[

Recent work has demonstrated the lack of robustness of well-trained deep neural networks (DNNs) to adversarial examples. For example, visually indistinguishable perturbations, when mixed with an original image, can easily lead deep learning models to misclassifications. In light of a recent study on the mutual influence between robustness and accuracy over 18 different ImageNet models, this paper investigates how training data affect the accuracy and robustness of deep neural networks. We conduct extensive experiments on four different datasets, including CIFAR-10, MNIST, STL-10, and Tiny ImageNet, with several representative neural networks. Our results reveal previously unknown phenomena that exist between the size of training data and characteristics of the resulting models. In particular, besides confirming that the model accuracy improves as the amount of training data increases, we also observe that the model robustness improves initially, but there exists a turning point after which robustness starts to decrease. How and when such turning points occur vary for different neural networks and different datasets.

##### Towards Metamerism via Foveated Style Transfer

tl;dr We introduce a novel feed-forward framework to generate visual metamers

The problem of visual metamerism is defined as finding a family of perceptually indistinguishable, yet physically different images. In this paper, we propose our NeuroFovea metamer model, a foveated generative model that is based on a mixture of peripheral representations and style transfer forward-pass algorithms. Our gradient-descent free model is parametrized by a foveated VGG19 encoder-decoder which allows us to encode images in high dimensional space and interpolate between the content and texture information with adaptive instance normalization anywhere in the visual field. Our contributions include: 1) A framework for computing metamers that resembles a noisy communication system via a foveated feed-forward encoder-decoder network – We observe that metamerism arises as a byproduct of noisy perturbations that partially lie in the perceptual null space; 2) A perceptual optimization scheme as a solution to the hyperparametric nature of our metamer model that requires tuning of the image-texture tradeoff coefficients everywhere in the visual field which are a consequence of internal noise; 3) An ABX psychophysical evaluation of our metamers where we also find that the rate of growth of the receptive fields in our model match V1 for reference metamers and V2 between synthesized samples. Our model also renders metamers at roughly a second, presenting a ×1000 speed-up compared to the previous work, which now allows for tractable data-driven metamer experiments.

##### Eidetic 3D LSTM: A Model for Video Prediction and Beyond

No tl;dr =[

Spatiotemporal predictive learning, though long considered to be a promising self-supervised feature learning method, seldom shows its effectiveness beyond future video prediction. The reason is that it is difficult to learn good representations for both short-term frame dependency and long-term high-level relations. We present a new model, Eidetic 3D LSTM (E3D-LSTM), that integrates 3D convolutions into RNNs. The encapsulated 3D-Conv makes local perceptrons of RNNs motion aware and enables the memory cell to store better short-term features. For long-term relations, we make the present memory state interact with its historical records via a gate-controlled self-attention module. We describe this memory transition mechanism eidetic as it is able to effectively recall the stored memories across multiple timestamps even after long periods of disturbance. We first evaluate the spatiotemporal modeling capability of the E3D-LSTM model on two widely-used future video prediction datasets and achieve the state of the art performance. Then we demonstrate that with a self-supervised auxiliary learning strategy, the E3D-LSTM network performs well on early activity recognition to infer what is happening after observing only limited frames of video.

No tl;dr =[

##### Learning Disentangled Representations with Reference-Based Variational Autoencoders

No tl;dr =[

Learning disentangled representations from visual data, where high-level generative factors correspond to independent dimensions of feature vectors, is of importance for many computer vision tasks. Supervised approaches, however, require a significant annotation effort in order to label the factors of interest in a training set. To alleviate the annotation cost, we introduce a learning setting which we refer to as "reference-based disentangling''. Given a pool of unlabelled images, the goal is to learn a representation where a set of target factors are disentangled from others. The only supervision comes from an auxiliary "reference set" that contains images where the factors of interest are constant. In order to address this problem, we propose reference-based variational autoencoders, a novel deep generative model designed to exploit the weak supervisory signal provided by the reference set. During training, we use the variational inference framework where adversarial learning is used to minimize the objective function. By addressing tasks such as feature learning, conditional image generation or attribute transfer, we validate the ability of the proposed model to learn disentangled representations from minimal supervision.

##### Neural MMO: A massively multiplayer game environment for intelligent agents

tl;dr An MMO-inspired research game platform for studying emergent behaviors of large populations in a complex environment

We present an artificial intelligence research platform inspired by the human game genre of MMORPGs (Massively Multiplayer Online Role-Playing Games, a.k.a. MMOs). We demonstrate how this platform can be used to study behavior and learning in large populations of neural agents. Unlike currently popular game environments, our platform supports persistent environments, with variable number of agents, and open-ended task descriptions. The emergence of complex life on Earth is often attributed to the arms race that ensued from a huge number of organisms all competing for finite resources. Our platform aims to simulate this setting in microcosm: we conduct a series of experiments to test how large-scale multiagent competition can incentivize the development of skillful behavior. We find that population size magnifies the complexity of the behaviors that emerge and results in agents that out-compete agents trained in smaller populations.

##### Learning Representations of Categorical Feature Combinations via Self-Attention

No tl;dr =[

Self-attention has been widely used to model the sequential data and achieved remarkable results in many applications. Although it can be used to model dependencies without regard to positions of sequences, self-attention is seldom applied to non-sequential data. In this work, we propose to learn representations of multi-field categorical data in prediction tasks via self-attention mechanism, where features are orderless but have intrinsic relations over different fields. In most current DNN based models, feature embeddings are simply concatenated for further processing by networks. Instead, by applying self-attention to transform the embeddings, we are able to relate features in different fields and automatically learn representations of their combinations, which are known as the factors of many prevailing linear models. To further improve the effect of feature combination mining, we modify the original self-attention structure by restricting the similarity weight to have at most k non-zero values, which additionally regularizes the model. We experimentally evaluate the effectiveness of our self-attention model on non-sequential data. Across two click through rate prediction benchmark datasets, i.e., Cretio and Avazu, our model with top-k restricted self-attention achieves the state-of-the-art performance. Compared with the vanilla MLP, the gain by adding self-attention is significantly larger than that by modifying the network structures, which most current works focus on.

##### Neural Network Cost Landscapes as Quantum States

tl;dr We show that NN parameter and hyperparamter cost landscapes can be generated as quantum states using a single quantum circuit and that these can be used for training and meta-training.

Quantum computers promise significant advantages over classical computers for a number of different applications. We show that the complete loss function landscape of a neural network can be represented as the quantum state output by a quantum computer. We demonstrate this explicitly for a binary neural network and, further, show how a quantum computer can train the network by manipulating this state using a well-known algorithm known as quantum amplitude amplification. We further show that with minor adaptation, this method can also represent the meta-loss landscape of a number of neural network architectures simultaneously. We search this meta-loss landscape with the same method to simultaneously train and design a binary neural network.

##### Deepström Networks

tl;dr A new neural architecture where top dense layers of standard convolutional architectures are replaced with an approximation of a kernel function by relying on the Nyström approximation.

Recent work has focused on combining kernel methods and deep learning. With this in mind, we introduce Deepström networks -- a new architecture of neural networks which we use to replace top dense layers of standard convolutional architectures with an approximation of a kernel function by relying on the Nyström approximation. Our approach is easy highly flexible. It is compatible with any kernel function and it allows exploiting multiple kernels. We show that Deepström networks reach state-of-the-art performance on standard datasets like SVHN and CIFAR100. One benefit of the method lies in its limited number of learnable parameters which make it particularly suited for small training set sizes, e.g. from 5 to 20 samples per class. Finally we illustrate two ways of using multiple kernels, including a multiple Deepström setting, that exploits a kernel on each feature map output by the convolutional part of the model.

##### Siamese Capsule Networks

tl;dr A variant of capsule networks that can be used for pairwise learning tasks. Results shows that Siamese Capsule Networks work well in the few shot learning setting.

Capsule Networks have shown encouraging results on defacto benchmark computer vision datasets such as MNIST, CIFAR and smallNORB. Although, they are yet to be tested on tasks where (1) the entities detected inherently have more complex internal representations and (2) there are very few instances per class to learn from and (3) where point-wise classification is not suitable. Hence, this paper carries out experiments on face verification in both controlled and uncontrolled settings that together address these points. In doing so we introduce Siamese Capsule Networks, a new variant that can be used for pairwise learning tasks. The model is trained using contrastive loss with l2-normalized capsule encoded pose features. We find that Siamese Capsule Networks perform well against strong baselines on both pairwise learning datasets, yielding best results in the few-shot learning setting where image pairs in the test set contain unseen subjects.

##### Measuring and regularizing networks in function space

tl;dr It is cheap to measure distances in function space, and these distances aren't always proportional to the corresponding parameter distances.

To optimize a neural network one often thinks of optimizing its parameters, but it is ultimately a matter of optimizing the function that maps inputs to outputs. Since a change in the parameters might serve as a poor proxy for the change in the function, it is of some concern that primacy is given to parameters but that the correspondence has not been tested. Here, we show that it is simple and computationally feasible to calculate distances between functions in a $L^2$ Hilbert space. We examine how typical networks behave in this space, and compare how parameter $\ell^2$ distances compare to function $L^2$ distances between various points of an optimization trajectory. We find that the two distances are nontrivially related. In particular, the $L^2/\ell^2$ ratio decreases throughout optimization, reaching a steady value around when test error plateaus. We then investigate how the $L^2$ distance could be applied directly to optimization. We first propose that in multitask learning, one can avoid catastrophic forgetting by directly limiting how much the input/output function changes between tasks. Secondly, we propose a new learning rule that regularizes the distance a network can travel through $L^2$-space in any one update. This allows new examples to be learned in a way that minimally interferes with what has previously been learned. These applications demonstrate how one can measure and regularize function distances directly, without relying on parameters or local approximations like loss curvature.

##### TherML: The Thermodynamics of Machine Learning

tl;dr We offer a framework for representation learning that connects with a wide class of existing objectives and is analogous to thermodynamics.

In this work we offer an information-theoretic framework for representation learning that connects with a wide class of existing objectives in machine learning. We develop a formal correspondence between this work and thermodynamics and discuss its implications.

tl;dr We found adversarial training not only speeds up the GAN training but also increases the image quality

In this paper, we are interested in two seemingly different concepts: \textit{adversarial training} and \textit{generative adversarial networks (GANs)}. Particularly, how these techniques work to improve each other. To this end, we analyze the limitation of adversarial training as a defense method, starting from questioning how well the robustness of a model can generalize. Then, we successfully improve the generalizability via data augmentation by the fake'' images sampled from generative adversarial network. After that, we are surprised to see that the resulting robust classifier leads to a better generator, for free. We intuitively explain this interesting phenomenon and leave the theoretical analysis for future work. Motivated by these observations, we propose a system that combines generator, discriminator, and adversarial attacker together in a single network. After end-to-end training and fine tuning, our method can simultaneously improve the robustness of classifiers, measured by accuracy under strong adversarial attacks, and the quality of generators, evaluated both aesthetically and quantitatively. In terms of the classifier, we achieve better robustness than the state-of-the-art adversarial training algorithm proposed in (Madry \textit{et al.}, 2017), while our generator achieves competitive performance compared with SN-GAN (Miyato and Koyama, 2018).

##### Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives

tl;dr Doubly reparameterized gradient estimators provide unbiased variance reduction which leads to improved performance.

Deep latent variable models have become a popular model choice due to the scalable learning algorithms introduced by (Kingma & Welling 2013, Rezende et al. 2014). These approaches maximize a variational lower bound on the intractable log likelihood of the observed data. Burda et al. (2015) introduced a multi-sample variational bound, IWAE, that is at least as tight as the standard variational lower bound and becomes increasingly tight as the number of samples increases. Counterintuitively, the typical inference network gradient estimator for the IWAE bound performs poorly as the number of samples increases (Rainforth et al. 2018, Le et al. 2018). Roeder et a. (2017) propose an improved gradient estimator, however, are unable to show it is unbiased. We show that it is in fact biased and that the bias can be estimated efficiently with a second application of the reparameterization trick. The doubly reparameterized gradient (DReG) estimator does not suffer as the number of samples increases, resolving the previously raised issues. The same idea can be used to improve many recently introduced training techniques for latent variable models. In particular, we show that this estimator reduces the variance of the IWAE gradient, the reweighted wake-sleep update (RWS) (Bornschein & Bengio 2014), and the jackknife variational inference (JVI) gradient (Nowozin 2018). Finally, we show that this computationally efficient, drop-in estimator translates to improved performance for all three objectives on several modeling tasks.

##### Local Binary Pattern Networks for Character Recognition

No tl;dr =[

Memory and computation efficient deep learning architectures are crucial to the continued proliferation of machine learning capabilities to new platforms and systems. Binarization of operations in convolutional neural networks has shown promising results in reducing the model size and computing efficiency. In this paper, we tackle the character recognition problem using a strategy different from the existing literature by proposing local binary pattern networks or LBPNet that can learn and perform bit-wise operations in an end-to-end fashion. LBPNet uses local binary comparisons and random projection in place of conventional convolution (or approximation of convolution) operations, providing important means to improve memory and speed efficiency that is particularly suited for small footprint devices and hardware accelerators. These operations can be implemented efficiently on different platforms including direct hardware implementation. LBPNet demonstrates its particular advantage on the character classification task where the content is composed of strokes. We applied LBPNet to benchmark datasets like MNIST, SVHN, DHCD, ICDAR, and Chars74K and observed encouraging results.

##### Using GANs for Generation of Realistic City-Scale Ride Sharing/Hailing Data Sets

tl;dr This paper focuses on the synthetic generation of human mobility data in urban areas using GANs.

This paper focuses on the synthetic generation of human mobility data in urban areas. We present a novel and scalable application of Generative Adversarial Networks (GANs) for modeling and generating human mobility data. We leverage actual ride requests from ride sharing/hailing services from four major cities in the US to train our GANs model. Our model captures the spatial and temporal variability of the ride-request patterns observed for all four cities on any typical day and over any typical week. Previous works have succinctly characterized the spatial and temporal properties of human mobility data sets using the fractal dimensionality and the densification power law, respectively, which we utilize to validate our GANs-generated synthetic data sets. Such synthetic data sets can avoid privacy concerns and be extremely useful for researchers and policy makers on urban mobility and intelligent transportation.

##### Feature Transformers: A Unified Representation Learning Framework for Lifelong Learning

tl;dr Single generic mathematical framework for lifelong learning paradigms with data privacy

Despite the recent advances in representation learning, lifelong learning continues to be one of the most challenging and unconquered problems. Catastrophic forgetting and data privacy constitute two of the important challenges for a successful lifelong learner. Further, existing techniques are designed to handle only specific manifestations of lifelong learning, whereas a practical lifelong learner is expected to switch and adapt seamlessly to different scenarios. In this paper, we present a single, unified mathematical framework for handling the myriad variants of lifelong learning, while alleviating these two challenges. We utilize an external memory to store only the features representing past data and learn richer and newer representations incrementally through transformation neural networks - feature transformers. We define, simulate and demonstrate exemplary performance on a realistic lifelong experimental setting using the MNIST rotations dataset, paving the way for practical lifelong learners. To illustrate the applicability of our method in data sensitive domains like healthcare, we study the pneumothorax classification problem from X-ray images, achieving near gold standard performance. We also benchmark our approach with a number of state-of-the art methods on MNIST rotations and iCIFAR100 datasets demonstrating superior performance.

##### Morpho-MNIST: Quantitative Assessment and Diagnostics for Representation Learning

tl;dr This paper introduces Morpho-MNIST, a collection of shape metrics and perturbations, in a step towards quantitative evaluation of representation learning in computer vision.

Revealing latent structure in data is an active field of research, having brought exciting new models such as variational autoencoders and generative adversarial networks, and is essential to push machine learning towards unsupervised knowledge discovery. However, a major challenge is the lack of suitable benchmarks for an objective and quantitative evaluation of learned representations. To address this issue we introduce Morpho-MNIST. We extend the popular MNIST dataset by adding a morphometric analysis enabling quantitative comparison of different models, identification of the roles of latent variables, and characterisation of sample diversity. We further propose a set of quantifiable perturbations to assess the performance of unsupervised and supervised methods on challenging tasks such as outlier detection and domain adaptation.

##### Dual Skew Divergence Loss for Neural Machine Translation

No tl;dr =[

For neural sequence model training, maximum likelihood (ML) has been commonly adopted to optimize model parameters with respect to the corresponding objective. However, in the case of sequence prediction tasks like neural machine translation (NMT), training with the ML-based cross entropy loss would often lead to models that overgeneralize and plunge into local optima. In this paper, we propose an extended loss function called dual skew divergence (DSD), which aims to give a better tradeoff between generalization ability and error avoidance during NMT training. Our empirical study indicates that switching to DSD loss after the convergence of ML training helps the model skip the local optimum and stimulates a stable performance improvement. The evaluations on WMT 2014 English-German and English-French translation tasks demonstrate that the proposed loss indeed helps bring about better translation performance than several baselines.

##### Information Regularized Neural Networks

tl;dr we propose a regularizer that improves the classification performance of neural networks

We formulate an information-based optimization problem for supervised classification. For invertible neural networks, the control of these information terms is passed down to the latent features and parameter matrix in the last fully connected layer, given that mutual information is invariant under invertible map. We propose an objective function and prove that it solves the optimization problem. Our framework allows us to learn latent features in an more interpretable form while improving the classification performance. We perform extensive quantitative and qualitative experiments in comparison with the existing state-of-the-art classification models.

##### An adaptive homeostatic algorithm for the unsupervised learning of visual features

tl;dr Unsupervised learning is hard and depends on normalisation heuristics. Can we find another simpler approach?

The formation of structure in the brain, that is, of the connection between cells within neural populations, is by large an unsupervised learning process: The emergence of this architecture is mostly self-organized. In the primary visual cortex of mammals, for example, one may observe during development the formation of cells selective to localized, oriented features. This leads to the development of a rough representation of contours of the retinal image in area V1. We modeled these mechanisms using sparse Hebbian learning algorithms. These algorithms alternate a coding step to encode the information with a learning step to find the proper encoder. A major difficulty faced by these algorithms is to deduce a good representation while knowing immature encoders, and to learn good encoders with a non-optimal representation. To address this problem, we propose here to introduce a new regulation process between learning and coding, called homeostasis. Our homeostasis is compatible with a neuro-mimetic architecture and allows for the fast emergence of localized filters sensitive to orientation. The key to this algorithm lies in a simple adaptation mechanism based on non-linear functions that reconciles the antagonistic processes that occur at the coding and learning time scales. We tested this unsupervised algorithm with this homeostasis rule for a range of existing unsupervised learning algorithms coupled with different neural coding algorithms. In addition, we propose a simplification of this optimal homeostasis rule by implementing a simple heuristic on the probability of activation of neurons. Compared to the optimal homeostasis rule, we show that this heuristic allows to implement a more rapid unsupervised learning algorithm while keeping a large part of its effectiveness. These results demonstrate the potential application of such a strategy in machine learning and we illustrate this with one result in a convolutional neural network.

##### Learning to Make Analogies by Contrasting Abstract Relational Structure

tl;dr The most robust capacity for analogical reasoning is induced when networks learn analogies by contrasting abstract relational structures in their input domains.

Analogical reasoning has been a principal focus of various waves of AI research. Analogy is particularly challenging for machines because it requires relational structures to be represented such that they can be flexibly applied across diverse domains of experience. Here, we study how analogical reasoning can be induced in neural networks that learn to perceive and reason about raw visual data. We find that the critical factor for inducing such a capacity is not an elaborate architecture, but rather, careful attention to the choice of data and the manner in which it is presented to the model. The most robust capacity for analogical reasoning is induced when networks learn analogies by contrasting abstract relational structures in their input domains, a training method that uses only the input data to force models to learn about important abstract features. Using this technique we demonstrate capacities for complex, visual and symbolic analogy making and generalisation in even the simplest neural network architectures.

##### Latent Transformations for View Synthesis with Conditional Convolutional Networks

tl;dr We introduce an effective, general framework for incorporating conditioning information into inference-based generative models.

We propose a fully-convolutional conditional generative model, the latent transformation neural network (LTNN), capable of view synthesis using a light-weight neural network suited for real-time applications. In contrast to existing conditional generative models which incorporate conditioning information via concatenation, we introduce a dedicated network component, the conditional transformation unit (CTU), designed to learn the latent space transformations corresponding to specified target views. In addition, a consistency loss term is defined to guide the network toward learning the desired latent space mappings, a task-divided decoder is constructed to refine the quality of generated views, and an adaptive discriminator is introduced to improve the adversarial training process. The generality of the proposed methodology is demonstrated on a collection of three diverse tasks: multi-view reconstruction on real hand depth images, view synthesis of real and synthetic faces, and the rotation of rigid objects. The proposed model is shown to exceed state-of-the-art results in each category while simultaneously achieving a reduction in the computational demand required for inference by 30% on average.

##### Nonlinear Channels Aggregation Networks for Deep Action Recognition

tl;dr An architecture enables CNN trained on the video sequences converging rapidly

We introduce the concept of channel aggregation in ConvNet architecture, a novel compact representation of CNN features useful for explicitly modeling the nonlinear channels encoding especially when the new unit is embedded inside of deep architectures for action recognition. The channel aggregation is based on multiple-channels features of ConvNet and aims to be at the spot finding the optical convergence path at fast speed. We name our proposed convolutional architecture “nonlinear channels aggregation networks (NCAN)” and its new layer “nonlinear channels aggregation layer (NCAL)”. We theoretically motivate channels aggregation functions and empirically study their effect on convergence speed and classification accuracy. Another contribution in this work is an efficient and effective implementation of the NCAL, speeding it up orders of magnitude. We evaluate its performance on standard benchmarks UCF101 and HMDB51, and experimental results demonstrate that this formulation not only obtains a fast convergence but stronger generalization capability without sacrificing performance.

##### Task-GAN for Improved GAN based Image Restoration

tl;dr Couple the GAN based image restoration framework with another task-specific network to generate realistic image while preserving task-specific features.

##### Learning Localized Generative Models for 3D Point Clouds via Graph Convolution

tl;dr A GAN using graph convolution operations with dynamically computed graphs from hidden features

Point clouds are an important type of geometric data and have widespread use in computer graphics and vision. However, learning representations for point clouds is particularly challenging due to their nature as being an unordered collection of points irregularly distributed in 3D space. Graph convolution, a generalization of the convolution operation for data defined over graphs, has been recently shown to be very successful at extracting localized features from point clouds in supervised or semi-supervised tasks such as classification or segmentation. This paper studies the unsupervised problem of a generative model exploiting graph convolution. We focus on the generator of a GAN and define methods for graph convolution when the graph is not known in advance as it is the very output of the generator. The proposed architecture learns to generate localized features that approximate graph embeddings of the output geometry. We also study the problem of defining an upsampling layer in the graph-convolutional generator, whereby it learns to exploit a self-similarity prior to sample the data distribution.

##### Harmonic Unpaired Image-to-image Translation

tl;dr Smooth regularization over sample graph for unpaired image-to-image translation results in significantly improved consistency

The recent direction of unpaired image-to-image translation is on one hand very exciting as it alleviates the big burden in obtaining label-intensive pixel-to-pixel supervision, but it is on the other hand not fully satisfactory due to the presence of artifacts and degenerated transformations. In this paper, we take a manifold view of the problem by introducing a smoothness constraint over the sample graph to attain harmonic functions to enforce consistent mappings during the translation. We develop HarmonicGAN to learn bi-directional translations between the source and the target domain. With the help of similarity-consistency, the inherent self-consistency property of samples can be maintained. Distance metrics defined on two types of features including histogram and CNN are exploited. Under an identical problem setting as CycleGAN without additional manual inputs, HarmonicGAN demonstrates a significant qualitative and quantitative improvement over the state of the art, as well as improved interpretability. We show experimental results in a number of applications including medical imaging, object transfiguration, and semantic labeling. We outperform the competing methods in all tasks, and for a medical imaging task in particular our method turns CycleGAN from a failure to a success, halving the mean-squared error, and generating images that radiologists prefer over competing methods in 95% of cases.

##### Deconfounding Reinforcement Learning

tl;dr This is the first attempt to build a bridge between confounding and the full reinforcement learning problem.

In this paper, we propose a general formulation to cope with a family of reinforcement learning tasks in which confounder (i.e., a factor affecting both actions and rewards) exists in dynamic environments. Based on the proposed approach, we extend two representatives of reinforcement learning algorithms: Q-learning and Actor-Critic Methods, to their deconfounding variants. Due to lack of datasets in this direction, a benchmark is developed for deconfounding reinforcement learning algorithms by revising OpenAI Gym and MNIST. We demonstrate that the proposed algorithms are superior to traditional reinforcement learning algorithms in confounding environments. To the best of our knowledge, this is the first time that confounders are taken into consideration for addressing full reinforcement learning problems.

##### Learning Corresponded Rationales for Text Matching

tl;dr We propose a novel self-explaining architecture to predict matches between two sequences of texts. Specifically, we introduce the notion of corresponded rationales and learn to extract them by the distal supervision from the downstream task.

The ability to predict matches between two sources of text has a number of applications including natural language inference (NLI) and question answering (QA). While flexible neural models have become effective tools in solving these tasks, they are rarely transparent in terms of the mechanism that mediates the prediction. In this paper, we propose a self-explaining architecture where the model is forced to highlight, in a dependent manner, how spans of one side of the input match corresponding segments of the other side in order to arrive at the overall decision. The text spans are regularized to be coherent and concise, and their correspondence is captured explicitly. The text spans -- rationales -- are learned entirely as latent mechanisms, guided only by the distal supervision from the end-to-end task. We evaluate our model on both NLI and QA using three publicly available datasets. Experimental results demonstrate quantitatively and qualitatively that our method delivers interpretable justification of the prediction without sacrificing state-of-the-art performance. Our code and data split will be publicly available.

##### Perception-Aware Point-Based Value Iteration for Partially Observable Markov Decision Processes

tl;dr We develop a point-based value iteration solver for POMDPs with active perception and planning tasks.

Partially observable Markov decision processes (POMDPs) are a widely-used framework to model decision-making with uncertainty about the environment and under stochastic outcome. In conventional POMDP models, the observations that the agent receives originate from fixed known distribution. However, in a variety of real-world scenarios the agent has an active role in its perception by selecting which observations to receive. Due to combinatorial nature of such selection process, it is computationally intractable to integrate the perception decision with the planning decision. To prevent such expansion of the action space, we propose a greedy strategy for observation selection. We develop a novel point-based value iteration algorithm that incorporates the greedy strategy to find near-optimal selection decision for sampled belief points. This in turn enables the solver to efficiently approximate the reachable subspace of belief simplex by essentially separating computations related to perception from planning. Lastly, we implement the proposed solver and demonstrate its performance and computational advantage in a range of robotic scenarios where the robot simultaneously performs active perception and planning.

##### Deep Probabilistic Video Compression

tl;dr Deep Probabilistic Video Compression Via Sequential Variational Autoencoders

We propose a variational inference approach to deep probabilistic video compression. Our model uses advances in variational autoencoders (VAEs) for sequential data and combines it with recent work on neural image compression. The approach jointly learns to transform the original video into a lower-dimensional representation as well as to entropy code this representation according to a temporally-conditioned probabilistic model. We split the latent space into local (per frame) and global (per segment) variables, and show that training the VAE to utilize both representations leads to an improved rate-distortion performance. Evaluation on small videos from public data sets with varying complexity and diversity show that our model yields competitive results when trained on generic video content. Extreme compression performance is achieved for videos with specialized content if the model is trained on similar videos.

##### Open Vocabulary Learning on Source Code with a Graph-Structured Cache

tl;dr We show that caching out-of-vocabulary words in a graph, with edges connecting them to their usages, and processing it with a graph neural network improves performance on supervised learning tasks on computer source code.

Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models' performance on a code completion task and a variable naming task --- with over 100\% relative improvement on the latter --- at the cost of a moderate increase in computation time.

##### Two-Timescale Networks for Nonlinear Value Function Approximation

tl;dr We propose an architecture for learning value functions which allows the use of any linear policy evaluation algorithm in tandem with nonlinear feature learning.

A key component for many reinforcement learning agents is to learn a value function, either for policy evaluation or control. Many of the algorithms for learning values, however, are designed for linear function approximation---with a fixed basis or fixed representation. Though there have been a few sound extensions to nonlinear function approximation, such as nonlinear gradient temporal difference learning, these methods have largely not been adopted, eschewed in favour of simpler but not sound methods like temporal difference learning and Q-learning. In this work, we provide a two-timescale network (TTN) architecture that enables linear methods to be used to learn values, with the representation learned at a slower timescale. The approach facilitates use of algorithms developed for the linear setting, such as data-efficient least-squares methods, eligibility traces and the myriad of recently developed linear policy evaluation algorithms. We prove convergence for TTNs, with particular care given to ensure convergence of the fast linear component under potentially dependent features provided by the learned representation. We empirically demonstrate the benefits of TTNs, compared to other nonlinear value function approximation algorithms, both for policy evaluation and control.

##### Combining Learned Representations for Combinatorial Optimization

tl;dr We use combinations of RBMs to solve number factorization and combinatorial optimization problems.

We propose a new approach to combine Restricted Boltzmann Machines (RBMs) that can be used to solve combinatorial optimization problems. This allows synthesis of larger models from smaller RBMs that have been pretrained, thus effectively bypassing the problem of learning in large RBMs, and creating a system able to model a large, complex multi-modal space. We validate this approach by using learned representations to create invertible boolean logic'', where we can use Markov chain Monte Carlo (MCMC) approaches to find the solution to large scale combinatorial optimization problems. Using this method, we are able to solve 64 bit addition based problems, as well as factorize 16 bit numbers. We find that these combined representations can provide a more accurate result for the same sample size as compared to a fully trained model.

##### ATTENTION INCORPORATE NETWORK: A NETWORK CAN ADAPT VARIOUS DATA SIZE

No tl;dr =[

In traditional neural networks for image processing, the inputs of the neural networks should be the same size such as 224×224×3. But how can we train the neural net model with different input size? A common way to do is image deformation which accompany a problem of information loss (e.g. image crop or wrap). In this paper we propose a new network structure called Attention Incorporate Network(AIN). It solve the problem of different size of input images and extract the key features of the inputs by attention mechanism, pay different attention depends on the importance of the features not rely on the data size. Experimentally, AIN achieve a higher accuracy, better convergence comparing to the same size of other network structure.

##### One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

tl;dr We present MetaMimic, an algorithm that takes as input a demonstration dataset and outputs (i) a one-shot high-fidelity imitation policy (ii) an unconditional task policy.

Humans are experts at high-fidelity imitation -- closely mimicking a demonstration, often in one attempt. Humans use this ability to quickly solve a task instance, and to bootstrap learning of new tasks. Achieving these abilities in autonomous agents is an open problem. In this paper, we introduce an off-policy RL algorithm (MetaMimic) to narrow this gap. MetaMimic can learn both (i) policies for high-fidelity one-shot imitation of diverse novel skills, and (ii) policies that enable the agent to solve tasks more efficiently than the demonstrators. MetaMimic relies on the principle of storing all experiences in a memory and replaying these to learn massive deep neural network policies by off-policy RL. This paper introduces, to the best of our knowledge, the largest existing neural networks for deep RL and shows that larger networks with normalization are needed to achieve one-shot high-fidelity imitation on a challenging manipulation task. The results also show that both types of policy can be learned from vision, in spite of the task rewards being sparse, and without access to demonstrator actions.

##### Probabilistic Model-Based Dynamic Architecture Search

tl;dr We present an efficient neural network architecture search method based on stochastic natural gradient method via probabilistic modeling.

The architecture search methods for convolutional neural networks (CNNs) have shown promising results. These methods require significant computational resources, as they repeat the neural network training many times to evaluate and search the architectures. Developing the computationally efficient architecture search method is an important research topic. In this paper, we assume that the structure parameters of CNNs are categorical variables, such as types and connectivities of layers, and they are regarded as the learnable parameters. Introducing the multivariate categorical distribution as the underlying distribution for the structure parameters, we formulate a differentiable loss for the training task, where the training of the weights and the optimization of the parameters of the distribution for the structure parameters are coupled. They are trained using the stochastic gradient descent, leading to the optimization of the structure parameters within a single training. We apply the proposed method to search the architecture for two computer vision tasks: image classification and inpainting. The experimental results show that the proposed architecture search method is fast and can achieve comparable performance to the existing methods.

##### Classification in the dark using tactile exploration

tl;dr In this work, we study the problem of learning representations to identify novel objects by exploring objects using tactile sensing. Key point here is that the query is provided in image domain.

Combining information from different sensory modalities to execute goal directed actions is a key aspect of human intelligence. Specifically, human agents are very easily able to translate the task communicated in one sensory domain (say vision) into a representation that enables them to complete this task when they can only sense their environment using a separate sensory modality (say touch). In order to build agents with similar capabilities, in this work we consider the problem of a retrieving a target object from a drawer. The agent is provided with an image of a previously unseen object and it explores objects in the drawer using only tactile sensing to retrieve the object that was shown in the image without receiving any visual feedback. Success at this task requires close integration of visual and tactile sensing. We present a method for performing this task in a simulated environment using an anthropomorphic hand. We hope that future research in the direction of combining sensory signals for acting will find the object retrieval from a drawer to be a useful benchmark problem

##### From Amortised to Memoised Inference: Combining Wake-Sleep and Variational-Bayes for Unsupervised Few-Shot Program Learning

tl;dr We extend the wake-sleep algorithm and use it to learn to learn structured models from few examples,

Given a large database of concepts but only one or a few examples of each, can we learn models for each concept that are not only generalisable, but interpretable? In this work, we aim to tackle this problem through hierarchical Bayesian program induction. We present a novel learning algorithm which can infer concepts as short, generative, stochastic programs, while learning a global prior over programs to improve generalisation and a recognition network for efficient inference. Our algorithm, Wake-Sleep-Remember (WSR), combines gradient learning for continuous parameters with neurally-guided search over programs. We show that WSR learns compelling latent programs in two tough symbolic domains: cellular automata and Gaussian process kernels. We also collect and evaluate on a new dataset, Text-Concepts, for discovering structured patterns in natural text data.

##### ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.

tl;dr ImageNet-trained CNNs are biased towards object texture (instead of shape like humans). Overcoming this bias (using a novel data augmentation) yields improved detection performance and previously unseen robustness to image distortions.

Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies hint to a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on our novel Stylized-ImageNet dataset. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.

##### Discovering General-Purpose Active Learning Strategies

No tl;dr =[

We propose a general-purpose approach to discovering active learning (AL) strategies from data. These strategies are transferable from one domain to another and can be used in conjunction with many machine learning models. To this end, we formalize the annotation process as a Markov decision process, design universal state and action spaces and introduce a new reward function that precisely reflects the AL objective of minimizing the annotation cost We seek to find an optimal (non-myopic) AL strategy using reinforcement learning. We evaluate the learned strategies on multiple unrelated domains and show that they consistently outperform state-of-the-art baselines.

##### Learning Deep Embeddings in Krein Spaces

tl;dr We propose a solution that realizes deep embeddings in Krein spaces.

The non-linear embedding achieved by a Siamese network is indeed a realization of a Hilbert space, \ie, a metric space with a positive definite inner product. Krein spaces generalize the notion of Hilbert spaces to geometrical structures with indefinite inner products. As a result, distances and norms in a Krein space can become negative. The negative spectral of an inner product is usually attributed to observation noise, though such a claim has never been fully studied, nor proved. Seeking how Krein spaces can be constructed from data, we propose a simple and innocent-looking modification to Siamese networks, equipping them with the power to realize indefinite inner-products. This provides a data-driven technique to decide whether the negative spectrum of an inner-product is helpful or not. We empirically show that our Krein embeddings outperform Hilbert space embeddings on recognition tasks.

##### No Training Required: Exploring Random Encoders for Sentence Classification

No tl;dr =[

We explore various methods for computing sentence representations from pre-trained word embeddings without any training, i.e., using nothing but random parameterizations. Our aim is to put sentence embeddings on more solid footing by 1) looking at how much modern sentence embeddings gain over random methods---as it turns out, surprisingly little; and by 2) providing the field with more appropriate baselines going forward---which are, as it turns out, quite strong. We also make important observations about proper experimental protocol for sentence classification evaluation, together with recommendations for future research.

##### I Know the Feeling: Learning to Converse with Empathy

tl;dr We improve existing dialogue systems for responding to people sharing personal stories, incorporating emotion prediction representations and also release a new benchmark and dataset of empathetic dialogues.

Beyond understanding what is being discussed, human communication requires an awareness of what someone is feeling. One challenge for dialogue agents is being able to recognize feelings in the conversation partner and reply accordingly, a key communicative skill that is trivial for humans. Research in this area is made difficult by the paucity of large-scale publicly available datasets both for emotion and relevant dialogues. This work proposes a new task for empathetic dialogue generation and EmpatheticDialogues, a dataset of 25k conversations grounded in emotional contexts to facilitate training and evaluating dialogue systems. Our experiments indicate that models explicitly leveraging emotion predictions from previous utterances are perceived to be more empathetic by human evaluators, while improving on other metrics as well (e.g. perceived relevance of responses, BLEU scores).

##### The Conditional Entropy Bottleneck

tl;dr The Conditional Entropy Bottleneck is an information-theoretic objective function for learning optimal representations.

We present a new family of objective functions, which we term the Conditional Entropy Bottleneck (CEB). We demonstrate the application of CEB to classification tasks. In our experiments, CEB gives: well-calibrated predictions; essentially perfect detection of challenging out-of-distribution examples and powerful whitebox adversarial examples; and natural robustness to the same. Finally, we report that CEB fails to learn a dataset with fixed random labels, providing a possible resolution to the problem of generalization observed in Zhang et al. (2016).

##### Unsupervised Neural Multi-Document Abstractive Summarization

tl;dr We propose an end-to-end neural model for unsupervised multi-document abstractive summarization, applying it to business and product reviews.

Abstractive summarization has been studied using neural sequence transduction methods with datasets of large, paired document-summary examples. However, such datasets are rare and the models trained from them do not generalize to other domains. Recently, some progress has been made in learning sequence-to-sequence mappings with only unpaired examples. In our work, we consider the setting where there are only documents and no summaries provided and propose an end-to-end, neural model architecture to perform unsupervised abstractive summarization. Our proposed model consists of an auto-encoder trained so that the mean of the representations of the input documents decodes to a reasonable summary. We consider variants of the proposed architecture and perform an ablation study to show the importance of specific components. We apply our model to the summarization of business and product reviews and show that the generated summaries are fluent, show relevancy in terms of word-overlap, representative of the average sentiment of the input documents, and are highly abstractive compared to baselines. The code to reproduce results is available at github.com/REDACTED.

##### BNN+: Improved Binary Network Training

tl;dr The paper presents an improved training mechanism for obtaining binary networks with smaller accuracy drop compared that helps close the gap with it's full precision counterpart

Deep neural networks (DNN) are widely used in many applications. However, their deployment on edge devices has been difficult because they are resource hungry. Binary networks (BNN) help to alleviate the prohibitive resource requirements of DNN; where both activations and weights are limited to one bit. We propose an improved binary training method (BNN+), an improvement to the popular BNN training scheme, which helps to reduce accuracy degradation compared to the full-precision counterpart. Our method is based on linear operations that are easily implementable into the binary training framework and we show experimental results on CIFAR-10 obtaining an accuracy of 86.5%, on AlexNet and 91.6% with VGG network. On ImageNet, our method also outperforms the traditional BNN and XNOR-net, by a margin of 4% and 2% respectively.

##### Deep Denoising: Rate-Optimal Recovery of Structured Signals with a Deep Prior

tl;dr By analyzing an algorithms minimizing a non-convex loss, we show that all but a small fraction of noise can be removed from an image using a deep neural network based generative prior.

Deep neural networks provide state-of-the-art performance for image denoising, where the goal is to recover a near noise-free image from a noisy image. The underlying principle is that neural networks trained on large datasets have empirically been shown to be able to generate natural images well from a low-dimensional latent representation of the image. Given such a generator network, or prior, a noisy image can be denoised by finding the closest image in the range of the prior. However, there is little theory to justify this success, let alone to predict the denoising performance as a function of the networks parameters. In this paper we consider the problem of denoising an image from additive Gaussian noise, assuming the image is well described by a deep neural network with ReLu activations functions, mapping a k-dimensional latent space to an n-dimensional image. We state and analyze a simple gradient-descent-like iterative algorithm that minimizes a non-convex loss function, and provably removes a fraction of (1 - O(k/n)) of the noise energy. We also demonstrate in numerical experiments that this denoising performance is, indeed, achieved by generative priors learned from data.

##### RNNs with Private and Shared Representations for Semi-Supervised Sequence Learning

tl;dr This paper focuses upon a traditionally overlooked mechanism -- an architecture with explicitly designed private and shared hidden units designed to mitigate the detrimental influence of the auxiliary unsupervised loss over the main supervised task.

Training recurrent neural networks (RNNs) on long sequences using backpropagation through time (BPTT) remains a fundamental challenge. It has been shown that adding a local unsupervised loss term into the optimization objective makes the training of RNNs on long sequences more effective. While the importance of an unsupervised task can in principle be controlled by a coefficient in the objective function, the gradients with respect to the unsupervised loss term still influence all the hidden state dimensions, which might cause important information about the supervised task to be degraded or erased. Compared to existing semi-supervised sequence learning methods, this paper focuses upon a traditionally overlooked mechanism -- an architecture with explicitly designed private and shared hidden units designed to mitigate the detrimental influence of the auxiliary unsupervised loss over the main supervised task. We achieve this by dividing RNN hidden space into a private space for the supervised task and a shared space for both the supervised and unsupervised tasks. We present extensive experiments with the proposed framework on several long sequence modeling benchmark datasets. Results indicate that the proposed framework can yield performance gains in RNN models where long term dependencies are notoriously challenging to deal with.

##### PAIRWISE AUGMENTED GANS WITH ADVERSARIAL RECONSTRUCTION LOSS

tl;dr We propose a novel autoencoding model with augmented adversarial reconstruction loss. We intoduce new metric for content-based assessment of reconstructions.

We propose a novel autoencoding model called Pairwise Augmented GANs. We train a generator and an encoder jointly and in an adversarial manner. The generator network learns to sample realistic objects. In turn the encoder network at the same time in turn is trained to map the true data distribution to the prior in a latent space. To ensure good reconstructions we introduce an augmented adversarial reconstruction loss. Here we train a discriminator to distinguish two types of pairs: the object with its augmentation and the one with its reconstruction. We show that such adversarial loss compares objects based on the content rather than on the exact match. We experimentally demonstrate that our model generates samples and reconstructions of quality competitive with state-of-the-art on datasets MNIST, CIFAR10, CelebA and achieves good quantitative results on CIFAR10.

##### Backdrop: Stochastic Backpropagation

tl;dr We introduce backdrop, intuitively described as dropout acting on the backpropagation pipeline and find significant improvements in generalization for problems with non-decomposable losses and problems with multi-scale, hierarchical data structure.

We introduce backdrop, a flexible and simple-to-implement method, intuitively described as dropout acting only along the backpropagation pipeline. Backdrop is implemented via one or more masking layers which are inserted at specific points along the network. Each backdrop masking layer acts as the identity in the forward pass, but randomly masks parts of the backward gradient propagation. Intuitively, inserting a backdrop layer after any convolutional layer leads to stochastic gradients corresponding to features of that scale. Therefore, backdrop is well suited for problems in which the data have a multi-scale, hierarchical structure. Backdrop can also be applied to problems with non-decomposable loss functions where standard SGD methods are not well suited. We perform a number of experiments and demonstrate that backdrop leads to significant improvements in generalization.

##### Trellis Networks for Sequence Modeling

tl;dr Trellis networks are a new sequence modeling architecture that bridges recurrent and convolutional models and sets a new state of the art on word- and character-level language modeling.

We present trellis networks, a new architecture for sequence modeling. On the one hand, a trellis network is a temporal convolutional network with special structure, characterized by weight tying across depth and direct injection of the input into deep layers. On the other hand, we show that truncated recurrent networks are equivalent to trellis networks with special sparsity structure in their weight matrices. Thus trellis networks with general weight matrices generalize truncated recurrent networks. We leverage these connections to design high-performing trellis networks that absorb structural and algorithmic elements from both recurrent and convolutional models. Experiments demonstrate that trellis networks outperform the current state of the art on a variety of challenging benchmarks, including word-level language modeling on Penn Treebank and WikiText-103, character-level language modeling on Penn Treebank, and stress tests designed to evaluate long-term memory retention.

##### Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization

tl;dr This paper presents a hierarchical reinforcement learning framework based on deterministic option policies and mutual information maximization.

Real-world tasks are often highly structured. Hierarchical reinforcement learning (HRL) has attracted research interest as an approach for leveraging the hierarchical structure of a given task in reinforcement learning (RL). However, identifying the hierarchical policy structure that enhances the performance of RL in not a trivial task. In this paper, we propose an HRL method that learns a latent variable of a hierarchical policy using mutual information maximization. Our approach can be interpreted as a way to learn a discrete and latent representation of the state-action space in an unsupervised manner. To estimate the density of states and actions induced by the unknown optimal policy, we introduce advantage-weighted importance sampling. In our HRL method, the gating policy learns to select option policies based on an option-value function, and these option policies are optimized based on the deterministic policy gradient method. This framework is derived by leveraging the analogy between a monolithic policy in standard RL and a hierarchical policy in HRL by using a deterministic option policy. Experimental results indicate that our HRL approach can learn a diversity of options and that it can enhance the performance of RL in continuous control tasks.

##### Locally Linear Unsupervised Feature Selection

tl;dr Unsupervised feature selection through capturing the local linear structure of the data

The paper, interested in unsupervised feature selection, aims to retain the features best accounting for the local patterns in the data. The proposed approach, called Locally Linear Unsupervised Feature Selection, relies on a dimensionality reduction method to characterize such patterns; each feature is thereafter assessed according to its compliance w.r.t. the local patterns, taking inspiration from Locally Linear Embedding (Roweis and Saul, 2000). The experimental validation of the approach on the scikit-feature benchmark suite demonstrates its effectiveness compared to the state of the art.

##### An Empirical Study of Example Forgetting during Deep Neural Network Learning

tl;dr We show that catastrophic forgetting occurs within what is considered to be a single task and find that examples that are not prone to forgetting can be removed from the training set without loss of generalization.

Inspired by the phenomenon of catastrophic forgetting, we investigate the learning dynamics of neural networks as they train on single classification tasks. Our goal is to understand whether a related phenomenon occurs when data does not undergo a clear distributional shift. We define a forgetting event'' to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set's (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.

##### SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY

tl;dr We present a new approach, SNIP, that is simple, versatile and interpretable; it prunes irrelevant connections for a given task at single-shot prior to training and is applicable to a variety of neural network models without modifications.

Pruning large neural networks while maintaining the performance is often highly desirable due to the reduced space and time complexity. In existing methods, pruning is incorporated within an iterative optimization procedure with either heuristically designed pruning schedules or additional hyperparameters, undermining their utility. In this work, we present a new approach that prunes a given network once at initialization. Specifically, we introduce a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task even before training. This eliminates the need for both pretraining as well as the complex pruning schedule while making it robust to architecture variations. After pruning, the sparse network is trained in the standard way. Our method obtains extremely sparse networks with virtually the same accuracy as the reference network on image classification tasks and is broadly applicable to various architectures including convolutional, residual and recurrent networks. Unlike existing methods, our approach enables us to demonstrate that the retained connections are indeed relevant to the given task.

##### Exponentially Decaying Flows for Optimization in Deep Learning

tl;dr Introduction of a new optimization method and its application to deep learning.

The field of deep learning has been craving for an optimization method that shows outstanding property for both optimization and generalization. We propose a method for mathematical optimization based on flows along geodesics, that is, the shortest paths between two points, with respect to the Riemannian metric induced by a non-linear function. In our method, the flows refer to Exponentially Decaying Flows (EDF), as they can be designed to converge on the local solutions exponentially. In this paper, we conduct experiments to show its high performance on optimization benchmarks (i.e., convergence properties), as well as its potential for producing good machine learning benchmarks (i.e., generalization properties).

##### AntMan: Sparse Low-Rank Compression To Accelerate RNN Inference

tl;dr Reducing computation and memory complexity of RNN models by up to 100x using sparse low-rank compression modules, trained via knowledge distillation.

Wide adoption of complex RNN based models is hindered by their inference performance, cost and memory requirements. To address this issue, we develop AntMan, combining structured sparsity with low-rank decomposition synergistically, to reduce model computation, size and execution time of RNNs while attaining desired accuracy. AntMan extends knowledge distillation based training to learn the compressed models efficiently. Our evaluation shows that AntMan offers up to 100x computation reduction with less than 1pt accuracy drop for language and machine reading comprehension models. Our evaluation also shows that for a given accuracy target, AntMan produces 5x smaller models than the state-of-art. Lastly, we show that AntMan offers super-linear speed gains compared to theoretical speedup, demonstrating its practical value on commodity hardware.

##### KnockoffGAN: Generating Knockoffs for Feature Selection using Generative Adversarial Networks

No tl;dr =[

Feature selection is a pervasive problem. The discovery of relevant features can be as important for performing a particular task (such as to avoid overfitting in prediction) as it can be for understanding the underlying processes governing the true label (such as discovering relevant genetic factors for a disease). Machine learning driven feature selection can enable discovery from large, high-dimensional, non-linear observational datasets by creating a subset of features for experts to focus on. In order to use expert time most efficiently, we need a principled methodology capable of controlling the False Discovery Rate. In this work, we build on the promising Knockoff framework by developing a flexible knockoff generation model. We adapt the Generative Adversarial Networks framework to allow us to generate knockoffs with no assumptions on the feature distribution. Our model consists of 4 networks, a generator, a discriminator, a stability network and a power network. We demonstrate the capability of our model to perform feature selection, showing that it performs as well as the originally proposed knockoff generation model in the Gaussian setting and that it outperforms the original model in non-Gaussian settings, including on a real-world dataset.

##### Improving Generative Adversarial Imitation Learning with Non-expert Demonstrations

tl;dr We improve GAIL by learning discriminators using multiclass classification with non-expert regarded as an extra class.

Imitation learning aims to learn an optimal policy from expert demonstrations and its recent combination with deep learning has shown impressive performance. However, collecting a large number of expert demonstrations for deep learning is time-consuming and requires much expert effort. In this paper, we propose a method to improve generative adversarial imitation learning by using additional information from non-expert demonstrations which are easier to obtain. The key idea of our method is to perform multiclass classification to learn discriminator functions where non-expert demonstrations are regarded as being drawn from an extra class. Experiments in continuous control tasks demonstrate that our method learns optimal policies faster and has more stable performance than the generative adversarial imitation learning baseline.

##### Sentence Encoding with Tree-Constrained Relation Networks

No tl;dr =[

The meaning of a sentence is a function of the relations that hold between its words. We instantiate this relational view of semantics in a series of neural models based on variants of relation networks (RNs) which represent a set of objects (for us, words forming a sentence) in terms of representations of pairs of objects. We propose two extensions to the basic RN model for natural language. First, building on the intuition that not all word pairs are equally informative about the meaning of a sentence, we use constraints based on both supervised and unsupervised dependency syntax to control which relations influence the representation. Second, since higher-order relations are poorly captured by a sum of pairwise relations, we use a recurrent extension of RNs to propagate information so as to form representations of higher order relations. Experiments on sentence classification, sentence pair classification, and machine translation reveal that, while basic RNs are only modestly effective for sentence representation, recurrent RNs with latent syntax are a reliably powerful representational device.

##### Auxiliary Variational MCMC

No tl;dr =[

We introduce Auxiliary Variational MCMC, a novel framework for learning MCMC kernels that combines recent advances in variational inference with insights drawn from traditional auxiliary variable MCMC methods such as Hamiltonian Monte Carlo. Our framework exploits low dimensional structure in the target distribution in order to learn a more efficient MCMC sampler. The resulting sampler is able to suppress random walk behaviour and mix between modes efficiently, without the need to compute gradients of the target distribution. We test our sampler on a number of challenging distributions, where the underlying structure is known, and on the task of posterior sampling in Bayesian logistic regression. Code to reproduce all experiments is available at https://github.com/AVMCMC/AuxiliaryVariationalMCMC.

##### Training for Faster Adversarial Robustness Verification via Inducing ReLU Stability

tl;dr We develop methods to train deep neural models that are both robust to adversarial perturbations and whose robustness is significantly easier to verify.

We explore the concept of co-design in the context of neural network verification. Specifically, we aim to train deep neural networks that not only are robust to adversarial perturbations but also whose robustness can be verified more easily. To this end, we identify two properties of network models - weight sparsity and so-called ReLU stability - that turn out to significantly impact the complexity of the corresponding verification task. We demonstrate that improving weight sparsity alone already enables us to turn computationally intractable verification problems into tractable ones. Then, improving ReLU stability leads to an additional 4-13x speedup in verification times. An important feature of our methodology is its "universality," in the sense that it can be used with a broad range of training procedures and verification approaches.

##### Learning to Design RNA

tl;dr We learn to solve the RNA Design problem with reinforcement learning.

Designing RNA molecules has garnered recent interest in medicine, synthetic biology, biotechnology and bioinformatics since many functional RNA molecules were shown to be involved in regulatory processes for transcription, epigenetics and translation. Since an RNA's function depends on its structural properties, the RNA Design problem is to find an RNA molecule that folds into a specified secondary structure. Here, we propose a new algorithm for the RNA Design problem, dubbed LEARNA. LEARNA uses deep reinforcement learning to train a policy network to sequentially design an entire RNA sequence given a specified secondary target structure. By meta-learning across thousands of different RNA target structures, our extension Meta-LEARNA constructs an RNA design policy that can be applied out of the box to solve novel RNA target structures. Comprehensive empirical results on two widely-used RNA secondary structure design benchmarks, as well as a third one that we introduce, show that our approach achieves new state-of-the-art performance on all benchmarks while also being up to 450x faster than any previous approach. We achieve this by a joint optimization of the policy network's architecture, the training hyperparameters, and the state space representation. In an ablation study, we analyze our method's different components.

##### Exploration in Policy Mirror Descent

No tl;dr =[

Policy optimization is a core problem in reinforcement learning. In this paper, we investigate Reversed Entropy Policy Mirror Descent (REPMD), an on-line policy optimization strategy that improves exploration behavior while assuring monotonic progress in a principled objective. REPMD conducts a form of maximum entropy exploration within a mirror descent framework, but uses an alternative policy update with a reversed KL projection. This modified formulation bypasses undesirable mode seeking behavior and avoids premature convergence to sub-optimal policies, while still supporting strong theoretical properties such as guaranteed policy improvement. An experimental evaluation demonstrates that this approach significantly improves practical exploration and surpasses the empirical performance of state-of-the art policy optimization methods in a set of benchmark tasks.

##### Learning State Representations in Complex Systems with Multimodal Data

tl;dr Multimodal synthetic dataset, collected from X-plane flight simulator, used for learning state representation and unified evaluation framework for representation learning

Representation learning becomes especially important for complex systems with multimodal data sources such as cameras or sensors. Recent advances in reinforcement learning and optimal control make it possible to design control algorithms on these latent representations, but the field still lacks a large-scale standard dataset for unified comparison. In this work, we present a large-scale dataset and evaluation framework for representation learning for the complex task of landing an airplane. We implement and compare several approaches to representation learning on this dataset in terms of the quality of simple supervised learning tasks and disentanglement scores. The resulting representations can be used for further tasks such as anomaly detection, optimal control, model-based reinforcement learning, and other applications.

##### Learning shared manifold representation of images and attributes for generalized zero-shot learning

No tl;dr =[

The most prior methods of zero-shot learning have realized predicting labels of unseen images by learning a mapping from images to pre-defined class-attributes. However, recent studies show that these approaches severely suffers from the issue of biased prediction under the more realistic generalized zero-shot learning (GZSL) scenarios, i.e., their classifier tends to predict all the examples from both seen and unseen class as one of the seen classes. The cause of this problem is that we can not obtain training data of the unseen class and that the representation of attributes is poor. To solve this, we propose a concept to learn a mapping that embeds both images and attributes to a space that is robust to such representations and generalized even for unseen data, which we refer to shared manifold learning. Furthermore, we propose modality invariant variational autoencoders, which can perform shared manifold learning by training variational autoencoders with both images and attributes as inputs. The empirical validation of well-known datasets in GZSL shows that our method achieves the significantly superior performances to the existing relation-based works.

##### NEURAL MALWARE CONTROL WITH DEEP REINFORCEMENT LEARNING

tl;dr A deep reinforcement learning-based system is proposed to control when to halt the emulation of an unknown file and to improve the detection rate of a deep malware classifier.

Antimalware products are a key component in detecting malware attacks, and their engines typically execute unknown programs in a sandbox prior to running them on the native operating system. Files cannot be scanned indefinitely so the engine employs heuristics to determine when to halt execution. Previous research has investigated analyzing the sequence of system calls generated during this emulation process to predict if an unknown file is malicious, but these models require the emulation to be stopped after executing a fixed number of events from the beginning of the file. Also, these classifiers are not accurate enough to halt emulation in the middle of the file on their own. In this paper, we propose a novel algorithm which overcomes this limitation and learns the best time to halt the file's execution based on deep reinforcement learning (DRL). Because the new DRL-based system continues to emulate the unknown file until it can make a confident decision to stop, it prevents attackers from avoiding detection by initiating malicious activity after a fixed number of system calls. Results show that the proposed malware execution control model automatically halts emulation for 91.3\% of the files earlier than heuristics employed by the engine. Furthermore, classifying the files at that time improves the true positive rate by 61.5%, at a false positive rate of 1%, compared to a baseline classifier.