# Search ICLR 2019

Searching papers submitted to ICLR 2019 can be painful. You might want to know which paper uses technique X, dataset D, or cites author ME. Unfortunately, search is limited to titles, abstracts, and keywords, missing the actual contents of the paper. This Frankensteinian search has returned from 2018 to help scour the papers of ICLR by ripping out their souls using pdftotext.

Good luck! Warranty's not included :)

Need random search inspiration..? Grab something from the list of all tags! ^_^
How about: graph classification, second-order stationary point, long term dependencies, interpretable deep learning, theory of mind ..?

Sanity Disclaimer: As you stare at the continuous stream of ICLR and arXiv papers, don't lose confidence or feel overwhelmed. This isn't a competition, it's a search for knowledge. You and your work are valuable and help carve out the path for progress in our field :)

### "Random selection" has 100 results

##### code2seq: Generating Sequences from Structured Representations of Code

tl;dr We leverage the syntactic structure of source code to generate natural language sequences.

The ability to generate natural language sequences from source code snippets has a variety of applications such as code summarization, documentation, and retrieval. Sequence-to-sequence (seq2seq) models, adopted from neural machine translation (NMT), have achieved state-of-the-art performance on these tasks by treating source code as a sequence of tokens. We present code2seq: an alternative approach that leverages the syntactic structure of programming languages to better encode source code. Our model represents a code snippet as the set of compositional paths in its abstract syntax tree (AST) and uses attention to select the relevant paths while decoding. We demonstrate the effectiveness of our approach for two tasks, two programming languages, and four datasets of up to 16M examples. Our model significantly outperforms previous models that were specifically designed for programming languages, as well as general state-of-the-art NMT models.

##### Modeling Parts, Structure, and System Dynamics via Predictive Learning

tl;dr Learning object parts, hierarchical structure, and dynamics by watching how they move

Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future. In this paper, we propose a novel formulation that simultaneously learns a hierarchical, disentangled object representation and a dynamics model for object parts from unlabeled videos in a self-supervised manner. Our Parts, Structure, and Dynamics (PSD) model learns to first recognize the object parts via a layered image representation; second, predict hierarchy via a structural descriptor that composes low-level concepts into a hierarchical structure; and third, model the system dynamics by predicting the future. Experiments on multiple real and synthetic datasets demonstrate that our PSD model works well on all three tasks: segmenting object parts, building their hierarchical structure, and capturing their motion distributions.

##### Bayesian Deep Learning via Stochastic Gradient MCMC with a Stochastic Approximation Adaptation

tl;dr a robust Bayesian deep learning algorithm to infer complex posteriors with latent variables

We propose a robust Bayesian deep learning algorithm to infer complex posteriors with latent variables. Inspired by dropout, a popular tool for regularization and model ensemble, we assign sparse priors to the weights in deep neural networks (DNN) in order to achieve automatic dropout'' and avoid over-fitting. By alternatively sampling from posterior distribution through stochastic gradient Markov Chain Monte Carlo (SG-MCMC) and optimizing latent variables via stochastic approximation (SA), the trajectory of the target weights is proved to converge to the true posterior distribution conditioned on optimal latent variables. This ensures a stronger regularization on the over-fitted parameter space and more accurate uncertainty quantification on the decisive variables. Simulations from large-p-small-n regressions showcase the robustness of this method when applied to models with latent variables. Additionally, its application on the convolutional neural networks (CNN) leads to state-of-the-art performance on MNIST and Fashion MNIST datasets and improved resistance to adversarial attacks.

##### Multi-Agent Dual Learning

No tl;dr =[

Dual learning has attracted much attention in machine learning, computer vision and natural language processing communities. The core idea of dual learning is to leverage the duality between the primal task (mapping from domain X to domain Y) and dual task (mapping from domain Y to X) to boost the performances of both tasks. Existing dual learning framework forms a system with two agents (one primal model and one dual model) to utilize such duality. In this paper, we extend this framework by introducing more primal and dual models, and propose the multi-agent dual learning framework. Experiments on neural machine translation and image translation tasks demonstrate the effectiveness of the new framework. In particular, our framework achieves state-of-the-art performance on IWSLT 2014 German-to-English translation with a 35.44 BLEU score and achieves a 30.67 BLEU score on WMT 2014 English-to-German translation, with over 2.2 BLEU improvement over the strong Transformer baseline.

No tl;dr =[

We address the problem of recovering an underlying signal from lossy and inaccurate measurements in an unsupervised fashion. Typically, we consider situations where there is no background knowledge on the structure of the unknown signal and where we do not have access to signal-measurement pairs, nor even unpaired signal data. We introduce a general framework, where a neural network is trained to recover plausible signals from the measurements in the data, by introducing an adversarial and a reconstruction loss. We evaluate our framework on different noise instances, and show that our approach yields comparable results to model variants trained with stronger supervision.

##### TarMAC: Targeted Multi-Agent Communication

tl;dr Targeted communication in multi-agent cooperative reinforcement learning

We explore the collaborative multi-agent setting where a team of deep reinforcement learning agents attempt to solve a shared task in partially observable environments. In this scenario, learning an effective communication protocol is key. We propose a communication protocol that allows for targeted communication, where agents learn \emph{what} messages to send and \emph{who} to send them to. Additionally, we introduce a multi-stage communication approach where the agents co-ordinate via several rounds of communication before taking an action in the environment. We evaluate our approach on several cooperative multi-agent tasks, of varying difficulties with varying number of agents, in a variety of environments ranging from 2D grid layouts of shapes and simulated traffic junctions to complex 3D indoor environments. We demonstrate the benefits of targeted as well as multi-stage communication. Moreover, we show that the targeted communication strategies learned by the agents are quite interpretable and intuitive.

tl;dr We proposal a simple surrogate for gradient descent to improve training of deep neural nets and other optimization problems.

We propose a class of very simple modifications of gradient descent and stochastic gradient descent. We show that when applied to a large variety of machine learning problems, ranging from softmax regression to deep neural nets, the proposed surrogates can dramatically reduce the variance and improve the generalization accuracy. The methods only involve multiplying the usual (stochastic) gradient by the inverse of a positive definitive matrix coming from the discrete Laplacian or high order generalizations. The theory of Hamilton-Jacobi partial differential equations demonstrates that the new algorithm is almost the same as doing gradient descent on a new function which (i) has the same global minima as the original function and (ii) is “more convex”. We show that optimization algorithms with these surrogates converge uniformly in the discrete Sobolev H^p_\sigma sense and reduce the optimality gap for convex optimization problems. We implement our algorithm into both PyTorch and Tensorflow platforms which only involves changing of a few lines of code. The code will be available on Github.

##### Whitening and Coloring transform for GANs

No tl;dr =[

Batch Normalization (BN) is a common technique used to speed-up and stabilize training. On the other hand, the learnable parameters of BN are commonly used in conditional Generative Adversarial Networks (cGANs) for representing class-specific information using conditional Batch Normalization (cBN). In this paper we propose to generalize both BN and cBN using a Whitening and Coloring based batch normalization. We show that our conditional Coloring can represent categorical conditioning information which largely helps the cGAN qualitative results. Moreover, we show that full-feature whitening is important in a general GAN scenario in which the training process is known to be highly unstable. We test our approach on different datasets and using different GAN networks and training protocols, showing a consistent improvement in all the tested frameworks. Our CIFAR-10 supervised results are higher than all previous works on this dataset.

##### Low Latency Privacy Preserving Inference

tl;dr This work presents methods, combining neural-networks and encryptions, to make predictions while preserving the privacy of the data owner with low latency

Using machine learning in domains such as medicine and finance requires tools that can preserve privacy and confidentiality. In this work, we focus on private inference with neural networks. Following the work of Dowlin et al. (2016), we use Homomorphic Encryption (HE) to allow neural networks to be applied to encrypted data and therefore make predictions while preserving privacy. We present 90x improvement in latency and 7x improvement in throughput compared to prior attempts. The improved performance is achieved via a modern implementation of the encryption scheme and a collection of methods to better represent the data during the computation. We also apply the method of transfer learning to provide private inference services using deep networks. We demonstrate the efficacy of our methods on several computer vision tasks.

##### Look Ma, No GANs! Image Transformation with ModifAE

tl;dr ModifAE is a standalone neural network, trained exclusively on an autoencoding task, that implicitly learns to make image modifications (without GANs).

Existing methods of image to image translation require multiple steps in the training or modification process, and suffer from either an inability to generalize, or long training times. These methods also focus on binary trait modification, ignoring continuous traits. To address these problems, we propose ModifAE: a novel standalone neural network, trained exclusively on an autoencoding task, that implicitly learns to make continuous trait image modifications. As a standalone image modification network, ModifAE requires fewer parameters and less time to train than existing models. We empirically show that ModifAE produces significantly more convincing and more consistent continuous face trait modifications than the previous state-of-the-art model.

##### Initialized Equilibrium Propagation for Backprop-Free Training

tl;dr We train a feedforward network without backprop by using an energy-based model to provide local targets

Deep neural networks are almost universally trained with reverse-mode automatic differentiation (a.k.a. backpropagation). Biological networks, on the other hand, appear to lack any mechanism for sending gradients back to their input neurons, and thus cannot be learning in this way. In response to this, Scellier & Bengio (2017) proposed Equilibrium Propagation - a method for gradient-based train- ing of neural networks which uses only local learning rules and, crucially, does not rely on neurons having a mechanism for back-propagating an error gradient. Equilibrium propagation, however, has a major practical limitation: inference involves doing an iterative optimization of neural activations to find a fixed-point, and the number of steps required to closely approximate this fixed point scales poorly with the depth of the network. In response to this problem, we propose Initialized Equilibrium Propagation, which trains a feedforward network to initialize the iterative inference procedure for Equilibrium propagation. This feed-forward network learns to approximate the state of the fixed-point using a local learning rule. After training, we can simply use this initializing network for inference, resulting in a learned feedforward network. Our experiments show that this network appears to work as well or better than the original version of Equilibrium propagation. This shows how we might go about training deep networks without using backpropagation.

##### StrokeNet: A Neural Painting Environment

tl;dr StrokeNet is a novel architecture where the agent is trained to draw by strokes on a differentiable simulation of the environment, which could effectively exploit the power of back-propagation.

We've seen tremendous success of image generating models these years. Generating an image through a neural network is like dreaming'', which is fundamentally different from how humans create artwork using brushes. To imitate human drawing, interactions between the agent and the environment is required to allow trials from the agent. However, the environment is usually non-differentiable, leading to slow convergence and massive computation. In this paper we try to address the discrete nature of software environment with an intermediate, differentiable simulation, which can be interpreted as a neural perception of the surroundings of the upper agent. We present StrokeNet, a novel model where the agent is trained upon a well-crafted neural approximation of the painting environment. With this approach, our agent was able to learn to write characters such as MNIST digits very quickly in an unsupervised manner. Our primary contribution is the neural simulation of real-world environment. Furthermore, the agent trained with our approach can be directly transferred to real world with learned skills. To the best of our knowledge, StrokeNet is the first model to apply differentiable simulation to real-world learning problems and standard datasets.

##### Classification from Positive, Unlabeled and Biased Negative Data

tl;dr This paper studied the PUbN classification problem, where we incorporate biased negative (bN) data, i.e., negative data that is not fully representative of the true underlying negative distribution, into positive-unlabeled (PU) learning.

Positive-unlabeled (PU) learning addresses the problem of learning a binary classifier from positive (P) and unlabeled (U) data. It is often applied to situations where negative (N) data are difficult to be fully labeled. However, collecting a non-representative N set that contains only a small portion of all possible N data can be much easier in many practical situations. This paper studies a novel classification framework which incorporates such biased N (bN) data in PU learning. The fact that the training N data are biased also makes our work very different from those of standard semi-supervised learning. We provide an empirical risk minimization-based method to address this PUbN classification problem. Our approach can be regarded as a variant of traditional example-reweighting algorithms, with the weight of each example computed through a preliminary step that draws inspiration from PU learning. We also derive an estimation error bound for the proposed method. Experimental results demonstrate the effectiveness of our algorithm in not only PUbN learning scenarios but also ordinary PU leaning scenarios on several benchmark datasets.

##### DynCNN: An Effective Dynamic Architecture on Convolutional Neural Network for Surveillance Videos

tl;dr An optimizing architecture on CNN for surveillance videos with 75.7% reduction on FLOPs and 2.2 times improvement on FPS

The large-scale surveillance video analysis becomes important as the development of intelligent city. The heavy computation resources neccessary for state-of-the-art deep learning model makes the real-time processing hard to be implemented. This paper exploits the characteristic of high scene similarity generally existing in surveillance videos and proposes dynamic convolution reusing the previous feature map to reduce the computation amount. We tested the proposed method on 45 surveillance videos with various scenes. The experimental results show that dynamic convolution can reduce up to 75.7% of FLOPs while preserving the precision within 0.7% mAP. Furthermore, the dynamic convolution can enhance the processing time up to 2.2 times.

##### W2GAN: RECOVERING AN OPTIMAL TRANSPORTMAP WITH A GAN

tl;dr "A GAN-style model to recover a solution of the Monge Problem"

Understanding and improving Generative Adversarial Networks (GAN) using notions from Optimal Transportation (OT) theory has been a successful area of study, originally established by the introduction of the Wasserstein GAN (WGAN). An increasing number of GANs incorporate OT for improving their discriminators, but that is so far the sole way for the two domains to cross-fertilize. We consolidate the bridge between GANs and OT with one model: W2GAN, where the discriminator approximates the second Wasserstein distance. This model exhibits a twofold connection: the discriminator implicitly computes an optimal map and the generator follows an optimal transport map during training. Perhaps surprisingly, we also provide empirical evidence that other GANs also approximately following the Optimal Transport.

##### Emerging Disentanglement in Auto-Encoder Based Unsupervised Image Content Transfer

tl;dr An image to image translation method which adds to one image the content of another thereby creating a new image.

We study the problem of learning to map, in an unsupervised way, between domains A and B, such that the samples b in B contain all the information that exists in samples $\va\in A$ and some additional information. For example, ignoring occlusions, B can be people with glasses, A people without, and the glasses, would be the added information. When mapping a sample a from the first domain to the other domain, the missing information is replicated from an independent reference sample b in B. Thus, in the above example, we can create, for every person without glasses a version with the glasses observed in any face image. Our solution employs a single two-pathway encoder and a single decoder for both domains. The common part of the two domains and the separate part are encoded as two vectors, and the separate part is fixed at zero for domain A. The loss terms are minimal and involve reconstruction losses for the two domains and a domain confusion term. Our analysis shows that under mild assumptions, this architecture, which is much simpler than the literature guided-translation methods, is enough to ensure disentanglement between the two domains. We present convincing results in a few visual domains, such as no-glasses to glasses, adding facial hair based on a reference image, etc.

##### Isolating effects of age with fair representation learning when assessing dementia

tl;dr Show that age confounds cognitive impairment detection + solve with fair representation learning + propose metrics and models.

One of the most prevalent symptoms among the elderly population, dementia, can be detected by classifiers trained on linguistic features extracted from narrative transcripts. However, these linguistic features are impacted in a similar but different fashion by the normal aging process. Aging is therefore a confounding factor, whose effects have been hard for machine learning classifiers to isolate. In this paper, we show that deep neural network (DNN) classifiers can infer ages from linguistic features, which is an entanglement that could lead to unfairness across age groups. We show this problem is caused by undesired activations of v-structures in causality diagrams, and it could be addressed with fair representation learning. We build neural network classifiers that learn low-dimensional representations reflecting the impacts of dementia yet discarding the effects of age. To evaluate these classifiers, we specify a model-agnostic score $\Delta_{eo}^{(N)}$ measuring how classifier results are disentangled from age. Our best models outperform baseline neural network classifiers in disentanglement, while compromising accuracy by as little as 2.56\% and 2.25\% on DementiaBank and the Famous People dataset respectively.

##### ALISTA: Analytic Weights Are As Good As Learned Weights in LISTA

No tl;dr =[

Deep neural networks based on unfolding an iterative algorithm, for example, LISTA (learned iterative shrinkage thresholding algorithm), have been an empirical success for sparse signal recovery. The weights of these neural networks are currently determined by data-driven “black-box” training. In this work, we propose Analytic LISTA (ALISTA), where the weight matrix in LISTA is computed as the solution to a data-free optimization problem, leaving only the stepsize and threshold parameters to data-driven learning. This signiﬁcantly simpliﬁes the training. Speciﬁcally, the data-free optimization problem is based on coherence minimization. We show our ALISTA retains the optimal linear convergence proved in (Chen et al., 2018) and has a performance comparable to LISTA. Furthermore, we extend ALISTA to convolutional linear operators, again determined in a data-free manner. We also propose a feed-forward framework that combines the data-free optimization and ALISTA networks from end to end, one that can be jointly trained to gain robustness to small perturbations in the encoding model.

##### COCO-GAN: Conditional Coordinate Generative Adversarial Network

No tl;dr =[

Recent advancements on Generative Adversarial Network (GAN) have inspired a wide range of works that generate synthetic images. However, the current processes have to generate an entire image at once, and therefore resolutions are limited by memory or computational constraints. In this work, we propose COnditional COordinate GAN (COCO-GAN), which generates a specific patch of an image conditioned on a spatial position rather than the entire image at a time. The generated patches are later combined together to form a globally coherent full-image. With this process, we show that the generated image can achieve competitive quality to state-of-the-arts and the generated patches are locally smooth between consecutive neighbors. One direct implication of the COCO-GAN is that it can be applied onto any coordinate systems including the cylindrical systems which makes it feasible for generating panorama images. The fact that the patch generation process is independent to each other inspires a wide range of new applications: firstly, "Patch-Inspired Image Generation" enables us to generate the entire image based on a single patch. Secondly, "Partial-Scene Generation" allows us to generate images within a customized target region. Finally, thanks to COCO-GAN's patch generation and massive parallelism, which enables combining patches for generating a full-image with higher resolution than state-of-the-arts.

##### Learning Mixed-Curvature Representations in Product Spaces

tl;dr Product manifold embedding spaces with heterogenous curvature yield improved representations compared to traditional embedding spaces for a variety of structures.

The quality of the representations achieved by embeddings is determined by how well the geometry of the embedding space matches the structure of the data. Euclidean space has been the workhorse space for embeddings; recently hyperbolic and spherical spaces are gaining popularity due to their ability to better embed new types of structured data---such as hierarchical data---but most data is not structured so uniformly. We address this problem by proposing embedding into a product manifold combining multiple copies of spherical, hyperbolic, and Euclidean spaces, providing a space of heterogeneous curvature suitable for a wide variety of structures. We introduce a heuristic to estimate the sectional curvature of graph data and directly determine the signature---the number of component spaces and their dimensions---of the product manifold. Empirically, we jointly learn the curvature and the embedding in the product space via Riemannian optimization. We discuss how to define and compute intrinsic quantities such as means---a challenging notion for product manifolds---and provably learnable optimization functions. On a range of datasets and reconstruction tasks, our product space embeddings outperform single Euclidean or hyperbolic spaces used in previous works, reducing distortion by 32.55% on a Facebook social network dataset. We learn word embeddings and find that a product of hyperbolic spaces in 50 dimensions consistently improves on baseline Euclidean and hyperbolic embeddings by 2.6 points in Spearman rank correlation on similarity tasks and 3.4 points on analogy accuracy.

##### Synthnet: Learning synthesizers end-to-end

tl;dr A convolutional autoregressive generative model that generates high fidelity audio, behchmarked on music

Learning synthesizers and generating music in the raw audio domain is a challenging task. We investigate the learned representations of convolutional autoregressive generative models. Consequently, we show that mappings between musical notes and the harmonic style (instrument timbre) can be learned based on the raw audio music recording and the musical score (in binary piano roll format). Our proposed architecture, SynthNet uses minimal training data (9 minutes), is substantially better in quality and converges 6 times faster than the baselines. The quality of the generated waveforms (generation accuracy) is sufficiently high that they are almost identical to the ground truth. Therefore, we are able to directly measure generation error during training, based on the RMSE of the Constant-Q transform. Mean opinion scores are also provided. We validate our work using 7 distinct harmonic styles and also provide visualizations and links to all generated audio.

##### Downsampling leads to Image Memorization in Convolutional Autoencoders

tl;dr We identify downsampling as a mechansim for memorization in convolutional autoencoders.

Memorization of data in deep neural networks has become a subject of significant research interest. In this paper, we link memorization of images in deep convolutional autoencoders to downsampling through strided convolution. To analyze this mechanism in a simpler setting, we train linear convolutional autoencoders and show that linear combinations of training data are stored as eigenvectors in the linear operator corresponding to the network when downsampling is used. On the other hand, networks without downsampling do not memorize training data. We provide further evidence that the same effect happens in nonlinear networks. Moreover, downsampling in nonlinear networks causes the model to not only memorize just linear combinations of images, but individual training images. Since convolutional autoencoder components are building blocks of deep convolutional networks, we envision that our findings will shed light on the important phenomenon of memorization in over-parameterized deep networks.

tl;dr We propose a gradient-based method to transfer knowledge from multiple sources across different domains and tasks.

##### Analysis of Memory Organization for Dynamic Neural Networks

No tl;dr =[

An increasing number of neural memory networks have been developed, leading to the need for a systematic approach to analyze and compare their underlying memory structures. Thus, in this paper, we first create a framework for memory organization and then compare four popular dynamic models: vanilla recurrent neural network, long short term memory, neural stack and neural RAM. This analysis helps to open the dynamic neural network' black box from the memory usage prospective. Accordingly, a taxonomy for these networks and their variants is proposed and proved using a unifying architecture. With the taxonomy, both network architectures and learning tasks are classified into four classes. And a one-to-one mapping is built between them to help practitioners select the appropriate architecture. To exemplify each task type, four synthetic tasks with different memory requirements are developed. Moreover, we use two natural language processing applications to apply the methodology in a realistic setting.

##### The Expressive Power of Gated Recurrent Units as a Continuous Dynamical System

tl;dr We classify the the dynamical features one and two GRU cells can and cannot capture in continuous time, and verify our findings experimentally with k-step time series prediction.

Gated recurrent units (GRUs) were inspired by the common gated recurrent unit, long short-term memory (LSTM), as a means of capturing temporal structure with less complex memory unit architecture. Despite their incredible success in tasks such as natural and artificial language processing, speech, video, and polyphonic music, very little is understood about the specific dynamic features representable in a GRU network. As a result, it is difficult to know a priori how successful a GRU-RNN will perform on a given data set. In this paper, we develop a new theoretical framework to analyze one and two dimensional GRUs as a continuous dynamical system, and classify the dynamic features obtainable with such system. In addition, we show that a two dimensional GRU cannot mimic the dynamics of a ring attractor, or more generally, any line attractor without near zero constant curvature in phase space. These results were then experimentally verified by means of time series prediction.

##### An Empirical Study of Example Forgetting during Deep Neural Network Learning

tl;dr We show that catastrophic forgetting occurs within what is considered to be a single task and find that examples that are not prone to forgetting can be removed from the training set without loss of generalization.

Inspired by the phenomenon of catastrophic forgetting, we investigate the learning dynamics of neural networks as they train on single classification tasks. Our goal is to understand whether a related phenomenon occurs when data does not undergo a clear distributional shift. We define a forgetting event'' to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set's (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.

##### Alignment Based Mathching Networks for One-Shot Classification and Open-Set Recognition

No tl;dr =[

Deep learning for object classification relies heavily on convolutional models. While effective, CNNs are rarely interpretable after the fact. An attention mechanism can be used to highlight the area of the image that the model focuses on thus offering a narrow view into the mechanism of classification. We expand on this idea by forcing the method to explicitly align images to be classified to reference images representing the classes. The mechanism of alignment is learned and therefore does not require that the reference objects are anything like those being classified. Beyond explanation, our exemplar based cross-alignment method enables classification with only a single example per category (one-shot). Our model cuts the 5-way, 1-shot error rate in Omniglot from 2.1\% to 1.4\% and in MiniImageNet from 53.5\% to 46.5\% while simultaneously providing point-wise alignment information providing some understanding on what the network is capturing. This method of alignment also enables the recognition of an unsupported class (open-set) in the one-shot setting while maintaining an F1-score of above 0.5 for Omniglot even with 19 other distracting classes while baselines completely fail to separate the open-set class in the one-shot setting.

##### Looking inside the black box: assessing the modular structure of deep generative models with counterfactuals

tl;dr We investigate the modularity of deep generative models.

Deep generative models such as Generative Adversarial Networks (GANs) and Variational Auto-Encoders (VAEs) are important tools to capture and investigate the properties of complex empirical data. However, the complexity of their inner elements makes their functionment challenging to assess and modify. In this respect, these architectures behave as black box models. In order to better understand the function of such networks, we analyze their modularity based on the counterfactual manipulation of their internal variables. Our experiments on the generation of human faces with VAEs and GANs support that modularity between activation maps distributed over channels of generator architectures is achieved to some degree, can be used to better understand how these systems operate and edit the content of generated images.

No tl;dr =[

This paper explores a simple regularizer for reinforcement learning by proposing Generative Adversarial Self-Imitation Learning (GASIL), which encourages the agent to imitate past good trajectories via generative adversarial imitation learning framework. Instead of directly maximizing rewards, GASIL focuses on reproducing past good trajectories, which can potentially make long-term credit assignment easier when rewards are sparse and delayed. GASIL can be easily combined with any policy gradient objective by using GASIL as a learned reward shaping function. Our experimental results show that GASIL improves the performance of proximal policy optimization on 2D Point Mass and MuJoCo environments with delayed reward and stochastic dynamics.

##### Pearl: Prototype lEArning via Rule Lists

tl;dr a method combining rule list learning and prototype learning

Deep neural networks have demonstrated promising classification performance on many healthcare applications. However, the interpretability of those models are often lacking. On the other hand, classical interpretable models such as rule lists or decision trees do not lead to the same level of accuracy as deep neural networks. Despite their interpretable structures, the resulting rules are often too complex to be interpretable (due to the potentially large depth of rule lists). In this work, we present PEARL, short for Prototype lEArning via Rule Lists, which iteratively use rule lists to guide a neural network to learn representative data prototypes. The resulting prototype neural network provides accurate prediction, and the prediction can be easily explained by prototype and its guiding rule lists. Thanks to the prediction power of neural networks, the rule lists defining prototypes are more concise and hence provide better interpretability. On two real-world electronic healthcare records (EHR) datasets, PEARL consistently outperforms all baselines, achieving performance improvement over conventional rule learning by up to 28% and over prototype learning by up to 3%. Experimental results also show the resulting interpretation of PEARL is simpler than the standard rule learning.

##### Differentiable Greedy Networks

tl;dr We propose a subset selection algorithm that is trainable with gradient based methods yet achieves near optimal performance via submodular optimization.

Optimal selection of a subset of items from a given set is a hard problem that requires combinatorial optimization. In this paper, we propose a subset selection algorithm that is trainable with gradient based methods yet achieves near optimal performance via submodular optimization. We focus on the task of identifying a relevant set of sentences for claim verification in the context of the FEVER task. Conventional methods for this task look at sentences on their individual merit and thus do not optimize the informativeness of sentences as a set. We show that our proposed method which builds on the idea of unfolding a greedy algorithm into a computational graph allows both interpretability and gradient based training. The proposed differentiable greedy network (DGN) outperforms discrete optimization algorithms as well as other baseline methods in terms of precision and recall.

##### Successor Options : An Option Discovery Algorithm for Reinforcement Learning

tl;dr An option discovery method for Reinforcement Learning using the Successor Representation

Hierarchical Reinforcement Learning is a popular method to exploit temporal abstractions in order to tackle the curse of dimensionality. The options framework is one such hierarchical framework that models the notion of skills or options. However, learning a collection of task-agnostic transferable skills is a challenging task. Option discovery typically entails using heuristics, the majority of which revolve around discovering bottleneck states. In this work, we adopt a method complementary to the idea of discovering bottlenecks. Instead, we attempt to discover landmark" sub-goals which are prototypical states of well connected regions. These sub-goals are points from which densely connected set of states are easily accessible. We propose a new model called Successor options that leverages Successor Representations to achieve the same. We also design a novel pseudo-reward for learning the intra-option policies. Additionally, we describe an Incremental Successor options model that iteratively builds options and explores in environments where exploration through primitive actions is inadequate to form the Successor Representations. Finally, we demonstrate the efficacy of our approach on a collection of grid worlds and on complex high dimensional environments like Deepmind-Lab.

##### Nested Dithered Quantization for Communication Reduction in Distributed Training

tl;dr The paper proposes and analyzes two quantization schemes for communicating Stochastic Gradients in distributed learning which would reduce communication costs compare to the state of the art while maintaining the same accuracy.

In distributed training, the communication cost due to the transmission of gradients or the parameters of the deep model is a major bottleneck in scaling up the number of processing nodes. To address this issue, we propose dithered quantization for the transmission of the stochastic gradients and show that training with Dithered Quantized Stochastic Gradients (DQSG) is similar to the training with unquantized SGs perturbed by an independent bounded uniform noise, in contrast to the other quantization methods where the perturbation depends on the gradients and hence, complicating the convergence analysis. We study the convergence of training algorithms using DQSG and the trade off between the number of quantization levels and the training time. Next, we observe that there is a correlation among the SGs computed by workers that can be utilized to further reduce the communication overhead without any performance loss. Hence, we develop a simple yet effective quantization scheme, nested dithered quantized SG (NDQSG), that can reduce the communication significantly without requiring the workers communicating extra information to each other. We prove that although NDQSG requires significantly less bits, it can achieve the same quantization variance bound as DQSG. Our simulation results confirm the effectiveness of training using DQSG and NDQSG in reducing the communication bits or the convergence time compared to the existing methods without sacrificing the accuracy of the trained model.

##### Learning To Simulate

tl;dr We propose an algorithm that automatically adjusts parameters of a simulation engine to generate training data for a neural network such that validation accuracy is maximized.

Simulation is a useful tool in situations where training data for machine learning models is costly to annotate or even hard to acquire. In this work, we propose a reinforcement learning-based method for automatically adjusting the parameters of any (non-differentiable) simulator, thereby controlling the distribution of synthesized data in order to maximize the accuracy of a model trained on that data. In contrast to prior art that hand-crafts these simulation parameters or adjusts only parts of the available parameters, our approach fully controls the simulator with the actual underlying goal of maximizing accuracy, rather than mimicking the real data distribution or randomly generating a large volume of data. We find that our approach (i) quickly converges to the optimal simulation parameters in controlled experiments and (ii) can indeed discover good sets of parameters for an image rendering simulator in actual computer vision applications.

##### Meta-Learning Probabilistic Inference for Prediction

tl;dr Novel framework for meta-learning that unifies and extends a broad class of existing few-shot learning methods. Achieves strong performance on few-shot learning benchmarks without requiring iterative test-time inference.

This paper introduces a new framework for data efficient and versatile learning. Specifically: 1) We develop ML-PIP, a general framework for Meta-Learning approximate Probabilistic Inference for Prediction. ML-PIP extends existing probabilistic interpretations of meta-learning to cover a broad class of methods. 2) We introduce \Versa{}, an instance of the framework employing a flexible and versatile amortization network that takes few-shot learning datasets as inputs, with arbitrary numbers of shots, and outputs a distribution over task-specific parameters in a single forward pass. \Versa{} substitutes optimization at test time with forward passes through inference networks, amortizing the cost of inference and relieving the need for second derivatives during training. 3) We evaluate \Versa{} on benchmark datasets where the method sets new state-of-the-art results, and can handle arbitrary number of shots, and for classification, arbitrary numbers of classes at train and test time. The power of the approach is then demonstrated through a challenging few-shot ShapeNet view reconstruction task.

##### Scalable Neural Theorem Proving on Knowledge Bases and Natural Language

tl;dr We scale Neural Theorem Provers to large datasets, improve the rule learning process, and extend it to jointly reason over text and Knowledge Bases.

Reasoning over text and Knowledge Bases (KBs) is a major challenge for ArtificialIntelligence, with applications in machine reading, dialogue, and question answering. Transducing text to logical forms which can be operated on is a brittle and error-prone process. Operating directly on text by jointly learning representations and transformations thereof by means of neural architectures that lack the ability to learn and exploit general rules can be very data-inefficient and not generalise correctly. These issues are addressed by Neural Theorem Provers (NTPs) (Rocktäschel & Riedel, 2017), neuro-symbolic systems based on a continuous relaxation of Prolog’s backward chaining algorithm, where symbolic unification between atoms is replaced by a differentiable operator computing the similarity between their embedding representations. In this paper, we first propose Neighbourhood-approximated Neural Theorem Provers (NaNTPs) consisting of two extensions toNTPs, namely a) a method for drastically reducing the previously prohibitive time and space complexity during inference and learning, and b) an attention mechanism for improving the rule learning process, deeming them usable on real-world datasets. Then, we propose a novel approach for jointly reasoning over KB facts and textual mentions, by jointly embedding them in a shared embedding space. The proposed method is able to extract rules and provide explanations—involving both textual patterns and KB relations—from large KBs and text corpora. We show thatNaNTPs perform on par with NTPs at a fraction of a cost, and can achieve competitive link prediction results on challenging large-scale datasets, including WN18, WN18RR, and FB15k-237 (with and without textual mentions) while being able to provide explanations for each prediction and extract interpretable rules.

tl;dr Adversarial training of ensembles provides robustness to adversarial examples beyond that observed in adversarially trained models and independently-trained ensembles thereof.

While deep learning has led to remarkable results on a number of challenging problems, researchers have discovered a vulnerability of neural networks in adversarial settings, where small but carefully chosen perturbations to the input can make the models produce extremely inaccurate outputs. This makes these models particularly unsuitable for safety-critical application domains (e.g. self-driving cars) where robustness is extremely important. Recent work has shown that augmenting training with adversarially generated data provides some degree of robustness against test-time attacks. In this paper we investigate how this approach scales as we increase the computational budget given to the defender. We show that increasing the number of parameters in adversarially-trained models increases their robustness, and in particular that ensembling smaller models while adversarially training the entire ensemble as a single model is a more efficient way of spending said budget than simply using a larger single model. Crucially, we show that it is the adversarial training of the ensemble, rather than the ensembling of adversarially trained models, which provides robustness.

##### Universal Successor Features Approximators

No tl;dr =[

The ability of a reinforcement learning (RL) agent to learn about many reward functions at the same time has many potential benefits, such as the decomposition of complex tasks into simpler ones, the exchange of information between tasks, and the reuse of skills. We focus on one aspect in particular, namely the ability to generalise to unseen tasks. Parametric generalisation relies on the interpolation power of a function approximator that is given the task description as input; one of its most common form are universal value function approximators (UVFAs). Another way to generalise to new tasks is to exploit structure in the RL problem itself. Generalised policy improvement (GPI) combines solutions of previous tasks into a policy for the unseen task; this relies on instantaneous policy evaluation of old policies under the new reward function, which is made possible through successor features (SFs). Our proposed \emph{universal successor features approximators} (USFAs) combine the advantages of all of these, namely the scalability of UVFAs, the instant inference of SFs, and the strong generalisation of GPI. We discuss the challenges involved in training a USFA, its generalisation properties and demonstrate its practical benefits and transfer abilities on a large-scale domain in which the agent has to navigate in a first-person perspective three-dimensional environment.

##### A Direct Approach to Robust Deep Learning Using Adversarial Networks

tl;dr Jointly train an adversarial noise generating network with a classification network to provide better robustness to adversarial attacks.

Deep neural networks have been shown to perform well in many classical machine learning problems, especially in image classification tasks. However, researchers have found that neural networks can be easily fooled, and they are surprisingly sensitive to small perturbations imperceptible to humans. Carefully crafted input images (adversarial examples) can force a well-trained neural network to provide arbitrary outputs. Including adversarial examples during training is a popular defense mechanism against adversarial attacks. In this paper we propose a new defensive mechanism under the generative adversarial network~(GAN) framework. We model the adversarial noise using a generative network, trained jointly with a classification discriminative network as a minimax game. We show empirically that our adversarial network approach works well against black box attacks, with performance on par with state-of-art methods such as ensemble adversarial training and adversarial training with projected gradient descent.

##### Variational recurrent models for representation learning

No tl;dr =[

We study the problem of learning representations of sequence data. Recent work has built on variational autoencoders to develop variational recurrent models for generation. Our main goal is not generation but rather representation learning for downstream prediction tasks. Existing variational recurrent models typically use stochastic recurrent connections to model the dependence among neighboring latent variables, while generation assumes independence of generated data per time step given the latent sequence. In contrast, our models assume independence among all latent variables given non-stochastic hidden states, which speeds up inference, while assuming dependence of observations at each time step on all latent variables, which improves representation quality. In addition, we propose and study extensions for improving downstream performance, including hierarchical auxiliary latent variables and prior updating during training. Experiments show improved performance on several speech and language tasks with different levels of supervision, as well as in a multi-view learning setting.

##### Model Compression with Generative Adversarial Networks

No tl;dr =[

The ever-increasing accuracy of machine learning models often comes at the expense of higher computational costs and memory requirements at test time, making them impractical to deploy on memory-constrained or CPU-constrained devices. Model compression (also known as distillation) is a technique to compress a complex model into a simpler one while maintaining most of the original accuracy. This can be done by using the same dataset for both the model training and compression tasks or by exploiting additional data. However, in many real-world applications, additional data are not available, and the repeated use of the original training data leads to suboptimal compression. In this work, we propose to use generative adversarial networks (GANs) to approximately sample from the distribution of the original data, thus generating ''unlimited'' synthetic data that can be used to perform the compression task. Our GAN-assisted model compression approach shows significant improvement in compressing complex models such as deep neural networks and large random forests on both image and tabular datasets. Furthermore, based on the model compression results, we propose a comprehensive metric—the Compression Score—to evaluate the quality of generative models, which captures both the discriminability and the diversity of the synthetic data. We show that the Compression Score performs well in cases when the popular Inception Score fails.

##### Competitive experience replay

tl;dr a novel method to learn with sparse reward using adversarial reward re-labeling

Deep learning has achieved remarkable successes in solving challenging reinforcement learning (RL) problems. However, it still often suffers from the need to engineer a reward function that not only reflects the task but is also carefully shaped. This limits the applicability of RL in the real world. It is therefore of great practical importance to develop algorithms which can learn from unshaped, sparse reward signals, e.g. a binary signal indicating successful task completion. We propose a novel method called competitive experience replay, which efficiently supplements a sparse reward by placing learning in the context of an exploration competition between a pair of agents. Our method complements the recently proposed hindsight experience replay (HER) by inducing an automatic exploratory curriculum. We evaluate our approach on the tasks of reaching various goal locations in an ant maze and manipulating objects with a robotic arm. Each task provides only binary rewards indicating whether or not the goal is completed. Our method asymmetrically augments these sparse rewards for a pair of agents each learning the same task, creating a competitive game designed to drive exploration. Extensive experiments demonstrate that this method leads to faster converge and improved task performance.

##### A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

tl;dr We use empirical tools of mode connectivity and SVCCA to investigate neural network training heuristics of learning rate restarts, warmup and knowledge distillation.

The convergence rate and final performance of common deep learning models have significantly benefited from recently proposed heuristics such as learning rate schedules, knowledge distillation, skip connections and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining the efficacy of these strategies can aid our understanding of deep learning landscapes and the training dynamics. Existing approaches for empirical analysis rely on tools of linear interpolation and visualizations with dimensionality reduction, each with their limitations. Instead, we revisit the empirical analysis of heuristics through the lens of recently proposed methods for loss surface and representation analysis, viz. mode connectivity and canonical correlation analysis (CCA), and hypothesize reasons why the heuristics succeed. In particular, we explore knowledge distillation and learning rate heuristics of (cosine) restarts and warmup using mode connectivity and CCA. Our empirical analysis suggests that: (a) the reasons often quoted for the success of cosine annealing are not evidenced in practice; (b) that the effect of learning rate warmup is to prevent the deeper layers from creating training instability; and (c) that the latent knowledge shared by the teacher is primarily disbursed in the deeper layers.

##### PA-GAN: Improving GAN Training by Progressive Augmentation

tl;dr We introduce a new technique - progressive augmentation of GANs (PA-GAN) - that helps to improve the overall stability of GAN training.

Despite recent progress, Generative Adversarial Networks (GANs) still suffer from training instability, requiring careful consideration of architecture design choices and hyper-parameter tuning. The reason for this fragile training behaviour is partially due to the discriminator performing well very quickly; its loss converges to zero, providing no reliable backpropagation signal to the generator. In this work we introduce a new technique - progressive augmentation of GANs (PA-GAN) - that helps to overcome this fundamental limitation and improve the overall stability of GAN training. The key idea is to gradually increase the task difficulty of the discriminator by progressively augmenting its input space, thus enabling continuous learning of the generator. We show that the proposed progressive augmentation preserves the original GAN objective, does not bias the optimality of the discriminator and encourages the healthy competition between the generator and discriminator, leading to a better-performing generator. We experimentally demonstrate the effectiveness of the proposed approach on multiple benchmarks (MNIST, Fashion-MNIST, CIFAR10, CELEBA) for the image generation task.

##### ProMP: Proximal Meta-Policy Search

tl;dr A novel and theoretically grounded meta-reinforcement learning algorithm

Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.

##### Deterministic Policy Gradients with General State Transitions

No tl;dr =[

We study a reinforcement learning setting, where the state transition function is a convex combination of a stochastic continuous function and a deterministic function. Such a setting generalizes the widely-studied stochastic state transition setting, namely the setting of deterministic policy gradient (DPG). We firstly give a simple example to illustrate that the deterministic policy gradient may be infinite under deterministic state transitions, and introduce a theoretical technique to prove the existence of the policy gradient in this generalized setting. Using this technique, we prove that the deterministic policy gradient indeed exists for a certain set of discount factors, and further prove two conditions that guarantee the existence for all discount factors. We then derive a closed form of the policy gradient whenever exists. Furthermore, to overcome the challenge of high sample complexity of DPG in this setting, we propose the Generalized Deterministic Policy Gradient (GDPG) algorithm. The main innovation of the algorithm is a new method of applying model-based techniques to the model-free algorithm, the deep deterministic policy gradient algorithm (DDPG). GDPG optimize the long-term rewards of the model-based augmented MDP subject to a constraint that the long-rewards of the MDP is less than the original one. We finally conduct extensive experiments comparing GDPG with state-of-the-art methods and the direct model-based extension method of DDPG on several standard continuous control benchmarks. Results demonstrate that GDPG substantially outperforms DDPG, the model-based extension of DDPG and other baselines in terms of both convergence and long-term rewards in most environments.

##### Learning to Make Analogies by Contrasting Abstract Relational Structure

tl;dr The most robust capacity for analogical reasoning is induced when networks learn analogies by contrasting abstract relational structures in their input domains.

Analogical reasoning has been a principal focus of various waves of AI research. Analogy is particularly challenging for machines because it requires relational structures to be represented such that they can be flexibly applied across diverse domains of experience. Here, we study how analogical reasoning can be induced in neural networks that learn to perceive and reason about raw visual data. We find that the critical factor for inducing such a capacity is not an elaborate architecture, but rather, careful attention to the choice of data and the manner in which it is presented to the model. The most robust capacity for analogical reasoning is induced when networks learn analogies by contrasting abstract relational structures in their input domains, a training method that uses only the input data to force models to learn about important abstract features. Using this technique we demonstrate capacities for complex, visual and symbolic analogy making and generalisation in even the simplest neural network architectures.

No tl;dr =[

##### Theoretical and Empirical Study of Adversarial Examples

No tl;dr =[

Many techniques are developed to defend against adversarial examples at scale. So far, the most successful defenses generate adversarial examples during each training step and add them to the training data. Yet, this brings significant computational overhead. In this paper, we investigate defenses against adversarial attacks. First, we propose feature smoothing, a simple data augmentation method with little computational overhead. Essentially, feature smoothing trains a neural network on virtual training data as an interpolation of features from a pair of samples, with the new label remaining the same as the dominant data point. The intuition behind feature smoothing is to generate virtual data points as close as adversarial examples, and to avoid the computational burden of generating data during training. Our experiments on MNIST and CIFAR10 datasets explore different combinations of known regularization and data augmentation methods and show that feature smoothing with logit squeezing performs best for both adversarial and clean accuracy. Second, we propose an unified framework to understand the connections and differences among different efficient methods by analyzing the biases and variances of decision boundary. We show that under some symmetrical assumptions, label smoothing, logit squeezing, weight decay, mix up and feature smoothing all produce an unbiased estimation of the decision boundary with smaller estimated variance. All of those methods except weight decay are also stable when the assumptions no longer hold.

##### The Case for Full-Matrix Adaptive Regularization

Adaptive regularization methods pre-multiply a descent direction by a preconditioning matrix. Due to the large number of parameters of machine learning problems, full-matrix preconditioning methods are prohibitively expensive. We show how to modify full-matrix adaptive regularization in order to make it practical and effective. We also provide novel theoretical analysis for adaptive regularization in non-convex optimization settings. The core of our algorithm, termed GGT, consists of efficient inverse computation of square roots of low-rank matrices. Our preliminary experiments underscore improved convergence rate of GGT across a variety of synthetic tasks and standard deep learning benchmarks.

##### Assessing Generalization in Deep Reinforcement Learning

No tl;dr =[

Deep reinforcement learning (RL) has achieved breakthrough results on many tasks, but has been shown to be sensitive to system changes at test time. As a result, building deep RL agents that generalize has become an active research area. Our aim is to catalyze and streamline community-wide progress on this problem by providing the first benchmark and a common experimental protocol for investigating generalization in RL. Our benchmark contains a diverse set of environments and our evaluation methodology covers both in-distribution and out-of-distribution generalization. To provide a set of baselines for future research, we conduct a systematic evaluation of state-of-the-art algorithms, including those that specifically tackle the problem of generalization. The experimental results indicate that in-distribution generalization may be within the capacity of current algorithms, while out-of-distribution generalization is an exciting challenge for future work.

##### INFORMATION MAXIMIZATION AUTO-ENCODING

tl;dr Information theoretical approach for unsupervised learning of unsupervised learning of a hybrid of discrete and continuous representations,

We propose the Information Maximization Autoencoder (IMAE), an information theoretic approach to simultaneously learn continuous and discrete representations in an unsupervised setting. Unlike the Variational Autoencoder (VAE) framework, IMAE starts from a stochastic encoder that seeks to map each input data to a hy- brid discrete and continuous representation with the objective of maximizing the mutual information between the data and the representation. A decoder is included for approximating the posterior distribution of the data given their representations, where a high fidelity approximation can be achieved by leveraging our informa- tive learned representations. We show that our objective is theoretically valid and provides a principled framework for understanding the tradeoffs among the infor- mativeness of each representation factor, disentanglement of representations, and the decoding quality.

tl;dr We propose a new, efficient algorithm to construct adversarial examples by means of deformations, rather than additive perturbations.

While deep neural networks have proven to be a powerful tool for many recognition and classification tasks, their stability properties are still not well understood. In the past, image classifiers have been shown to be vulnerable to so-called adversarial attacks, which are created by additively perturbing the correctly classified image. In this paper, we propose the ADef algorithm to construct a different kind of adversarial attack created by iteratively applying small deformations to the image, found through a gradient descent step. We demonstrate our results on MNIST with convolutional neural networks and on ImageNet with Inception-v3 and ResNet-101.

##### Computation-Efficient Quantization Method for Deep Neural Networks

tl;dr A simple computation-efficient quantization training method for CNNs and RNNs.

Deep Neural Networks, being memory and computation intensive, are a challenge to deploy in smaller devices. Numerous quantization techniques have been proposed to reduce the inference latency/memory consumption. However, these techniques impose a large overhead on the training procedure or need to change the training process. We present a non-intrusive quantization technique based on re-training the full precision model, followed by directly optimizing the corresponding binary model. The quantization training process takes no longer than the original training process. We also propose a new loss function to regularize the weights, resulting in reduced quantization error. Combining both help us achieve full precision accuracy on CIFAR dataset using binary quantization. We also achieve full precision accuracy on WikiText-2 using 2 bit quantization. Comparable results are also shown for ImageNet. We also present a 1.5 bits hybrid model exceeding the performance of TWN LSTM model for WikiText-2.

##### Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning

No tl;dr =[

Deep artificial neural networks (DNNs) are typically trained via gradient-based learning algorithms, namely backpropagation. Evolution strategies (ES) can rival backprop-based algorithms such as Q-learning and policy gradients on challenging deep reinforcement learning (RL) problems. However, ES can be considered a gradient-based algorithm because it performs stochastic gradient descent via an operation similar to a finite-difference approximation of the gradient. That raises the question of whether non-gradient-based evolutionary algorithms can work at DNN scales. Here we demonstrate they can: we evolve the weights of a DNN with a simple, gradient-free, population-based genetic algorithm (GA) and it performs well on hard deep RL problems, including Atari and humanoid locomotion. The Deep GA successfully evolves networks with over four million free parameters, the largest neural networks ever evolved with a traditional evolutionary algorithm. These results (1) expand our sense of the scale at which GAs can operate, (2) suggest intriguingly that in some cases following the gradient is not the best choice for optimizing performance, and (3) make immediately available the multitude of neuroevolution techniques that improve performance. We demonstrate the latter by showing that combining DNNs with novelty search, which encourages exploration on tasks with deceptive or sparse reward functions, can solve a high-dimensional problem on which reward-maximizing algorithms (e.g.\ DQN, A3C, ES, and the GA) fail. Additionally, the Deep GA is faster than ES, A3C, and DQN (it can train Atari in {\raise.17ex\hbox{$\scriptstyle\sim$}}4 hours on one workstation or {\raise.17ex\hbox{$\scriptstyle\sim$}}1 hour distributed on 720 cores), and enables a state-of-the-art, up to 10,000-fold compact encoding technique.

##### AntMan: Sparse Low-Rank Compression To Accelerate RNN Inference

tl;dr Reducing computation and memory complexity of RNN models by up to 100x using sparse low-rank compression modules, trained via knowledge distillation.

Wide adoption of complex RNN based models is hindered by their inference performance, cost and memory requirements. To address this issue, we develop AntMan, combining structured sparsity with low-rank decomposition synergistically, to reduce model computation, size and execution time of RNNs while attaining desired accuracy. AntMan extends knowledge distillation based training to learn the compressed models efficiently. Our evaluation shows that AntMan offers up to 100x computation reduction with less than 1pt accuracy drop for language and machine reading comprehension models. Our evaluation also shows that for a given accuracy target, AntMan produces 5x smaller models than the state-of-art. Lastly, we show that AntMan offers super-linear speed gains compared to theoretical speedup, demonstrating its practical value on commodity hardware.

##### PASS: Phased Attentive State Space Modeling of Disease Progression Trajectories

No tl;dr =[

Disease progression models are instrumental in predicting individual-level health trajectories and understanding disease dynamics. Existing models are capable of providing either accurate predictions of patients’ prognoses or clinically interpretable representations of disease pathophysiology, but not both. In this paper, we develop the phased attentive state space (PASS) model of disease progression, a deep probabilistic model that captures complex representations for disease progression while maintaining clinical interpretability. Unlike Markovian state space models which assume memoryless dynamics, PASS uses an attention mechanism to induce "memoryful" state transitions, whereby repeatedly updated attention weights are used to focus on past state realizations that best predict future states. This gives rise to complex, non-stationary state dynamics that remain interpretable through the generated attention weights, which designate the relationships between the realized state variables for individual patients. PASS uses phased LSTM units (with time gates controlled by parametrized oscillations) to generate the attention weights in continuous time, which enables handling irregularly-sampled and potentially missing medical observations. Experiments on data from a realworld cohort of patients show that PASS successfully balances the tradeoff between accuracy and interpretability: it demonstrates superior predictive accuracy and learns insightful individual-level representations of disease progression.

##### Exploration in Policy Mirror Descent

No tl;dr =[

Policy optimization is a core problem in reinforcement learning. In this paper, we investigate Reversed Entropy Policy Mirror Descent (REPMD), an on-line policy optimization strategy that improves exploration behavior while assuring monotonic progress in a principled objective. REPMD conducts a form of maximum entropy exploration within a mirror descent framework, but uses an alternative policy update with a reversed KL projection. This modified formulation bypasses undesirable mode seeking behavior and avoids premature convergence to sub-optimal policies, while still supporting strong theoretical properties such as guaranteed policy improvement. An experimental evaluation demonstrates that this approach significantly improves practical exploration and surpasses the empirical performance of state-of-the art policy optimization methods in a set of benchmark tasks.

##### Prob2Vec: Mathematical Semantic Embedding for Problem Retrieval in Adaptive Tutoring

tl;dr We propose the Prob2Vec method for problem embedding used in a personalized e-learning tool in addition to a data level classification method, called negative pre-training, for cases where the training data set is imbalanced.

We propose a new application of embedding techniques to problem retrieval in adaptive tutoring. The objective is to retrieve problems similar in mathematical concepts. There are two challenges: First, like sentences, problems helpful to tutoring are never exactly the same in terms of the underlying concepts. Instead, good problems mix concepts in innovative ways, while still displaying continuity in their relationships. Second, it is difficult for humans to determine a similarity score consistent across a large enough training set. We propose a hierarchical problem embedding algorithm, called Prob2Vec, that consists of an abstraction and an embedding step. Prob2Vec achieves 96.88\% accuracy on a problem similarity test, in contrast to 75\% from directly applying state-of-the-art sentence embedding methods. It is surprising that Prob2Vec is able to distinguish very fine-grained differences among problems, an ability humans need time and effort to acquire. In addition, the sub-problem of concept labeling with imbalanced training data set is interesting in its own right. It is a multi-label problem suffering from dimensionality explosion, which we propose ways to ameliorate. We propose the novel negative pre-training algorithm that dramatically reduces false negative and positive ratios for classification, using an imbalanced training data set.

##### Toward Understanding the Impact of Staleness in Distributed Machine Learning

tl;dr Empirical and theoretical study of the effects of staleness in non-synchronous execution on machine learning algorithms.

Most distributed machine learning (ML) systems store a copy of the model parameters locally on each machine to minimize network communication. In practice, in order to reduce synchronization waiting time, these copies of the model are not necessarily updated in lock-steps, and can become stale. Despite much development in large-scale ML, the effect of staleness on the learning efficiency is inconclusive, mainly because it is challenging to control or monitor the staleness in complex distributed environments. In this work, we study the convergence behaviors of a wide array of ML models and algorithms under delayed updates. Our extensive experiments reveal the rich diversity of the effects of staleness on the convergence of ML algorithms, and offer insights into seemingly contradictory reports in the literature. The empirical findings also inspire a new convergence analysis of SGD in non-convex optimization under staleness, matching the best known convergence rate of O(1/\sqrt{T}).

##### Learning to Separate Domains in Generalized Zero-Shot and Open Set Learning: a probabilistic perspective

tl;dr This paper studies the problem of domain division problem by segmenting instances drawn from different probabilistic distributions.

This paper studies the problem of domain division problem which aims to segment instances drawn from different probabilistic distributions. Such a problem exists in many previous recognition tasks, such as Open Set Learning (OSL) and Generalized Zero-Shot Learning (G-ZSL), where the testing instances come from either seen or novel/unseen classes of different probabilistic distributions. Previous works focused on either only calibrating the confident prediction of classifiers of seen classes (W-SVM), or taking unseen classes as outliers. In contrast, this paper proposes a probabilistic way of directly estimating and fine-tuning the decision boundary between seen and novel/unseen classes. In particular, we propose a domain division algorithm of learning to split the testing instances into known, unknown and uncertain domains, and then conduct recognize tasks in each domain. Two statistical tools, namely, bootstrapping and Kolmogorov-Smirnov (K-S) Test, for the first time, are introduced to discover and fine-tune the decision boundary of each domain. Critically, the uncertain domain is newly introduced in our framework to adopt those instances whose domain cannot be predicted confidently. Extensive experiments demonstrate that our approach achieved the state-of-the-art performance on OSL and G-ZSL benchmarks.

##### Laplacian Networks: Bounding Indicator Function Smoothness for Neural Networks Robustness

No tl;dr =[

For the past few years, Deep Neural Network (DNN) robustness has become a question of paramount importance. As a matter of fact, in sensitive settings misclassification can lead to dramatic consequences. Such misclassifications are likely to occur when facing adversarial attacks, hardware failures or limitations, and imperfect signal acquisition. To address this question, authors have proposed different approaches aiming at increasing the robustness of DNNs, such as adding regularizers or training using noisy examples. In this paper we propose a new regularizer built upon the Laplacian of similarity graphs obtained from the representation of training data at each layer of the DNN architecture. This regularizer penalizes large changes (across consecutive layers in the architecture) in the distance between examples of different classes, and as such enforces smooth variations of the class boundaries. Since it is agnostic to the type of deformations that are expected when predicting with the DNN, the proposed regularizer can be combined with existing ad-hoc methods. We provide theoretical justification for this regularizer and demonstrate its effectiveness to improve robustness of DNNs on classical supervised learning vision datasets.

##### Invariance and Inverse Stability under ReLU

tl;dr We analyze the invertibility of deep neural networks by studying preimages of ReLU-layers and the stability of the inverse.

We flip the usual approach to study invariance and robustness of neural networks by considering the non-uniqueness and instability of the inverse mapping. We provide theoretical and numerical results on the inverse of ReLU-layers. First, we derive a necessary and sufficient condition on the existence of invariance that provides a geometric interpretation. Next, we move to robustness via analyzing local effects on the inverse. To conclude, we show how this reverse point of view not only provides insights into key effects, but also enables to view adversarial examples from different perspectives.

##### Data-Dependent Coresets for Compressing Neural Networks with Applications to Generalization Bounds

No tl;dr =[

We present an efficient coresets-based neural network compression algorithm that sparsifies the parameters of a trained fully-connected neural network in a manner that provably approximates the network's output. Our approach is based on an importance sampling scheme that judiciously defines a sampling distribution over the neural network parameters, and as a result, retains parameters of high importance while discarding redundant ones. We leverage a novel, empirical notion of sensitivity and extend traditional coreset constructions to the application of compressing parameters. Our theoretical analysis establishes guarantees on the size and accuracy of the resulting compressed network and gives rise to generalization bounds that may provide new insights into the generalization properties of neural networks. We demonstrate the practical effectiveness of our algorithm on a variety of neural network configurations and real-world data sets.

##### INTERPRETABLE CONVOLUTIONAL FILTER PRUNING

No tl;dr =[

The sophisticated structure of Convolutional Neural Network (CNN) allows for outstanding performance, but at the cost of intensive computation. As significant redundancies inevitably present in such a structure, many works have been proposed to prune the convolutional filters for computation cost reduction. Although extremely effective, most works are based only on quantitative characteristics of the convolutional filters, and highly overlook the qualitative interpretation of individual filter’s specific functionality. In this work, we interpreted the functionality and redundancy of the convolutional filters from different perspectives, and proposed a functionality-oriented filter pruning method. With extensive experiment results, we proved the convolutional filters’ qualitative significance regardless of magnitude, demonstrated significant neural network redundancy due to repetitive filter functions, and analyzed the filter functionality defection under inappropriate retraining process. Such an interpretable pruning approach not only offers outstanding computation cost optimization over previous filter pruning methods, but also interprets filter pruning process.

##### Overcoming Catastrophic Forgetting via Model Adaptation

No tl;dr =[

Learning multiple tasks sequentially is important for the development of AI and lifelong learning systems. However, standard neural network architectures suffer from catastrophic forgetting which makes it difficult to learn a sequence of tasks. Several continual learning methods have been proposed to address the problem. In this paper, we propose a very different approach, called model adaptation, to dealing with the problem. The proposed approach learns to build a model, called the solver, with two sets of parameters. The first set is shared by all tasks learned so far and the second set is dynamically generated to adapt the solver to suit each individual test example in order to classify it. Extensive experiments have been carried out to demonstrate the effectiveness of the proposed approach.

##### BlackMarks: Black-box Multi-bit Watermarking for Deep Neural Networks

tl;dr Proposing the first watermarking framework for multi-bit signature embedding and extraction using the outputs of the DNN.

Deep Neural Networks (DNNs) are increasingly deployed in cloud servers and autonomous agents due to their superior performance. The deployed DNN is either leveraged in a white-box setting (model internals are publicly known) or a black-box setting (only model outputs are known) depending on the application. A practical concern in the rush to adopt DNNs is protecting the models against Intellectual Property (IP) infringement. We propose BlackMarks, the first end-to-end multi-bit watermarking framework that is applicable in the black-box scenario. BlackMarks takes the pre-trained unmarked model and the owner’s binary signature as inputs. The output is the corresponding marked model with specific keys that can be later used to trigger the embedded watermark. To do so, BlackMarks first designs a model-dependent encoding scheme that maps all possible classes in the task to bit ‘0’ and bit ‘1’. Given the owner’s watermark signature (a binary string), a set of key image and label pairs is designed using targeted adversarial attacks. The watermark (WM) is then encoded in the distribution of output activations of the DNN by fine-tuning the model with a WM-specific regularized loss. To extract the WM, BlackMarks queries the model with the WM key images and decodes the owner’s signature from the corresponding predictions using the designed encoding scheme. We perform a comprehensive evaluation of BlackMarks’ performance on MNIST, CIFAR-10, ImageNet datasets and corroborate its effectiveness and robustness. BlackMarks preserves the functionality of the original DNN and incurs negligible WM embedding overhead as low as 2.054%.

##### Selective Convolutional Units: Improving CNNs via Channel Selectivity

tl;dr We propose a new module that improves any ResNet-like architectures by enforcing "channel selective" behavior to convolutional layers

Bottleneck structures with identity (e.g., residual) connection are now emerging popular paradigms for designing deep convolutional neural networks (CNN), for processing large-scale features efficiently. In this paper, we focus on the information-preserving nature of the bottleneck structures and utilize this to enable a convolutional layer to have a new functionality of channel-selectivity, i.e., focusing its computations on important channels. In particular, we propose Selective Convolutional Unit (SCU), an easy-to-use architectural unit that improves parameter efficiency of various modern CNNs with bottlenecks. During training, SCU gradually learns the channel-selectivity on-the-fly via the alternative usage of (a) pruning unimportant channels, and (b) rewiring the pruned parameters to important channels. The rewired parameters emphasize the target channel in a way that selectively enlarges the convolutional kernels corresponding to it. Our experimental results demonstrate that the SCU-based models without any post-processing generally achieve both model compression and accuracy improvement compared to the baselines, consistently for all tested architectures.

##### REVERSED NEURAL NETWORK - AUTOMATICALLY FINDING NASH EQUILIBRIUM

tl;dr REVERSED NEURAL NETWORK - A PRIMAL

Contrary to most reinforcement learning studies emphasizing on approximating the output layer of a neural network to certain strategies, this paper proposes a reversed way for reinforcement learning. We call this “Reversed Neural Network”. In short, after sufficiently training a canonical deep feed-forward neural network according to a strategy-and-environment-to-payoff table, we randomize part of the neurons in the input layer and propagate the error between the generated output and the desired output back to the part of the neurons in the “input layer” of the trained deep neural network recurrently. And we view the final neurons in the “input layer” as the fittest strategy for a neural network.

##### Robustness May Be at Odds with Accuracy

tl;dr We show that adversarial robustness might come at the cost of standard classification performance, but also yields unexpected benefits.

We show that there exists an inherent tension between the goal of adversarial robustness and that of standard generalization. Specifically, training robust models may not only be more resource-consuming, but also lead to a reduction of standard accuracy. We demonstrate that this trade-off between the standard accuracy of a model and its robustness to adversarial perturbations provably exists even in a fairly simple and natural setting. These findings also corroborate a similar phenomenon observed in practice. Further, we argue that this phenomenon is a consequence of robust classifiers learning fundamentally different feature representations than standard classifiers. These differences, in particular, seem to result in unexpected benefits: the representations learned by robust models tend to align better with salient data characteristics and human perception.

##### NECST: Neural Joint Source-Channel Coding

tl;dr jointly learn compression + error correcting codes with deep learning

For reliable transmission across a noisy communication channel, classical results from information theory show that it is asymptotically optimal to separate out the source and channel coding processes. However, this decomposition can fall short in the finite bit-length regime, as it requires non-trivial tuning of hand-crafted codes and assumes infinite computational power for decoding. In this work, we propose Neural Error Correcting and Source Trimming (NECST) codes to jointly learn the encoding and decoding processes in an end-to-end fashion. By adding noise into the latent codes to simulate the channel during training, we learn to both compress and error-correct given a fixed bit-length and computational budget. We obtain codes that are not only competitive against several capacity-approaching channel codes, but also learn useful robust representations of the data for downstream tasks such as classification. Finally, we learn an extremely fast neural decoder, yielding almost an order of magnitude in speedup compared to standard decoding methods based on iterative belief propagation.

##### Faster Training by Selecting Samples Using Embeddings

tl;dr Training is sped up by using a dataset that has been subsampled through embedding analysis.

Long training times have increasingly become a burden for researchers by slowing down the pace of innovation, with some models taking days or weeks to train. In this paper, a new, general technique is presented that aims to speed up the training process by using a thinned-down training dataset. By leveraging autoencoders and the unique properties of embedding spaces, we are able to filter training datasets to include only those samples that matter the most. Through evaluation on a standard CIFAR-10 image classification task, this technique is shown to be effective. With this technique, training times can be reduced with a minimal loss in accuracy. Conversely, given a fixed training time budget, the technique was shown to improve accuracy by over 50%. This technique is a practical tool for achieving better results with large datasets and limited computational budgets.

##### Representation Flow for Action Recognition

No tl;dr =[

In this paper, we propose a convolutional layer inspired by optical flow algorithms to learn motion representations. Our representation flow layer is a fully-differentiable layer designed to optimally capture the 'flow' of any representation channel within a convolutional neural network. Its parameters for iterative flow optimization are learned in an end-to-end fashion together with the other model parameters, maximizing the action recognition performance. Furthermore, we newly introduce the concept of learning 'flow of flow' representations by stacking multiple representation flow layers. We conducted extensive experimental evaluations, confirming its advantages over previous recognition models using traditional optical flows in both computational speed and performance.

##### Sentence Encoding with Tree-Constrained Relation Networks

No tl;dr =[

The meaning of a sentence is a function of the relations that hold between its words. We instantiate this relational view of semantics in a series of neural models based on variants of relation networks (RNs) which represent a set of objects (for us, words forming a sentence) in terms of representations of pairs of objects. We propose two extensions to the basic RN model for natural language. First, building on the intuition that not all word pairs are equally informative about the meaning of a sentence, we use constraints based on both supervised and unsupervised dependency syntax to control which relations influence the representation. Second, since higher-order relations are poorly captured by a sum of pairwise relations, we use a recurrent extension of RNs to propagate information so as to form representations of higher order relations. Experiments on sentence classification, sentence pair classification, and machine translation reveal that, while basic RNs are only modestly effective for sentence representation, recurrent RNs with latent syntax are a reliably powerful representational device.

##### Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm

tl;dr We analyze and determine the precision requirements for training neural networks when all tensors, including back-propagated signals and weight accumulators, are quantized to fixed-point format.

The high computational and parameter complexity of neural networks makes their training very slow and difficult to deploy on energy and storage-constrained comput- ing systems. Many network complexity reduction techniques have been proposed including fixed-point implementation. However, a systematic approach for design- ing full fixed-point training and inference of deep neural networks remains elusive. We describe a precision assignment methodology for neural network training in which all network parameters, i.e., activations and weights in the feedforward path, gradients and weight accumulators in the feedback path, are assigned close to minimal precision. The precision assignment is derived analytically and enables tracking the convergence behavior of the full precision training, known to converge a priori. Thus, our work leads to a systematic methodology of determining suit- able precision for fixed-point training. The near optimality (minimality) of the resulting precision assignment is validated empirically for four networks on the CIFAR-10, CIFAR-100, and SVHN datasets. The complexity reduction arising from our approach is compared with other fixed-point neural network designs.

##### SALSA-TEXT : SELF ATTENTIVE LATENT SPACE BASED ADVERSARIAL TEXT GENERATION

tl;dr We propose a self-attention based GAN architecture for unconditional text generation and improve on previous adversarial code-based results.

Inspired by the success of self attention mechanism and Transformer architecture in sequence transduction and image generation applications, we propose novel self attention-based architectures to improve the performance of adversarial latent code- based schemes in text generation. Adversarial latent code-based text generation has recently gained a lot of attention due to their promising results. In this paper, we take a step to fortify the architectures used in these setups, specifically AAE and ARAE. We benchmark two latent code-based methods (AAE and ARAE) designed based on adversarial setups. In our experiments, the Google sentence compression dataset is utilized to compare our method with these methods using various objective and subjective measures. The experiments demonstrate the proposed (self) attention-based models outperform the state-of-the-art in adversarial code-based text generation.

##### Multiple-Attribute Text Rewriting

tl;dr A system for rewriting text conditioned on multiple controllable attributes

The dominant approach to unsupervised "style transfer" in text is based on the idea of learning a latent representation, which is independent of the attributes specifying its "style". In this paper, we show that this condition is not necessary and is not always met in practice, even with domain adversarial training, that explicitly aims at learning such disentangled representations. We thus propose a new model that controls several factors of variation in textual data where this condition on disentanglement is replaced with a simpler mechanism based on back-translation. Our method allows control over multiple attributes, like gender, sentiment, product type, etc., and a more fine-grained control on the trade-off between content preservation and change of style with a pooling operator in the latent space. Our experiments demonstrate that the fully entangled model produces better generations, even when tested on new and more challenging benchmarks comprising reviews with multiple sentences and multiple attributes.

##### An Efficient and Margin-Approaching Zero-Confidence Adversarial Attack

tl;dr This paper introduces MarginAttack, a stronger and faster zero-confidence adversarial attack.

There are two major paradigms of white-box adversarial attacks that attempt to impose input perturbations. The first paradigm, called the fix-perturbation attack, crafts adversarial samples within a given perturbation level. The second paradigm, called the zero-confidence attack, finds the smallest perturbation needed to cause misclassification, also known as the margin of an input feature. While the former paradigm is well-resolved, the latter is not. Existing zero-confidence attacks either introduce significant approximation errors, or are too time-consuming. We therefore propose MarginAttack, a zero-confidence attack framework that is able to compute the margin with improved accuracy and efficiency. Our experiments show that MarginAttack is able to compute a smaller margin than the state-of-the-art zero-confidence attacks, and matches the state-of-the-art fix-perturbation attacks. In addition, it runs significantly faster than the Carlini-Wagner attack, currently the most accurate zero-confidence attack algorithm.

##### MARGINALIZED AVERAGE ATTENTIONAL NETWORK FOR WEAKLY-SUPERVISED LEARNING

tl;dr A novel marginalized average attentional network for weakly-supervised temporal action localization

In weakly-supervised temporal action localization, previous works suffer from overestimating the most salient regions and fail to locate dense and integral regions for each entire action. To alleviate this issue, we propose a marginalized average attentional network (MAAN) to suppress the dominant response of the most salient regions in a principled manner. The MAAN employs a novel marginalized average aggregation (MAA) module and learns a set of latent discriminative probabilities in an end-to-end fashion. MAA samples the subsets from the video snippet features based on the latent discriminative probabilities and takes the expectation over all the subset features. Theoretically, we prove that the learned latent discriminative probabilities reduce the difference of responses between the most salient regions and the others, and thus MAAN generates better class activation sequences to identify more dense and integral action regions in the videos. Moreover, we propose a fast algorithm to reduce the complexity of constructing MAA from $O(2^T)$ to $O(T^2)$. Extensive experiments on two large-scale video datasets show that our MAAN achieves superior performance on weakly-supervised temporal action localization task.

##### Total Style Transfer with a Single Feed-Forward Network

tl;dr A paper suggesting a method to transform the style of images using deep neural networks.

Recent image style transferring methods achieved arbitrary stylization with input content and style images. To transfer the style of an arbitrary image to a content image, these methods used a feed-forward network with a lowest-scaled feature transformer or a cascade of the networks with a feature transformer of a corresponding scale. However, their approaches did not consider either multi-scaled style in their single-scale feature transformer or dependency between the transformed feature statistics across the cascade networks. This shortcoming resulted in generating partially and inexactly transferred style in the generated images. To overcome this limitation of partial style transfer, we propose a total style transferring method which transfers multi-scaled feature statistics through a single feed-forward process. First, our method transforms multi-scaled feature maps of a content image into those of a target style image by considering both inter-channel correlations in each single scaled feature map and inter-scale correlations between multi-scaled feature maps. Second, each transformed feature map is inserted into the decoder layer of the corresponding scale using skip-connection. Finally, the skip-connected multi-scaled feature maps are decoded into a stylized image through our trained decoder network.

##### Learning Latent Superstructures in Variational Autoencoders for Deep Multidimensional Clustering

tl;dr We investigate a variant of variational autoencoders where there is a superstructure of discrete latent variables on top of the latent features.

We investigate a variant of variational autoencoders where there is a superstructure of discrete latent variables on top of the latent features. In general, our superstructure is a tree structure of multiple super latent variables and it is automatically learned from data. When there is only one latent variable in the superstructure, our model reduces to one that assumes the latent features to be generated from a Gaussian mixture model. We call our model the latent tree variational autoencoder (LTVAE). Whereas previous deep learning methods for clustering produce only one partition of data, LTVAE produces multiple partitions of data, each being given by one super latent variable. This is desirable because high dimensional data usually have many different natural facets and can be meaningfully partitioned in multiple ways.

##### TherML: The Thermodynamics of Machine Learning

tl;dr We offer a framework for representation learning that connects with a wide class of existing objectives and is analogous to thermodynamics.

In this work we offer an information-theoretic framework for representation learning that connects with a wide class of existing objectives in machine learning. We develop a formal correspondence between this work and thermodynamics and discuss its implications.

##### Learned optimizers that outperform on wall-clock and validation loss

tl;dr We analyze problems when training learned optimizers, address those problems via variational optimization using two complementary gradient estimators, and train optimizers that are 5x faster in wall-clock time than baseline optimizers (e.g. Adam).

Deep learning has shown that learned functions can dramatically outperform hand-designed functions on perceptual tasks. Analogously, this suggests that learned update functions may similarly outperform current hand-designed optimizers, especially for specific tasks. However, learned optimizers are notoriously difficult to train and have yet to demonstrate wall-clock speedups over hand-designed optimizers, and thus are rarely used in practice. Typically, learned optimizers are trained by truncated backpropagation through an unrolled optimization process. The resulting gradients are either strongly biased (for short truncations) or have exploding norm (for long truncations). In this work we propose a training scheme which overcomes both of these difficulties, by dynamically weighting two unbiased gradient estimators for a variational loss on optimizer performance. This allows us to train neural networks to perform optimization faster than well tuned first-order methods. Moreover, by training the optimizer against validation loss, as opposed to training loss, we are able to use it to train models which generalize better than those trained by first order methods. We demonstrate these results on problems where our learned optimizer trains convolutional networks in a fifth of the wall-clock time compared to tuned first-order methods, and with an improvement

##### Combining adaptive algorithms and hypergradient method: a performance and robustness study

tl;dr We provide a study trying to see how the recent online learning rate adaptation extends the conclusion made by Wilson et al. 2018 about adaptive gradient methods, along with comparison and sensitivity analysis.

Wilson et al. (2017) showed that, when the stepsize schedule is properly designed, stochastic gradient generalizes better than ADAM (Kingma & Ba, 2014). In light of recent work on hypergradient methods (Baydin et al., 2018), we revisit these claims to see if such methods close the gap between the most popular optimizers. As a byproduct, we analyze the true benefit of these hypergradient methods compared to more classical schedules, such as the fixed decay of Wilson et al. (2017). In particular, we observe they are of marginal help since their performance varies significantly when tuning their hyperparameters. Finally, as robustness is a critical quality of an optimizer, we provide a sensitivity analysis of these gradient based optimizers to assess how challenging their tuning is.

##### ON THE EFFECTIVENESS OF TASK GRANULARITY FOR TRANSFER LEARNING

tl;dr If the model architecture is fixed, how would the complexity and granularity of task, effect the quality of learned features for transferring to a new task.

We describe a DNN for video classification and captioning, trained end-to-end, with shared features, to solve tasks at different levels of granularity, exploring the link between granularity in a source task and the quality of learned features for transfer learning. For solving the new task domain in transfer learning, we freeze the trained encoder and fine-tune an MLP on the target domain. We train on the Something-Something dataset with over 220, 000 videos, and multiple levels of target granularity, including 50 action groups, 174 fine-grained action categories and captions. Classification and captioning with Something-Something are challenging because of the subtle differences between actions, applied to thousands of different object classes, and the diversity of captions penned by crowd actors. Our model performs better than existing classification baselines for SomethingSomething, with impressive fine-grained results. And it yields a strong baseline on the new Something-Something captioning task. Experiments reveal that training with more fine-grained tasks tends to produce better features for transfer learning.

##### DEEP HIERARCHICAL MODEL FOR HIERARCHICAL SELECTIVE CLASSIFICATION AND ZERO SHOT LEARNING

tl;dr propose a new heirarchical probability bases loss funcsion which yeilds a better semantic classifier. We show our model advantages on two applications.

Object recognition in real-world image scenes is still an open problem. A large number of object classes with complex relationships between them makes the classification problem particularly challenging. Standard N-way discrete classifiers treat all classes as disconnected and unrelated, and therefore unable to learn from their semantic relationships. In this work, we present a hierarchical interclass relationship model, and train it using a newly proposed probability-based loss function. We show the model advantages deploying it in two scenarios. The first one, selective classification, deals with the problem of low-confidence classification, wherein a model is unable to make a successful exact classification. In this case, our model returns a corresponding closest super-class. In the second scenario, the proposed method is used for the zero-shot learning problem. In this case, given a new input, the model returns its hierarchically related group, rather than generating a true unseen group. Extensive experiments with the two scenarios show that the proposed hierarchical model provides significantly better semantic generalization ability compared to a regular N-way classifier, and yields more accurate and meaningful super-class predictions.

##### Estimating Heterogeneous Treatment Effects Using Neural Networks With The Y-Learner

tl;dr We develop a CATE estimation strategy that takes advantage some of the intriguing properties of neural networks.

We develop the Y-learner for estimating heterogeneous treatment effects in experimental and observational studies. The Y-learner is designed to leverage the abilities of neural networks to optimize multiple objectives and continually update, which allows for better pooling of underlying feature information between treatment and control groups. We evaluate the Y-learner on three test problems: (1) A set of six simulated data benchmarks from the literature. (2) A real-world large-scale experiment on voter persuasion. (3) A task from the literature that estimates artificially generated treatment effects on MNIST didgits. The Y-learner achieves state of the art results on two of the three tasks. On the MNIST task, it gets the second best results.

##### Human-level Protein Localization with Convolutional Neural Networks

No tl;dr =[

Localizing a specific protein in a human cell is essential for understanding cellular functions and biological processes of underlying diseases. A promising, low-cost, and time-efficient biotechnology for localizing proteins is high-throughput fluorescence microscopy imaging (HTI). HTI stains the protein of interest in a cell with fluorescent antibodies and subsequently takes a microscopic image. Together with images of other stained proteins or cell organelles and the annotation by the Human Protein Atlas project, these images provide a rich source of information on the protein location which can be utilized by computational methods. It is yet unclear how precise such methods are and whether they can compete with human experts. We here focus on deep learning image analysis methods and, in particular, on Convolutional Neural Networks (CNNs) since they showed overwhelming success across different imaging tasks. We propose a novel CNN architecture “GapNet-PL” that has been designed to tackle the characteristics of HTI data and uses global averages of filters at different abstraction levels. We present the largest comparison of CNN architectures including GapNet-PL for protein localization in HTI images of human cells. GapNet-PL outperforms all other competing methods and reaches close to perfect localization in all 13 tasks with an average AUC of 98% and F1 score of 78%. On a separate test set the performance of GapNet-PL was compared with a human expert. GapNet-PL achieved an accuracy of 91%, significantly (p-value 2e-10) outperforming the human expert with an accuracy of 61%.

##### The Nonlinearity Coefficient - Predicting Generalization in Deep Neural Networks

tl;dr We introduce the NLC, a metric that is cheap to compute in the networks randomly initialized state and is highly predictive of generalization, at least in fully-connected networks.

For a long time, designing neural architectures that exhibit high performance was considered a dark art that required expert hand-tuning. One of the few well-known guidelines for architecture design is the avoidance of exploding or vanishing gradients. However, even this guideline has remained relatively vague and circumstantial, because there exists no well-defined, gradient-based metric that can be computed {\it before} training begins and can robustly predict the performance of the network {\it after} training is complete. We introduce what is, to the best of our knowledge, the first such metric: the nonlinearity coefficient (NLC). Via an extensive empirical study, we show that the NLC, computed in the network's randomly initialized state, is a powerful predictor of test error and that attaining a right-sized NLC is essential for attaining an optimal test error, at least in fully-connected feedforward networks. The NLC is also conceptually simple, cheap to compute, and is robust to a range of confounders and architectural design choices that comparable metrics are not necessarily robust to. Hence, we argue the NLC is an important tool for architecture search and design, as it can robustly predict poor training outcomes before training even begins.

##### Learning space time dynamics with PDE guided neural networks

No tl;dr =[

Spatio-Temporal processes bear a central importance in many applied scientific fields. Generally, differential equations are used to describe these processes. In this work, we address the problem of learning spatio-temporal dynamics with neural networks when only partial information on the system's state is available. Taking inspiration from the dynamical system approach, we outline a general framework in which complex dynamics generated by families of differential equations can be learned in a principled way. Two models are derived from this framework. We demonstrate how they can be applied in practice by considering the problem of forecasting fluid flows. We show how the underlying equations fit into our formalism and evaluate our method by comparing with standard baselines.

##### Dopamine: A Research Framework for Deep Reinforcement Learning

tl;dr In this paper we introduce Dopamine, a new research framework for deep RL that is open-source, TensorFlow-based, and provides compact yet reliable implementations of some state-of-the-art deep RL agents.

Deep reinforcement learning (deep RL) research has grown significantly in recent years. A number of software offerings now exist that provide stable, comprehensive implementations for benchmarking. At the same time, recent deep RL research has become more diverse in its goals. In this paper we introduce Dopamine, a new research framework for deep RL that aims to support some of that diversity. Dopamine is open-source, TensorFlow-based, and provides compact yet reliable implementations of some state-of-the-art deep RL agents. We complement this offering with a taxonomy of the different research objectives in deep RL research. While by no means exhaustive, our analysis highlights the heterogeneity of research in the field, and the value of frameworks such as ours.

##### Discovery of natural language concepts in individual units

tl;dr We show that individual units in CNN representations learned in NLP tasks are selectively responsive to natural language concepts.

Although deep convolutional networks have achieved improved performance in many natural language tasks, they have been treated as black boxes because they are difficult to interpret. Especially, little is known about how they represent language in their intermediate layers. In an attempt to understand the representations of deep convolutional networks trained on language tasks, we show that individual units are selectively responsive to specific morphemes, words, and phrases, rather than responding to arbitrary and uninterpretable patterns. In order to quantitatively analyze such intriguing phenomenon, we propose a concept alignment method based on how units respond to replicated text. We conduct analyses with different architectures on multiple datasets for classification and translation tasks and provide new insights into how deep models understand natural language.

##### ON RANDOM DEEP AUTOENCODERS: EXACT ASYMPTOTIC ANALYSIS, PHASE TRANSITIONS, AND IMPLICATIONS TO TRAINING

tl;dr We study the behavior of weight-tied multilayer vanilla autoencoders under the assumption of random weights. Via an exact characterization in the limit of large dimensions, our analysis reveals interesting phase transition phenomena.

We study the behavior of weight-tied multilayer vanilla autoencoders under the assumption of random weights. Via an exact characterization in the limit of large dimensions, our analysis reveals interesting phase transition phenomena when the depth becomes large. This, in particular, provides quantitative answers and insights to three questions that were yet fully understood in the literature. Firstly, we provide a precise answer on how the random deep weight-tied autoencoder model performs “approximate inference” as posed by Scellier et al. (2018), and its connection to reversibility considered by several theoretical studies. Secondly, we show that deep autoencoders display a higher degree of sensitivity to perturbations in the parameters, distinct from the shallow counterparts. Thirdly, we obtain insights on pitfalls in training initialization practice, and demonstrate experimentally that it is possible to train a deep autoencoder, even with the tanh activation and a depth as large as 200 layers, without resorting to techniques such as layer-wise pre-training or batch normalization. Our analysis is not specific to any depths or any Lipschitz activations, and our analytical techniques may have broader applicability.

##### Detecting Adversarial Examples Via Neural Fingerprinting

Deep neural networks are vulnerable to adversarial examples: input data that has been manipulated to cause dramatic model output errors. To defend against such attacks, we propose NeuralFingerprinting: a simple, yet effective method to detect adversarial examples that verifies whether model behavior is consistent with a set of fingerprints. These fingerprints are encoded into the model response during training and are inspired by the use of biometric and cryptographic signatures. In contrast to previous defenses, our method does not rely on knowledge of the adversary and can scale to large networks and input data. The benefits of our method are that 1) it is fast, 2) it is prohibitively expensive for an attacker to reverse-engineer which fingerprints were used, and 3) it does not assume knowledge of the adversary. In this work, we 1) theoretically analyze NeuralFingerprinting for linear models and 2) show that NeuralFingerprinting significantly improves on state-of-the-art detection mechanisms for deep neural networks, by detecting the strongest known adversarial attacks with 98-100% AUC-ROC scores on the MNIST, CIFAR-10 and MiniImagenet (20 classes) datasets. In particular, we consider several threat models, including the most conservative one in which the attacker has full knowledge of the defender's strategy. In all settings, the detection accuracy of NeuralFingerprinting generalizes well to unseen test-data and is robust over a wide range of hyperparameters.

##### Hierarchical RL Using an Ensemble of Proprioceptive Periodic Policies

No tl;dr =[

In this paper we introduce a simple, robust approach to hierarchically training an agent in the setting of sparse reward tasks. The agent is split into a low-level and a high-level policy. The low-level policy only accesses internal, proprioceptive dimensions of the state observation. The low-level policies are trained with a simple reward that encourages changing the values of the non-proprioceptive dimensions. Furthermore, it is induced to be periodic with the use a phase function.'' The high-level policy is trained using a sparse, task-dependent reward, and operates by choosing which of the low-level policies to run at any given time. Using this approach, we solve difficult maze and navigation tasks with sparse rewards using the Mujoco Ant and Humanoid agents and show improvement over recent hierarchical methods.

##### Smoothing the Geometry of Probabilistic Box Embeddings

tl;dr Improve hierarchical embedding models using kernel smoothing

There is growing interest in geometrically-inspired embeddings for learning hierarchies, partial orders, and lattice structures, with natural applications to transitive relational data such as entailment graphs. Recent work has extended these ideas beyond deterministic hierarchies to probabilistically calibrated models, which enable learning from uncertain supervision and inferring soft-inclusions among concepts, while maintaining the geometric inductive bias of hierarchical embedding models. We build on the Box Lattice model of Vilnis et al. (2018), which showed promising results in modeling soft-inclusions through an overlapping hierarchy of sets, parameterized as high-dimensional hyperrectangles (boxes). However, the hard edges of the boxes present difficulties for standard gradient based optimization; that work employed a special surrogate function for the disjoint case, but we find this method to be fragile. In this work, we present a novel hierarchical embedding model, inspired by a relaxation of box embeddings into parameterized density functions using Gaussian convolutions over the boxes. Our approach provides an alternative surrogate to the original lattice measure that improves the robustness of optimization in the disjoint case, while also preserving the desirable properties with respect to the original lattice. We demonstrate increased or matching performance on WordNet hypernymy prediction, Flickr caption entailment, and a MovieLens-based market basket dataset. We show especially marked improvements in the case of sparse data, where many conditional probabilities should be low, and thus boxes should be nearly disjoint.

##### Bayesian Policy Optimization for Model Uncertainty

tl;dr We formulate model uncertainty in Reinforcement Learning as a continuous Bayes-Adaptive Markov Decision Process and present a method for practical and scalable Bayesian policy optimization.

Addressing uncertainty is critical for autonomous systems to robustly adapt to the real world. We formulate the problem of model uncertainty as a continuous Bayes-Adaptive Markov Decision Process (BAMDP), where an agent maintains a posterior distribution over the latent model parameters given a history of observations and maximizes its expected long-term reward with respect to this belief distribution. Our algorithm, Bayesian Policy Optimization, builds on recent policy optimization algorithms to learn a universal policy that navigates the exploration-exploitation trade-off to maximize the Bayesian value function. To address challenges from discretizing the continuous latent parameter space, we propose a policy network architecture that independently encodes the belief distribution from the observable state. Our method significantly outperforms algorithms that address model uncertainty without explicitly reasoning about belief distributions, and is competitive with state-of-the-art Partially Observable Markov Decision Process solvers.

##### Constraining Action Sequences with Formal Languages for Deep Reinforcement Learning

tl;dr We constrain an agent's actions during reinforcement learning, for safety or to enhance exploration.

We study the problem of deep reinforcement learning where the agent's action sequences are constrained, e.g., prohibition of dithering or overactuating action sequences that might damage a robot, drone, or other physical device. Our model focuses on constraints that can be described by automata such as DFAs or PDAs. We then propose multiple approaches to augment the state descriptions of the Markov decision process (MDP) with summaries of recent action histories. We empirically evaluate these methods applying DQN to three Atari games, training with reward shaping. We found that our approaches are effective in significantly reducing, and even eliminating, constraint violations while maintaining high reward. We also observed that the total reward achieved by an agent can be highly sensitive to how much the constraints encourage or discourage exploration of potentially effective actions during training, and, in addition to helping ensure safe policies, the use of constraints can enhance exploration during training.

##### ROBUST ESTIMATION VIA GENERATIVE ADVERSARIAL NETWORKS

tl;dr GANs are shown to provide us a new effective robust mean estimate against agnostic contaminations with both statistical optimality and practical tractability.

Robust estimation under Huber's $\epsilon$-contamination model has become an important topic in statistics and theoretical computer science. Rate-optimal procedures such as Tukey's median and other estimators based on statistical depth functions are impractical because of their computational intractability. In this paper, we establish an intriguing connection between f-GANs and various depth functions through the lens of f-Learning. Similar to the derivation of f-GAN, we show that these depth functions that lead to rate-optimal robust estimators can all be viewed as variational lower bounds of the total variation distance in the framework of f-Learning. This connection opens the door of computing robust estimators using tools developed for training GANs. In particular, we show that a JS-GAN that uses a neural network discriminator with at least one hidden layer is able to achieve the minimax rate of robust mean estimation under Huber's $\epsilon$-contamination model. Interestingly, the hidden layers of the neural net structure in the discriminator class are shown to be necessary for robust estimation.

##### Feature Attribution As Feature Selection

No tl;dr =[

Feature attribution methods identify "relevant" features as an explanation of a complex machine learning model. Several feature attribution methods have been proposed; however, only a few studies have attempted to define the "relevance" of each feature mathematically. In this study, we formalize the feature attribution problem as a feature selection problem. In our proposed formalization, there arise two possible definitions of relevance. We name the feature attribution problems based on these two relevances as Exclusive Feature Selection (EFS) and Inclusive Feature Selection (IFS). We show that several existing feature attribution methods can be interpreted as approximation algorithms for EFS and IFS. Moreover, through exhaustive experiments, we show that IFS is better suited as the formalization for the feature attribution problem than EFS.