# Search ICLR 2019

Searching papers submitted to ICLR 2019 can be painful. You might want to know which paper uses technique X, dataset D, or cites author ME. Unfortunately, search is limited to titles, abstracts, and keywords, missing the actual contents of the paper. This Frankensteinian search has returned from 2018 to help scour the papers of ICLR by ripping out their souls using pdftotext.

Good luck! Warranty's not included :)

Need random search inspiration..? Grab something from the list of all tags! ^_^
How about: graph autoencoder, dnc, overparameterization, langugage model, sets ..?

Sanity Disclaimer: As you stare at the continuous stream of ICLR and arXiv papers, don't lose confidence or feel overwhelmed. This isn't a competition, it's a search for knowledge. You and your work are valuable and help carve out the path for progress in our field :)

### "Random selection" has 100 results

##### Optimal Completion Distillation for Sequence Learning

tl;dr Optimal Completion Distillation (OCD) is a training procedure for optimizing sequence to sequence models based on edit distance which achieves state-of-the-art on end-to-end Speech Recognition tasks.

We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance. OCD is efficient, has no hyper-parameters of its own, and does not require pre-training or joint optimization with conditional log-likelihood. Given a partial sequence generated by the model, we first identify the set of optimal suffixes that minimize the total edit distance, using an efficient dynamic programming algorithm. Then, for each position of the generated sequence, we use a target distribution which puts equal probability on the first token of all the optimal suffixes. OCD achieves the state-of-the-art performance on end-to-end speech recognition, on both Wall Street Journal and Librispeech datasets, achieving $9.3\%$ WER and $4.8\%$ WER, respectively.

##### Query-Efficient Hard-label Black-box Attack: An Optimization-based Approach

No tl;dr =[

We study the problem of attacking machine learning models in the hard-label black-box setting, where no model information is revealed except that the attacker can make queries to probe the corresponding hard-label decisions. This is a very challenging problem since the direct extension of state-of-the-art white-box attacks (e.g., C&W or PGD) to the hard-label black-box setting will require minimizing a non-continuous step function, which is combinatorial and cannot be solved by a gradient-based optimizer. The only two current approaches are based on random walk on the boundary (Brendel et al., 2017) and random trials to evaluate the loss function (Ilyas et al., 2018), which require lots of queries and lacks convergence guarantees. We propose a novel way to formulate the hard-label black-box attack as a real-valued optimization problem which is usually continuous and can be solved by any zeroth order optimization algorithm. For example, using the Randomized Gradient-Free method (Nesterov & Spokoiny, 2017), we are able to bound the number of iterations needed for our algorithm to achieve stationary points under mild assumptions. We demonstrate that our proposed method outperforms the previous stochastic approaches to attacking convolutional neural networks on MNIST, CIFAR, and ImageNet datasets. More interestingly, we show that the proposed algorithm can also be used to attack other discrete and non-continuous machine learning models, such as Gradient Boosting Decision Trees (GBDT).

##### ON RANDOM DEEP AUTOENCODERS: EXACT ASYMPTOTIC ANALYSIS, PHASE TRANSITIONS, AND IMPLICATIONS TO TRAINING

tl;dr We study the behavior of weight-tied multilayer vanilla autoencoders under the assumption of random weights. Via an exact characterization in the limit of large dimensions, our analysis reveals interesting phase transition phenomena.

We study the behavior of weight-tied multilayer vanilla autoencoders under the assumption of random weights. Via an exact characterization in the limit of large dimensions, our analysis reveals interesting phase transition phenomena when the depth becomes large. This, in particular, provides quantitative answers and insights to three questions that were yet fully understood in the literature. Firstly, we provide a precise answer on how the random deep weight-tied autoencoder model performs “approximate inference” as posed by Scellier et al. (2018), and its connection to reversibility considered by several theoretical studies. Secondly, we show that deep autoencoders display a higher degree of sensitivity to perturbations in the parameters, distinct from the shallow counterparts. Thirdly, we obtain insights on pitfalls in training initialization practice, and demonstrate experimentally that it is possible to train a deep autoencoder, even with the tanh activation and a depth as large as 200 layers, without resorting to techniques such as layer-wise pre-training or batch normalization. Our analysis is not specific to any depths or any Lipschitz activations, and our analytical techniques may have broader applicability.

##### Collapse of deep and narrow neural nets

tl;dr Deep and narrow neural networks will converge to erroneous mean or median states of the target function depending on the loss with high probability.

Recent theoretical work has demonstrated that deep neural networks have superior performance over shallow networks, but their training is more difficult, e.g., they suffer from the vanishing gradient problem. This problem can be typically resolved by the rectified linear unit (ReLU) activation. However, here we show that even for such activation, deep and narrow neural networks will converge to erroneous mean or median states of the target function depending on the loss with high probability. We demonstrate this collapse of deep and narrow neural networks both numerically and theoretically, and provide estimates of the probability of collapse. We also construct a diagram of a safe region of designing neural networks that avoid the collapse to erroneous states. Finally, we examine different ways of initialization and normalization that may avoid the collapse problem.

##### Visualizing and Understanding Generative Adversarial Networks

tl;dr GAN representations are examined in detail, and sets of representation units are found that control the generation of semantic concepts in the output.

Generative Adversarial Networks (GANs) have recently achieved impressive results for many real-world applications. As an active research topic, many GAN variants have emerged with immprovements in sample quality and training stability. However, visualization and understanding of GANs is largely missing. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models. In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to \concepts with a segmentation-based network dissection method. Then, we quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. Finally, we examine the contextual relationship between these units and their surrounding by inserting the discovered \concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifacts'' units, to interactively manipulating objects in the scene. We will open source our interactive online tools to help researchers and practitioners better understand their models.

##### Fixing Posterior Collapse with delta-VAEs

tl;dr Avoid posterior collapse by lower bounding the rate.

Due to the phenomenon of “posterior collapse”, current latent variable generativemodels pose a challenging design choice which trades-off optimizing the ELBObut handicapping the decoders’ capacity and expressivity, or changing the loss tosomething that is not directly minimizing the description length. In this paper wepropose an alternative that utilizes the best, most powerful generative models asdecoders, whilst optimizing the proper variational lower bound all while ensuringthat the latent variables preserve and encode useful information. delta-VAEs pro-posed here achieve this by constraining the variational family for the posterior tohave a minimum distance to the prior, which resembles the classic representationlearning approach of slow feature analysis. We demonstrate the efficacy of our ap-proach at modeling images: learning representations, improving sample quality,and improving state of the art log-likelihood on CIFAR-10 and ImageNet32×32.

##### Fluctuation-dissipation relations for stochastic gradient descent

tl;dr We prove fluctuation-dissipation relations for SGD, which can be used to (i) adaptively set learning rates and (ii) probe loss surfaces.

The notion of the stationary equilibrium ensemble has played a central role in statistical mechanics. In machine learning as well, training serves as generalized equilibration that drives the probability distribution of model parameters toward stationarity. Here, we derive stationary fluctuation-dissipation relations that link measurable quantities and hyperparameters in the stochastic gradient descent algorithm. These relations hold exactly for any stationary state and can in particular be used to adaptively set training schedule. We can further use the relations to efficiently extract information pertaining to a loss-function landscape such as the magnitudes of its Hessian and anharmonicity. Our claims are empirically verified.

##### Where and when to look? Spatial-temporal attention for action recognition in videos

No tl;dr =[

Inspired by the observation that humans are able to process videos efficiently by only paying attention when and where it is needed, we propose a novel spatial-temporal attention mechanism for video-based action recognition. For spatial attention, we learn a saliency mask to allow the model to focus on the most salient parts of the feature maps. For temporal attention, we employ a soft temporal attention mechanism to identify the most relevant frames from an input video. Further, we propose a set of regularizers that ensure that our attention mechanism attends to coherent regions in space and time. Our model is efficient, as it proposes a separable spatio-temporal mechanism for video attention, while being able to identify important parts of the video both spatially and temporally. We demonstrate the efficacy of our approach on three public video action recognition datasets. The proposed approach leads to state-of-the-art performance on all of them, including the new large-scale Moments in Time dataset. Furthermore, we quantitatively and qualitatively evaluate our model's ability to accurately localize discriminative regions spatially and critical frames temporally. This is despite our model only being trained with per video classification labels.

##### Online Learning for Supervised Dimension Reduction

tl;dr We proposed two new approaches, the incremental sliced inverse regression and incremental overlapping sliced inverse regression, to implement supervised dimension reduction in an online learning manner.

Online learning has attracted great attention due to the increasing demand for systems that have the ability of learning and evolving. When the data to be processed is also high dimensional and dimension reduction is necessary for visualization or prediction enhancement, online dimension reduction will play an essential role. The purpose of this paper is to propose new online learning approaches for supervised dimension reduction. Our first algorithm is motivated by adapting the sliced inverse regression (SIR), a pioneer and effective algorithm for supervised dimension reduction, and making it implementable in an incremental manner. The new algorithm, called incremental sliced inverse regression (ISIR), is able to update the subspace of significant factors with intrinsic lower dimensionality fast and efficiently when new observations come in. We also refine the algorithm by using an overlapping technique and develop an incremental overlapping sliced inverse regression (IOSIR) algorithm. We verify the effectiveness and efficiency of both algorithms by simulations and real data applications.

##### Unseen Action Recognition with Multimodal Learning

No tl;dr =[

In this paper, we present a method to learn a joint multimodal representation space that allows for the recognition of unseen activities in videos. We compare the effect of placing various constraints on the embedding space using paired text and video data. Additionally, we propose a method to improve the joint embedding space using an adversarial formulation with unpaired text and video data. In addition to testing on publicly available datasets, we introduce a new, large-scale text/video dataset. We experimentally confirm that learning such shared embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning.

##### Learning Joint Wasserstein Auto-Encoders for Joint Distribution Matching

tl;dr Learning Joint Wasserstein Auto-Encoders for Joint Distribution Matching

We study the joint distribution matching problem which aims at learning bidirectional mappings to match the joint distribution of two domains. This problem occurs in unsupervised image-to-image translation and video-to-video synthesis tasks, which, however, has two critical challenges: (i) it is difficult to exploit sufficient information from the joint distribution; (ii) how to theoretically and experimentally evaluate the generalization performance remains an open question. To address the above challenges, we propose a new optimization problem and design a novel Joint Wasserstein Auto-Encoders (JWAE) to minimize the Wasserstein distance of the joint distributions in two domains. We theoretically prove that the generalization ability of the proposed method can be guaranteed by minimizing the Wasserstein distance of joint distributions. To verify the generalization ability, we apply our method to unsupervised video-to-video synthesis by performing video frame interpolation and producing visually smooth videos in two domains, simultaneously. Both qualitative and quantitative comparisons demonstrate the superiority of our method over several state-of-the-arts.

##### Learning Information Propagation in the Dynamical Systems via Information Bottleneck Hierarchy

tl;dr Compact perception of dynamical process

Extracting relevant information, causally inferring and predicting the future states with high accuracy is a crucial task for modeling complex systems. On one hand, the aforementioned challenges require a rigorous mathematical analysis capable of dealing with high-dimensional heterogeneous data streams. On the other hand, this rigorous mathematical framework allows us to learn precise, compact information and be able to coherently propagate measures of uncertainty across spatio-temporal states. To efficiently and rigorously process high-dimensional data coming from complex systems, we advocate for an information theory inspired approach that incorporates stochastic calculus and seeks to determine a trade-off between the predictive accuracy and compactness of the mathematical representation. Mathematically, such a model construction is cast as an optimization problem that maximizes the compression such that the predictive ability and correlation (relatedness) constraints between the original data and compact model are closely bounded. To learn this compact representation of a time-varying complex system and solve the above-mentioned optimization problem we use variational calculus and derive its general solution expressions. Moreover, we provide theoretical guarantees concerning the convergence of the proposed algorithm. To further test the proposed framework, we consider a high-dimensional Gaussian case study and describe an iterative scheme for updating the new model parameters. Using numerical experiments, we demonstrate the benefits on compression and prediction accuracy for a class of dynamical systems. Finally, we apply the proposed algorithm to the real-world dataset of multimodal sentiment intensity and show improvements in prediction with reduced dimensions.

##### Large Scale Graph Learning From Smooth Signals

No tl;dr =[

Graphs are a prevalent tool in data science, as they model the inherent structure of the data. Typically they are constructed either by connecting nearest samples, or by learning them from data, solving an optimization problem. While graph learning does achieve a better quality, it also comes with a higher computational cost. In particular, the current state-of-the-art model cost is O(n^2) for n samples. In this paper, we show how to scale it, obtaining an approximation with leading cost of O(n log(n)), with quality that approaches the exact graph learning model. Our algorithm uses known approximate nearest neighbor techniques to reduce the number of variables, and automatically selects the correct parameters of the model, requiring a single intuitive input: the desired edge density.

##### Building Dynamic Knowledge Graphs from Text using Machine Reading Comprehension

No tl;dr =[

We propose a neural machine-reading model that constructs dynamic knowledge graphs from procedural text. It builds these graphs recurrently for each step of the described procedure, and uses them to track the evolving states of participant entities. We harness and extend a recently proposed machine reading comprehension(MRC) model to query for entity states, since these states are generally communicated in spans of text and MRC models perform well in extracting entity-centric spans. The explicit, structured, and evolving knowledge graph representations that our model constructs can be used in downstream question answering tasks to improve machine comprehension of text, as we demonstrate empirically. On two comprehension tasks from the recently proposed ProPara dataset, our model achieves state-of-the-art results. We further show that our model is competitive on the Recipes dataset, suggesting it may be generally applicable.

##### Exemplar Guided Unsupervised Image-to-Image Translation with Semantic Consistency

tl;dr We propose the Exemplar Guided & Semantically Consistent Image-to-image Translation (EGSC-IT) network which conditions the translation process on an exemplar image in the target domain.

Image-to-image translation has recently received significant attention due to advances in deep learning. Most works focus on learning either a one-to-one mapping in an unsupervised way or a many-to-many mapping in a supervised way. However, a more practical setting is many-to-many mapping in an unsupervised way, which is harder due to the lack of supervision and the complex inner- and cross-domain variations. To alleviate these issues, we propose the Exemplar Guided & Semantically Consistent Image-to-image Translation (EGSC-IT) network which conditions the translation process on an exemplar image in the target domain. We assume that an image comprises of a content component which is shared across domains, and a style component specific to each domain. Under the guidance of an exemplar from the target domain we apply Adaptive Instance Normalization to the shared content component, which allows us to transfer the style information of the target domain to the source domain. To avoid semantic inconsistencies during translation that naturally appear due to the large inner- and cross-domain variations, we introduce the concept of feature masks that provide coarse semantic guidance without requiring the use of any semantic labels. Experimental results on various datasets show that EGSC-IT does not only translate the source image to diverse instances in the target domain, but also preserves the semantic consistency during the process.

##### Neural Probabilistic Motor Primitives for Humanoid Control

tl;dr Neural Probabilistic Motor Primitives compress motion capture tracking policies into one flexible model capable of one-shot imitation and reuse as a low-level controller.

Transferring functional properties from one or multiple expert policies to a student policy is an important challenge in control. Expert robustness is of particular interest; we would like to not only transfer the expert behavior but also its ability to recover from perturbations. With this in mind, we explore approaches for policy cloning and propose linear feedback policy cloning as a simple option for certain settings. We show that it can be surprisingly straightforward to clone ex-pert policies for seemingly complex behaviors without the student requiring any environment interactions. We then propose a latent-variable architecture that bottlenecks a sensory-motor primitive space, which, again, can be trained entirely offline to compress thousands of expert policies. We show this resulting neural probabilistic motor primitive system produces robust one-shot imitation of whole-body humanoid behaviors. In addition, we analyze the resulting latent space and demonstrate the ability to reuse this system. We encourage readers to view a supplementary video (https://youtu.be/44tPXdUCc-g ) summarizing our results.

##### Synthnet: Learning synthesizers end-to-end

tl;dr A convolutional autoregressive generative model that generates high fidelity audio, behchmarked on music

Learning synthesizers and generating music in the raw audio domain is a challenging task. We investigate the learned representations of convolutional autoregressive generative models. Consequently, we show that mappings between musical notes and the harmonic style (instrument timbre) can be learned based on the raw audio music recording and the musical score (in binary piano roll format). Our proposed architecture, SynthNet uses minimal training data (9 minutes), is substantially better in quality and converges 6 times faster than the baselines. The quality of the generated waveforms (generation accuracy) is sufficiently high that they are almost identical to the ground truth. Therefore, we are able to directly measure generation error during training, based on the RMSE of the Constant-Q transform. Mean opinion scores are also provided. We validate our work using 7 distinct harmonic styles and also provide visualizations and links to all generated audio.

##### DHER: Hindsight Experience Replay for Dynamic Goals

No tl;dr =[

Dealing with sparse rewards is one of the most important challenges in reinforcement learning (RL), especially when a goal is dynamic (e.g., to grasp a moving object). Hindsight experience replay (HER) has been shown an effective solution to handling sparse rewards with fixed goals. However, it does not account for dynamic goals in its vanilla form and, as a result, even degrades the performance of existing off-policy RL algorithms when the goal is changing over time. In this paper, we present Dynamic Hindsight Experience Replay (DHER), a novel approach for tasks with dynamic goals and sparse rewards. DHER automatically assembles successful experiences from two relevant failures and learns a reliable policy to achieve the dynamic goals. We evaluate DHER on tasks of robotic manipulation and moving object tracking, and transfer the polices from simulation to physical robots. Extensive comparison and ablation studies demonstrate the superiority of our approach, showing that DHER is a crucial ingredient to enable RL to solve tasks with dynamic goals.

##### Large-Scale Study of Curiosity-Driven Learning

tl;dr An agent trained only with curiosity, and no extrinsic reward, does surprisingly well on 54 popular environments, including the suite of Atari games, Mario etc.

Reinforcement learning algorithms rely on carefully engineered rewards from the environment that are extrinsic to the agent. However, annotating each environment with hand-designed, dense rewards is difficult and not scalable, motivating the need for developing reward functions that are intrinsic to the agent. Curiosity is such intrinsic reward function which uses prediction error as a reward signal. In this paper: (a) We perform the first large-scale study of purely curiosity-driven learning, i.e. {\em without any extrinsic rewards}, across $54$ standard benchmark environments, including the Atari game suite. Our results show surprisingly good performance as well as a high degree of alignment between the intrinsic curiosity objective and the hand-designed extrinsic rewards of many games. (b) We investigate the effect of using different feature spaces for computing prediction error and show that random features are sufficient for many popular RL game benchmarks, but learned features appear to generalize better (e.g. to novel game levels in Super Mario Bros.). (c) We demonstrate limitations of the prediction-based rewards in stochastic setups. Game-play videos and code are at https://doubleblindsupplementary.github.io/large-curiosity/.

##### Dynamic Pricing on E-commerce Platform with Deep Reinforcement Learning

tl;dr This paper describes a methodology for pre-training, evaluating and online dynamic pricing on E-commerce platform using deep reinforcement learning.

Dynamic pricing problem has been studied for decades and varieties of methodologies were developed under different assumptions. We developed an approach based on deep reinforcement learning (DRL) to address the dynamic pricing problem on an E-commerce platform with few assumptions. This paper first modeled dynamic pricing as a Markov Decision Process, defined various reward functions, with both discrete and continuous pricing action space. And then it introduced the methods to pre-train the model with the historical sales data. Offline evaluations and field experiments were designed and conducted to validate our approach.

##### Improving the Differentiable Neural Computer Through Memory Masking, De-allocation, and Link Distribution Sharpness Control

No tl;dr =[

The Differentiable Neural Computer (DNC) can learn algorithmic and question answering tasks. An analysis of its internal activation patterns reveals three problems: Most importantly, content based look-up results in flat and noisy address distributions, because the lack of key-value separation makes the DNC unable to ignore memory content which is not present in the key and need to be retrieved. Second, DNC's de-allocation of memory results in aliasing, which is a problem for content-based look-up. Thirdly, chaining memory reads with the temporal linkage matrix exponentially degrades the quality of the address distribution. Our proposed fixes of these problems yield improved performance on arithmetic tasks, and also improve the mean error rate on the bAbI question answering dataset by 43%.

##### q-Neurons: Neuron Activations based on Stochastic Jackson's Derivative Operators

tl;dr q-calculus helps build simple and scalable neural activation functions

We propose a new generic type of stochastic neurons, called $q$-neurons, that considers activation functions based on Jackson's $q$-derivatives, with stochastic parameters $q$. Our generalization of neural network architectures with $q$-neurons is shown to be both scalable and very easy to implement. We demonstrate experimentally consistently improved performances over state-of-the-art standard activation functions, both on training and testing loss functions.

##### Visual Imitation with a Minimal Adversary

tl;dr Imitation from pixels, with sparse or no reward, using off-policy RL and a tiny adversarially-learned reward function.

High-dimensional sparse reward tasks present major challenges for reinforcement learning agents. In this work we use imitation learning to address two of these challenges: how to learn a useful representation of the world e.g. from pixels, and how to explore efficiently given the rarity of a reward signal? We show that adversarial imitation can work well even in this high dimensional observation space. Surprisingly the adversary itself, acting as the learned reward function, can be tiny, comprising as few as 128 parameters, and can be easily trained using the most basic GAN formulation. Our approach removes limitations present in most contemporary imitation approaches: requiring no demonstrator actions (only video), no special initial conditions or warm starts, and no explicit tracking of any single demo. The proposed agent can solve a challenging robot manipulation task of block stacking from only video demonstrations and sparse reward, in which the non-imitating agents fail to learn completely. Furthermore, our agent learns much faster than competing approaches that depend on hand-crafted, staged dense reward functions, and also better compared to standard GAIL baselines. Finally, we develop a new adversarial goal recognizer that in some cases allows the agent to learn stacking without any task reward, purely from imitation.

##### The wisdom of the crowd: reliable deep reinforcement learning through ensembles of Q-functions

tl;dr Examined how a simple ensemble approach can tackle the biggest challenges in Q-learning.

Reinforcement learning agents learn by exploring the environment and then exploiting what they have learned. This frees the human trainers from having to know the preferred action or intrinsic value of each encountered state. The cost of this freedom is reinforcement learning is slower and more unstable than supervised learning. We explore the possibility that ensemble methods can remedy these shortcomings and do so by investigating a novel technique which harnesses the wisdom of the crowds by bagging Q-function approximator estimates. Our results show that this proposed approach improves all three tasks and reinforcement learning approaches attempted. We are able to demonstrate that this is a direct result of the increased stability of the action portion of the state-action-value function used by Q-learning to select actions and by policy gradient methods to train the policy.

##### Learning Corresponded Rationales for Text Matching

tl;dr We propose a novel self-explaining architecture to predict matches between two sequences of texts. Specifically, we introduce the notion of corresponded rationales and learn to extract them by the distal supervision from the downstream task.

The ability to predict matches between two sources of text has a number of applications including natural language inference (NLI) and question answering (QA). While flexible neural models have become effective tools in solving these tasks, they are rarely transparent in terms of the mechanism that mediates the prediction. In this paper, we propose a self-explaining architecture where the model is forced to highlight, in a dependent manner, how spans of one side of the input match corresponding segments of the other side in order to arrive at the overall decision. The text spans are regularized to be coherent and concise, and their correspondence is captured explicitly. The text spans -- rationales -- are learned entirely as latent mechanisms, guided only by the distal supervision from the end-to-end task. We evaluate our model on both NLI and QA using three publicly available datasets. Experimental results demonstrate quantitatively and qualitatively that our method delivers interpretable justification of the prediction without sacrificing state-of-the-art performance. Our code and data split will be publicly available.

##### Activity Regularization for Continual Learning

tl;dr This paper develops a novel regularization for continual learning

While deep neural networks have achieved remarkable successes, they suffer the well-known catastrophic forgetting issue when switching from existing tasks to tackle a new one. In this paper, we study continual learning with deep neural networks that learn from tasks arriving sequentially. We first propose an approximated multi-task learning framework that unifies a family of popular regularization based continual learning methods. We then analyze the weakness of existing approaches, and propose a novel regularization method named “Activity Regularization” (AR), which alleviates forgetting meanwhile keeping model’s plasticity to acquire new knowledge. Extensive experiments show that our method outperform state-of-the-art methods and effectively overcomes catastrophic forgetting.

##### Hybrid Policies Using Inverse Rewards for Reinforcement Learning

tl;dr A broad-spectrum improvement for reinforcement learning algorithms, which combines the policies using original rewards and inverse (negative) rewards

This paper puts forward a broad-spectrum improvement for reinforcement learning algorithms, which combines the policies using original rewards and inverse (negative) rewards. The policies using inverse rewards are competitive with the original policies, and help the original policies correct their mis-actions. We have proved the convergence of the inverse policies. The experiments for some games in OpenAI gym show that the hybrid polices based on deep Q-learning, double Q-learning, and on-policy actor-critic obtain the rewards up to 63.8%, 97.8%, and 54.7% more than the original algorithms. The improved polices are more stable than the original policies as well.

##### A Deep Learning Approach for Dynamic Survival Analysis with Competing Risks

No tl;dr =[

Currently available survival analysis methods are limited in their ability to deal with complex, heterogeneous, and longitudinal data such as that available in primary care records, or in their ability to deal with multiple competing risks. This paper develops a novel deep learning architecture that flexibly incorporates the available longitudinal data comprising various repeated measurements (rather than only the last available measurements) in order to issue dynamically updated survival predictions for one or multiple competing risk(s). Unlike existing works in the survival analysis on the basis of longitudinal data, the proposed method learns the time-to-event distributions without specifying underlying stochastic assumptions of the longitudinal or the time-to-event processes. Thus, our method is able to learn associations between the longitudinal data and the various associated risks in a fully data-driven fashion. We demonstrate the power of our method by applying it to real-world longitudinal datasets and show a drastic improvement over state-of-the-art methods in discriminative performance. Furthermore, our analysis of the variable importance and dynamic survival predictions will yield a better understanding of the predicted risks which will result in more effective health care.

##### Surprising Negative Results for Generative Adversarial Tree Search

tl;dr Surprising negative results on Model Based + Model deep RL

Although many recent advances in deep reinforcement learning consist of model- free methods, model-based approaches remain an alluring prospect owing to their potential to exploit unsupervised data to learn environment dynamics. Moreover, with new breakthroughs on image-to-image transduction, Pix2Pix GANs are a natural choice for learning to predict the dynamics of environments where ob- servations consist of images (like Atari games). Inspired by AlphaGo, which combines model-based and model-free RL, we propose generative adversarial tree search (GATS), simulating roll-outs with a learned GAN-based dynamics model and reward predictor. We theoretically prove some favorable properties of GATS vis-a-vis the bias-variance trade-off. The approach combines model-based planning via MCTS with model-free learning with DQNs. Empirically, on 5 popular Atari games, despite the dynamics and reward predictors converging quickly to accurate solutions GATS fails to outperform DQNs. We present a hypothesis for why tree search with short roll-outs can fail even given perfect modelling.

##### Neural Random Projections for Language Modelling

tl;dr Neural language models can be trained with a compressed embedding space, by using sparse random projections, created incrementally for each unique discrete input.

Neural network-based language models deal with data sparsity problems by mapping the large discrete space of words into a smaller continuous space of real-valued vectors. By learning distributed vector representations for words, each training sample informs the neural network model about a combinatorial number of other patterns. In this paper, we exploit the sparsity in natural language even further by encoding each unique input word using a fixed sparse random representation. These sparse codes are then projected onto a smaller embedding space which allows for the encoding of word occurrences from a possibly unknown vocabulary, along with the creation of more compact language models using a reduced number of parameters. We investigate the properties of our encoding mechanism empirically, by evaluating its performance on the widely used Penn Treebank corpus. We show that guaranteeing approximately equidistant vector representations for unique discrete inputs is enough to provide the neural network model with enough information to learn --and make use-- of distributed representations for these inputs.

##### Learning Graph Decomposition

No tl;dr =[

We propose a novel end-to-end trainable framework for the graph decomposition problem. The minimum cost multicut problem is first converted to an unconstrained binary cubic formulation where cycle consistency constraints are incorporated into the objective function. The new optimization problem can be viewed as a Conditional Random Field (CRF) in which the random variables are associated with the binary edge labels of the initial graph and the hard constraints are introduced in the CRF as high-order potentials. The parameters of a standard Neural Network and the fully differentiable CRF can be optimized in an end-to-end manner. We demonstrate the proposed learning algorithm in the context of clustering of hand written digits, particularly in a setting where no direct supervision for the graph decomposition task is available, and multiple person pose estimation from images in the wild. The experiments validate the effectiveness of our approach both for the feature learning and for the final clustering task.

##### Playing the Game of Universal Adversarial Perturbations

tl;dr We propose a robustification method under the presence of universal adversarial perturbations, by connecting a game theoretic method (fictitious play) with the problem of robustification, and making it more scalable.

We study the problem of learning classifiers robust to universal adversarial perturbations. While prior work approaches this problem via robust optimization, adversarial training, or input transformation, we instead phrase it as a two-player zero-sum game. In this new formulation, both players simultaneously play the same game, where one player chooses a classifier that minimizes a classification loss whilst the other player creates an adversarial perturbation that increases the same loss when applied to every sample in the training set. By observing that performing a classification (respectively creating adversarial samples) is the best response to the other player, we propose a novel extension of a game-theoretic algorithm, namely fictitious play, to the domain of training robust classifiers. Finally, we empirically show the robustness and versatility of our approach in two defence scenarios where universal attacks are performed on several image classification datasets -- CIFAR10, CIFAR100 and ImageNet.

##### K For The Price Of 1: Parameter Efficient Multi-task And Transfer Learning

No tl;dr =[

We introduce a novel method that enables parameter-efficient transfer and multitask learning. The basic approach is to allow a model patch - a small set of parameters - to specialize to each task, instead of fine-tuning the last layer or the entire network. For instance, we show that learning a set of scales and biases allows a network to learn a completely different embedding that could be used for different tasks (such as converting an SSD detection model into a 1000-class classification model while reusing 98% of parameters of the feature extractor). Similarly, we show that re-learning the existing low-parameter layers (such as depth-wise convolutions) also improves accuracy significantly. Our approach allows both simultaneous (multi-task) learning as well as sequential transfer learning wherein we adapt pretrained networks to solve new problems. For multi-task learning, despite using much fewer parameters than traditional logits-only fine-tuning, we match single-task-based performance.

##### Locally Linear Unsupervised Feature Selection

tl;dr Unsupervised feature selection through capturing the local linear structure of the data

The paper, interested in unsupervised feature selection, aims to retain the features best accounting for the local patterns in the data. The proposed approach, called Locally Linear Unsupervised Feature Selection, relies on a dimensionality reduction method to characterize such patterns; each feature is thereafter assessed according to its compliance w.r.t. the local patterns, taking inspiration from Locally Linear Embedding (Roweis and Saul, 2000). The experimental validation of the approach on the scikit-feature benchmark suite demonstrates its effectiveness compared to the state of the art.

##### Understand the dynamics of GANs via Primal-Dual Optimization

tl;dr We show that, with a proper stepsize choice, the widely used first-order iterative algorithm in training GANs would in fact converge to a stationary solution with a sublinear rate.

Generative adversarial network (GAN) is one of the best known unsupervised learning techniques these days due to its superior ability to learn data distributions. In spite of its great success in applications, GAN is known to be notoriously hard to train. The tremendous amount of time it takes to run the training algorithm and its sensitivity to hyper-parameter tuning have been haunting researchers in this area. To resolve these issues, we need to first understand how GANs work. Herein, we take a step toward this direction by examining the dynamics of GANs. We relate a large class of GANs including the Wasserstein GANs to max-min optimization problems with the coupling term being linear over the discriminator. By developing new primal-dual optimization tools, we show that, with a proper stepsize choice, the widely used first-order iterative algorithm in training GANs would in fact converge to a stationary solution with a sublinear rate. The same framework also applies to multi-task learning and distributional robust learning problems. We verify our analysis on numerical examples with both synthetic and real data sets. We hope our analysis shed light on future studies on the theoretical properties of relevant machine learning problems.

##### Learning Grid-like Units with Vector Representation of Self-Position and Matrix Representation of Self-Motion

No tl;dr =[

This paper proposes a simple model for learning grid-like units for spatial awareness and navigation. In this model, the self-position of the agent is represented by a vector, and the self-motion of the agent is represented by a block-diagonal matrix. Each component of the vector is a unit (or a cell). The model consists of the following two sub-models. (1) Motion sub-model. The movement from the current position to the next position is modeled by matrix-vector multiplication, i.e., multiplying the matrix representation of the motion to the current vector representation of the position in order to obtain the vector representation of the next position. (2) Localization sub-model. The adjacency between any two positions is a monotone decreasing function of their Euclidean distance, and the adjacency is modeled by the inner product between the vector representations of the two positions. Both sub-models can be implemented by neural networks. The motion sub-model is a recurrent network with dynamic weight matrix, and the localization sub-model is a feedforward network. The model can be learned by minimizing a loss function that combines the loss functions of the two sub-models. The learned units exhibit grid-like patterns (as well as stripe patterns) in both 2D and 3D environments. The learned model can be used for path integral and path planning. Moreover, the learned representation is capable of error correction.

##### Hierarchically-Structured Variational Autoencoders for Long Text Generation

tl;dr Propose a hierarchically-structured variational autoencoder for generating long and coherent units of text

Variational autoencoders (VAEs) have received much attention recently as an end-to-end architecture for text generation. Existing methods primarily focus on synthesizing relatively short sentences (with less than twenty words). In this paper, we propose a novel framework, hierarchically-structured variational autoencoder (hier-VAE), for generating long and coherent units of text. To enhance the model’s plan-ahead ability, intermediate sentence representations are introduced into the generative networks to guide the word-level predictions. To alleviate the typical optimization challenges associated with textual VAEs, we further employ a hierarchy of stochastic layers between the encoder and decoder networks. Extensive experiments are conducted to evaluate the proposed method, where hier-VAE is shown to make effective use of the latent codes and achieve lower perplexity relative to language models. Moreover, the generated samples from hier-VAE also exhibit superior quality according to both automatic and human evaluations.

##### Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures

tl;dr We show that rare but catastrophic failures may be missed entirely by random testing, which poses issues for safe deployment. Our proposed approach for adversarial testing fixes this.

This paper addresses the problem of evaluating learning systems in safety critical domains such as autonomous driving, where failures can have catastrophic consequences. To this end, we focus on two problems: searching for scenarios when learned agents fail and the related problem of assessing their probability of failure. The standard method for agent evaluation in reinforcement learning, Vanilla Monte Carlo, can severely underestimate agent failure probabilities, leading to the deployment of unsafe agents. In our experiments, we observe this even after allocating equal compute to training and evaluation. To address this shortcoming, we draw upon the rare event probability estimation literature and propose an adversarial evaluation approach. Our approach focuses evaluation on difficult scenarios that are selected adversarially, while still providing unbiased estimates of failure probabilities. To do this, we propose a continuation approach to learning a failure probability predictor. This leverages data from related agents to overcome issues of data sparsity and allows the adversary to reuse data gathered for training the agent. We demonstrate the efficacy of adversarial evaluation on two complex reinforcement learning domains (humanoid control and simulated driving). Experimental results show that our methods can find catastrophic failures and estimate failures rates of agents multiple orders of magnitude faster (hours instead of days) than standard evaluation schemes.

##### BIGSAGE: unsupervised inductive representation learning of graph via bi-attended sampling and global-biased aggregating

tl;dr For unsupervised and inductive network embedding, we propose a novel approach to explore most relevant neighbors and preserve previously learnt knowledge of nodes by utilizing bi-attention architecture and introducing global bias, respectively

Different kinds of representation learning techniques on graph have shown significant effect in downstream machine learning tasks. Recently, in order to inductively learn representations for graph structures that is unobservable during training, a general framework with sampling and aggregating (GraphSAGE) was proposed by Hamilton and Ying and had been proved more efficient than transductive methods on fileds like transfer learning or evolving dataset. However, GraphSAGE is uncapable of selective neighbor sampling and lack of memory of known nodes that've been trained. To address these problems, we present an unsupervised method that samples neighborhood information attended by co-occurring structures and optimizes a trainable global bias as a representation expectation for each node in the given graph. Experiments show that our approach outperforms the state-of-the-art inductive and unsupervised methods for representation learning on graphs.

##### What Information Does a ResNet Compress?

tl;dr The Information Bottleneck Principle applied to ResNets, using PixelCNN++ models to decode mutual information and conditionally generate images for information illustration

The information bottleneck principle (Shwartz-Ziv & Tishby, 2017) suggests that SGD-based training of deep neural networks results in optimally compressed hidden layers, from an information theoretic perspective. However, this claim was established on toy data. The goal of the work we present here is to test these claims in a realistic setting using a larger and deeper convolutional architecture, a ResNet model. We trained PixelCNN++ models as inverse representation decoders to measure the mutual information between hidden layers of a ResNet and input image data, when trained for (1) classification and (2) autoencoding. We find that two stages of learning happen for both training regimes, and that compression does occur, even for an autoencoder. Sampling images by conditioning on hidden layers’ activations offers an intuitive visualisation to understand what a ResNets learns to forget.

##### iRDA Method for Sparse Convolutional Neural Networks

tl;dr A sparse optimization algorithm for deep CNN models.

We propose a new approach, known as the iterative regularized dual averaging (iRDA), to improve the efficiency of convolutional neural networks (CNN) by significantly reducing the redundancy of the model without reducing its accuracy. The method has been tested for various data sets, and proven to be significantly more efficient than most existing compressing techniques in the deep learning literature. For many popular data sets such as MNIST and CIFAR-10, more than 95% of the weights can be zeroed out without losing accuracy. In particular, we are able to make ResNet18 with 95% sparsity to have an accuracy that is comparable to that of a much larger model ResNet50 with the best 60% sparsity as reported in the literature.

##### Why Do Neural Response Generation Models Prefer Universal Replies?

tl;dr Analyze the reason for neural response generative models preferring universal replies; Propose a method to avoid it.

Recent advances in neural Sequence-to-Sequence (Seq2Seq) models reveal a purely data-driven approach to the response generation task. Despite its diverse variants and applications, the existing Seq2Seq models are prone to producing short and generic replies, which blocks such neural network architectures from being utilized in practical open-domain response generation tasks. In this research, we analyze this critical issue from the perspective of the optimization goal of models and the specific characteristics of human-to-human conversational corpora. Our analysis is conducted by decomposing the goal of Neural Response Generation (NRG) into the optimizations of word selection and ordering. It can be derived from the decomposing that Seq2Seq based NRG models naturally tend to select common words to compose responses, and ignore the semantic of queries in word ordering. On the basis of the analysis, we propose a max-marginal ranking regularization term to avoid Seq2Seq models from producing the generic and uninformative responses. The empirical experiments on benchmarks with several metrics have validated our analysis and proposed methodology.

tl;dr We propose a new attribution method that removes noise from saliency maps through layer-wise thresholding during backpropagation.

Saliency map, or the gradient of the score function with respect to the input, is the most basic means of interpreting deep neural network decisions. However, saliency maps are often visually noisy. Although several hypotheses were proposed to account for this phenomenon, there is no work that provides a rigorous analysis of noisy saliency maps. This may be a problem as numerous advanced attribution methods were proposed under the assumption that the existing hypotheses are true. In this paper, we identify the cause of noisy saliency maps. Then, we propose Rectified Gradient, a simple method that significantly improves saliency maps by alleviating that cause. Experiments showed effectiveness of our method and its superiority to other attribution methods. Codes and examples for the experiments will be released in public.

##### Improving Sentence Representations with Multi-view Frameworks

tl;dr Multi-view learning improves unsupervised sentence representation learning

Multi-view learning can provide self-supervision when different views are available of the same data. Distributional hypothesis provides another form of useful self-supervision from adjacent sentences which are plentiful in large unlabelled corpora. Motivated by the asymmetry in the two hemispheres of the human brain as well as the observation that different learning architectures tend to emphasise different aspects of sentence meaning, we present two multi-view frameworks for learning sentence representations in an unsupervised fashion. One framework uses a generative objective and the other a discriminative one. In both frameworks, the final representation is an ensemble of two views, in which, one view encodes the input sentence with a Recurrent Neural Network (RNN), and the other view encodes it with a simple linear model. We show that, after learning, the vectors produced by our multi-view frameworks provide improved representations over their single-view learned counterparts, and the combination of different views gives representational improvement over each view and demonstrates solid transferability on standard downstream tasks.

##### Few-Shot Intent Inference via Meta-Inverse Reinforcement Learning

tl;dr The applicability of inverse reinforcement learning is often hampered by the expense of collecting expert demonstrations; this paper seeks to broaden its applicability by incorporating prior task information through meta-learning.

A significant challenge for the practical application of reinforcement learning toreal world problems is the need to specify an oracle reward function that correctly defines a task. Inverse reinforcement learning (IRL) seeks to avoid this challenge by instead inferring a reward function from expert behavior. While appealing, it can be impractically expensive to collect datasets of demonstrations that cover the variation common in the real world (e.g. opening any type of door). Thus in practice, IRL must commonly be performed with only a limited set of demonstrations where it can be exceedingly difficult to unambiguously recover a reward function. In this work, we exploit the insight that demonstrations from other tasks can be used to constrain the set of possible reward functions by learning a "prior" that is specifically optimized for the ability to infer expressive reward functions from limited numbers of demonstrations. We demonstrate that our method can efficiently recover rewards from images for novel tasks and provide intuition as to how our approach is analogous to learning a prior.

##### A Main/Subsidiary Network Framework for Simplifying Binary Neural Networks

tl;dr we define the filter-level pruning problem for binary neural networks for the first time and propose method to solve it.

To reduce memory footprint and run-time latency, techniques such as neural net-work pruning and binarization have been explored separately. However, it is un-clear how to combine the best of the two worlds to get extremely small and efficient models. In this paper, we, for the first time, define the filter-level pruning problem for binary neural networks, which cannot be solved by simply migrating existing structural pruning methods for full-precision models. A novel learning-based approach is proposed to prune filters in our main/subsidiary network frame-work, where the main network is responsible for learning representative features to optimize the prediction performance, and the subsidiary component works as a filter selector on the main network. To avoid gradient mismatch when training the subsidiary component, we propose a layer-wise and bottom-up scheme. We also provide the theoretical and experimental comparison between our learning-based and greedy rule-based methods. Finally, we empirically demonstrate the effectiveness of our approach applied on several binary models, including binarizedNIN, VGG-11, and ResNet-18, on various image classification datasets. For bi-nary ResNet-18 on ImageNet, we use 78.6% filters but can achieve slightly better test error 49.87% (50.02%-0.15%) than the original model

##### Music Transformer

tl;dr We show the first successful use of Transformer in generating music that exhibits long-term structure.

Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity is quadratic in the sequence length. We propose an algorithm that reduces the intermediate memory requirements to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long (thousands of steps) compositions with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-competition, and obtain state-of-the-art results on the latter.

##### A Forensic Representation to Detect Non-Trivial Image Duplicates, and How it Applies to Semantic Segmentation

tl;dr A forensic metric to determine if a given image is a copy (with possible manipulation) of another image from a given dataset.

Manipulation and re-use of images in scientific publications is a recurring problem, at present lacking a scalable solution. Existing tools for detecting image duplication are mostly manual or semi-automated, despite the fact that generating data for a learning-based approach is straightforward, as we here illustrate. This paper addresses the problem of determining if, given two images, one is a manipulated version of the other by means of certain geometric and statistical manipulations, e.g. copy, rotation, translation, scale, perspective transform, histogram adjustment, partial erasing, and compression artifacts. We propose a solution based on a 3-branch Siamese Convolutional Neural Network. The ConvNet model is trained to map images into a 128-dimensional space, where the Euclidean distance between duplicate (respectively, unique) images is no greater (respectively, greater) than 1. Our results suggest that such an approach can serve as tool to improve surveillance of the published and in-peer-review literature for image manipulation. We also show that as a byproduct the network learns useful representations for semantic segmentation, with performance comparable to that of domain-specific models.

##### RETHINKING SELF-DRIVING : MULTI -TASK KNOWLEDGE FOR BETTER GENERALIZATION AND ACCIDENT EXPLANATION ABILITY

tl;dr we proposed a new self-driving model which is composed of perception module for see and think and driving module for behave to acquire better generalization and accident explanation ability.

Current end-to-end deep learning driving models have two problems: (1) Poor generalization ability of unobserved driving environment when diversity of train- ing driving dataset is limited (2) Lack of accident explanation ability when driving models don’t work as expected. To tackle these two problems, rooted on the be- lieve that knowledge of associated easy task is benificial for addressing difficult task, we proposed a new driving model which is composed of perception module for see and think and driving module for behave, and trained it with multi-task perception-related basic knowledge and driving knowledge stepwisely. Specifi- cally segmentation map and depth map (pixel level understanding of images) were considered as what & where and how far knowledge for tackling easier driving- related perception problems before generating final control commands for difficult driving task. The results of experiments demonstrated the effectiveness of multi- task perception knowledge for better generalization and accident explanation abil- ity. With our method the average sucess rate of finishing most difficult navigation tasks in untrained city of CoRL test surpassed current benchmark method for 15 percent in trained weather and 20 percent in untrained weathers.

##### Subgradient Descent Learns Orthogonal Dictionaries

tl;dr Efficient dictionary learning by L1 minimization via a novel analysis of the non-convex non-smooth geometry.

This paper concerns dictionary learning, viz., sparse coding, a fundamental representation learning problem. We show that a subgradient descent algorithm, with random initialization, can recover orthogonal dictionaries on a natural nonsmooth, nonconvex L1 minimization formulation of the problem, under mild statistical assumption on the data. This is in contrast to previous provable methods that require either expensive computation or delicate initialization schemes. Our analysis develops several tools for characterizing landscapes of nonsmooth functions, which might be of independent interest for provable training of deep networks with nonsmooth activations (e.g., ReLU), among other applications. Preliminary experiments corroborate our analysis and show that our algorithm works well empirically in recovering orthogonal dictionaries.

##### Optimistic Acceleration for Optimization

tl;dr We consider new variants of optimization algorithms for training deep nets.

We consider new variants of optimization algorithms. Our algorithms are based on the observation that mini-batch of stochastic gradients in consecutive iterations do not change drastically and consequently may be predictable. Inspired by the similar setting in online learning literature called Optimistic Online learning, we propose two new algorithms, Optimistic-AMSGrad and Optimistic-Adam that exploit the predictability of gradients. Optimistic-AMSGrad and Optimistic-Adam combine the idea of momentum method, adaptive gradient method, and algorithms in Optimistic Online learning, which leads to speed up in training deep neural nets in practice.

##### Directional Analysis of Stochastic Gradient Descent via von Mises-Fisher Distributions in Deep Learning

tl;dr One of theoretical issues in deep learning

Although stochastic gradient descent (SGD) is a driving force behind the recent success of deep learning, our understanding of its dynamics in a high-dimensional parameter space is limited. In recent years, some researchers have used the stochasticity of minibatch gradients, or the signal-to-noise ratio, to better characterize the learning dynamics of SGD. Inspired from these work, we here analyze SGD from a geometrical perspective by inspecting the stochasticity of the norms and directions of minibatch gradients. We propose a model of the directional concentration for minibatch gradients through von Mises-Fisher (VMF) distribution, and show that the directional uniformity of minibatch gradients increases over the course of SGD. We empirically verify our result using deep convolutional networks and observe a higher correlation between the gradient stochasticity and the proposed directional uniformity than that against the gradient norm stochasticity, suggesting that the directional statistics of minibatch gradients is a major factor behind SGD.

##### EFFICIENT SEQUENCE LABELING WITH ACTOR-CRITIC TRAINING

No tl;dr =[

Neural approaches to sequence labeling often use a Conditional Random Field (CRF) to model their output dependencies, while Recurrent Neural Networks (RNN) are used for the same purpose in other tasks. We set out to establish RNNs as an attractive alternative to CRFs for sequence labeling. To do so, we address one of the RNN’s most prominent shortcomings, the fact that it is not exposed to its own errors with the maximum-likelihood training. We frame the prediction of the output sequence as a sequential decision-making process, where we train the network with an adjusted actor-critic algorithm (AC-RNN). We comprehensively compare this strategy with maximum-likelihood training for both RNNs and CRFs on three structured-output tasks. The proposed AC-RNN efficiently matches the performance of the CRF on NER and CCG tagging, and outperforms it on Machine Transliteration. We also show that our training strategy is significantly better than other techniques for addressing RNN’s exposure bias, such as Scheduled Sampling, and Self-Critical policy training.

##### Boosting Trust Region Policy Optimization by Normalizing flows Policy

tl;dr Normalizing flows policy to improve TRPO and ACKTR

We propose to improve trust region policy search with normalizing flows policy. We illustrate that when the trust region is constructed by KL divergence constraint, normalizing flows policy can generate samples far from the 'center' of the previous policy iterate, which potentially enables better exploration and helps avoid bad local optima. We show that normalizing flows policy significantly improves upon factorized Gaussian policy baseline, with both TRPO and ACKTR, especially on tasks with complex dynamics such as Humanoid.

##### A MAX-AFFINE SPLINE PERSPECTIVE OF RECURRENT NEURAL NETWORKS

tl;dr We provide new insights and interpretations of RNNs from a max-affine spline operators perspective

We develop a framework for understanding and improving recurrent neural net-works (RNNs) using max-affine spline operators (MASO). We prove that RNNsusing piecewise affine and convex nonlinearities can be written as a simple piece-wise affine spline operator. The resulting representation provides several new per-spectives for analyzing RNNs, three of which we study in this paper. First, weshow that an RNN internally partitions the input space during training using vec-tor quantization and that it builds up the partition through time. Second, we showthat the affine parameter of an RNN corresponds to an input-specific template,from which we can interpret an RNN as performing a simple template matching(matched filtering) given the input. Third, by closely examining the MASO RNN formula, we prove that injecting Gaussian noise in the initial hidden state in RNNs corresponds to an explicit L2regularization on the affine parameters, which links to exploding gradient issues and improves generalization. Extensive experimentson several datasets of various modalities demonstrates and validates each of theabove analyses. In particular, using initial hidden states elevates simple RNNs tostate-of-the-art performance on these datasets

##### Adversarial Imitation via Variational Inverse Reinforcement Learning

tl;dr Our proposed method builds on GANs and exploits potential-based reward shaping to learn near-optimal rewards and policies from expert demonstrations.

We consider a problem of learning a reward and policy from expert examples under unknown dynamics in high-dimensional scenarios. Our proposed method builds on the framework of generative adversarial networks and exploits reward shaping to learn near-optimal rewards and policies. Potential-based reward shaping functions are known to guide the learning agent whereas in this paper we bring forward their benefits in learning near-optimal rewards. Our method simultaneously learns a potential-based reward shaping function through variational information maximization along with the reward and policy under the adversarial learning formulation. We evaluate our method on various high-dimensional complex control tasks. We also evaluate our learned rewards in transfer learning problems where training and testing environments are made to be different from each other in terms of dynamics or structure. Our experimentation shows that our proposed method not only learns near-optimal rewards and policies matching expert behavior, but also performs significantly better than state-of-the-art inverse reinforcement learning algorithms.

##### Deep Reinforcement Learning of Universal Policies with Diverse Environment Summaries

tl;dr As an alternative to domain randomization, we summarize simulator configurations to ensure that the policy is trained on a diverse set of induced state-trajectories.

Deep reinforcement learning has enabled robots to complete complex tasks in simulation. However, the resulting policies do not transfer to real robots due to model errors in the simulator. One solution is to randomize the simulation environment, so that the resulting, trained policy achieves high performance in expectation over a variety of configurations that could represent the real-world. However, the distribution over simulator configurations must be carefully selected to represent the relevant dynamic modes of the system, as otherwise it can be unlikely to sample challenging configurations frequently enough. Moreover, the ideal distribution to improve the policy changes as the policy (un)learns to solve tasks in certain configurations. In this paper, we propose to use an inexpensive, kernel-based summarization method method that identifies configurations that lead to diverse behaviors. Since failure modes for the given task are naturally diverse, the policy trains on a mixture of representative and challenging configurations, which leads to more robust policies. In experiments, we show that the proposed method achieves the same performance as domain randomization in simple cases, but performs better when domain randomization does not lead to diverse dynamic modes.

##### Learning to Progressively Plan

No tl;dr =[

For problem solving, making reactive decisions based on problem description is fast but inaccurate, while search-based planning using heuristics gives better solutions but could be exponentially slow. In this paper, we propose a new approach that improves an existing solution by iteratively picking and rewriting its local components until convergence. The rewriting policy employs a neural network trained with reinforcement learning. We evaluate our approach in two domains: job scheduling and expression simplification. Compared to common effective heuristics, baseline deep models and search algorithms, our approach efficiently gives solutions with higher quality.

##### TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer

tl;dr We present the TimbreTron, a pipeline for perfoming high-quality timbre transfer on musicalwaveforms using CQT-domain style transfer.

In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, an audio processing pipeline which combines three powerful ideas from different domains: Constant Q Transform (CQT) spectrogram for audio representation, a variant of CycleGAN for timbre transfer and WaveNet-Synthesizer for high quality audio generation. We verified that CQT TimbreTron in principle and in practice is more suitable than its STFT counterpart, even though STFT is more commonly used for audio representation. Based on human perceptual evaluations, we confirmed that timbre was transferred recognizably while the musical content was preserved by TimbreTron.

##### Gradient Descent Provably Optimizes Over-parameterized Neural Networks

tl;dr We prove gradient descent achieves zero training loss with a linear rate on over-parameterized neural networks.

One of the mystery in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and $n$ training data, we show as long as $m$ is large enough and the data is non-degenerate, randomly initialized gradient descent converges a globally optimal solution at a linear convergence rate for the quadratic loss function. Our analysis is based on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods.

##### Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

tl;dr We propose a cross-lingual document classification framework for related language pairs.

Text classification must sometimes be applied in situations with no training data in a target language. However, training data may be available in a related language. We introduce a cross-lingual document classification framework CACO between related language pairs. To best use limited training data, our transfer learning scheme exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. CACO models trained under low-resource settings rival cross-lingual word embedding models trained under high-resource settings on related language pairs.

##### A Walk with SGD: How SGD Explores Regions of Deep Network Loss?

No tl;dr =[

The non-convex nature of the loss landscape of deep neural networks (DNN) lends them the intuition that over the course of training, stochastic optimization algorithms explore different regions of the loss surface by entering and escaping many local minima due to the noise induced by mini-batches. But is this really the case? This question couples the geometry of the DNN loss landscape with how stochastic optimization algorithms like SGD interact with it during training. Answering this question may help us qualitatively understand the dynamics of deep neural network optimization. We show evidence through qualitative and quantitative experiments that mini-batch SGD rarely crosses barriers during DNN optimization. As we show, the mini-batch induced noise helps SGD explore different regions of the loss surface using a seemingly different mechanism. To complement this finding, we also investigate the qualitative reason behind the slowing down of this exploration when using larger batch-sizes. We show this happens because gradients from larger batch-sizes align more with the top eigenvectors of the Hessian, which makes SGD oscillate in the proximity of the parameter initialization, thus preventing exploration.

##### AntMan: Sparse Low-Rank Compression To Accelerate RNN Inference

tl;dr Reducing computation and memory complexity of RNN models by up to 100x using sparse low-rank compression modules, trained via knowledge distillation.

Wide adoption of complex RNN based models is hindered by their inference performance, cost and memory requirements. To address this issue, we develop AntMan, combining structured sparsity with low-rank decomposition synergistically, to reduce model computation, size and execution time of RNNs while attaining desired accuracy. AntMan extends knowledge distillation based training to learn the compressed models efficiently. Our evaluation shows that AntMan offers up to 100x computation reduction with less than 1pt accuracy drop for language and machine reading comprehension models. Our evaluation also shows that for a given accuracy target, AntMan produces 5x smaller models than the state-of-art. Lastly, we show that AntMan offers super-linear speed gains compared to theoretical speedup, demonstrating its practical value on commodity hardware.

##### Mimicking actions is a good strategy for beginners: Fast Reinforcement Learning with Expert Action Sequences

tl;dr Appending most frequent action pairs from an expert player to a novice RL agent's action space improves the scores by huge margin.

Imitation Learning is the task of mimicking the behavior of an expert player in a Reinforcement Learning(RL) Environment to enhance the training of a fresh agent (called novice) beginning from scratch. Most of the Reinforcement Learning environments are stochastic in nature, i.e., the state sequences that an agent may encounter usually follow a Markov Decision Process (MDP). This makes the task of mimicking difficult as it is very unlikely that a new agent may encounter same or similar state sequences as an expert. Prior research in Imitation Learning proposes various ways to learn a mapping between the states encountered and the respective actions taken by the expert while mostly being agnostic to the order in which these were performed. Most of these methods need considerable number of states-action pairs to achieve good results. We propose a simple alternative to Imitation Learning by appending the novice’s action space with the frequent short action sequences that the expert has taken. This simple modification, surprisingly improves the exploration and significantly outperforms alternative approaches like Dataset Aggregation. We experiment with several popular Atari games and show significant and consistent growth in the score that the new agents achieve using just a few expert action sequences.

##### Discovering General-Purpose Active Learning Strategies

No tl;dr =[

We propose a general-purpose approach to discovering active learning (AL) strategies from data. These strategies are transferable from one domain to another and can be used in conjunction with many machine learning models. To this end, we formalize the annotation process as a Markov decision process, design universal state and action spaces and introduce a new reward function that precisely reflects the AL objective of minimizing the annotation cost We seek to find an optimal (non-myopic) AL strategy using reinforcement learning. We evaluate the learned strategies on multiple unrelated domains and show that they consistently outperform state-of-the-art baselines.

##### Towards Decomposed Linguistic Representation with Holographic Reduced Representation

tl;dr Holographic Reduced Representation enables language model to discover linguistic roles.

The vast majority of neural models in Natural Language Processing adopt a form of structureless distributed representations. While these models are powerful at making predictions, the representational form is rather crude and does not provide insights into linguistic structures. In this paper we introduce novel language models with representations informed by the framework of Holographic Reduced Representation (HRR). This allows us to inject structures directly into our word-level and chunk-level representations. Our analyses show that by using HRR as a structured compositional representation, our models are able to discover crude linguistic roles, which roughly resembles a classic division between syntax and semantics.

##### On the Convergence and Robustness of Batch Normalization

tl;dr We mathematically analyze the effect of batch normalization on a simple model and obtain key new insights that applies to general supervised learning.

Despite its empirical success, the theoretical underpinnings of the stability, convergence and acceleration properties of batch normalization (BN) remain elusive. In this paper, we attack this problem from a modelling approach, where we perform thorough theoretical analysis on BN applied to simplified model: ordinary least squares (OLS). We discover that gradient descent on OLS with BN has interesting properties, including a scaling law, convergence for arbitrary learning rates for the weights, asymptotic acceleration effects, as well as insensitivity to choice of learning rates. We then demonstrate numerically that these findings are not specific to the OLS problem and hold qualitatively for more complex supervised learning problems. This points to a new direction towards uncovering the mathematical principles that underlies batch normalization.

##### Temporal Difference Variational Auto-Encoder

tl;dr Generative model of temporal data, that builds online belief state, operates in latent space, does jumpy predictions and rollouts of states.

To act and plan in complex environments, we posit that agents should have a mental simulator of the world with three characteristics: (a) it should build an abstract state representing the condition of the world; (b) it should form a belief which represents uncertainty on the world; (c) it should go beyond simple step-by-step simulation, and exhibit temporal abstraction. Motivated by the absence of a model satisfying all these requirements, we propose TD-VAE, a generative sequence model that learns representations containing explicit beliefs about states several steps into the future, and that can be rolled out directly without single-step transitions. TD-VAE is trained on pairs of temporally separated time points, using an analogue of temporal difference learning used in reinforcement learning.

##### Hierarchical Attention: What Really Counts in Various NLP Tasks

tl;dr The paper proposed a novel hierarchical model to replace the original attention model in various NLP tasks.

Attention mechanisms in sequence to sequence models have shown great ability and wonderful performance in various natural language processing (NLP) tasks, such as sentence embedding, text generation, machine translation, machine reading comprehension, etc. Unfortunately, existing attention mechanisms only learn either high-level or low-level features. In this paper, we think that the lack of hierarchical mechanisms is a bottleneck in improving the performance of the attention mechanisms, and propose a novel Hierarchical Attention Mechanism (Ham) based on the weighted sum of different layers of a multi-level attention. Ham achieves a state-of-the-art BLEU score of 0.26 on Chinese poem generation task and a nearly 6.5% averaged improvement compared with the existing machine reading comprehension models such as BIDAF and Match-LSTM. Furthermore, our experiments and theorems reveal that Ham has greater generalization and representation ability than existing attention mechanisms.

##### Confidence Regularized Self-Training

tl;dr A self-training optimization framework with pseudo-label confidence regularization

Recent advances in domain adaptation show that self-training with deep networks presents a powerful means for unsupervised domain adaptation. Specifically, these methods often involve an iterative process of predicting on target domain and then taking the confident predictions as pseudo-labels for retraining. Basic self-training treats selected pseudo-labels equally as one-hot vectors, and the selection process is often modeled to be progressive via self-paced learning. While one-hot vector is a natural choice to model multiclass targets, such encoding scheme does not consider the difference between selected samples. As a result, the approach sometimes generates overconfident false pseudo-labels that lead to the convergence to deviated solutions with propagated errors, especially when there is a large domain gap. To address this problem, we generalize self-training as an expectation maximization (EM) problem which treats pseudo-labels as latent variables, and solves maximum marginal likelihood estimation (MMLE) by maximizing its lower bound. We then propose confidence regularized self-training, where we introduce multiple confidence regularizers along with their solutions. These regularizers mainly consider the control of pseudo-label confidence from two aspects: model regularization and label refinement. Experiments on both semantic segmentation and image classification show that self-training with different confidence regularizers comprehensively outperform their non-regularized counterparts.

##### MERCI: A NEW METRIC TO EVALUATE THE CORRELATION BETWEEN PREDICTIVE UNCERTAINTY AND TRUE ERROR

tl;dr We review existing metrics and propose a new one to evaluate predictive uncertainty in deep learning

##### Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

tl;dr We use a hypernetwork to predict optimal weights given hyperparameters, and jointly train everything together.

Hyperparameter optimization is a bi-level optimization problem, where the optimal parameters on the training set depend on the current hyperparameters. The best-response function which maps hyperparameters to these optimal parameters allows gradient-based hyperparameter optimization but is difficult to represent and compute when the parameters are high dimensional, as in neural networks. We develop efficient best-response approximations for neural networks by applying insights from the structure of the optimal response in a Jacobian-regularized two-layer linear network to deep, nonlinear networks. The approximation works by scaling and shifting the hidden units by amounts which depend on the current hyperparameters. We use our approximation for a gradient-based hyperparameter optimization algorithm which alternates between approximating the best-response in a neighborhood around the current hyperparameters and optimizing the hyperparameters using the approximate best-response. We show this method outperforms competing hyperparameter optimization methods on large-scale deep learning problems. We call our networks, which update their own hyperparameters online during training, Self-Tuning Networks.

##### Posterior Attention Models for Sequence to Sequence Learning

tl;dr Computing attention based on posterior distribution leads to more meaningful attention and better performance

Modern neural architectures critically rely on attention for mapping structured inputs to sequences. In this paper we show that prevalent attention architectures do not adequately model the dependence among the attention and output variables along the length of a predicted sequence. We present an alternative architecture called Posterior Attention Models that relying on a principled factorization of the full joint distribution of the attention and output variables propose two major changes. First, the position where attention is marginalized is changed from the input to the output. Second, the attention propagated to the next decoding stage is a posterior attention distribution conditioned on the output. Empirically on five translation and two morphological inflection tasks the proposed posterior attention models yield better predictions and alignment accuracy than existing attention models.

##### Provable Defenses against Spatially Transformed Adversarial Inputs: Impossibility and Possibility Results

No tl;dr =[

One intriguing property of neural networks is their inherent vulnerability to adversarial inputs, which are maliciously crafted samples to trigger target networks to misbehave. The state-of-the-art attacks generate adversarial inputs using either pixel perturbation or spatial transformation. Thus far, several provable defenses have been proposed against pixel perturbation-based attacks; yet, little is known about whether such solutions exist for spatial transformation-based attacks. This paper bridges this striking gap by conducting the first systematic study on provable defenses against spatially transformed adversarial inputs. Our findings convey mixed messages. On the impossibility side, we show that such defenses may not exist in practice: for any given networks, it is possible to find legitimate inputs and imperceptible transformations to generate adversarial inputs that force arbitrarily large errors. On the possibility side, we show that it is still feasible to construct adversarial training methods to significantly improve the resilience of networks against adversarial inputs over empirical datasets. We believe our findings provide insights for designing more effective defenses against spatially transformed adversarial inputs.

##### Improving Generative Adversarial Imitation Learning with Non-expert Demonstrations

tl;dr We improve GAIL by learning discriminators using multiclass classification with non-expert regarded as an extra class.

Imitation learning aims to learn an optimal policy from expert demonstrations and its recent combination with deep learning has shown impressive performance. However, collecting a large number of expert demonstrations for deep learning is time-consuming and requires much expert effort. In this paper, we propose a method to improve generative adversarial imitation learning by using additional information from non-expert demonstrations which are easier to obtain. The key idea of our method is to perform multiclass classification to learn discriminator functions where non-expert demonstrations are regarded as being drawn from an extra class. Experiments in continuous control tasks demonstrate that our method learns optimal policies faster and has more stable performance than the generative adversarial imitation learning baseline.

##### Learning from Noisy Demonstration Sets via Meta-Learned Suitability Assessor

tl;dr We propose a framework to learn a good policy through imitation learning from a noisy demonstration set via meta-training a demonstration suitability assessor.

A noisy and diverse demonstration set may hinder the performances of an agent aiming to acquire certain skills via imitation learning. However, state-of-the-art imitation learning algorithms often assume the optimality of the given demonstration set. In this paper, we address such optimal assumption by learning only from the most suitable demonstrations in a given set. Suitability of a demonstration is estimated by whether imitating it produce desirable outcomes for achieving the goals of the tasks. For more efficient demonstration suitability assessments, the learning agent should be capable of imitating a demonstration as quick as possible, which shares similar spirit with fast adaptation in the meta-learning regime. Our framework, thus built on top of Model-Agnostic Meta-Learning, evaluates how desirable the imitated outcomes are, after adaptation to each demonstration in the set. The resulting assessments hence enable us to select suitable demonstration subsets for acquiring better imitated skills. The videos related to our experiments are available at: https://sites.google.com/view/deepdj

##### Neural Model-Based Reinforcement Learning for Recommendation

tl;dr A new insight of designing a RL recommendation policy based on user behavior model along with some technical highlights.

There is a great interest in applying reinforcement learning (RL) to recommendation systems. However, in this setting, an online user is the environment; neither the reward function nor the environment dynamics is clearly defined, making the application of RL challenging. In this paper, we propose a novel model-based reinforcement learning framework for recommendation systems, where we developed a generative adversarial network to imitate user behavior dynamics and learn her reward function. Using this user model as the simulation environment, we develop a novel DQN algorithm to obtain a combinatorial recommendation policy which can handle a large number of candidate items efficiently. In our experiments with real data, we show this generative adversarial user model can better explain user behavior than alternatives, and the RL policy based on this model can lead to better long turn reward for the user and higher click rate for the system.

##### Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model

tl;dr State-of-the-art zero-shot learning performance by using a translation task to bridge multi-task training across languages.

Neural language models have been shown to achieve an impressive level of performance on a number of language processing tasks. The majority of these models, however, are limited to producing predictions for only English texts due to limited amounts of labeled data available in other languages. One potential method for overcoming this issue is learning cross-lingual text representations that can be used to transfer the performance from training on English tasks to non-English tasks, despite little to no task-specific non-English data. In this paper, we explore a natural setup for learning cross-lingual sentence representations: the dual-encoder. We provide a comprehensive evaluation of our cross-lingual representations on a number of monolingual, cross-lingual, and zero-shot/few-shot learning tasks, and also give an analysis of different learned cross-lingual embedding spaces.

##### Meta Learning with Fast/Slow Learners

tl;dr We applied multiple meta-strategy to improve meta-learning performance on base CNNs.

Meta-learning has recently achieved success in many optimization problems. In general, a meta learner g(.) could be learned for a base model f(.) on a variety of tasks, such that it can be more efficient on a new task. In this paper, we make some key modifications to enhance the performance of meta-learning models. (1) we leverage different meta-strategies for different modules to optimize them separately: we use conservative “slow learners” on low-level basic feature representation layers and “fast learners” on high-level task-specific layers; (2) Furthermore, we provide theoretical analysis on why the proposed approach works, based on a case study on a two-layer MLP. We evaluate our model on synthetic MLP regression, as well as low-shot learning tasks on Omniglot and ImageNet benchmarks. We demonstrate that our approach is able to achieve state-of-the-art performance.

##### Evaluation Methodology for Attacks Against Confidence Thresholding Models

tl;dr We present metrics and an optimal attack for evaluating models that defend against adversarial examples using confidence thresholding

Current machine learning algorithms can be easily fooled by adversarial examples. One possible solution path is to make models that use confidence thresholding to avoid making mistakes. Such models refuse to make a prediction when they are not confident of their answer. We propose to evaluate such models in terms of tradeoff curves with the goal of high success rate on clean examples and low failure rate on adversarial examples. Existing untargeted attacks developed for models that do not use confidence thresholding tend to underestimate such models' vulnerability. We propose the MaxConfidence family of attacks, which are optimal in a variety of theoretical settings, including one realistic setting: attacks against linear models. Experiments show the attack attains good results in practice. We show that simple defenses are able to perform well on MNIST but not on CIFAR, contributing further to previous calls that MNIST should be retired as a benchmarking dataset for adversarial robustness research. We release code for these evaluations as part of the cleverhans (Papernot et al 2018) library (ICLR reviewers should be careful not to look at who contributed these features to cleverhans to avoid de-anonymizing this submission).

##### Inhibited Softmax for Uncertainty Estimation in Neural Networks

tl;dr Uncertainty estimation in a single forward pass without additional learnable parameters.

We present a new method for uncertainty estimation and out-of-distribution detection in neural networks with softmax output. We extend softmax layer with an additional constant input. The corresponding additional output is able to represent the uncertainty of the network. The proposed method requires neither additional parameters nor multiple forward passes nor input preprocessing nor out-of-distribution datasets. We show that our method performs comparably to more computationally expensive methods and outperforms baselines on our experiments from image recognition and sentiment analysis domains.

##### Improving Generalization and Stability of Generative Adversarial Networks

tl;dr We propose a zero-centered gradient penalty for improving generalization and stability of GANs

Generative Adversarial Networks (GANs) are one of the most popular tools for learning complex high dimensional distributions. However, generalization properties of GANs have not been well understood. In this paper, we analyze the generalization of GANs in practical settings. We show that discriminators trained on discrete datasets with the original GAN loss have poor generalization capability and do not approximate the theoretically optimal discriminator. We propose a zero-centered gradient penalty for improving the generalization of the discriminator by pushing it toward the optimal discriminator. The penalty guarantees the generalization and convergence of GANs. Experiments on synthetic and large scale datasets verify our theoretical analysis.

##### A Unified View of Deep Metric Learning via Gradient Analysis

No tl;dr =[

Loss functions play a pivotal role in deep metric learning (DML). A large variety of loss functions have been proposed in DML recently. However, it remains difficult to answer this question: what are the intrinsic differences among these loss functions?This paper answers this question by proposing a unified perspective to rethink deep metric loss functions. We show theoretically that most DML methods in deep metric learning, in view of gradient equivalence, are essentially weight assignment strategies of training pairs. Based on this unified view, we revisit several typical DML methods and disclose their hidden drawbacks. Moreover, we point out the key components of an effective DML approach which drives us to propose our weight assignment framework. We evaluate our method on image retrieval tasks, and show that it outperforms the state-of-the-art DML approaches by a significant margin on the CUB-200-2011, Cars-196, Stanford Online Products and In-Shop Clothes Retrieval datasets.

##### Diminishing Batch Normalization

tl;dr We propose a extension of the batch normalization, show a first-of-its-kind convergence analysis for this extension and show in numerical experiments that it has better performance than the original batch normalizatin.

In this paper, we propose a generalization of the BN algorithm, diminishing batch normalization (DBN), where we update the BN parameters in a diminishing moving average way. Batch normalization (BN) is very effective in accelerating the convergence of a neural network training phase that it has become a common practice. Our proposed DBN algorithm remains the overall structure of the original BN algorithm while introduces a weighted averaging update to some trainable parameters. We provide an analysis of the convergence of the DBN algorithm that converges to a stationary point with respect to trainable parameters. Our analysis can be easily generalized for original BN algorithm by setting some parameters to constant. To the best knowledge of authors, this analysis is the first of its kind for convergence with Batch Normalization introduced. We analyze a two-layer model with arbitrary activation function. The primary challenge of the analysis is the fact that some parameters are updated by gradient while others are not. The convergence analysis applies to any activation function that satisfies our common assumptions. For the analysis, we also show the sufficient and necessary conditions for the stepsizes and diminishing weights to ensure the convergence. In the numerical experiments, we use more complex models with more layers and ReLU activation. We observe that DBN outperforms the original BN algorithm on Imagenet, MNIST, NI and CIFAR-10 datasets with reasonable complex FNN and CNN models.

##### Variation Network: Learning High-level Attributes for Controlled Input Manipulation

tl;dr The Variation Network is a generative model able to learn high-level attributes without supervision that can then be used for controlled input manipulation.

This paper presents the Variation Network (VarNet), a generative model providing means to manipulate the high-level attributes of a given input. The originality of our approach is that VarNet is not only capable of handling pre-defined attributes but can also learn the relevant attributes of the dataset by itself. These two settings can be easily combined which makes VarNet applicable for a wide variety of tasks. Further, VarNet has a sound probabilistic interpretation which grants us with a novel way to navigate in the latent spaces as well as means to control how the attributes are learned. We demonstrate experimentally that this model is capable of performing interesting input manipulation and that the learned attributes are relevant and interpretable.

##### Live Face De-Identification in Video

No tl;dr =[

We propose a method for face de-identification that enables fully automatic video modification at high frame rates. The goal is to maximally decorrelate the identity, while having the perception (pose, illumination and expression) fixed. We achieve this by a novel feed forward encoder-decoder network architecture that is conditioned on the high-level representation of a person's facial image. The network is global, in the sense that it does not need to be retrained for a given video or for a given identity, and it creates natural-looking image sequences with little distortion in time.

##### A More Globally Accurate Dimensionality Reduction Method Using Triplets

tl;dr A new dimensionality reduction method using triplets which is significantly faster than t-SNE and provides more accurate results globally

We first show that the commonly used dimensionality reduction (DR) methods such as t-SNE and LargeVis poorly capture the global structure of the data in the low dimensional embedding. We show this via a number of tests for the DR methods that can be easily applied by any practitioner to the dataset at hand. Surprisingly enough, t-SNE performs the best w.r.t. the commonly used measures that reward the local neighborhood accuracy such as precision-recall while having the worst performance in our tests for global structure. We then contrast the performance of these two DR method against our new method called TriMap. The main idea behind TriMap is to capture higher orders of structure with triplet information (instead of pairwise information used by t-SNE and LargeVis), and to minimize a robust loss function for satisfying the chosen triplets. We provide compelling experimental evidence on large natural datasets for the clear advantage of the TriMap DR results. As LargeVis, TriMap is fast and scales linearly with the number of data points.

tl;dr We proposed "Difference-Seeking Generative Adversarial Network" (DSGAN) model to learn the target distribution which is hard to collect training data.

We propose a novel algorithm, Difference-Seeking Generative Adversarial Network (DSGAN), developed from traditional GAN. DSGAN considers the scenario that the training samples of target distribution, $p_{t}$, are difficult to collect. Suppose there are two distributions $p_{\bar{d}}$ and $p_{d}$ such that the density of the target distribution can be the differences between the densities of $p_{\bar{d}}$ and $p_{d}$. We show how to learn the target distribution $p_{t}$ only via samples from $p_{d}$ and $p_{\bar{d}}$ (relatively easy to obtain). DSGAN has the flexibility to produce samples from various target distributions (e.g. the out-of-distribution). Two key applications, semi-supervised learning and adversarial training, are taken as examples to validate the effectiveness of DSGAN. We also provide theoretical analyses about the convergence of DSGAN.

##### Language Modeling with Graph Temporal Convolutional Networks

No tl;dr =[

Recently, there have been some attempts to use non-recurrent neural models for language modeling. However, a noticeable performance gap still remains. We propose a non-recurrent neural language model, dubbed graph temporal convolutional network (GTCN), that relies on graph neural network blocks and convolution operations. While the standard recurrent neural network language models encode sentences sequentially without modeling higher-level structural information, our model regards sentences as graphs and processes input words within a message propagation framework, aiming to learn better syntactic information by inferring skip-word connections. Specifically, the graph network blocks operate in parallel and learn the underlying graph structures in sentences without any additional annotation pertaining to structure knowledge. Experiments demonstrate that the model without recurrence can achieve comparable perplexity results in language modeling tasks and successfully learn syntactic information.

##### GraphSeq2Seq: Graph-Sequence-to-Sequence for Neural Machine Translation

tl;dr Graph-Sequence-to-Sequence for Neural Machine Translation

Sequence-to-Sequence (Seq2Seq) neural models have become popular for text generation problems, e.g. neural machine translation (NMT) (Bahdanau et al.,2014; Britz et al., 2017), text summarization (Nallapati et al., 2017; Wang &Ling, 2016), and image captioning (Venugopalan et al., 2015; Liu et al., 2017). Though sequential modeling has been shown to be effective, the dependency graph among words contains additional semantic information and thus can be utilized for sentence modeling. In this paper, we propose a Graph-Sequence-to-Sequence(GraphSeq2Seq) model to fuse the dependency graph among words into the traditional Seq2Seq framework. For each sample, the sub-graph of each word is encoded to a graph representation, which is then utilized to sequential encoding. At last, a sequence decoder is leveraged for output generation. Since above model fuses different features by contacting them together to encode, we also propose a variant of our model that regards the graph representations as additional annotations in attention mechanism (Bahdanau et al., 2014) by separately encoding different features. Experiments on several translation benchmarks show that our models can outperform existing state-of-the-art methods, demonstrating the effectiveness of the combination of Graph2Seq and Seq2Seq.

##### Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling

tl;dr We compare many tasks and task combinations for pretraining sentence-level BiLSTMs for NLP tasks. Language modeling is the best single pretraining task, but simple baselines also do well.

Work on the problem of contextualized word representation—the development of reusable neural network components for sentence understanding—has recently seen a surge of progress centered on the unsupervised pretraining task of language modeling with methods like ELMo (Peters et al., 2018). This paper contributes the first large-scale systematic study comparing different pretraining tasks in this context, both as complements to language modeling and as potential alternatives. The primary results of the study support the use of language modeling as a pretraining task and set a new state of the art among comparable models using multitask learning with language models. However, a closer look at these results reveals worryingly strong baselines and strikingly varied results across target tasks, suggesting that the widely-used paradigm of pretraining and freezing sentence encoders may not be an ideal platform for further work.

##### Unsupervised classification into unknown k classes

No tl;dr =[

We propose a novel spectral decomposition framework for the unsupervised classification task. Unlike the widely used classification method, this architecture does not require the labels of data and the number of classes. Our key idea is to introduce a piecewise linear map and a spectral decomposition method on the dimension reduced space into generative adversarial networks. Inspired by the human visual recognition system, the proposed framework can classify and also generate images as the human brains do. We build a piecewise linear connection analogous to the cerebral cortex, between the discriminator D and the generator G. This connection allows us to estimate the number of classes k and extract the vectors that represent each class. We show that our framework has the reasonable performance in the experiment.

No tl;dr =[

An adversarial process between two deep neural networks is a promising approach to train robust networks. In this study, we propose a framework for training networks that eliminates subsidiary information via the adversarial process. The objective of the proposed framework is to train a primary model that is robust to existing subsidiary information. This primary model can be used for various recognition tasks, such as digit recognition and speaker identification. Subsidiary information refers to the factors that might decrease the performance of the primary model such as channel information in speaker recognition and noise information in digit recognition. Our proposed framework comprises two discriminative models for the primary and subsidiary task, as well as an encoder network for feature representation. A subsidiary task is an operation associated with subsidiary information such as identifying the noise type. The discriminative model for the subsidiary task is trained for modeling the dependency of subsidiary class labels on codes from the encoder. Therefore, we expect that subsidiary information could be eliminated by training the encoder to reduce the dependency between the class labels and codes. In order to do so, we train the weight parameters of the subsidiary model; then, we develop the codes and the parameters of subsidiary model to make them orthogonal. For this purpose, we design a loss function to train the encoder based on cosine similarity between the weight parameters of the subsidiary model and codes. Finally, the proposed framework involves repeatedly performing the adversarial process of modeling the subsidiary information and eliminating it. Furthermore, we discuss possible applications of the proposed framework: reducing channel information for speaker identification and domain information for unsupervised domain adaptation.

##### Bridging HMMs and RNNs through Architectural Transformations

tl;dr Are HMMs a special case of RNNs? We investigate a series of architectural transformations between HMMs and RNNs, both through theoretical derivations and empirical hybridization and provide new insights.

A distinct commonality between HMMs and RNNs is that they both learn hidden representations for sequential data. In addition, it has been noted that the backward computation of the Baum-Welch algorithm for HMMs is a special case of the back propagation algorithm used for neural networks (Eisner (2016)). Do these observations suggest that, despite their many apparent differences, HMMs are a special case of RNNs? In this paper, we investigate a series of architectural transformations between HMMs and RNNs, both through theoretical derivations and empirical hybridization, to answer this question. In particular, we investigate three key design factors—independence assumptions between the hidden states and the observation, the placement of softmax, and the use of non-linearity—in order to pin down their empirical effects. We present a comprehensive empirical study to provide insights on the interplay between expressivity and interpretability with respect to language modeling and parts-of-speech induction.

##### Modeling Parts, Structure, and System Dynamics via Predictive Learning

tl;dr Learning object parts, hierarchical structure, and dynamics by watching how they move

Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future. In this paper, we propose a novel formulation that simultaneously learns a hierarchical, disentangled object representation and a dynamics model for object parts from unlabeled videos in a self-supervised manner. Our Parts, Structure, and Dynamics (PSD) model learns to first recognize the object parts via a layered image representation; second, predict hierarchy via a structural descriptor that composes low-level concepts into a hierarchical structure; and third, model the system dynamics by predicting the future. Experiments on multiple real and synthetic datasets demonstrate that our PSD model works well on all three tasks: segmenting object parts, building their hierarchical structure, and capturing their motion distributions.

##### Diffusion Scattering Transforms on Graphs

tl;dr Stability of scattering transform representations of graph data to deformations of the underlying graph support.

Stability is a key aspect of data analysis. In many applications, the natural notion of stability is geometric, as illustrated for example in computer vision. Scattering transforms construct deep convolutional representations which are certified stable to input deformations. This stability to deformations can be interpreted as stability with respect to changes in the metric structure of the domain. In this work, we show that scattering transforms can be generalized to non-Euclidean domains using diffusion wavelets, while preserving a notion of stability with respect to metric changes in the domain, measured with diffusion maps. The resulting representation is stable to metric perturbations of the domain while being able to capture ''high-frequency'' information, akin to the Euclidean Scattering.

##### Hierarchical Reinforcement Learning with Limited Policies and Hindsight

tl;dr We propose a new hierarchical RL framework that can improve performance in tasks involving long time horizons and sparse rewards.

We introduce a new hierarchical reinforcement learning framework that can accelerate learning in tasks involving long time horizons and sparse rewards. Our approach improves sample efficiency by enabling agents to learn a hierarchy of short policies that operate at different time scales. The policy hierarchies can support an arbitrary number of levels, and all policies within the hierarchy are trained in parallel and end-to-end. Our framework is the first hierarchical reinforcement learning approach that can learn hierarchies with more than two levels of policies in continuous tasks. We demonstrate experimentally in both grid world and simulated robotics domains that our approach can significantly boost sample efficiency. A video illustrating our results is available at https://www.youtube.com/watch?v=i04QF7Yi50Y.

##### Learning Global Additive Explanations for Neural Nets Using Model Distillation

tl;dr We propose to leverage model distillation to learn global additive explanations in the form of feature shapes (that are more expressive than feature attributions) for models such as neural nets trained on tabular data.

Interpretability has largely focused on local explanations, i.e. explaining why a model made a particular prediction for a sample. These explanations are appealing due to their simplicity and local fidelity. However, they do not provide information about the general behavior of the model. We propose to leverage model distillation to learn global additive explanations that describe the relationship between input features and model predictions. These global explanations take the form of feature shapes, which are more expressive than feature attributions. Through careful experimentation, we show qualitatively and quantitatively that global additive explanations are able to describe model behavior and yield insights about models such as neural nets. A visualization of our approach applied to a neural net as it is trained is available at https://youtu.be/ErQYwNqzEdc

##### Competitive experience replay

tl;dr a novel method to learn with sparse reward using adversarial reward re-labeling

Deep learning has achieved remarkable successes in solving challenging reinforcement learning (RL) problems. However, it still often suffers from the need to engineer a reward function that not only reflects the task but is also carefully shaped. This limits the applicability of RL in the real world. It is therefore of great practical importance to develop algorithms which can learn from unshaped, sparse reward signals, e.g. a binary signal indicating successful task completion. We propose a novel method called competitive experience replay, which efficiently supplements a sparse reward by placing learning in the context of an exploration competition between a pair of agents. Our method complements the recently proposed hindsight experience replay (HER) by inducing an automatic exploratory curriculum. We evaluate our approach on the tasks of reaching various goal locations in an ant maze and manipulating objects with a robotic arm. Each task provides only binary rewards indicating whether or not the goal is completed. Our method asymmetrically augments these sparse rewards for a pair of agents each learning the same task, creating a competitive game designed to drive exploration. Extensive experiments demonstrate that this method leads to faster converge and improved task performance.

##### MLPrune: Multi-Layer Pruning for Automated Neural Network Compression

tl;dr MLPrune: an automated pruning method that doesn't require any tuning for per-layer compression ratio, achieves state-of-the-art pruning results on AlexNet and VGG16.

Model compression can significantly reduce the computation and memory footprint of large neural networks. To achieve a good trade-off between model size and accuracy, popular compression techniques usually rely on hand-crafted heuristics and require manually setting the compression ratio of each layer. This process is typically costly and suboptimal. In this paper, we propose a Multi-Layer Pruning method (MLPrune), which is theoretically sound, and can automatically decide appropriate compression ratios for all layers. Towards this goal, we use an efficient approximation of the Hessian as our pruning criterion, based on a Kronecker-factored Approximate Curvature method. We demonstrate the effectiveness of our method on several datasets and architectures, outperforming previous state-of-the-art by a large margin. Our experiments show that we can compress AlexNet and VGG16 by 25x without loss in accuracy on ImageNet. Furthermore, our method has much fewer hyper-parameters and requires no expert knowledge.