# Search ICLR 2019

Searching papers submitted to ICLR 2019 can be painful. You might want to know which paper uses technique X, dataset D, or cites author ME. Unfortunately, search is limited to titles, abstracts, and keywords, missing the actual contents of the paper. This Frankensteinian search has returned from 2018 to help scour the papers of ICLR by ripping out their souls using pdftotext.

Good luck! Warranty's not included :)

Need random search inspiration..? Grab something from the list of all tags! ^_^
How about: conditional autoencoder, autonomous flying, visual reasoning, spatiotemporal dependencies, structural sparsity ..?

Sanity Disclaimer: As you stare at the continuous stream of ICLR and arXiv papers, don't lose confidence or feel overwhelmed. This isn't a competition, it's a search for knowledge. You and your work are valuable and help carve out the path for progress in our field :)

### "dirichlet distribution" has 14 results

##### Neural Network Regression with Beta, Dirichlet, and Dirichlet-Multinomial Outputs

tl;dr Neural network regression should use Dirichlet output distribution when targets are probabilities in order to quantify uncertainty of predictions.

We propose a method for quantifying uncertainty in neural network regression models when the targets are real values on a $d$-dimensional simplex, such as probabilities. We show that each target can be modeled as a sample from a Dirichlet distribution, where the parameters of the Dirichlet are provided by the output of a neural network, and that the combined model can be trained using the gradient of the data likelihood. This approach provides interpretable predictions in the form of multidimensional distributions, rather than point estimates, from which one can obtain confidence intervals or quantify risk in decision making. Furthermore, we show that the same approach can be used to model targets in the form of empirical counts as samples from the Dirichlet-multinomial compound distribution. In experiments, we verify that our approach provides these benefits without harming the performance of the point estimate predictions on two diverse applications: (1) distilling deep convolutional networks trained on CIFAR-100, and (2) predicting the location of particle collisions in the XENON1T Dark Matter detector.

##### Learning Procedural Abstractions and Evaluating Discrete Latent Temporal Structure

No tl;dr =[

Clustering methods and latent variable models are often used as tools for pattern mining and discovery of latent structure in time-series data. In this work, we consider the problem of learning procedural abstractions from possibly high-dimensional observational sequences, such as video demonstrations. Given a dataset of time-series, the goal is to identify the latent sequence of steps common to them and label each time-series with the temporal extent of these procedural steps. We introduce a hierarchical Bayesian model called Prism that models the realization of a common procedure across multiple time-series, and can recover procedural abstractions with supervision. We also bring to light two characteristics ignored by traditional evaluation criteria when evaluating latent temporal labelings (temporal clusterings) -- segment structure, and repeated structure -- and develop new metrics tailored to their evaluation. We demonstrate that our metrics improve interpretability and ease of analysis for evaluation on benchmark time-series datasets. Results on benchmark and video datasets indicate that Prism outperforms standard sequence models as well as state-of-the-art techniques in identifying procedural abstractions.

##### A variational Dirichlet framework for out-of-distribution detection

tl;dr A new framework based variational inference for out-of-distribution detection

With the recently rapid development in deep learning, deep neural networks have been widely adopted in many real-life applications. However, deep neural networks are known to have very little control over its uncertainty for test examples, which can potentially cause very harmful and annoying consequences in practical scenarios. In this paper, we are particularly interested in designing a higher-order uncertainty metric for deep neural networks and investigate its performance on the out-of-distribution detection task proposed by~\cite{hendrycks2016baseline}. Our method is based on a variational inference framework where we interpret the output distribution $p(x)$ as a stochastic variable $z$ lying on a simplex of multi-dimensional space and represent the higher-order uncertainty via the entropy of the latent distribution $p(z)$. Under the variational Bayesian framework with a given dataset $D$, we propose to adopt Dirichlet distribution as the approximate posterior $F_{\theta}(z|x)$ to approach the true posterior distribution $p(z|D)$ by maximizing the evidence lower bound of marginal likelihood. By identifying the over-concentration issue in the Dirichlet framework, we further design a log-scaling smoothing function to avert such issue and greatly increase the robustness of the entropy-based uncertainty measure. Through comprehensive experiments on various datasets and architectures, our proposed variational Dirichlet framework is observed to yield state-of-the-art results for out-of-distribution detection.

##### Hierarchically Clustered Representation Learning

tl;dr We introduce hierarchically clustered representation learning (HCRL), which simultaneously optimizes representation learning and hierarchical clustering in the embedding space.

The joint optimization of representation learning and clustering in the embedding space has experienced a breakthrough in recent years. In spite of the advance, clustering with representation learning has been limited to flat-level categories, which oftentimes involves cohesive clustering with a focus on instance relations. To overcome the limitations of flat clustering, we introduce hierarchically clustered representation learning (HCRL), which simultaneously optimizes representation learning and hierarchical clustering in the embedding space. Specifically, we place a nonparametric Bayesian prior on embeddings to handle dynamic mixture hierarchies under the variational autoencoder framework, and to adopt the generative process of a hierarchical-versioned Gaussian mixture model. Compared with a few prior works focusing on unifying representation learning and hierarchical clustering, HCRL is the first model to consider a generation of deep embeddings from every component of the hierarchy, not just leaf components. This generation process enables more meaningful separations and mergers of clusters via branches in a hierarchy. In addition to obtaining hierarchically clustered embeddings, we can reconstruct data by the various abstraction levels, infer the intrinsic hierarchical structure, and learn the level-proportion features. We conducted evaluations with image and text domains, and our quantitative analyses showed competent likelihoods and the best accuracies compared with the baselines.

##### Prior Networks for Detection of Adversarial Attacks

tl;dr We show that it is possible to successfully detect a range of adversarial attacks using measures of uncertainty derived from Prior Networks.

##### Probabilistic Program Induction for Intuitive Physics Game Play

tl;dr The paper describes a method imitating human cognition about the physical world to play games in environments of physical interactions.

Recent findings suggest that humans deploy cognitive mechanism of physics simulation engines to simulate the physics of objects. We propose a framework for bots to deploy similar tools for interacting with intuitive physics environments. The framework employs a physics simulation in a probabilistic way to infer about moves performed by an agent in a setting governed by Newtonian laws of motion. However, methods of probabilistic programs can be slow in such setting due to their need to generate many samples. We complement the model with a model-free approach to aid the sampling procedures in becoming more efficient through learning from experience during game playing. We present an approach where a myriad of model-free approaches (a convolutional neural network in our model) and model-based approaches (probabilistic physics simulation) is able to achieve what neither could alone. This way the model outperforms an all model-free or all model-based approach. We discuss a case study showing empirical results of the performance of the model on the game of Flappy Bird.

##### Variational Bayesian Phylogenetic Inference

tl;dr The first variational Bayes formulation of phylogenetic inference, a challenging inference problem over structures with intertwined discrete and continuous components

Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo with simple mechanisms for proposing new states, which hinders exploration efficiency and often requires long runs to deliver accurate posterior estimates. In this paper we present an alternative approach: a variational framework for Bayesian phylogenetic analysis. We approximate the true posterior using an expressive graphical model for tree distributions, called a subsplit Bayesian network, together with appropriate branch length distributions. We train the variational approximation via stochastic gradient ascent and adopt multi-sample based gradient estimators for different latent variables separately to handle the composite latent space of phylogenetic models. We show that our structured variational approximations are flexible enough to provide comparable posterior estimation to MCMC, while requiring less computation due to a more efficient tree exploration mechanism enabled by variational inference. Moreover, the variational approximations can be readily used for further statistical analysis such as marginal likelihood estimation for model comparison via importance sampling. Experiments on both synthetic data and real data Bayesian phylogenetic inference problems demonstrate the effectiveness and efficiency of our methods.

##### Curiosity-Driven Experience Prioritization via Density Estimation

tl;dr Our paper proposes a curiosity-driven prioritization framework for RL agents, which improves both performance and sample-efficiency.

In Reinforcement Learning (RL), an agent explores the environment and collects trajectories into the memory buffer for later learning. However, the collected trajectories can easily be imbalanced with respect to the achieved goal states. The problem of learning from imbalanced data is a well-known problem in supervised learning, but has not yet been thoroughly researched in RL. To address this problem, we propose a novel Curiosity-Driven Prioritization (CDP) framework to encourage the agent to over-sample those trajectories that have rare achieved goal states. The CDP framework mimics the human learning process and focuses more on relatively uncommon events. We evaluate our methods using the robotic environment provided by OpenAI Gym. The environment contains six robot manipulation tasks. In our experiments, we combined CDP with Deep Deterministic Policy Gradient (DDPG) with or without Hindsight Experience Replay (HER). The experimental results show that CDP improves both performance and sample-efficiency of reinforcement learning agents, compared to state-of-the-art methods.

##### Probabilistic Federated Neural Matching

tl;dr We propose a Bayesian nonparametric model for federated learning with neural networks.

In federated learning problems, data is scattered across different servers and exchanging or pooling it is often impractical or prohibited. We develop a Bayesian nonparametric framework for federated learning with neural networks. Each data server is assumed to train local neural network weights, which are modeled through our framework. We then develop an inference approach that allows us to synthesize a more expressive global network without additional supervision or data pooling. We then demonstrate the efficacy of our approach on federated learning problems simulated from two popular image classification datasets.

##### Sample-efficient policy learning in multi-agent Reinforcement Learning via meta-learning

tl;dr Our work applies meta-learning to multi-agent Reinforcement Learning to help our agent efficiently adapted to new coming opponents.

To gain high rewards in muti-agent scenes, it is sometimes necessary to understand other agents and make corresponding optimal decisions. We can solve these tasks by first building models for other agents and then finding the optimal policy with these models. To get an accurate model, many observations are needed and this can be sample-inefficient. What's more, the learned model and policy can overfit to current agents and cannot generalize if the other agents are replaced by new agents. In many practical situations, each agent we face can be considered as a sample from a population with a fixed but unknown distribution. Thus we can treat the task against some specific agents as a task sampled from a task distribution. We apply meta-learning method to build models and learn policies. Therefore when new agents come, we can adapt to them efficiently. Experiments on grid games show that our method can quickly get high rewards.

##### Filter Training and Maximum Response: Classification via Discerning

tl;dr The proposed scheme mimics the classification process mediated by a series of one component picking.

This report introduces a training and recognition scheme, in which classification is realized via class-wise discerning. Trained with datasets whose labels are randomly shuffled except for one class of interest, a neural network learns class-wise parameter values, and remolds itself from a feature sorter into feature filters, each of which discerns objects belonging to one of the classes only. Classification of an input can be inferred from the maximum response of the filters. A multiple check with multiple versions of filters can diminish fluctuation and yields better performance. This scheme of discerning, maximum response and multiple check is a method of general viability to improve performance of feedforward networks, and the filter training itself is a promising feature abstraction procedure. In contrast to the direct sorting, the scheme mimics the classification process mediated by a series of one component picking.

##### Regularized Learning for Domain Adaptation under Label Shifts

tl;dr A practical and provably guaranteed approach for efficiently training a classifier when there is a label shift between Source and Target

We propose Regularized Learning under Label shifts (RLLS), a principled and practical approach to correct for shifts in the label distribution between the source and target domains, and is a special case of domain adaptation. We estimate the importance weights using both labeled source and unlabeled target data and then train a classifier on the weighted source samples. Depending on the number of source and target samples, we regularize the influence of these estimated weights and train a corrected classifier on the source set with the estimated weights. We derive a generalization bound for the loss of the classifier on the target set. This bound is independent of the (ambient) data dimensions and instead depends on the complexity of the function class. To the best of our knowledge, this is the first generalization bound for label-shift problems where the labels in the target domain are unknown. Based on this bound, we design a regularized estimator for importance weights that allows us to adapt to the uncertainty in the estimated weights for the small-sample regime. Experiments show that regularized RLLS significantly improves accuracy in the low (target) sample and large-shift regimes compared to previous methods.

##### Dirichlet Variational Autoencoder

No tl;dr =[

This paper proposes Dirichlet Variational Autoencoder (DirVAE) using a Dirichlet prior for a continuous latent variable that exhibits the characteristic of the categorical probabilities. To infer the parameters of DirVAE, we utilize the stochastic gradient method by approximating the Gamma distribution, which is a component of the Dirichlet distribution, with the inverse Gamma CDF approximation. Additionally, we reshape the component collapsing issue by investigating two problem sources, which are decoder weight collapsing and latent value collapsing, and we show that DirVAE has no component collapsing; while Gaussian VAE exhibits the decoder weight collapsing and Stick-Breaking VAE shows the latent value collapsing. The experimental results show that 1) DirVAE models the latent representation result with the best log-likelihood compared to the baselines; and 2) DirVAE produces more interpretable latent values with no collapsing issues which the baseline models suffer from. Also, we show that the learned latent representation from the DirVAE achieves the best classification accuracy in the semi-supervised and the supervised classification tasks on MNIST, OMNIGLOT, and SVHN compared to the baseline VAEs. Finally, we demonstrated that the DirVAE augmented topic models show better performances in most cases.

##### GO Gradient for Expectation-Based Objectives

No tl;dr =[

Within many machine learning algorithms, a fundamental problem concerns efficient calculation of an unbiased gradient wrt parameters $\boldsymbol{\gamma}$ for expectation-based objectives $\mathbb{E}_{q_{\boldsymbol{\gamma}} (\boldsymbol{y})} [f (\boldsymbol{y}) ]$. Most existing methods either ($i$) suffer from high variance, seeking help from (often) complicated variance-reduction techniques; or ($ii$) they only apply to distributions of continuous random variables and employ a reparameterization trick. To address these limitations, we propose a General and One-sample (GO) gradient that ($i$) applies to distributions associated with continuous {\em or} discrete random variables, and ($ii$) has the same low-variance as the reparameterization trick. We find that the GO gradient often works well in practice based on only one Monte Carlo sample (although one can of course use more samples if desired). Alongside the GO gradient, we develop a means of propagating the chain rule through distributions, yielding {statistical back-propagation}, coupling neural networks to random variables.