# Search ICLR 2019

Searching papers submitted to ICLR 2019 can be painful. You might want to know which paper uses technique X, dataset D, or cites author ME. Unfortunately, search is limited to titles, abstracts, and keywords, missing the actual contents of the paper. This Frankensteinian search has returned from 2018 to help scour the papers of ICLR by ripping out their souls using pdftotext.

Good luck! Warranty's not included :)

Need random search inspiration..? Grab something from the list of all tags! ^_^
How about: excitation, subsplit bayesian networks, binary network training, disentangling, structured prediction energy networks ..?

Sanity Disclaimer: As you stare at the continuous stream of ICLR and arXiv papers, don't lose confidence or feel overwhelmed. This isn't a competition, it's a search for knowledge. You and your work are valuable and help carve out the path for progress in our field :)

### "bayesian network" has 29 results

##### NSGA-Net: A Multi-Objective Genetic Algorithm for Neural Architecture Search

tl;dr An efficient multi-objective neural architecture search algorithm using NSGA-II

This paper introduces NSGA-Net, an evolutionary approach for neural architecture search (NAS). NSGA-Net is designed with three goals in mind: (1) a NAS procedure for multiple, possibly conflicting, objectives, (2) efficient exploration and exploitation of the space of potential neural network architectures, and (3) output of a diverse set of network architectures spanning a trade-off frontier of the objectives in a single run. NSGA-Net is a population-based search algorithm that explores a space of potential neural network architectures in three steps, namely, a population initialization step that is based on prior-knowledge from hand-crafted architectures, an exploration step comprising crossover and mutation of architectures and finally an exploitation step that applies the entire history of evaluated neural architectures in the form of a Bayesian Network prior. Experimental results suggest that combining the objectives of minimizing both an error metric and computational complexity, as measured by FLOPS, allows NSGA-Net to find competitive neural architectures near the Pareto front of both objectives on two different tasks, object classification and object alignment. NSGA-Net obtains networks that achieve 3.72% (at 4.5 million FLOP) error on CIFAR-10 classification and 8.64% (at 26.6 million FLOP) error on the CMU-Car alignment task.

##### Context-aware Forecasting for Multivariate Stationary Time-series

tl;dr In order to forecast multivariate stationary time-series we learn embeddings containing contextual features within a RNN; we apply the framework on public transportation data

The domain of time-series forecasting has been extensively studied because it is of fundamental importance in many real-life applications. Weather prediction, traffic flow forecasting or sales are compelling examples of sequential phenomena. Predictive models generally make use of the relations between past and future values. However, in the case of stationary time-series, observed values also drastically depend on a number of exogenous features that can be used to improve forecasting quality. In this work, we propose a change of paradigm which consists in learning such features in embeddings vectors within recurrent neural networks. We apply our framework to forecast smart cards tap-in logs in the Parisian subway network. Results show that context-embedded models perform quantitatively better in one-step ahead and multi-step ahead forecasting.

##### Directed-Info GAIL: Learning Hierarchical Policies from Unsegmented Demonstrations using Directed Information

tl;dr Learning Hierarchical Policies from Unsegmented Demonstrations using Directed Information

The use of imitation learning to learn a single policy for a complex task that has multiple modes or hierarchical structure can be challenging. In fact, previous work has shown that when the modes are known, learning separate policies for each mode or sub-task can greatly improve the performance of imitation learning. In this work, we discover the interaction between sub-tasks from their resulting state-action trajectory sequences using a directed graphical model. We propose a new algorithm based on the generative adversarial imitation learning framework which automatically learns sub-task policies from unsegmented demonstrations. Our approach maximizes the directed information flow in the graphical model between sub-task latent variables and their generated trajectories. We also show how our approach connects with the existing Options framework, which is commonly used to learn hierarchical policies.

##### ROBUST ESTIMATION VIA GENERATIVE ADVERSARIAL NETWORKS

tl;dr GANs are shown to provide us a new effective robust mean estimate against agnostic contaminations with both statistical optimality and practical tractability.

Robust estimation under Huber's $\epsilon$-contamination model has become an important topic in statistics and theoretical computer science. Rate-optimal procedures such as Tukey's median and other estimators based on statistical depth functions are impractical because of their computational intractability. In this paper, we establish an intriguing connection between f-GANs and various depth functions through the lens of f-Learning. Similar to the derivation of f-GAN, we show that these depth functions that lead to rate-optimal robust estimators can all be viewed as variational lower bounds of the total variation distance in the framework of f-Learning. This connection opens the door of computing robust estimators using tools developed for training GANs. In particular, we show that a JS-GAN that uses a neural network discriminator with at least one hidden layer is able to achieve the minimax rate of robust mean estimation under Huber's $\epsilon$-contamination model. Interestingly, the hidden layers of the neural net structure in the discriminator class are shown to be necessary for robust estimation.

##### Differentiable Perturb-and-Parse: Semi-Supervised Parsing with a Structured Variational Autoencoder

tl;dr Differentiable dynamic programming over perturbed input weights with application to semi-supervised VAE

Human annotation for syntactic parsing is expensive, and large resources are available only for a fraction of languages. A question we ask is whether one can leverage abundant unlabeled texts to improve syntactic parsers, beyond just using the texts to obtain more generalisable lexical features (i.e. beyond word embeddings). To this end, we propose a novel latent-variable generative model for semi-supervised syntactic dependency parsing. As exact inference is intractable, we introduce a differentiable relaxation to obtain approximate samples and compute gradients with respect to the parser parameters. Our method (Differentiable Perturb-and-Parse) relies on differentiable dynamic programming over stochastically perturbed edge scores. We demonstrate effectiveness of our approach with experiments on English, French and Swedish.

##### Learning Procedural Abstractions and Evaluating Discrete Latent Temporal Structure

No tl;dr =[

Clustering methods and latent variable models are often used as tools for pattern mining and discovery of latent structure in time-series data. In this work, we consider the problem of learning procedural abstractions from possibly high-dimensional observational sequences, such as video demonstrations. Given a dataset of time-series, the goal is to identify the latent sequence of steps common to them and label each time-series with the temporal extent of these procedural steps. We introduce a hierarchical Bayesian model called Prism that models the realization of a common procedure across multiple time-series, and can recover procedural abstractions with supervision. We also bring to light two characteristics ignored by traditional evaluation criteria when evaluating latent temporal labelings (temporal clusterings) -- segment structure, and repeated structure -- and develop new metrics tailored to their evaluation. We demonstrate that our metrics improve interpretability and ease of analysis for evaluation on benchmark time-series datasets. Results on benchmark and video datasets indicate that Prism outperforms standard sequence models as well as state-of-the-art techniques in identifying procedural abstractions.

##### A variational Dirichlet framework for out-of-distribution detection

tl;dr A new framework based variational inference for out-of-distribution detection

With the recently rapid development in deep learning, deep neural networks have been widely adopted in many real-life applications. However, deep neural networks are known to have very little control over its uncertainty for test examples, which can potentially cause very harmful and annoying consequences in practical scenarios. In this paper, we are particularly interested in designing a higher-order uncertainty metric for deep neural networks and investigate its performance on the out-of-distribution detection task proposed by~\cite{hendrycks2016baseline}. Our method is based on a variational inference framework where we interpret the output distribution $p(x)$ as a stochastic variable $z$ lying on a simplex of multi-dimensional space and represent the higher-order uncertainty via the entropy of the latent distribution $p(z)$. Under the variational Bayesian framework with a given dataset $D$, we propose to adopt Dirichlet distribution as the approximate posterior $F_{\theta}(z|x)$ to approach the true posterior distribution $p(z|D)$ by maximizing the evidence lower bound of marginal likelihood. By identifying the over-concentration issue in the Dirichlet framework, we further design a log-scaling smoothing function to avert such issue and greatly increase the robustness of the entropy-based uncertainty measure. Through comprehensive experiments on various datasets and architectures, our proposed variational Dirichlet framework is observed to yield state-of-the-art results for out-of-distribution detection.

##### SpaMHMM: Sparse Mixture of Hidden Markov Models for Graph Connected Entities

tl;dr A method to model the generative distribution of sequences coming from graph connected entities.

We propose a framework to model the distribution of sequential data coming from a set of entities connected in a graph with a known topology. The method is based on a mixture of shared hidden Markov models (HMMs), which are trained in order to exploit the knowledge of the graph structure and in such a way that the obtained mixtures tend to be sparse. Experiments in different application domains demonstrate the effectiveness and versatility of the method.

##### Inference of unobserved event streams with neural Hawkes particle smoothing

No tl;dr =[

Events that we observe in the world may be caused by other, unobserved events. We consider sequences of discrete events in continuous time. When only some of the events are observed, we propose particle smoothing to infer the missing events. Particle smoothing is an extension of particle filtering in which proposed events are conditioned on the future as well as the past. For our setting, we develop a novel proposal distribution that is a type of continuous-time bidirectional LSTM. We use the sampled particles in an approximate minimum Bayes risk decoder that outputs a single low-risk prediction of the missing events. We experiment in multiple synthetic and real domains, modeling the complete sequences in each domain with a neural Hawkes process (Mei & Eisner, 2017). On held-out incomplete sequences, our method is effective at inferring the ground-truth unobserved events. In particular, particle smoothing consistently improves upon particle filtering, showing the benefit of training a bidirectional proposal distribution.

##### Learning To Simulate

tl;dr We propose an algorithm that automatically adjusts parameters of a simulation engine to generate training data for a neural network such that validation accuracy is maximized.

Simulation is a useful tool in situations where training data for machine learning models is costly to annotate or even hard to acquire. In this work, we propose a reinforcement learning-based method for automatically adjusting the parameters of any (non-differentiable) simulator, thereby controlling the distribution of synthesized data in order to maximize the accuracy of a model trained on that data. In contrast to prior art that hand-crafts these simulation parameters or adjusts only parts of the available parameters, our approach fully controls the simulator with the actual underlying goal of maximizing accuracy, rather than mimicking the real data distribution or randomly generating a large volume of data. We find that our approach (i) quickly converges to the optimal simulation parameters in controlled experiments and (ii) can indeed discover good sets of parameters for an image rendering simulator in actual computer vision applications.

##### Universal Marginalizer for Amortised Inference and Embedding of Generative Models

No tl;dr =[

Probabilistic graphical models are powerful tools which allow us to formalise our knowledge about the world and reason about its inherent uncertainty. There exist a considerable number of methods for performing inference in probabilistic graphical models, however, they can be computational costly due to significant time burden, storage requirements or they lack theoretical guarantees of convergence and accuracy when applied to very large graphical models. We propose the Universal Marginaliser Importance Sampler (UM-IS) -- a hybrid inference scheme that combines the flexibility of a deep neural network trained on samples from the model and it inherits the asymptotic guarantees of importance sampling. We show how combining samples drawn from the graphical model with an appropriate masking function allows us to train a single neural network to approximate any of the corresponding conditional marginal distributions, and thus amortise the cost of inference. We demonstrate that the efficiency of importance sampling is significantly improved by using as the proposal distribution samples from the neural network. We also use the embeddings obtained from the proposed neural network and utilise them for different tasks such as clustering, classification and interpretation of relationships between the nodes. Finally, we benchmark the method on a large graph (>1000 nodes), showing that UM-IS outperforms sampling-based based methods by a large margin while being computationally efficient.

No tl;dr =[

##### Opportunistic Learning: Budgeted Cost-Sensitive Learning from Data Streams

tl;dr An online algorithm for cost-aware feature acquisition and prediction

In many real-world learning scenarios, features are only acquirable at a cost constrained under a budget. In this paper, we propose a novel approach for cost-sensitive feature acquisition at the prediction-time. The suggested method acquires features incrementally based on a context-aware feature-value function. We formulate the problem in the reinforcement learning paradigm, and introduce a reward function based on the utility of each feature. Specifically, MC dropout sampling is used to measure expected variations of the model uncertainty which is used as a feature-value function. Furthermore, we suggest sharing representations between the class predictor and value function estimator networks. The suggested approach is completely online and is readily applicable to stream learning setups. The solution is evaluated on three different datasets including the well-known MNIST dataset as a benchmark as well as two cost-sensitive datasets: Yahoo Learning to Rank and a dataset in the medical domain for diabetes classification. According to the results, the proposed method is able to efficiently acquire features and make accurate predictions.

##### Reducing Overconfident Errors outside the Known Distribution

tl;dr Deep networks are more likely to be confidently wrong when testing on unexpected data. We propose two methods to reduce confident errors on unknown input distributions, and an experimental methodology to study the problem.

Intuitively, unfamiliarity should lead to lack of confidence. In reality, current algorithms often make highly confident yet wrong predictions when faced with unexpected test samples from an unknown distribution different from training. Unlike domain adaptation methods, we cannot gather an "unexpected dataset" prior to test, and unlike novelty detection methods, a best-effort original task prediction is still expected. We propose two simple solutions that reduce overconfident errors of samples from an unknown novel distribution without drastically increasing evaluation time: (1) G-distillation, training an ensemble of classifiers and then distill into a single model using both labeled and unlabeled examples, or (2) NCR, reducing prediction confidence based on its novelty detection score. Experimentally, we investigate the overconfidence problem and evaluate our solution by creating "familiar" and "novel" test splits, where "familiar" are identically distributed with training and "novel" are not. We show that our solution yields more appropriate prediction confidences, on familiar and novel data, compared to single models and ensembles distilled on training data only. For example, our G-distillation reduces confident errors in gender recognition by 94% on demographic groups different from the training data.

##### Improved Gradient Estimators for Stochastic Discrete Variables

tl;dr We propose simple ways to reduce bias and complexity of stochastic gradient estimators used for learning distributions over discrete variables.

In many applications we seek to optimize an expectation with respect to a distribution over discrete variables. Estimating gradients of such objectives with respect to the distribution parameters is a challenging problem. We analyze existing solutions including finite-difference (FD) estimators and continuous relaxation (CR) estimators in terms of bias and variance. We show that the commonly used Gumbel-Softmax estimator is biased and propose a simple method to reduce it. We also derive a simpler piece-wise linear continuous relaxation that also possesses reduced bias. We demonstrate empirically that reduced bias leads to a better performance in variational inference and on binary optimization tasks.

##### Uncertainty-guided Lifelong Learning in Bayesian Networks

tl;dr We formulate lifelong learning in the Bayesian-by-Backprop framework, exploiting the parameter uncertainty in two settings: for pruning network parameters and in importance weight based continual learning.

##### Variational Bayesian Phylogenetic Inference

tl;dr The first variational Bayes formulation of phylogenetic inference, a challenging inference problem over structures with intertwined discrete and continuous components

Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo with simple mechanisms for proposing new states, which hinders exploration efficiency and often requires long runs to deliver accurate posterior estimates. In this paper we present an alternative approach: a variational framework for Bayesian phylogenetic analysis. We approximate the true posterior using an expressive graphical model for tree distributions, called a subsplit Bayesian network, together with appropriate branch length distributions. We train the variational approximation via stochastic gradient ascent and adopt multi-sample based gradient estimators for different latent variables separately to handle the composite latent space of phylogenetic models. We show that our structured variational approximations are flexible enough to provide comparable posterior estimation to MCMC, while requiring less computation due to a more efficient tree exploration mechanism enabled by variational inference. Moreover, the variational approximations can be readily used for further statistical analysis such as marginal likelihood estimation for model comparison via importance sampling. Experiments on both synthetic data and real data Bayesian phylogenetic inference problems demonstrate the effectiveness and efficiency of our methods.

##### PASS: Phased Attentive State Space Modeling of Disease Progression Trajectories

No tl;dr =[

Disease progression models are instrumental in predicting individual-level health trajectories and understanding disease dynamics. Existing models are capable of providing either accurate predictions of patients’ prognoses or clinically interpretable representations of disease pathophysiology, but not both. In this paper, we develop the phased attentive state space (PASS) model of disease progression, a deep probabilistic model that captures complex representations for disease progression while maintaining clinical interpretability. Unlike Markovian state space models which assume memoryless dynamics, PASS uses an attention mechanism to induce "memoryful" state transitions, whereby repeatedly updated attention weights are used to focus on past state realizations that best predict future states. This gives rise to complex, non-stationary state dynamics that remain interpretable through the generated attention weights, which designate the relationships between the realized state variables for individual patients. PASS uses phased LSTM units (with time gates controlled by parametrized oscillations) to generate the attention weights in continuous time, which enables handling irregularly-sampled and potentially missing medical observations. Experiments on data from a realworld cohort of patients show that PASS successfully balances the tradeoff between accuracy and interpretability: it demonstrates superior predictive accuracy and learns insightful individual-level representations of disease progression.

##### Learning Latent Superstructures in Variational Autoencoders for Deep Multidimensional Clustering

tl;dr We investigate a variant of variational autoencoders where there is a superstructure of discrete latent variables on top of the latent features.

We investigate a variant of variational autoencoders where there is a superstructure of discrete latent variables on top of the latent features. In general, our superstructure is a tree structure of multiple super latent variables and it is automatically learned from data. When there is only one latent variable in the superstructure, our model reduces to one that assumes the latent features to be generated from a Gaussian mixture model. We call our model the latent tree variational autoencoder (LTVAE). Whereas previous deep learning methods for clustering produce only one partition of data, LTVAE produces multiple partitions of data, each being given by one super latent variable. This is desirable because high dimensional data usually have many different natural facets and can be meaningfully partitioned in multiple ways.

##### Variational Autoencoders with Jointly Optimized Latent Dependency Structure

tl;dr We propose a method for learning latent dependency structure in variational autoencoders.

We propose a method for learning the dependency structure between latent variables in deep latent variable models. Our general modeling and inference framework combines the complementary strengths of deep generative models and probabilistic graphical models. In particular, we express the latent variable space of a variational autoencoder (VAE) in terms of a Bayesian network with a learned, flexible dependency structure. The network parameters, variational parameters as well as the latent topology are optimized simultaneously with a single variational objective. Inference is formulated via a sampling procedure that produces expectations over latent variable structures and incorporates top-down and bottom-up reasoning over latent variable values. We validate our framework in extensive experiments on MNIST, Omniglot, and CIFAR-10. Comparisons to state-of-the-art structured variational autoencoder baselines show improvements in terms of the expressiveness of the learned model.

##### Learning with Reflective Likelihoods

tl;dr We identify a peculiarity in maximum likelihood learning that causes input collapse and propose a new learning criterion for better representation learning.

Machine learning systems have achieved state-of-the-art results in many domains. They are usually trained using the maximum likelihood principle. However maximum likelihood learning can lead to poor learned representations of high dimensional data. For example this is manifested in deep generative latent variable models where the latent variables and their associated observations are driven independent from each other. We identify a peculiarity in maximum likelihood learning that causes this problem of poor learned representations. We then propose a new learning criterion for better representation learning. The proposed criterion relies on simultaneously maximizing the likelihood of the data and minimizing what we term the reflective likelihood of the data. We study this new criterion both theoretically and empirically and show improved performance on image classification under imbalance and text modeling with deep generative latent variable models.

##### Feed-forward Propagation in Probabilistic Neural Networks with Categorical and Max Layers

tl;dr Approximating mean and variance of the NN output over noisy input / dropout / uncertain parameters. Analytic approximations for argmax, softmax and max layers.

Probabilistic Neural Networks take into account various sources of stochasticity: input noise, dropout, stochastic neurons, parameter uncertainties modeled as random variables. In this paper we revisit the feed-forward propagation method that allows one to estimate for each neuron its mean and variance w.r.t. mentioned sources of stochasticity. In contrast, standard NNs propagate only point estimates, discarding the uncertainty. Methods propagating also the variance have been proposed by several authors in different context. The presented view attempts to clarify the assumptions and derivation behind such methods, relate it to classical NNs and broaden the scope of its applicability. The main technical innovations are new posterior approximations for argmax and max-related transforms, that allows for applicability in networks with softmax and max-pooling layers as well as leaky ReLU activations. We evaluate the accuracy of the approximation and suggest a simple calibration. Applying the method to networks with dropout allows for faster training and gives improved test likelihoods without the need of sampling.

##### Minimal Random Code Learning: Getting Bits Back from Compressed Model Parameters

tl;dr This paper proposes an effective method to compress neural networks based on recent results in information theory.

While deep neural networks are a highly successful model class, their large memory footprint puts considerable strain on energy consumption, communication bandwidth, and storage requirements. Consequently, model size reduction has become an utmost goal in deep learning. A typical approach is to train a set of deterministic weights, while applying certain techniques such as pruning and quantization, in order that the empirical weight distribution becomes amenable to Shannon-style coding schemes. However, as shown in this paper, relaxing weight determinism and using a full variational distribution over weights allows for more efficient coding schemes and consequently higher compression rates. In particular, following the classical bits-back argument, we encode the network weights using a random sample, requiring only a number of bits corresponding to the Kullback-Leibler divergence between the sampled variational distribution and the encoding distribution. By imposing a constraint on the Kullback-Leibler divergence, we are able to explicitly control the compression rate, while optimizing the expected loss on the training set. The employed encoding scheme can be shown to be close to the optimal information-theoretical lower bound, with respect to the employed variational family. Our method sets new state-of-the-art in neural network compression, as it strictly dominates previous approaches in a Pareto sense: On the benchmarks LeNet-5/MNIST and VGG-16/CIFAR-10, our approach yields the best test performance for a fixed memory budget, and vice versa, it achieves the highest compression rates for a fixed test performance.

##### Using Ontologies To Improve Performance In Massively Multi-label Prediction

tl;dr We propose a new method for using ontology information to improve performance on massively multi-label prediction/classification problems.

Massively multi-label prediction/classification problems arise in environments like health-care or biology where it is useful to make very precise predictions. One challenge with massively multi-label problems is that there is often a long-tailed frequency distribution for the labels, resulting in few positive examples for the rare labels. We propose a solution to this problem by modifying the output layer of a neural network to create a Bayesian network of sigmoids which takes advantage of ontology relationships between the labels to help share information between the rare and the more common labels. We apply this method to the two massively multi-label tasks of disease prediction (ICD-9 codes) and protein function prediction (Gene Ontology terms) and obtain significant improvements in per-label AUROC and average precision.

tl;dr We design an adversarial training method to Bayesian neural networks, showing a much stronger defense to white-box adversarial attacks

We present a new algorithm to train a robust neural network against adversarial attacks. Our algorithm is motivated by the following two ideas. First, although recent work has demonstrated that fusing randomness can improve the robustness of neural networks (Liu 2017), we noticed that adding noise blindly to all the layers is not the optimal way to incorporate randomness. Instead, we model randomness under the framework of Bayesian Neural Network (BNN) to formally learn the posterior distribution of models in a scalable way. Second, we formulate the mini-max problem in BNN to learn the best model distribution under adversarial attacks, leading to an adversarial-trained Bayesian neural net. Experiment results demonstrate that the proposed algorithm achieves state-of-the-art performance under strong attacks. On CIFAR-10 with VGG network, our model leads to 14% accuracy improvement compared with adversarial training (Madry 2017) and random self-ensemble (Liu, 2017) under PGD attack with 0.035 distortion, and the gap becomes even larger on a subset of ImageNet.

##### Auto-Encoding Knockoff Generator for FDR Controlled Variable Selection

tl;dr This paper provide model free method for generating Knockoffs, which is critical step in Model-X procedure to choose important variables with any supervised learning method under rigorous FDR control.

A new statistical procedure (Candès,2018) has provided a way to identify important factors using any supervised learning method controlling for FDR. This line of research has shown great potential to expand the horizon of machine learning methods beyond the task of prediction, to serve the broader need for scientific researches for interpretable findings. However, the lack of a practical and flexible method to generate knockoffs remains the major obstacle for wide application of Model-X procedure. This paper fills in the gap by proposing a model-free knockoff generator which approximates the correlation structure between features through latent variable representation. We demonstrate our proposed method can achieve FDR control and better power than two existing methods in various simulated settings and a real data example for finding mutations associated with drug resistance in HIV-1 patients.

##### Practical lossless compression with latent variables using bits back coding

tl;dr We do lossless compression of large image datasets using a VAE, beat existing compression algorithms.

Deep latent variable models have seen recent success in many data domains. Lossless compression is an application of these models which, despite having the potential to be highly useful, has yet to be implemented in a practical manner. We present `Bits Back with ANS' (BB-ANS), a scheme to perform lossless compression with latent variable models. The scheme is an extension of bits back coding, and permits compression of large datasets at close to the optimal rate. We demonstrate this scheme by using it to compress the MNIST dataset with a variational auto-encoder model (VAE), achieving compression rates superior to standard methods with only a simple VAE. Given that the scheme is highly amenable to parallelization, we conclude that with a sufficiently high quality generative model this scheme could be used to achieve substantial improvements in compression rate with acceptable running time. We make our implementation available open source at https://github.com/bits-back/bits-back .

##### Learning Factorized Multimodal Representations

tl;dr We propose a model to learn factorized multimodal representations that are discriminative, generative, and interpretable.

Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.

##### Layerwise Recurrent Autoencoder for General Real-world Traffic Flow Forecasting

tl;dr We propose Layerwise Recurrent Autoencoder with effective spatiotemporal dependencies modeling for general traffic flow forecasting.

Accurate spatio-temporal traffic forecasting is a fundamental task that has wide applications in city management, transportation area and financial domain. There are many factors that make this significant task also challenging, like: (1) maze-like road network makes the spatial dependency complex; (2) the traffic-time relationships bring non-linear temporal complication; (3) with the larger road network, the difficulty of flow forecasting grows. The prevalent and state-of-the-art methods have mainly been discussed on datasets covering relatively small districts and short time span, e.g., the dataset that is collected within a city during months. To forecast the traffic flow across a wide area and overcome the mentioned challenges, we design and propose a promising forecasting model called Layerwise Recurrent Autoencoder (LRA), in which a three-layer stacked autoencoder (SAE) architecture is used to obtain temporal traffic correlations and a recurrent neural networks (RNNs) model for prediction. The convolutional neural networks (CNNs) model is also employed to extract spatial traffic information within the transport topology for more accurate prediction. To the best of our knowledge, there is no general and effective method for traffic flow prediction in large area which covers a group of cities. The experiment is completed on such large scale real-world traffic datasets to show superiority. And a smaller dataset is exploited to prove universality of the proposed model. And evaluations show that our model outperforms the state-of-the-art baselines by 6% - 15%.