# Search ICLR 2019

Searching papers submitted to ICLR 2019 can be painful. You might want to know which paper uses technique X, dataset D, or cites author ME. Unfortunately, search is limited to titles, abstracts, and keywords, missing the actual contents of the paper. This Frankensteinian search has returned from 2018 to help scour the papers of ICLR by ripping out their souls using pdftotext.

Good luck! Warranty's not included :)

Need random search inspiration..? Grab something from the list of all tags! ^_^
How about: deep model learning, unseen class categorization, structured objects, feature representation., winograd convolution ..?

Sanity Disclaimer: As you stare at the continuous stream of ICLR and arXiv papers, don't lose confidence or feel overwhelmed. This isn't a competition, it's a search for knowledge. You and your work are valuable and help carve out the path for progress in our field :)

### "convolution neural network" has 35 results

##### Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes

tl;dr Finite-width SGD trained CNNs vs. infinitely wide fully Bayesian CNNs. Who wins?

There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating an FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers. Surprisingly, in the absence of pooling layers, the corresponding GP is identical for CNNs with and without weight sharing. This means that translation equivariance in SGD-trained finite CNNs has no corresponding property in the Bayesian treatment of the infinite-width limit -- a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally that in some scenarios, while the performance of trained finite CNNs becomes similar to that of the corresponding GP with increasing channel count, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs. Finally, we introduce a Monte Carlo method to estimate the GP corresponding to a NN architecture, even in cases where the analytic form has too many terms to be computationally feasible.

##### AntMan: Sparse Low-Rank Compression To Accelerate RNN Inference

tl;dr Reducing computation and memory complexity of RNN models by up to 100x using sparse low-rank compression modules, trained via knowledge distillation.

Wide adoption of complex RNN based models is hindered by their inference performance, cost and memory requirements. To address this issue, we develop AntMan, combining structured sparsity with low-rank decomposition synergistically, to reduce model computation, size and execution time of RNNs while attaining desired accuracy. AntMan extends knowledge distillation based training to learn the compressed models efficiently. Our evaluation shows that AntMan offers up to 100x computation reduction with less than 1pt accuracy drop for language and machine reading comprehension models. Our evaluation also shows that for a given accuracy target, AntMan produces 5x smaller models than the state-of-art. Lastly, we show that AntMan offers super-linear speed gains compared to theoretical speedup, demonstrating its practical value on commodity hardware.

##### Geometric Operator Convolutional Neural Network

tl;dr Traditional image processing algorithms are combined with Convolutional Neural Networks，a new neural network.

The Convolutional Neural Network (CNN) has been successfully applied in many fields during recent decades; however it lacks the ability to utilize prior domain knowledge when dealing with many realistic problems. We present a framework called Geometric Operator Convolutional Neural Network (GO-CNN) that uses domain knowledge, wherein the kernel of the first convolutional layer is replaced with a kernel generated by a geometric operator function. This framework integrates many conventional geometric operators, which allows it to adapt to a diverse range of problems. Under certain conditions, we theoretically analyze the convergence and the bound of the generalization errors between GO-CNNs and common CNNs. Although the geometric operator convolution kernels have fewer trainable parameters than common convolution kernels, the experimental results indicate that GO-CNN performs more accurately than common CNN on CIFAR-10/100. Furthermore, GO-CNN reduces dependence on the amount of training examples and enhances adversarial stability.

##### Transfer Learning for Sequences via Learning to Collocate

tl;dr Transfer learning for sequence via learning to align cell-level information across domains.

Transfer learning aims to solve the data sparsity for a specific domain by applying information of another domain. Given a sequence (e.g. a natural language sentence), the transfer learning, usually enabled by recurrent neural network (RNN), represent the sequential information transfer. RNN uses a chain of repeating cells to model the sequence data. However, previous studies of neural network based transfer learning simply transfer the information across the whole layers, which are unfeasible for seq2seq and sequence labeling. Meanwhile, such layer-wise transfer learning mechanisms also lose the fine-grained cell-level information from the source domain. In this paper, we proposed the aligned recurrent transfer, ART, to achieve cell-level information transfer. ART is in a recurrent manner that different cells share the same parameters. Besides transferring the corresponding information at the same position, ART transfers information from all collocated words in the source domain. This strategy enables ART to capture the word collocation across domains in a more flexible way. We conducted extensive experiments on both sequence labeling tasks (POS tagging, NER) and sentence classification (sentiment analysis). ART outperforms the state-of-the-arts over all experiments.

##### TTS-GAN: a generative adversarial network for style modeling in a text-to-speech system

tl;dr a generative adversarial network for style modeling in a text-to-speech system

The modeling of style when synthesizing natural human speech from text has been the focus of significant attention. Some state-of-the-art approaches train an encoder-decoder network on paired text and audio samples (x_txt, x_aud) by encouraging its output to reconstruct x_aud. The synthesized audio waveform is expected to contain the verbal content of x_txt and the auditory style of x_aud. Unfortunately, modeling style in TTS is somewhat under-determined and training models with a reconstruction loss alone is insufficient to disentangle content and style from other factors of variation. In this work, we introduce TTS-GAN, an end-to-end TTS model that offers enhanced content-style disentanglement ability and controllability. We achieve this by combining a pairwise training procedure, an adversarial game, and a collaborative game into one training scheme. The adversarial game concentrates the true data distribution, and the collaborative game minimizes the distance between real samples and generated samples in both the original space and the latent space. As a result, TTS-GAN delivers a highly controllable generator, and a disentangled representation. Benefiting from the separate modeling of style and content, TTS-GAN can generate human fidelity speech that satisfies the desired style conditions. TTS-GAN achieves start-of-the-art results across multiple tasks, including style transfer (content and style swapping), emotion modeling, and identity transfer (fitting a new speaker's voice).

##### Low-Cost Parameterizations of Deep Convolutional Neural Networks

tl;dr This paper introduces efficient and economic parametrizations of convolutional neural networks motivated by partial differential equations

Convolutional Neural Networks (CNNs) filter the input data using a series of spatial convolution operators with compactly supported stencils and point-wise nonlinearities. Commonly, the convolution operators couple features from all channels. For wide networks, this leads to immense computational cost in the training of and prediction with CNNs. In this paper, we present novel ways to parameterize the convolution more efficiently, aiming to decrease the number of parameters in CNNs and their computational complexity. We propose new architectures that use a sparser coupling between the channels and thereby reduce both the number of trainable weights and the computational cost of the CNN. Our architectures arise as new types of residual neural network (ResNet) that can be seen as discretizations of a Partial Differential Equations (PDEs) and thus have predictable theoretical properties. Our first architecture involves a convolution operator with a special sparsity structure, and is applicable to a large class of CNNs. Next, we present an architecture that can be seen as a discretization of a diffusion reaction PDE, and use it with three different convolution operators. We outline in our experiments that the proposed architectures, although considerably reducing the number of trainable weights, yield comparable accuracy to existing CNNs that are fully coupled in the channel dimension.

##### RotDCF: Decomposition of Convolutional Filters for Rotation-Equivariant Deep Networks

No tl;dr =[

Explicit encoding of group actions in deep features makes it possible for convolutional neural networks (CNNs) to handle global deformations of images, which is critical to success in many vision tasks. This paper proposes to decompose the convolutional filters over joint steerable bases across the space and the group geometry simultaneously, namely a rotation-equivariant CNN with decomposed convolutional filters (RotDCF). This decomposition facilitates computing the joint convolution, which is proved to be necessary for the group equivariance. It significantly reduces the model size and computational complexity while preserving performance, and truncation of the bases expansion serves implicitly to regularize the filters. On datasets involving in-plane and out-of-plane object rotations, RotDCF deep features demonstrate greater robustness and interpretability than regular CNNs. The stability of the equivariant representation to input variations is also proved theoretically. The RotDCF framework can be extended to groups other than rotations, providing a general approach which achieves both group equivariance and representation stability at a reduced model size.

##### A Resizable Mini-batch Gradient Descent based on a Multi-Armed Bandit

tl;dr An optimization algorithm that explores various batch sizes based on probability and automatically exploits successful batch size which minimizes validation loss.

Determining the appropriate batch size for mini-batch gradient descent is always time consuming as it often relies on grid search. This paper considers a resizable mini-batch gradient descent (RMGD) algorithm based on a multi-armed bandit for achieving best performance in grid search by selecting an appropriate batch size at each epoch with a probability defined as a function of its previous success/failure. This probability encourages exploration of different batch size and then later exploitation of batch size with history of success. At each epoch, the RMGD samples a batch size from its probability distribution, then uses the selected batch size for mini-batch gradient descent. After obtaining the validation loss at each epoch, the probability distribution is updated to incorporate the effectiveness of the sampled batch size. The RMGD essentially assists the learning process to explore the possible domain of the batch size and exploit successful batch size. Experimental results show that the RMGD achieves performance better than the best performing single batch size. Furthermore, it, obviously, attains this performance in a shorter amount of time than grid search. It is surprising that the RMGD achieves better performance than grid search.

##### Generative Adversarial Models for Learning Private and Fair Representations

tl;dr We present Generative Adversarial Privacy and Fairness (GAPF), a data-driven framework for learning private and fair representations with certified privacy/fairness guarantees

We present Generative Adversarial Privacy and Fairness (GAPF), a data-driven framework for learning private and fair representations. GAPF leverages recent advancements in adversarial learning to allow a data holder to learn "universal" representations that decouple a set of sensitive attributes from the rest of the dataset. Under GAPF, finding the optimal privacy mechanism is formulated as a constrained minimax game between a private/fair encoder and an adversary. We show that for appropriately chosen adversarial loss functions, GAPF provides privacy guarantees against strong information-theoretic adversaries and enforces demographic parity. We also evaluate the performance of GAPF on multi-dimensional Gaussian mixture models and real datasets, and show how a designer can certify that representations learned under an adversary with a fixed architecture perform well against more complex adversaries.

##### Robust Text Classifier on Test-Time Budgets

tl;dr Modular framework for document classification and data aggregation technique for making the framework robust to various distortion, and noise and focus only on the important words.

In this paper, we design a generic framework for learning a robust text classification model that achieves accuracy comparable to standard full models under test-time budget constraints. We take a different approach from existing methods and learn to dynamically delete a large fraction of unimportant words by a low-complexity selector such that the high-complexity classifier only needs to process a small fraction of important words. In addition, we propose a new data aggregation method to train the classifier, allowing it to make accurate predictions even on fragmented sequence of words. Our end-to-end method achieves state-of-the-art performance while its computational complexity scales linearly with the small fraction of important words in the whole corpus. Besides, a single deep neural network classifier trained by our framework can be dynamically tuned to different budget levels at inference time.

##### Convolutional Neural Networks combined with Runge-Kutta Methods

No tl;dr =[

A convolutional neural network for image classification can be constructed mathematically since it is inspired by the ventral stream in visual cortex which can be regarded as a multi-period dynamical system. In this paper, a novel approach is proposed to construct network models from the dynamical systems view. Since a pre-activation residual network can be deemed an approximation of a time-dependent dynamical system using the Euler method, higher order Runge-Kutta methods (RK methods) can be utilized to build network models in order to achieve higher accuracy. The model constructed in such a way is referred to as the Runge-Kutta Convolutional Neural Network (RKNet). RK methods also provide an interpretation of Dense Convolutional Networks (DenseNets) and Convolutional Neural Networks with Alternately Updated Clique (CliqueNets) from the dynamical systems view. The proposed methods are evaluated on the benchmark datasets: CIFAR-10/100 and ImageNet. The experimental results are consistent with the theoretical properties of RK methods and support the dynamical systems interpretation. Moreover, the experimental results show that the RKNets are superior to the state-of-the-art network models on CIFAR-10 and be comparable with them on CIFAR-100 and ImageNet.

##### MILE: A Multi-Level Framework for Scalable Graph Embedding

tl;dr A generic framework to scale existing graph embedding techniques to large graphs.

Recently there has been a surge of interest in designing graph embedding methods. Few, if any, can scale to a large-sized graph with millions of nodes due to both computational complexity and memory requirements. In this paper, we relax this limitation by introducing the MultI-Level Embedding (MILE) framework – a generic methodology allowing contemporary graph embedding methods to scale to large graphs. MILE repeatedly coarsens the graph into smaller ones using a hybrid matching technique to maintain the backbone structure of the graph. It then applies existing embedding methods on the coarsest graph and refines the embeddings to the original graph through a novel graph convolution neural network that it learns. The proposed MILE framework is agnostic to the underlying graph embedding techniques and can be applied to many existing graph embedding methods without modifying them. We employ our framework on several popular graph embedding techniques and conduct embedding for real-world graphs. Experimental results on five large-scale datasets demonstrate that MILE significantly boosts the speed (order of magnitude) of graph embedding while also often generating embeddings of better quality for the task of node classification. MILE can comfortably scale to a graph with 9 million nodes and 40 million edges, on which existing methods run out of memory or take too long to compute on a modern workstation.

##### Visual Semantic Navigation using Scene Priors

No tl;dr =[

How do humans navigate to target objects in novel scenes? Do we use the semantic/functional priors we have built over years to efficiently search and navigate? For example, to search for mugs, we search cabinets near the coffee machine and for fruits we try the fridge. In this work, we focus on incorporating semantic priors in the task of semantic navigation. We propose to use Graph Convolutional Networks for incorporating the prior knowledge into a deep reinforcement learning framework. The agent uses the features from the knowledge graph to predict the actions. For evaluation, we use the AI2-THOR framework. Our experiments show how semantic knowledge improves the performance significantly. More importantly, we show improvement in generalization to unseen scenes and/or objects.

##### Theoretical and Empirical Study of Adversarial Examples

No tl;dr =[

Many techniques are developed to defend against adversarial examples at scale. So far, the most successful defenses generate adversarial examples during each training step and add them to the training data. Yet, this brings significant computational overhead. In this paper, we investigate defenses against adversarial attacks. First, we propose feature smoothing, a simple data augmentation method with little computational overhead. Essentially, feature smoothing trains a neural network on virtual training data as an interpolation of features from a pair of samples, with the new label remaining the same as the dominant data point. The intuition behind feature smoothing is to generate virtual data points as close as adversarial examples, and to avoid the computational burden of generating data during training. Our experiments on MNIST and CIFAR10 datasets explore different combinations of known regularization and data augmentation methods and show that feature smoothing with logit squeezing performs best for both adversarial and clean accuracy. Second, we propose an unified framework to understand the connections and differences among different efficient methods by analyzing the biases and variances of decision boundary. We show that under some symmetrical assumptions, label smoothing, logit squeezing, weight decay, mix up and feature smoothing all produce an unbiased estimation of the decision boundary with smaller estimated variance. All of those methods except weight decay are also stable when the assumptions no longer hold.

##### UNSUPERVISED CONVOLUTIONAL NEURAL NETWORKS FOR ACCURATE VIDEO FRAME INTERPOLATION WITH INTEGRATION OF MOTION COMPONENTS

No tl;dr =[

Optical flow and video frame interpolation are considered as a chicken-egg problem such that one problem affects the other and vice versa. This paper presents a deep neural network that integrates the flow network into the frame interpolation problem, with end-to-end learning. The proposed approach exploits the relationship between the two problems for quality enhancement of interpolation frames. Unlike recent convolutional neural networks, the proposed approach learns motions from natural video frames without graphical ground truth flows for training. This makes the network learn from extensive data and improve the performance. The motion information from the flow network guides interpolator networks to be trained to synthesize the interpolated frame accurately from motion scenarios. In addition, diverse datasets to cover various challenging cases that previous interpolations usually fail in is used for comparison. In all experimental datasets, the proposed network achieves better performance than state-of-art CNN based interpolations. With Middebury benchmark, compared with the top-ranked algorithm, the proposed network reduces an average interpolation error by about 9.3%. The proposed interpolation is ranked the 1st in Standard Deviation (SD) interpolation error, the 2nd in Average Interpolation Error among over 150 algorithms listed in the Middlebury interpolation benchmark.

##### ACTRCE: Augmenting Experience via Teacher’s Advice

tl;dr Combine language goal representation with hindsight experience replays.

Sparse reward is one of the most challenging problems in reinforcement learning (RL). Hindsight Experience Replay (HER) attempts to address this issue by converting a failure experience to a successful one by relabeling the goals. Despite its effectiveness, HER has limited applicability because it lacks a compact and universal goal representation. We present Augmenting experienCe via TeacheR's adviCE (ACTRCE), an efficient reinforcement learning technique that extends the HER framework using natural language as the goal representation. We first analyze the differences among goal representation, and show that ACTRCE can efficiently solve difficult reinforcement learning problems in challenging 3D navigation tasks, whereas HER with non-language goal representation failed to learn. We also show that with language goal representations, the agent can generalize to unseen instructions, and even generalize to instructions with unseen lexicons. We further demonstrate it is crucial to use hindsight advice to solve challenging tasks, but we also found that little amount of hindsight advice is sufficient for the learning to take off, showing the practical aspect of the method.

##### Unsupervised Document Representation using Partition Word-Vectors Averaging

tl;dr A simple unsupervised method for multi-sentense-document embedding using partition based word vectors averaging that achieve results comparable to sophisticated models.

Learning effective document-level representation is essential in many important NLP tasks such as document classification, summarization, etc. Recent research has shown that simple weighted averaging of word vectors is an effective way to represent sentences, often outperforming complicated seq2seq neural models in many tasks. While it is desirable to use the same method to represent documents as well, unfortunately, the effectiveness is lost when representing long documents involving multiple sentences. One reason for this degradation is due to the fact that a longer document is likely to contain words from many different themes (or topics), and hence creating a single vector while ignoring all the thematic structure is unlikely to yield an effective representation of the document. This problem is less acute in single sentences and other short text fragments where presence of a single theme/topic is most likely. To overcome this problem, in this paper we present PSIF, a partitioned word averaging model to represent long documents. P-SIF retains the simplicity of simple weighted word averaging while taking a document's thematic structure into account. In particular, P-SIF learns topic-specific vectors from a document and finally concatenates them all to represent the overall document. Through our experiments over multiple real-world datasets and tasks, we demonstrate PSIF's effectiveness compared to simple weighted averaging and many other state-of-the-art baselines. We also show that PSIF is particularly effective in representing long multi-sentence documents. We will release PSIF's embedding source code and data-sets for reproducing results.

##### DynCNN: An Effective Dynamic Architecture on Convolutional Neural Network for Surveillance Videos

tl;dr An optimizing architecture on CNN for surveillance videos with 75.7% reduction on FLOPs and 2.2 times improvement on FPS

The large-scale surveillance video analysis becomes important as the development of intelligent city. The heavy computation resources neccessary for state-of-the-art deep learning model makes the real-time processing hard to be implemented. This paper exploits the characteristic of high scene similarity generally existing in surveillance videos and proposes dynamic convolution reusing the previous feature map to reduce the computation amount. We tested the proposed method on 45 surveillance videos with various scenes. The experimental results show that dynamic convolution can reduce up to 75.7% of FLOPs while preserving the precision within 0.7% mAP. Furthermore, the dynamic convolution can enhance the processing time up to 2.2 times.

##### Integral Pruning on Activations and Weights for Efficient Neural Networks

tl;dr This work advances DNN compression beyond the weights to the activations by integrating the activation pruning with the weight pruning.

With the rapidly scaling up of deep neural networks (DNNs), extensive research studies on network model compression such as weight pruning have been performed for efficient deployment. This work aims to advance the compression beyond the weights to the activations of DNNs. We propose the Integral Pruning (IP) technique which integrates the activation pruning with the weight pruning. Through the learning on the different importance of neuron responses and connections, the generated network, namely IPnet, balances the sparsity between activations and weights and therefore further improves execution efficiency. The feasibility and effectiveness of IPnet are thoroughly evaluated through various network models with different activation functions and on different datasets. With <0.5% disturbance on the testing accuracy, IPnet saves 71.1% ~ 96.35% of computation cost, compared to the original dense models with up to 5.8x and 10x reductions in activation and weight numbers, respectively.

##### Optimal margin Distribution Network

tl;dr This paper presents a deep neural network embedding a loss function in regard to the optimal margin distribution, which alleviates the overfitting problem theoretically and empirically.

Recent research about margin theory has proved that maximizing the minimum margin like support vector machines does not necessarily lead to better performance, and instead, it is crucial to optimize the margin distribution. In the meantime, margin theory has been used to explain the empirical success of deep network in recent studies. In this paper, we present ODN (the Optimal margin Distribution Network), a network which embeds a loss function in regard to the optimal margin distribution. We give a theoretical analysis for our method using the PAC-Bayesian framework, which confirms the significance of the margin distribution for classification within the framework of deep networks. In addition, empirical results show that the ODN model always outperforms the baseline cross-entropy loss model consistently across different regularization situations. And our ODN model also outperforms the other three loss models in generalization task through limited training data.

No tl;dr =[

Transfer learning is a widely used method to build high performing computer vision models. In this paper, we study the efficacy of transfer learning by examining how the choice of data impacts performance. We find that more pre-training data does not always help, and transfer performance depends on a judicious choice of pre-training data. These findings are important given the continued increase in dataset sizes. We further propose domain adaptive transfer learning, a simple and effective pre-training method using importance weights computed based on the target dataset. Our methods achieve state-of-the-art results on multiple fine-grained classification datasets and are well-suited for use in practice.

##### Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization

tl;dr We describe a dynamic sparse reparameterization technique that allow training of a small sparse network to generalize on par with, or better than, a full-sized dense model compressed to the same size.

Modern deep neural networks are highly overparameterized, and often of huge sizes. A number of post-training model compression techniques, such as distillation, pruning and quantization, can reduce the size of network parameters by a substantial fraction with little loss in performance. However, training a small network of the post-compression size de novo typically fails to reach the same level of accuracy achieved by compression of a large network, leading to a widely-held belief that gross overparameterization is essential to effective learning. In this work, we argue that this is not necessarily true. We describe a dynamic sparse reparameterization technique that closed the performance gap between a model compressed by pruning and a model of the post-compression size trained de novo. We applied our method to training deep residual networks and showed that it outperformed existing static reparameterization techniques, yielding the best accuracy for a given parameter budget for training. Compared to other dynamic reparameterization methods that reallocate non-zero parameters during training, our approach broke free from a few key limitations and achieved much better performance at lower computational cost. Our method is not only of practical value for training under stringent memory constraints, but also potentially informative to theoretical understanding of generalization properties of overparameterized deep neural networks.

##### KNOWLEDGE DISTILL VIA LEARNING NEURON MANIFOLD

tl;dr A new knowledge distill method for transfer learning

Although deep neural networks show their extraordinary power in various tasks, they are not feasible for deploying such large models on embedded systems due to high computational cost and storage space limitation. The recent work knowledge distillation (KD) aims at transferring model knowledge from a well-trained teacher model to a small and fast student model which can significantly help extending the usage of large deep neural networks on portable platform. In this paper, we show that, by properly defining the neuron manifold of deep neuron network (DNN), we can significantly improve the performance of student DNN networks through approximating neuron manifold of powerful teacher network. To make this, we propose several novel methods for learning neuron manifold from DNN model. Empowered with neuron manifold knowledge, our experiments show the great improvement across a variety of DNN architectures and training data. Compared with other KD methods, our Neuron Manifold Transfer (NMT) has best transfer ability of the learned features.

##### Multi-class classification without multi-class labels

No tl;dr =[

This work presents a new strategy for multi-class classification that requires no class-specific labels, but instead leverages pairwise similarity between examples, which is a weaker form of annotation. The proposed method, meta classification learning, optimizes a binary classifier for pairwise similarity prediction and through this process learns a multi-class classifier as a submodule. We formulate this approach, present a probabilistic graphical model for it, and derive a surprisingly simple loss function that can be used to learn neural network-based models. We then demonstrate that this same framework generalizes to the supervised, unsupervised cross-task, and semi-supervised settings. Our method is evaluated against state of the art in all three learning paradigms and shows a superior or comparable accuracy, providing evidence that learning multi-class classification without multi-class labels is a viable learning option.

##### Graph2Seq: Graph to Sequence Learning with Attention-Based Neural Networks

tl;dr Graph to Sequence Learning with Attention-Based Neural Networks

The celebrated \emph{Sequence to Sequence learning (Seq2Seq)} technique and its numerous variants achieve excellent performance on many tasks. However, many machine learning tasks have inputs naturally represented as graphs; existing Seq2Seq models face a significant challenge in achieving accurate conversion from graph form to the appropriate sequence. To address this challenge, we introduce a general end-to-end graph-to-sequence neural encoder-decoder architecture that maps an input graph to a sequence of vectors and uses an attention-based LSTM method to decode the target sequence from these vectors. Our method first generates the node and graph embeddings using an improved graph-based neural network with a novel aggregation strategy to incorporate edge direction information in the node embeddings. We further introduce a novel attention mechanism that aligns node embeddings and the decoding sequence to better cope with large graphs. Experimental results on bAbI, Shortest Path, and Natural Language Generation tasks demonstrate that our model achieves state-of-the-art performance and significantly outperforms existing Seq2Seq and Tree2Seq models; using the proposed aggregation strategy, the model can converge rapidly to the optimal performance.

##### Adaptive Estimators Show Information Compression in Deep Neural Networks

tl;dr We developed robust mutual information estimates for DNNs and used them to observe compression in networks with non-saturating activation functions

To improve how neural networks function it is crucial to understand their learning process. The information bottleneck theory of deep learning proposes that neural networks achieve good generalization by compressing their representations to disregard information that is not relevant for the task. However, empirical evidence for this theory is conflicting, as compression was only observed when the networks used a saturating activation functions. In contrast, networks with non-saturating activation functions achieved comparable levels of task performance but did not show compression. In this paper we developed a more robust mutual information estimation technique, that adapts to hidden activity of neural networks and produces more sensitive measurements of activations from all functions, especially unbounded functions. Using these adaptive estimation techniques, we explored compression in networks with a range of different activation functions. With two improved methods of estimation, firstly, we show that saturation of the activation function is not required for compression, and the amount of compression varies between different activation functions. We also found that there is a large amount of variation in compression between different network initializations. Secondary, we see that L2 regularization leads to significantly increased compression, while preventing overfitting. Finally, we show that only compression of the last layer is positively correlated with generalization.

##### Search-Guided, Lightly-supervised Training of Structured Prediction Energy Networks

No tl;dr =[

In structured output prediction tasks, labeling ground-truth training output is often expensive. However, for many tasks, even when the true output is unknown, we can evaluate predictions using a scalar reward function, which may be easily assembled from human knowledge or non-differentiable pipelines. But searching through the entire output space to find the best output with respect to this reward function is typically intractable. In this paper, we instead use efficient truncated randomized search in this reward function to train structured prediction energy networks (SPENs), which provide efficient test-time inference using gradient-based search on a smooth, learned representation of the score landscape, and have previously yielded state-of-the-art results in structured prediction. In particular, this truncated randomized search in the reward function yields previously unknown local improvements, providing effective supervision to SPENs, avoiding their traditional need for labeled training data.

##### HANDLING CONCEPT DRIFT IN WIFI-BASED INDOOR LOCALIZATION USING REPRESENTATION LEARNING

tl;dr We introduce an augmented robust feature space for streaming wifi data that is capable of tackling concept drift for indoor localization

We outline the problem of concept drifts for time series data. In this work, we analyze the temporal inconsistency of streaming wireless signals in the context of device-free passive indoor localization. We show that data obtained from WiFi channel state information (CSI) can be used to train a robust system capable of performing room level localization. One of the most challenging issues for such a system is the movement of input data distribution to an unexplored space over time, which leads to an unwanted shift in the learned boundaries of the output space. In this work, we propose a phase and magnitude augmented feature space along with a standardization technique that is little affected by drifts. We show that this robust representation of the data yields better learning accuracy and requires less number of retraining.

##### Morph-Net: An Universal Function Approximator

tl;dr Using mophological operation (dilation and erosion) we have defined a class of network which can approximate any continious function.

Artificial neural networks are built on the basic operation of linear combination and non-linear activation function. Theoretically this structure can approximate any continuous function with three layer architecture. But in practice learning the parameters of such network can be hard. Also the choice of activation function can greatly impact the performance of the network. In this paper we are proposing to replace the basic linear combination operation with non-linear operations that do away with the need of additional non-linear activation function. To this end we are proposing the use of elementary morphological operations (dilation and erosion) as the basic operation in neurons. We show that these networks (Denoted as Morph-Net) with morphological operations can approximate any smooth function requiring less number of parameters than what is necessary for normal neural networks. The results show that our network perform favorably when compared with similar structured network. We have carried out our experiments on MNIST, Fashion-MNIST, CIFAR10 and CIFAR100.

##### Hint-based Training for Non-Autoregressive Translation

tl;dr We develop a training algorithm for non-autoregressive machine translation models, achieving comparable accuracy to strong autoregressive baselines, but one order of magnitude faster in inference.

Machine translation is an important real-world application, and neural network-based AutoRegressive Translation (ART) models have achieved very promising accuracy. Due to the unparallelizable nature of the autoregressive factorization, ART models have to generate tokens one by one during decoding and thus suffer from high inference latency. Recently, Non-AutoRegressive Translation (NART) models were proposed to reduce the inference time. However, they could only achieve inferior accuracy compared with ART models. To improve the accuracy of NART models, in this paper, we propose to leverage the hints from a well-trained ART model to train the NART model. We define two hints for the machine translation task: hints from hidden states and hints from word alignments, and use such hints to regularize the optimization of NART models. Experimental results show that the NART model trained with hints could achieve significantly better translation performance than previous NART models on several tasks. In particular, for the WMT14 En-De and De-En task, we obtain BLEU scores of 25.20 and 29.52 respectively, which largely outperforms the previous non-autoregressive baselines. It is even comparable to a strong LSTM-based ART model (24.60 on WMT14 En-De), but one order of magnitude faster in inference.

##### Pearl: Prototype lEArning via Rule Lists

tl;dr a method combining rule list learning and prototype learning

Deep neural networks have demonstrated promising classification performance on many healthcare applications. However, the interpretability of those models are often lacking. On the other hand, classical interpretable models such as rule lists or decision trees do not lead to the same level of accuracy as deep neural networks. Despite their interpretable structures, the resulting rules are often too complex to be interpretable (due to the potentially large depth of rule lists). In this work, we present PEARL, short for Prototype lEArning via Rule Lists, which iteratively use rule lists to guide a neural network to learn representative data prototypes. The resulting prototype neural network provides accurate prediction, and the prediction can be easily explained by prototype and its guiding rule lists. Thanks to the prediction power of neural networks, the rule lists defining prototypes are more concise and hence provide better interpretability. On two real-world electronic healthcare records (EHR) datasets, PEARL consistently outperforms all baselines, achieving performance improvement over conventional rule learning by up to 28% and over prototype learning by up to 3%. Experimental results also show the resulting interpretation of PEARL is simpler than the standard rule learning.

##### Query-Efficient Hard-label Black-box Attack: An Optimization-based Approach

No tl;dr =[

We study the problem of attacking machine learning models in the hard-label black-box setting, where no model information is revealed except that the attacker can make queries to probe the corresponding hard-label decisions. This is a very challenging problem since the direct extension of state-of-the-art white-box attacks (e.g., C&W or PGD) to the hard-label black-box setting will require minimizing a non-continuous step function, which is combinatorial and cannot be solved by a gradient-based optimizer. The only two current approaches are based on random walk on the boundary (Brendel et al., 2017) and random trials to evaluate the loss function (Ilyas et al., 2018), which require lots of queries and lacks convergence guarantees. We propose a novel way to formulate the hard-label black-box attack as a real-valued optimization problem which is usually continuous and can be solved by any zeroth order optimization algorithm. For example, using the Randomized Gradient-Free method (Nesterov & Spokoiny, 2017), we are able to bound the number of iterations needed for our algorithm to achieve stationary points under mild assumptions. We demonstrate that our proposed method outperforms the previous stochastic approaches to attacking convolutional neural networks on MNIST, CIFAR, and ImageNet datasets. More interestingly, we show that the proposed algorithm can also be used to attack other discrete and non-continuous machine learning models, such as Gradient Boosting Decision Trees (GBDT).

##### Deep Learning 3D Shapes Using Alt-az Anisotropic 2-Sphere Convolution

tl;dr A method for applying deep learning to 3D surfaces using their spherical descriptors and alt-az anisotropic convolution on 2-sphere.

The ground-breaking performance obtained by deep convolutional neural networks (CNNs) for image processing tasks is inspiring research efforts attempting to extend it for 3D geometric tasks. One of the main challenge in applying CNNs to 3D shape analysis is how to define a natural convolution operator on non-euclidean surfaces. In this paper, we present a method for applying deep learning to 3D surfaces using their spherical descriptors and alt-az anisotropic convolution on 2-sphere. A cascade set of geodesic disk filters rotate on the 2-sphere and collect multi-level spherical patterns to extract non-trivial features for various 3D shape analysis tasks. We demonstrate theoretically and experimentally that our proposed method has the possibility to bridge the gap between 2D images and 3D shapes with the desired rotation equivariance/invariance, and its effectiveness is evaluated in applications of non-rigid/ rigid shape classification and shape retrieval.

##### Context Dependent Modulation of Activation Function

tl;dr We propose a modification to traditional Artificial Neural Networks motivated by the biology of neurons to enable the shape of the activation function to be context dependent.

We propose a modification to traditional Artificial Neural Networks (ANNs), which provides the ANNs with new aptitudes motivated by biological neurons. Biological neurons work far beyond linearly summing up synaptic inputs and then transforming the integrated information. A biological neuron change firing modes accordingly to peripheral factors (e.g., neuromodulators) as well as intrinsic ones. Our modification connects a new type of ANN nodes, which mimic the function of biological neuromodulators and are termed modulators, to enable other traditional ANN nodes to adjust their activation sensitivities in run-time based on their input patterns. In this manner, we enable the slope of the activation function to be context dependent. This modification produces statistically significant improvements in comparison with traditional ANN nodes in the context of Convolutional Neural Networks and Long Short-Term Memory networks.

##### A Main/Subsidiary Network Framework for Simplifying Binary Neural Networks

tl;dr we define the filter-level pruning problem for binary neural networks for the first time and propose method to solve it.

To reduce memory footprint and run-time latency, techniques such as neural net-work pruning and binarization have been explored separately. However, it is un-clear how to combine the best of the two worlds to get extremely small and efficient models. In this paper, we, for the first time, define the filter-level pruning problem for binary neural networks, which cannot be solved by simply migrating existing structural pruning methods for full-precision models. A novel learning-based approach is proposed to prune filters in our main/subsidiary network frame-work, where the main network is responsible for learning representative features to optimize the prediction performance, and the subsidiary component works as a filter selector on the main network. To avoid gradient mismatch when training the subsidiary component, we propose a layer-wise and bottom-up scheme. We also provide the theoretical and experimental comparison between our learning-based and greedy rule-based methods. Finally, we empirically demonstrate the effectiveness of our approach applied on several binary models, including binarizedNIN, VGG-11, and ResNet-18, on various image classification datasets. For bi-nary ResNet-18 on ImageNet, we use 78.6% filters but can achieve slightly better test error 49.87% (50.02%-0.15%) than the original model