Cheatsheet for Backpropagation
Mind Map of Reinforcement Learning
27 Mar 2017 

11 May/08 Jun 
Deep learning is far beyond CNNs, RNNs, etc. In these two seminars, Yunchuan and I introduced several recent techniques of sequence (sentence) generation, including sampling approaches, reinforcement learning, and variational autoencoding.  
23/27 Apr 
Trans* is a family method of learning the vector representations of entities and relations in a knowledge base (or a knowledge graphDon't ask me the difference). From h+tr started all.  
04 Apr 2016 
Different from traditional LMs where we typically decompose the joint probability of a corpus as p(wprevious words), we can choose a split word and model the backward and forward subsequences in a sentence. The model is useful in constrained sentence generation. [See preprint paper.] 

04 Apr 2016 
(Courtesy of Yunchuan) A combination of convolutional neural network (CNN) and Monta Carto tree search (MCTS). 

27 Mar 2016 
A series work from Noah's Ark Lab, Huawei. My understanding is to design a (complicated) neural network to mimic human behaviors: modeling a sentence, querying a table/KB, selecting a field/column, selecting a row, copying something, etc. Several challenges of endtoend neuralized learning include differentiability, supervision, scalability. 

26 Mar 2016 
Neural science and Alzheimer's disease 
(Courtesy of Yu Wu) Ankyrin G (AnkG) plays a critical role at the axon initial segment (AIS). AnkG downregulartion induces impaired selective filtering machineary at AIS. Impaired AIS filtering might underlie functional defects in APP/PS1 neurons. Disclaimer: I am not an expert in neural science. 

08 Mar 2016 
A combition of neural networks and game theory. Imagine that we have two agents Generator and Discriminator: G generates fake samples, while D tries to distinguish these fake samples in disguise. The objective is to minimize_{G} max_{D} V(D,G). 

28 Oct 2015 
Variational Autoencoders 
(By Yunchuan) Variational autoencoders give a distribution of hidden variables, z, while traditional autoencoders compute z in a deterministic way. But why is it useful in practice? 

28 Oct 2015 
Including EasyAdapt, instance weighting, and structural correspondence learning. I am, in fact, curious about adaptation in neural networkbased settings. However, NNs are adaptable by the incremental/multitask training nature. Therefore, there is little point, as far as I can currently see, in NN adaptation. Nevertheless, I have conducted a series of comparative studies to shed more light on transferring knowledge in neural networks [pdf (EMNLP16)]. NEW: One may also be intereseted in frustratingly easy domain adaptation for neural networks [pdf]. 

21 Oct 2015 
Let x be visible variables, and z be invisible (hidden) ones. Estimating p(x) is usually difficult because we have to sum/integrate over z. A variational lower bound peaks when z~p(zx), which is oftentimes intractable. The mixture of Gaussian, for example, assumes z in parametric forms, i.e., Gaussian. In VI in general, we still have to restrict the form of z, but not in a parametric way. A typical approximation is factorization, that it, p(z)= _{i} p(z_{i}). 

14 Oct 2015 
Attentionbased Networks 
(By Hao Peng) The encodingdecoding model opens a new era of sequence generation. It is unrealistic, however, to encode a very long input sequence to a fixed vector. The attention mechanism is designed to aggregate information over the input sequence by an adaptable weighted sum. Selected Papers: NIPS'14, pp. 31043112, ICLR'15, ICML'15 EMNLP'15, pp. 319389 EMNLP'15, pp. 14121421 

14 Oct 2015 
We wrap up discourse analysis by PCFGbased discourse parsing, which requires probabilistic contextfree grammar in general. 

23 Sep 2015 
We shall also explore various NLP research topics, and discourse analysis, discussed in this seminar, precedes our horizon expansion. Notice that the slide is nothing but snapshots of papers in the proceedings, and in fact has little substance. 

22 Jul 2015 
I am a tyro in variational inference. Please refer to Ch 10, Pattern Recognition and Machine Learning. 

Ch 1: Losses, Risks and Decision Principles 


Resources: 
My
textual digest, highlighting some meaningful philosophy
discussion in the textbook. 

Ch 3: Prior Information and Subjective Probability [digest, note, slide by Dr. Yu] 

Frequentist vs Bayesian  
30 Apr 2015 
1. (By Yangyang Lu) A guided tour to selected papers. 

Equipped with Bayesian logistic regression and GP in general, we find GP classification is easy except the seemingly overwhelming formulas. 

29 Apr 2015 
(Courtesy of Yunchuan Chen) God does not play dices, but we human do. As inference in many machine learning models is intractable, we have to resort to some approximations, among which are sampling methods. The idea of sampling is straightforwardif we want to estimante p(Head) of a coin, one approach is to go through all mathematical and physical details, which does not seem to be a good idea; an alternative is to toss the coin multiple times, giving a fairly good estimation of p(Head). However, how to design efficient sampling algorithms is a $64,000,000 question. 

23 Apr 2015 
Linear
Classification 
We first wrap up our discussion of Gaussian processes by introducing hyperparameter learning in kernels. Then we introduce linear classification models, including discriminant functions, probabilistic generative/ discriminative models, and Bayesian logistic regression (with special interest). Linear classification is easymy good old friend, logistic regression, always serves as a baseline method in various applications. Through a systematic study, however, we can grasp the main idea behind a range of machine learning techniques. This seminar also precedes our future discussion on GP classfication. 

17 Apr 2015 
Sum Product Networks 
(By Weizhuo Li) On some theortical aspects of SPNs, e.g., normalizing, decompositionality, etc. Weizhuo also highlighted a '11 NIPS paper on deep architectures visavis shallow ones. 

16 Apr 2015 
(Courtesy of Yangyang Lu) 

09 Apr 2015 
Gaussian
Processes 
In this seminar, we introduce Gaussian process regression,
which extends Bayesian linear regression with kernels. However, as
far as I am concerned, the two models are not equivalent, even
with finite basis functions. If I were wrong, please feel free to
tell me. 

14 Jan 2015 
(Courtesy of Weizhuo Li) Sum product networks (SPNs) are a way of decomposition joint distributions. Most inference is tractable w.r.t. the size the the SPN network. However, it seems that graphical models, if converted to SPNs, have exponential numbers of nodes in SPNs. The story confirms the "no free lunch theorem." As in general no perfect "Imap" exists for most realworld applications, what we have to do is to capture important aspects by ignoring unimportant ones. 

7 Jan 2015 
One of the most core concepts in deep learning is that "do things wrongly and hope they work." G. Hinton introduced CDk algorithm for fast training restricted Boltzmann machines; he also introduced layerwise RBM pretraining for neural networks, opening an era of deep learning. 

19 Dec 2014 
Copulas

Given marginal distributions, the joint distribution in not unique because of all possible kinds of independencies among varibles. A copula is defined as the joint distribution on a unit cube with uniform marginals. It can (just can) capture nontrivial independencies and link marginals with joint distributions. Sklar's theorem says, Copula(Marginals)=Joint 
