Cheatsheet for Backpropagation

Notes on Markov Networks

Mind Map of Reinforcement Learning


  • 2010--, Software Institute, Peking University
  • 2013--2015, SIG ML/NLP, Software Institute, Peking University
  • 2015.11--, Baidu Inc. with Rui Yan

  • 27 Mar 2017

    Neural Programming

    11 May/08 Jun

    Sequence Generation
    [1, 2]

    Deep learning is far beyond CNNs, RNNs, etc. In these two seminars, Yunchuan and I introduced several recent techniques of sequence (sentence) generation, including sampling approaches, reinforcement learning, and variational autoencoding.

    23/27 Apr


    Trans* is a family method of learning the vector representations of entities and relations in a knowledge base (or a knowledge graph---Don't ask me the difference). From |h+t-r| started all.

    04 Apr 2016

    B/F language models

    Different from traditional LMs where we typically decompose the joint probability of a corpus as p(w|previous words), we can choose a split word and model the backward and forward subsequences in a sentence. The model is useful in constrained sentence generation. [See preprint paper.]

    04 Apr 2016


    (Courtesy of Yunchuan) A combination of convolutional neural network (CNN) and Monta Carto tree search (MCTS).

    27 Mar 2016

    Neural Symbolics

    A series work from Noah's Ark Lab, Huawei. My understanding is to design a (complicated) neural network to mimic human behaviors: modeling a sentence, querying a table/KB, selecting a field/column, selecting a row, copying something, etc. Several challenges of end-to-end neuralized learning include differentiability, supervision, scalability.

    26 Mar 2016

    Neural science and Alzheimer's disease

    (Courtesy of Yu Wu) Ankyrin G (AnkG) plays a critical role at the axon initial segment (AIS). AnkG downregulartion induces impaired selective filtering machineary at AIS. Impaired AIS filtering might underlie functional defects in APP/PS1 neurons. Disclaimer: I am not an expert in neural science.

    08 Mar 2016

    Generative Adversarial Nets

    A combition of neural networks and game theory. Imagine that we have two agents Generator and Discriminator: G generates fake samples, while D tries to distinguish these fake samples in disguise. The objective is to minimizeG maxD V(D,G).

    28 Oct 2015

    Variational Autoencoders

    (By Yunchuan) Variational autoencoders give a distribution of hidden variables, z, while traditional autoencoders compute z in a deterministic way. But why is it useful in practice?

    28 Oct 2015

    Domain Adaptation

    Including EasyAdapt, instance weighting, and structural correspondence learning. I am, in fact, curious about adaptation in neural network-based settings. However, NNs are adaptable by the incremental/multi-task training nature. Therefore, there is little point, as far as I can currently see, in NN adaptation. Nevertheless, I have conducted a series of comparative studies to shed more light on transferring knowledge in neural networks [pdf (EMNLP-16)]. NEW: One may also be intereseted in frustratingly easy domain adaptation for neural networks [pdf].

    21 Oct 2015

    Variational Inference (again)

    Let x be visible variables, and z be invisible (hidden) ones. Estimating p(x) is usually difficult because we have to sum/integrate over z. A variational lower bound peaks when z~p(z|x), which is oftentimes intractable. The mixture of Gaussian, for example, assumes z in parametric forms, i.e., Gaussian. In VI in general, we still have to restrict the form of z, but not in a parametric way. A typical approximation is factorization, that it, p(z)=| |i p(zi).

    14 Oct 2015

    Attention-based Networks

    (By Hao Peng) The encoding-decoding model opens a new era of sequence generation. It is unrealistic, however, to encode a very long input sequence to a fixed vector. The attention mechanism is designed to aggregate information over the input sequence by an adaptable weighted sum. Selected Papers: NIPS'14, pp. 3104--3112, ICLR'15, ICML'15 EMNLP'15, pp. 319--389 EMNLP'15, pp. 1412--1421

    14 Oct 2015

    Discourse Parsing with PCFG

    We wrap up discourse analysis by PCFG-based discourse parsing, which requires probabilistic context-free grammar in general.

    23 Sep 2015

    Discourse Analysis

    We shall also explore various NLP research topics, and discourse analysis, discussed in this seminar, precedes our horizon expansion. Notice that the slide is nothing but snapshots of papers in the proceedings, and in fact has little substance.

    22 Jul 2015

    Variational Inference

    I am a tyro in variational inference. Please refer to Ch 10, Pattern Recognition and Machine Learning.

    Bad news: Thursday evening's seminars are suspended temporarily.
    Good news: I am reading Statistical Decision Theory and Bayesian Analysis, by James O. Berger (1985). Following list some hopefully useful materials.

    Ch 1: Losses, Risks and Decision Principles



    My textual digest, highlighting some meaningful philosophy discussion in the textbook.
    My written note, mostly derived from the textbook with remarks.
    Slide, by Dr. Yu, who was the instructor of my undergraduate course Probability Theory and Statistics. I was always agitated after his lectures.

    Ch 2: Utilities and Losses [digest, note, slide by Dr. Yu]

    Ch 3: Prior Information and Subjective Probability [digest, note, slide by Dr. Yu]

    Frequentist vs Bayesian

    30 Apr 2015

    1. (By Yangyang Lu) A guided tour to selected papers.
    2. Gaussian processes for classification. Ref: Ch. 6.4.5, 6.4.6, Pattern Recognition and Machine Learning.

    Equipped with Bayesian logistic regression and GP in general, we find GP classification is easy except the seemingly overwhelming formulas.

    29 Apr 2015

    Sampling methods

    (Courtesy of Yunchuan Chen) God does not play dices, but we human do. As inference in many machine learning models is intractable, we have to resort to some approximations, among which are sampling methods. The idea of sampling is straightforward---if we want to estimante p(Head) of a coin, one approach is to go through all mathematical and physical details, which does not seem to be a good idea; an alternative is to toss the coin multiple times, giving a fairly good estimation of p(Head). However, how to design efficient sampling algorithms is a $64,000,000 question.

    23 Apr 2015

    Linear Classification
    Ref: Ch. 4, Ch. 6.4, Pattern Recognition and Machine Learning

    We first wrap up our discussion of Gaussian processes by introducing hyperparameter learning in kernels. Then we introduce linear classification models, including discriminant functions, probabilistic generative/ discriminative models, and Bayesian logistic regression (with special interest). Linear classification is easy---my good old friend, logistic regression, always serves as a baseline method in various applications. Through a systematic study, however, we can grasp the main idea behind a range of machine learning techniques. This seminar also precedes our future discussion on GP classfication.

    17 Apr 2015

    Sum Product Networks

    (By Weizhuo Li) On some theortical aspects of SPNs, e.g., normalizing, decompositionality, etc. Weizhuo also highlighted a '11 NIPS paper on deep architectures vis-a-vis shallow ones.

    16 Apr 2015

    Memory Networks

    (Courtesy of Yangyang Lu)

    09 Apr 2015

    Gaussian Processes
    +Bayesian linear regression

    In this seminar, we introduce Gaussian process regression, which extends Bayesian linear regression with kernels. However, as far as I am concerned, the two models are not equivalent, even with finite basis functions. If I were wrong, please feel free to tell me.
    It was really an awesome seminar, filled with a whole bunch of food, drinks, and also fruitful discussion. [See photos 1 2 3.]

    14 Jan 2015

    Sum Product Networks

    (Courtesy of Weizhuo Li) Sum product networks (SPNs) are a way of decomposition joint distributions. Most inference is tractable w.r.t. the size the the SPN network. However, it seems that graphical models, if converted to SPNs, have exponential numbers of nodes in SPNs. The story confirms the "no free lunch theorem." As in general no perfect "I-map" exists for most real-world applications, what we have to do is to capture important aspects by ignoring unimportant ones.

    7 Jan 2015

    Deep Belief Nets

    One of the most core concepts in deep learning is that "do things wrongly and hope they work." G. Hinton introduced CD-k algorithm for fast training restricted Boltzmann machines; he also introduced layer-wise RBM pretraining for neural networks, opening an era of deep learning.

    19 Dec 2014

    Ref: Ch. 4.6, Statistical Pattern Recognition

    Given marginal distributions, the joint distribution in not unique because of all possible kinds of independencies among varibles. A copula is defined as the joint distribution on a unit cube with uniform marginals. It can (just can) capture nontrivial independencies and link marginals with joint distributions. Sklar's theorem says, Copula(Marginals)=Joint