Introduction
Logistic Regression (LR) is a classical model for classification and regression which is widely used in Compute Vision, NLP and Recommending System. Its simple, effective and adaptive to both linear and nonlinear conditions. It is a logarithmic linear model as well as Maximum Entropy Model (MEM). MEM is a rule for statistic learning and generalized from Logistic Regression.
In this article, I will introduce LR first, then generalize it to MEM. At last, I will introduce how to train LR and MEM by IIS and Quasi-Newton Method.
In recent years, gradient decent and another optimizer like Adam becomes the main choices for training a ML and DL model. So I will also introduce Gradient Decent briefly but more details will be noted in another article.
Core Concepts
- Logistic Regression and Maximum Entropy;
- IIS and Quasi-Newton Method;
- Gradient Decent;
Logistic Regression (LR)
Mathematical Details
LR is a binary classifier which will classify train set in two classes 0, 1 and predict a new instance with label 0, 1. The mathematical model of LR is simple:
The cost function of LR is:
where
called sigmoid function. It often occurs in Neural Network as an activate function. I will introduce deep learning in other articles.
So how we decide which label a new instance is? Simply, according to the probability. For example, if we have P(y=1|x) = 0.4, P(y=0|x) = 0.6, then we can predict instance x as class 0. That’s we call 0.5 is decision boundary.
Maximum Entropy Model (MEM)
Before we discuss MEM, let’s ask a simple question: what is entropy?
Emmmmm, it’s absolutely abstract. Entropy is not firstly proposed by AI or CS researchers, even Shannon himself also transferred this concept from Statistical Physics. Ok, it seems more complicate when we talking about Physics. But describing a complicate thing in a simple way is Physics’s job. Simple way means abstraction.
Seemingly we cannot escape from Physics. Let us see a Physical example at first, then I will introduce the strict definition of entropy in Shannon’s form. Finally I will also discuss some other entropy briefly.
To avoid complexity, I will ignore many details and emphasize on concepts. In Physics, many micro-particles consist of a macro-system. A macro-system has several observable quantities like temperature, pressure and so on. But in microcosmic world, a particle’s status is not definite revealed by Schrodinger equation which has been discussed very thoroughly in Quantum Mechanics (if you’re interested in, you can read a book Modern Quantum Mechanics by Sakurai J.J). A particle’s status can only be represented by wave function (you can see it as particle’s probability density). So our problem is: how to compute definite macro system’s quantities by indefinite quantum theory? and how can we get some parameter from these quantities which will decide particle’s status?
Such a problem is extraordinary like many statistic problems: how can we compute some statistical quantities which can represent some central properties from a random dataset and how can we decide the distribution by several parameters.
In Statistical Physics, Entropy is proposed to resolve this problem. It assumes a physical quantity entropy:
where
means i-th particle’s distribution determined by its energy and the parameter is a physical quantity with form:
where k is Boltzmann constant and T is temperature.
In Statistical Physics, A particle’s probability in a given state can be represented as canonical distribution for a equilibrium canonical system in the form:
Now, let us link Physical entropy with Information entropy. In Shannon’s theory, information entropy can be represented as:
where the index i means the i-th element in a symbol set (a symbol set corresponds to a data sequence). The unit of information entropy is Bit. Yes, bit not means 0, 1 but a unit for information quantity.
All right, entropy is an absolutely abstract notion which should take some time to understand it. Let’s forget these math formulas and see an example which can recover some properties of entropy.
On day, a mathematician named Ellen wanted to buy some fruit. He went to shop to choose and shop-owner told him there are several kinds of fruit in his shop;
Ellen didn’t know how many he planed to buy for each kind fruit so he decided to select them randomly. But the total ‘number’ he hoped is fixed whatever the ‘number’ is. The problem is to create a measurement of the ‘number’ which could satisfy Ellen’s demand so that no matter how he selected fruit, this measurement could tell him just a single digit to help him to decide when to stop.
Because he selected each kind of fruit randomly from a set, Ellen noted every kind as a symbol and he had a symbol set now:
Because only five kinds of fruit this shop had, n=5. Besides, Ellen applied probability notion in his theory, he wrote each symbol’s probability as:
such that
where n_k is the amount of k‘s symbol selected and N is the size of the symbol set. The progress of selecting fruit randomly and put them in a buffet is a sort of sampling without replacement.
Back to the problem we proposed before: is there a measurement which will reveal how many fruit Ellen has by a single digit? It’s difficult to resolve straightly. Expectation, diversion, all of these statistical properties fail to show a whole picture. However, if we