Maximum Likelihood Estimators

Arbaj Khan
5 min readNov 11, 2021

--

Introduction

In this blog, we will describe what the maximum likelihood method for parameter estimation and provide a simple example to illustrate the method. Some of the content requires knowledge of fundamental probability concepts such as the definition of joint probability and independence of events. I wrote a blog post with these prerequisites, feel free to read this if you feel you need an update

What are parameters?

Typically, in machine learning, we use a model to describe the process of obtaining data monitoring. For example, we can use a forest model to classify whether customers can opt out of a service (called a discard model) or we can use a linear model to compare the revenue of a service. a available for a company, depending on how much they do. . .can be spent on ads (this would be an example of linear regression). Each model has its own set of seals that define the shape of the design.

For a linear model, we can write this as y = mx + c. In this example, x can represent advertising expenses, and y can represent revenue. m and c are parameters for this model. Different prices for these games will give different lines.

There are three sample sizes with different signal values.

Therefore, the parameters define a scheme for the model. Only when specific prices are chosen for the parameters that we get an installation for the model that describes a given phenomenon.

Maximum likelihood estimation

This is a method that determines values for the parameters of a model. The parameters are obtained to increase the likelihood that the process described by the model will yield the observed information.

The information above may seem trivial, so let’s look at an example to understand this.

Suppose we look at 10 records in a row. For example, each record can be compared to the time in seconds required for a student to answer a specific test question. These 10 records are shown in the figure below

First, we have to decide which model we think best describes the data process. This is very important. At the least, we should have a good idea of what design to use.

For these data, it is assumed that the data flow can be accurately described by a Gaussian (conventional) distribution. Visual analysis of the figure above suggests that the Gaussian distribution is appropriate, since most of the 10 points are aligned in the center, with a few points intersecting to the left and right. (Making this type of decision in real time with only 10 data points is not recommended, but since we generated these data, let’s get with this).

Note that the Gaussian distribution has 2 parameters. The mean, μ, and the standard deviation, σ. The different values of these parameters lead to different curves (e.g. in the case of straight lines above). We would like to know which curve may have been responsible for the creation of the data set we observed? (See photo below). Measuring the maximum probability is a method to obtain the values of μ and σ to obtain the curvature that best fits the data.

Calculating the Maximum Likelihood Estimates

Now that we have a specialized understanding of the calculation of the probability level, we can move forward in learning how to calculate the parameter values. The values we receive are called maximum likelihood estimates (MLE)

Again, we illustrate this with an example. Suppose we have three data at the moment and that it is generated from a process that is directly described by a Gaussian distribution. These are 9, 9.5 and 11. How do we calculate the maximum cost estimates of the Gaussian ratio µ and σ?

What we want to estimate is the total number of opportunities to monitor all data, i.e. the regular distribution of all data. To do this, we have to calculate some conditional probabilities, which can be very difficult. So we make the first guess. The assumption is that each record is generated independently of the others. This idea is much simpler numerically. If the events (e.g. the data generation process) are independent, then the total number of monitoring of all data is the measure of the monitoring of each individual data (e.g. the product of the emergency).

The level of observation of the single x, which is obtained from the Gaussian distribution, is given by:

The semi colon used in the notation P(x; μ, σ) is there to emphasise that the symbols that appear after it are parameters of the probability distribution. Therefore, it should not be confused with a probability condition (which is usually represented by a line above, for example P (A | B)).

We only need to find the values of μ and σ to cause the high value of the above statement.

If you’ve studied calculus in math class, you probably know that there is a method that can help us find the maxima (and minima) function. It’s called the differentiation. All we have to do is find the derivative of the function, set the derivative of the function to zero and then reset the equation to make the parameter of interest the subject of the equation. Now we’re going to get our MLE prices for our parameters.

Conclusion

There are many methods for solving estimates, although a common one is used in field of machine learning is maximum likelihood estimation. The Maximum likelihood estimation involves the definition of a specific function to measure the specificity of the observation of the sample given by the distribution of spaces and parameters. This method can be used to find a space of possible distributions and parameters.

This complex probabilistic framework also provides the basis for a wide range of machine learning algorithms, including important methods, such as linear regression and logistic regression, to look at numerical values and class label’s, in sequence,. but also in the deep learning of artificial neural networks.

--

--