aka maximum likelihood estimation, MLE
Minimize a cost function, which often is the negative log likelihood (similar to entropy. More precisely, cross-entropy, or relative entropy), which corresponds to maximizing likelihood. Likelihood is the probability of getting the right given and , i.e. the probability that a given model predicts the right outputs. This is equivalent to finding the most likely in the Bayesian posterior, given a flat prior (but if we add a regularizer, we can tweak the prior, by just adding a term to the log likelihood). If our model uses a Gaussian distribution to predict the data (where the s are the means), maximizing likelihood is equivalent to minimizing spring energy for springs vertically placed between fit curve and data.
The maximum likelihood is found by Optimization, often by Stochastic gradient descent.
If we want the whole distribution of likelihoods over s, we need to use Bayesian statistics, which involves doing complicated integrals, often done numerically using Montecarlo methods
See video
Too see the application of this method in Supervised learning see Discriminative learning, and Generative learning
https://www.wikiwand.com/en/Maximum_likelihood_estimation