Before Reading
This note is a quick go-through for basic ML / DL concepts, starting with linear regression. The outline of this note follows Huawei's HCCDP – AI certification.
[TOC]
1. Machine Learning Foundation
1.1 Overview
Regression is supervised learning, which means the output can be 'corrected' via certain loss function.
1.2 Regression
1.2.1 Univariate linear regression
Input :
Output:
Loss: , which means ①
The univariate linear regression assume that for every that our regression function predicts, the actual corresponding will be located around it, in a normal distribution :
which means the error is in a normal distribution:
since ①, let :
Loss function:
where is what the regression is looking for.
The max likelihood function of loss function:
Note: above process uses following rules:
Finally, the target function is:
1.2.2 Multivariable Linear Regression
Like the process above, assume the formula is:
And to define we can define as a vector with dimensions:
The target function is:
1.2.3 Optimization
- Least Square
- Batch Gradient Descent BGD
- Update weights using all data (time consuming)
- Stochastic Gradient Descent SGD
- Randomly select data to update weights
- Mini-Batch Gradient Descent MBGD
1.2.4 Logistic Regression
The differences between logistic regression and linear regression challenge is whether the input is discrete or not. Discrete input, such like attributions of user, can be processed by logistic regression to classify corresponding labels.
Pros:
- Easy to implement
- Fast computing
- Provide probability score
- Can use L2 Regularization to deal with Multicollinearity
Linear Regression | Logistic Regression | |
---|---|---|
Objective | Prediction | Classification(with probability) |
Function | Fitting function | Prediction Function |
Weights Calculation | Least squares method/GD | Maximum likelihood estimation, MLE |
Inference process based on a binary decision question, given the input:
which means the algorithm should output 0 or 1 for any given vector .
In an ideal situation, we can have a trained weight , and out put the score as the classification result, which passes following Activation Function(Heaviside step function)
where the Heaviside step function is not negligible, we replace it with a similar function:
then:
Let the to be the probability of the result as 1, and the as the probability of the result as 0.
The fraction of these two probability (odds) is , which can replace the one in the above equation.
let to be the posteriori estimation, now the is:
Note: Input is vector , output is the label , score is
Let:
Likelihood function (reward function):
Loss function:
1.3 Classification
Placeholder here
1.4 Supervised Learning
1.4.1 KNN
Pros:
- Simple
- Applicable for non-linear classification
- No Assumption , not sensitive against noise data
Cons:
- Computation assuming
- Unbalanced issue
- Require large memory
1.4.2 Naive-bayes
If it is round and red, it should be an apple!
- Determine attributions
- Obtain training sample data
- Calculate for every label
- Calculate every conditional probabilities for every label
- The label with max is the result.
Pros:
- Stable classification efficiency.
- Good performance in small scale data
- Suitable for text classification
Cons:
- Bad performance if attribution is too many or attributions are not well isolated
- Prior probability is required
1.4.3 SVM
Draw a line in the sky to divide stars.
- If the line works well, (stars are linearly separable), with a max hard margin, SVM is done.
- If the line works but few stars is in the wrong side, with a max soft margin, SVM is done.
- If the line doesn't works, we can use kernel function to project the flat sky into a high dimension dome, and SVM is done.
Pros:
- Nice robustness.
- Global optimization can be discovered.
- Suitable for small scale data.
Cons:
- Bad performance for large scale data
- Hard to deal with multi-label classification task
- Sensitive for missing data, args, and kennel function selection.
1.4.4 Decision tree
Pros:
- Simple concept
- Input can be number or attribution
- Allow missing data
Cons:
- Overfitting
- Hard to predict consistent data
Information Gain
ID3 - using max info gain
- For every attribution, calculate its Information Gain.
- Select attribution with max information gain as first decision gate.
- Repeat
C4.5 - using gain ratio
In the C4.5 algorithm, the gain for a specific attribute is calculated using the normalized information gain. The gain for an attribute A is determined by subtracting the weighted average of the entropies of its partitions from the entropy of the original set.
- Calculate the entropy of the original set.
- Calculate the weighted average of the entropies of the partitions created by the attribute "Outlook."
- Subtract the result from step 2 from the entropy of the original set to get the gain for the "Outlook" attribute.
This process is repeated for each attribute, and the attribute with the highest gain is selected as the splitting criterion.The gain ratio can also be calculated for each attribute, which takes into account the intrinsic information of an attribute. The gain ratio is calculated by dividing the gain by the split information.
CART - using Gini index
In every division, calculate the Gini index of proposed two sub-dataset (its pureness) and find best choice to get max pureness.
1.5 Unsupervised Learning
1.5.1 K-means
- Select k position as the centers of the aggregation.
- For every data, calculate its distance between every center, select closest center as its label.
- Update centers: The center of every data with this center as its label will be the new center.
- If the center does not changes, provide the result
Cons:
- Can be effected by initial centers
- The k can be hard to determine
- Slow for large scale data
- Sensitive for noise and isolated data
1.5.2 K-means++
Placeholder here
1.5.3 K-medoids
Placeholder here
1.5.4 Hierarchical Clustering
Placeholder here
1.5.5 DBSCAN
Placeholder here
2. Deep Learning Foundation
2.1 Basic Knowledge of Neural Networks
2.1.1 Perceptron
A perceptron is a basic building block of artificial neural networks, which are models inspired by the structure and functioning of the human brain. It was introduced by Frank Rosenblatt in 1957. A perceptron takes multiple binary inputs (0 or 1), applies weights to these inputs, sums them up, and passes the result through an activation function to produce an output (typically 0 or 1). Mathematically, the output (y) of a perceptron is calculated as follows:
2.1.2 Activation Function
Step Function: It outputs 1 if the input is above a certain threshold and 0 otherwise. It's rarely used in hidden layers of modern neural networks but is sometimes used in the output layer for binary classification problems.
Sigmoid Function (Logistic Function):It squashes the input values between 0 and 1, which is useful for binary classification problems.
Hyperbolic Tangent (tanh) Function: Similar to the sigmoid, but it squashes input values between -1 and 1. It's often used in hidden layers of neural networks.
Rectified Linear Unit (ReLU): It outputs the input directly if it is positive, and zero otherwise. ReLU is widely used in hidden layers due to its simplicity and effectiveness and can mostly replace with Sigmoid
Softmax Function: It is commonly used in the output layer of a neural network for multi-class classification problems. It converts a vector of raw scores into a probability distribution.
2.1.3 Loss Function
Mean Squared Error (MSE): Used for regression problems, MSE calculates the average squared difference between predicted and actual values.
Binary Cross-Entropy Loss: Commonly used for binary classification problems. It measures the dissimilarity between the predicted probabilities and the actual binary labels.
Categorical Cross-Entropy Loss: Used for multi-class classification problems. It generalizes binary cross-entropy to more than two classes.
where is the number of classes, is an indicator of whether class is the true class for sample , and is the predicted probability of sample belonging to class .
Hinge Loss: Used for support vector machines (SVM) and some types of neural networks for binary classification.
2.1.4 Backpropagation
Error Back Propagation
- Back propagation loss to every computing unit
- Update weight based on loss
2.2 Dataset Process
2.2.1 Data Partition
Placeholder here
2.2.1 Bias & Variance
High Bias:
- Use larger model
- More training steps
- Alternative model
- Remediate regularization
High Variance:
- Obtain more data
- Add regularization
- Early stopping
- Alternative model
2.3 Network Design
2.4 Regularization
2.4.1 Underfitting
Reason:
Lack of enough features
Lack of complexity of model
Remediation:
- Add new features
- Add polynomial features
- Reduce regularization args
- Use non-linear model (kennel SVM, decision trees, etc.)
- Adjust model capacity
- Bagging
2.4.2 Overfitting
Reason:
- Too many noise
- Less sample
- Model is too complex
Remediation:
- Reduce features
- Regularization
2.4.3 Penalty
2.4.4 ℓ1-norm
Also known as Manhattan Distance or Taxicab norm. L1 Norm is the sum of the magnitudes of the vectors in a space. It is the most natural way of measure distance between vectors, that is the sum of absolute difference of the components of the vectors. In this norm, all the components of the vector are weighted equally. Having, for example, the vector X = [3,4] :
L1-Regularization is actually adding a ℓ1-norm to the model:
2.4.5 ℓ2-norm
also known as the Euclidean norm. It is the shortest distance to go from one point to another.
L2-Regularization is actually adding a ℓ2-norm to the model:
When ℓ1-norm and ℓ2-norm is set as Loss function, they are least absolute deviation (LAD) and (least squares error, LSE).
2.4.6 Dropout
2.4.7 Pooling
2.5 Optimizer
2.5.1 Gradient Descent
2.5.2 Momentum
2.5.3 Adam
2.5.4 Optimizer Selection
Data is sparse | Self-adaption(Adagrad, Adadelta, RMSprop, Adam) |
---|---|
Gradient is sparse | Adam is better than RMSprop |
Summary | Adam |