Machine Learning

Core ideas and common task types

Machine learning can be understood from several different angles.

One common way is to divide it by how data is provided:

Supervised learning: data is labeled, so the model receives both inputs and outputs. Typical tasks include regression and classification.
Unsupervised learning: data is unlabeled, and the model only sees inputs. A typical task is clustering.
Semi-supervised learning: combines supervised and unsupervised approaches.

It can also be grouped by learning style:

Model-based learning and instance-based learning
Batch learning and incremental learning

From the perspective of problems to solve, the most common categories are:

Regression: the prediction target is continuous.
Classification: the prediction target is discrete.
Clustering: an unsupervised task that groups similar samples into the same cluster.
Dimensionality reduction: reducing the number of features or the scale of the data.

A typical machine learning workflow usually follows this path:

data collection → data cleaning → model selection → training → evaluation → testing → deployment and maintenance

Data preprocessing

Before training, data often needs to be transformed into a form that models can handle more effectively.

Common preprocessing methods include:

Standardization: transform each column so that its mean is 0 and its standard deviation is 1.
Min-max scaling: map the minimum value of each column to 0 and the maximum to 1.
Normalization: convert data into percentages between 0 and 1, usually row-wise.
Binarization: convert values into only two states, 0 and 1.
One-hot encoding: represent a category as one 1 and the rest 0s.
Label encoding: convert string labels into numeric form.

Regression

Linear regression

A linear model can be written as:

$y = w^T x + b$

Linear regression uses this form for regression tasks, usually when the samples roughly follow a linear distribution.

A loss function measures the difference between predicted values and true values. Model parameters are commonly optimized with gradient descent, which updates parameters in the opposite direction of the gradient:

$$ w_i = w_i + \Delta w_i \ \Delta W_i = - \eta \frac{\partial E}{\partial w_i} $$

Two widely used regularized variants are:

Lasso regression: adds an L1 regularization term to the standard linear regression loss.
Ridge regression: adds an L2 regularization term.

Polynomial regression

When the relationship in the data is not linear, polynomial regression introduces higher-order terms. Even though the features become nonlinear, the coefficients are still linear, so it can be viewed as an extension of linear regression.

This brings up two classic modeling issues:

Underfitting: the model is too simple and fails to capture the true pattern. Both training and test performance tend to be low. Common fixes include increasing model complexity or adding more features.
Overfitting: the model fits the training samples too closely and generalizes poorly. Training accuracy is high but test accuracy is low. Common fixes include increasing the sample size, simplifying the model, or reducing the number of features.

Decision tree regression

Decision trees can be used for regression as well as classification. In regression settings, prediction is obtained by averaging values within a leaf node.

Regression metrics

Common evaluation metrics include:

R² score
Mean squared error

Classification

Logistic regression

Logistic regression is used for binary classification. It first produces a continuous value and then maps it into a discrete class through the logistic function.

The sigmoid function is:

$y = \frac {1}{1+e^{-x}}$

Its standard loss function is cross-entropy.

Multiclass classification can be handled by combining multiple binary classifiers.

Decision trees

A decision tree follows the idea that similar causes tend to lead to similar outcomes. It builds a tree structure that splits samples with similar attributes into the same branch. Voting is used for classification, and averaging is used for regression.

Important concepts around decision trees include:

Information entropy: describes how chaotic or ordered a dataset is.
Feature selection
Information gain: the reduction in entropy after a split.
Gain ratio: information gain divided by intrinsic entropy.
Gini coefficient
Pruning: both pre-pruning and post-pruning

Decision trees are also central to ensemble learning:

Strongly dependent ensembles: Boosting, AdaBoosting
Weakly dependent ensembles: Bagging, Random Forest

Support Vector Machine

An SVM is another binary classification model. Its goal is to find the optimal linear decision boundary that maximizes the margin between the boundary and the support vectors, the samples closest to that boundary.

A good classification boundary typically aims for:

correctness
fairness
safety
simplicity

SVMs handle both:

linearly separable problems
linearly inseparable problems

For nonlinear cases, a kernel function maps the data into a higher-dimensional space where linear separation becomes possible. Common kernels include:

linear kernel
polynomial kernel
Gaussian kernel

Naive Bayes

Naive Bayes is based on Bayes’ theorem:

$$ P(A|B) = \frac{P(A)P(B|A)}{P(B)} $$

Its defining assumption is that features are independent of each other. Under that assumption, the model computes the probability that a sample belongs to each class.

Clustering

Clustering is an unsupervised learning task that groups samples according to similarity. Samples within the same cluster should be more similar to each other, while samples from different clusters should be less similar.

Similarity is commonly measured through distance, such as:

Euclidean distance
Manhattan distance
Chebyshev distance
Minkowski distance

Several major clustering families are commonly used:

Prototype-based / partition-based clustering: such as K-Means
Density-based clustering: such as DBSCAN
Hierarchical clustering: such as agglomerative clustering

A quick comparison:

<table> <thead> <tr> <th>Item</th> <th>K-Means</th> <th>DBSCAN</th> <th>Agglomerative Hierarchical</th> </tr> </thead> <tbody> <tr> <td>Type</td> <td>Prototype-based</td> <td>Density-based</td> <td>Hierarchical</td> </tr> <tr> <td>Cluster center</td> <td>Yes</td> <td>No</td> <td>No</td> </tr> <tr> <td>Need to set K in advance</td> <td>Yes</td> <td>No</td> <td>Yes</td> </tr> <tr> <td>Sensitivity to noise</td> <td>Sensitive</td> <td>Not sensitive</td> <td>Not sensitive</td> </tr> </tbody> </table>

A common evaluation metric for clustering is the silhouette coefficient.

Model evaluation and tuning

Classification metrics

The most common metrics for classification include:

Accuracy: number of correct predictions / total number of samples
Error rate: number of incorrect predictions / total number of samples
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
F1 score: 2 * precision * recall / (precision + recall)
Confusion matrix

Cross-validation

K-fold cross-validation splits the dataset into K folds. Each time, one fold is used as the test set and the others are used for training. This effectively creates K different train-test splits and is especially useful when the sample size is small.

Learning curves and validation curves

Learning curve: compares model behavior under different training set sizes.
Validation curve: compares model behavior under different hyperparameter settings.

Hyperparameter selection

Hyperparameters are not learned automatically from data. They are usually determined through experience and experimental comparison.

Examples include:

tree depth in a decision tree
minimum number of samples in a leaf node
regularization strength
expected value and standard deviation in a normal distribution setting
number of trees in a random forest
learning rate

Two common search strategies are:

Grid search: exhaustively tests combinations of predefined values.
Random search: samples parameter values randomly and then evaluates combinations.

Deep Learning

Perceptrons and neural networks

A perceptron can be seen as a neuron. It receives multiple inputs, such as $x_1, x_2, …, x_n$, combines them with weights, and produces an output. It can serve as a classifier or regressor and is suited to linear problems. Multiple neurons together form a neural network.

A neural network is a layered directed structure made up of many neurons. According to the universal approximation theorem, a neural network with only one hidden layer can approximate any continuous function to arbitrary precision, as long as that hidden layer has enough neurons and an activation function is applied.

Activation functions

The purpose of an activation function is to turn a network’s output from linear into nonlinear.

Common activation functions include:

sigmoid: smooth and continuous, but prone to vanishing gradients
tanh: also smooth and continuous, also prone to vanishing gradients, and typically converges faster than sigmoid
ReLU: simple to compute and avoids gradients becoming too large or too small
softmax: used in the output layer to convert scores into a probability distribution

Loss functions and gradient descent

A loss function measures the gap between the true value and the predicted value, and is used to judge how good the model is.

Typical choices are:

mean squared error for regression
cross-entropy for classification

Gradient descent updates each model parameter step by step in the negative gradient direction.

Backpropagation

In deep neural networks, backpropagation is used to compute gradients for the parameters of hidden layers.

Its mathematical foundation is the chain rule.

Convolutional Neural Networks

A convolution is the weighted overlap of two functions along some dimension.

A Convolutional Neural Network (CNN) is a neural network that introduces convolution operations. A common structure looks like this:

input → convolution / activation / pooling → … → fully connected

The roles of the main layers are:

Convolution layer: mainly for feature extraction, and also for dimensionality reduction
Activation layer: applies the activation operation
Pooling layer: reduces dimensionality and improves generalization; common types are max pooling and average pooling
Fully connected layer: acts as a classifier
Dropout: helps prevent overfitting
Batch Normalization: helps reduce vanishing gradients, reduce overfitting, improve model stability, and speed up convergence

Classic CNN architectures include:

LeNet
AlexNet
VGG
GoogLeNet
ResNet

Computer Vision

Fundamentals of digital images

A useful starting point in computer vision is understanding how images are formed and represented.

Basic topics include:

Imaging principles
Image storage formats: grayscale images are single-channel matrices, while color images are multi-channel matrices
Color spaces: RGB, HSV, YUV, and others
Gray levels: the range of grayscale pixel values; 256 gray levels are commonly used today

Color and intensity operations

Common image transformations include:

Grayscale conversion: turning a color image into grayscale using methods such as averaging, max-value selection, or weighted averaging
Binarization: converting a grayscale image into one that contains only 0 and 255
Color channel operations
Grayscale histograms and histogram equalization

Geometric and morphological transformations

Frequently used transformations include:

Affine transformations: simple linear transformations such as rotation, translation, and mirroring
Perspective transformation
Scaling: often implemented with interpolation methods such as nearest-neighbor interpolation and bilinear interpolation
Cropping
Morphological operations: erosion, dilation, opening, closing, and morphological gradient

Filtering, gradients, and edges

Template-based processing includes both template convolution and template ordering. By choosing different templates, it is possible to achieve blurring, sharpening, edge extraction, and similar effects.

Common operations include:

Blurring / filtering: median filtering, mean filtering, Gaussian filtering
Edge detection: Sobel, Laplacian transform, Canny algorithm
Contour detection and drawing

Deep learning for image understanding

Image classification

A typical image classification pipeline is:

raw image → feature extraction → classification model

Common classification backbones include:

LeNet
AlexNet
VGG
GoogLeNet
ResNet

Object detection

Object detection combines local classification with regression for localization.

There are two major paradigms:

Two-stage detection: generate candidate regions first, then classify and regress them. The R-CNN family belongs here.
One-stage detection: classify and regress directly. The YOLO family and SSD are typical examples.

Candidate region generation can be done in several ways:

Sliding window: high detection accuracy, but extremely low efficiency
Selective Search: an image-based algorithm that computes similarity between neighboring regions
RPN (Region Proposal Network): generates candidate region predictions from feature maps

A key metric is IoU, the ratio between the intersection and union of the predicted box and the ground-truth box.

Important practical ideas also include:

multi-scale detection
feature fusion, where feature maps of different sizes are merged

Representative model families include R-CNN and YOLO.

OCR

OCR systems usually contain two parts:

text detection
text recognition

For text detection:

CTPN is suitable for horizontal text detection
SegLink is suitable for text detection with orientation or angle variation

For text recognition:

CRNN + CTC is a common combination

Face detection and face recognition

Face detection: MTCNN
Face recognition: Siamese networks, triplet networks, DeepFace, FaceNet

Image segmentation

Image segmentation assigns a class to each pixel in the image.

Segmentation can be divided by granularity into:

semantic segmentation
instance segmentation
panoptic segmentation

Common evaluation metrics include:

pixel accuracy
mean pixel accuracy
mean Intersection over Union

Common models include:

FCN
U-Net
Mask R-CNN
DeepLab family, which involves ideas such as dilated convolution, conditional random fields, and multi-scale pooling

Practical Project Questions

In real projects, technical understanding is only part of the work. Questions around data, deployment, and trade-offs often matter just as much.

Building a dataset

A complete dataset workflow typically includes:

collecting or acquiring data
cleaning the data
organizing it by category and annotating it

Where data comes from

Common sources include:

historical business data, often the most valuable
self-collected data, though time and cost can be high
purchased data, which is not always available
web scraping, where compliance must be considered
public datasets, which are easy to access but often less valuable for specific business needs

How much data is enough

For deep learning, more data is generally better. At the class level, having sample counts in the hundreds per class is considered a practical lower bound.

What to do when data is limited

Typical options include:

data augmentation
choosing models that work relatively well with small samples, such as SVM or U-Net

Handling extreme class imbalance

One direct method is to oversample the minority class, even by simple duplication.

Choosing a model

Model selection should depend on the real problem and its difficulty.

A practical rule is:

start with existing, classic, and mature models
use simple models for simple problems
use more complex models for harder problems

If the best choice is unclear, compare several models experimentally. In some cases, combining multiple models can make better use of their different strengths.

Traditional image processing or deep learning?

Choose traditional image processing when:

there is no need to understand image content semantically
the problem is simple
image variation is small
interference is limited

Choose deep learning when:

understanding image content or scene context is necessary
the task is complex
image variation is large
interference is significant
stronger generalization is required

Annotation strategy

Annotation depends on the task:

classification
object detection
segmentation

Who does the labeling also varies:

in large companies, dedicated annotation staff or teams may handle it
in small and medium-sized companies, developers or technical teams often label data themselves
some datasets require domain expertise for accurate annotation

Training time

Training time is usually estimated in advance, but in real projects incremental training is often used.

Why not use a certain model?

The answer should usually be based on effectiveness, and it helps to explain the model’s characteristics rather than rejecting it vaguely.

Deployment and use

Models can be deployed in several ways:

server-side deployment
client-side deployment
embedded device deployment

They are often packaged either as:

a network service
a class or function for direct invocation

Expected accuracy

In practical projects, performance is often expected to reach above 95%.

Project details that often matter

People may ask about details such as:

what GPU model was used
what type of industrial camera was used, and at what resolution
how the camera was installed and what the frame rate was
how many people were on the project and how responsibilities were divided

What should be clearly described in a résumé project entry

A project description should make these points explicit:

requirement: where it is used, who uses it, and what problem it solves
dataset: source, size, and preprocessing methods
model selection and optimization process
whether overfitting or underfitting appeared and how they were handled
final results

Example Project Scenarios

Chip inspection

Samples: high-resolution images of chips
Technical route: OpenCV-based image processing
Key techniques: grayscale conversion, binarization, dilation, contour detection, solid contour filling

Capsule inspection

Samples: high-resolution capsule images
Technical route: OpenCV-based image processing
Key techniques: grayscale conversion, binarization, dilation, blurring, Hough transform, pixel counting, contour finding/drawing/area-perimeter calculation

Tile defect detection

Samples: more than 1,000 tile samples across 7 classes: normal, cavity, crack, missing block, color plate, scratch, and others
Preprocessing: after rotation and mirroring augmentation, the dataset expanded to more than 40,000 samples
Model: standard CNN
Key parameters: input image size 256*256, learning rate 0.0001~0.00001
Accuracy: above 97% on the test set

Object detection use cases

Representative detection tasks include:

judging lumbar disc herniation
detecting whether storage tank covers in a lubricant enterprise are open or closed
detecting oil leakage at key nodes in oil pipelines
detecting rolling rocks, landslides, or debris flow on highways
detecting cracks or seepage inside highway tunnels
pest detection in crops and forest land
fire point detection, including smoke and flames
smoke and fire detection in power plants
detecting electric scooters brought into residential buildings
security inspection systems for detecting prohibited items

Image segmentation use cases

detecting and segmenting defect regions in industrial products
detecting road damage such as cracks, alligator cracking, and potholes