Home About Me

A Practical AI Knowledge Map: Machine Learning, Deep Learning, and Computer Vision

Machine Learning

Core ideas and common task types

Machine learning can be understood from several different angles.

One common way is to divide it by how data is provided:

  • Supervised learning: data is labeled, so the model receives both inputs and outputs. Typical tasks include regression and classification.
  • Unsupervised learning: data is unlabeled, and the model only sees inputs. A typical task is clustering.
  • Semi-supervised learning: combines supervised and unsupervised approaches.

It can also be grouped by learning style:

  • Model-based learning and instance-based learning
  • Batch learning and incremental learning

From the perspective of problems to solve, the most common categories are:

  • Regression: the prediction target is continuous.
  • Classification: the prediction target is discrete.
  • Clustering: an unsupervised task that groups similar samples into the same cluster.
  • Dimensionality reduction: reducing the number of features or the scale of the data.

A typical machine learning workflow usually follows this path:

data collection → data cleaning → model selection → training → evaluation → testing → deployment and maintenance

Data preprocessing

Before training, data often needs to be transformed into a form that models can handle more effectively.

Common preprocessing methods include:

  1. Standardization: transform each column so that its mean is 0 and its standard deviation is 1.
  2. Min-max scaling: map the minimum value of each column to 0 and the maximum to 1.
  3. Normalization: convert data into percentages between 0 and 1, usually row-wise.
  4. Binarization: convert values into only two states, 0 and 1.
  5. One-hot encoding: represent a category as one 1 and the rest 0s.
  6. Label encoding: convert string labels into numeric form.

Regression

Linear regression

A linear model can be written as:

$y = w^T x + b$

Linear regression uses this form for regression tasks, usually when the samples roughly follow a linear distribution.

A loss function measures the difference between predicted values and true values. Model parameters are commonly optimized with gradient descent, which updates parameters in the opposite direction of the gradient:

$$ w_i = w_i + \Delta w_i \ \Delta W_i = - \eta \frac{\partial E}{\partial w_i} $$

Two widely used regularized variants are:

  • Lasso regression: adds an L1 regularization term to the standard linear regression loss.
  • Ridge regression: adds an L2 regularization term.

Polynomial regression

When the relationship in the data is not linear, polynomial regression introduces higher-order terms. Even though the features become nonlinear, the coefficients are still linear, so it can be viewed as an extension of linear regression.

This brings up two classic modeling issues:

  • Underfitting: the model is too simple and fails to capture the true pattern. Both training and test performance tend to be low. Common fixes include increasing model complexity or adding more features.
  • Overfitting: the model fits the training samples too closely and generalizes poorly. Training accuracy is high but test accuracy is low. Common fixes include increasing the sample size, simplifying the model, or reducing the number of features.

Decision tree regression

Decision trees can be used for regression as well as classification. In regression settings, prediction is obtained by averaging values within a leaf node.

Regression metrics

Common evaluation metrics include:

  • R² score
  • Mean squared error

Classification

Logistic regression

Logistic regression is used for binary classification. It first produces a continuous value and then maps it into a discrete class through the logistic function.

The sigmoid function is:

$y = \frac {1}{1+e^{-x}}$

Its standard loss function is cross-entropy.

Multiclass classification can be handled by combining multiple binary classifiers.

Decision trees

A decision tree follows the idea that similar causes tend to lead to similar outcomes. It builds a tree structure that splits samples with similar attributes into the same branch. Voting is used for classification, and averaging is used for regression.

Important concepts around decision trees include:

  • Information entropy: describes how chaotic or ordered a dataset is.
  • Feature selection
  • Information gain: the reduction in entropy after a split.
  • Gain ratio: information gain divided by intrinsic entropy.
  • Gini coefficient
  • Pruning: both pre-pruning and post-pruning

Decision trees are also central to ensemble learning:

  • Strongly dependent ensembles: Boosting, AdaBoosting
  • Weakly dependent ensembles: Bagging, Random Forest

Support Vector Machine

An SVM is another binary classification model. Its goal is to find the optimal linear decision boundary that maximizes the margin between the boundary and the support vectors, the samples closest to that boundary.

A good classification boundary typically aims for:

  • correctness
  • fairness
  • safety
  • simplicity

SVMs handle both:

  • linearly separable problems
  • linearly inseparable problems

For nonlinear cases, a kernel function maps the data into a higher-dimensional space where linear separation becomes possible. Common kernels include:

  • linear kernel
  • polynomial kernel
  • Gaussian kernel

Naive Bayes

Naive Bayes is based on Bayes’ theorem:

$$ P(A|B) = \frac{P(A)P(B|A)}{P(B)} $$

Its defining assumption is that features are independent of each other. Under that assumption, the model computes the probability that a sample belongs to each class.

Clustering

Clustering is an unsupervised learning task that groups samples according to similarity. Samples within the same cluster should be more similar to each other, while samples from different clusters should be less similar.

Similarity is commonly measured through distance, such as:

  • Euclidean distance
  • Manhattan distance
  • Chebyshev distance
  • Minkowski distance

Several major clustering families are commonly used:

  • Prototype-based / partition-based clustering: such as K-Means
  • Density-based clustering: such as DBSCAN
  • Hierarchical clustering: such as agglomerative clustering

A quick comparison:

<table> <thead> <tr> <th>Item</th> <th>K-Means</th> <th>DBSCAN</th> <th>Agglomerative Hierarchical</th> </tr> </thead> <tbody> <tr> <td>Type</td> <td>Prototype-based</td> <td>Density-based</td> <td>Hierarchical</td> </tr> <tr> <td>Cluster center</td> <td>Yes</td> <td>No</td> <td>No</td> </tr> <tr> <td>Need to set K in advance</td> <td>Yes</td> <td>No</td> <td>Yes</td> </tr> <tr> <td>Sensitivity to noise</td> <td>Sensitive</td> <td>Not sensitive</td> <td>Not sensitive</td> </tr> </tbody> </table>

A common evaluation metric for clustering is the silhouette coefficient.

Model evaluation and tuning

Classification metrics

The most common metrics for classification include:

  • Accuracy: number of correct predictions / total number of samples
  • Error rate: number of incorrect predictions / total number of samples
  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • F1 score: 2 * precision * recall / (precision + recall)
  • Confusion matrix

Cross-validation

K-fold cross-validation splits the dataset into K folds. Each time, one fold is used as the test set and the others are used for training. This effectively creates K different train-test splits and is especially useful when the sample size is small.

Learning curves and validation curves

  • Learning curve: compares model behavior under different training set sizes.
  • Validation curve: compares model behavior under different hyperparameter settings.

Hyperparameter selection

Hyperparameters are not learned automatically from data. They are usually determined through experience and experimental comparison.

Examples include:

  • tree depth in a decision tree
  • minimum number of samples in a leaf node
  • regularization strength
  • expected value and standard deviation in a normal distribution setting
  • number of trees in a random forest
  • learning rate

Two common search strategies are:

  • Grid search: exhaustively tests combinations of predefined values.
  • Random search: samples parameter values randomly and then evaluates combinations.

Deep Learning

Perceptrons and neural networks

A perceptron can be seen as a neuron. It receives multiple inputs, such as $x_1, x_2, …, x_n$, combines them with weights, and produces an output. It can serve as a classifier or regressor and is suited to linear problems. Multiple neurons together form a neural network.

A neural network is a layered directed structure made up of many neurons. According to the universal approximation theorem, a neural network with only one hidden layer can approximate any continuous function to arbitrary precision, as long as that hidden layer has enough neurons and an activation function is applied.

Activation functions

The purpose of an activation function is to turn a network’s output from linear into nonlinear.

Common activation functions include:

  • sigmoid: smooth and continuous, but prone to vanishing gradients
  • tanh: also smooth and continuous, also prone to vanishing gradients, and typically converges faster than sigmoid
  • ReLU: simple to compute and avoids gradients becoming too large or too small
  • softmax: used in the output layer to convert scores into a probability distribution

Loss functions and gradient descent

A loss function measures the gap between the true value and the predicted value, and is used to judge how good the model is.

Typical choices are:

  • mean squared error for regression
  • cross-entropy for classification

Gradient descent updates each model parameter step by step in the negative gradient direction.

Backpropagation

In deep neural networks, backpropagation is used to compute gradients for the parameters of hidden layers.

Its mathematical foundation is the chain rule.

Convolutional Neural Networks

A convolution is the weighted overlap of two functions along some dimension.

A Convolutional Neural Network (CNN) is a neural network that introduces convolution operations. A common structure looks like this:

input → convolution / activation / pooling → … → fully connected

The roles of the main layers are:

  • Convolution layer: mainly for feature extraction, and also for dimensionality reduction
  • Activation layer: applies the activation operation
  • Pooling layer: reduces dimensionality and improves generalization; common types are max pooling and average pooling
  • Fully connected layer: acts as a classifier
  • Dropout: helps prevent overfitting
  • Batch Normalization: helps reduce vanishing gradients, reduce overfitting, improve model stability, and speed up convergence

Classic CNN architectures include:

  • LeNet
  • AlexNet
  • VGG
  • GoogLeNet
  • ResNet

Computer Vision

Fundamentals of digital images

A useful starting point in computer vision is understanding how images are formed and represented.

Basic topics include:

  1. Imaging principles
  2. Image storage formats: grayscale images are single-channel matrices, while color images are multi-channel matrices
  3. Color spaces: RGB, HSV, YUV, and others
  4. Gray levels: the range of grayscale pixel values; 256 gray levels are commonly used today

Color and intensity operations

Common image transformations include:

  • Grayscale conversion: turning a color image into grayscale using methods such as averaging, max-value selection, or weighted averaging
  • Binarization: converting a grayscale image into one that contains only 0 and 255
  • Color channel operations
  • Grayscale histograms and histogram equalization

Geometric and morphological transformations

Frequently used transformations include:

  • Affine transformations: simple linear transformations such as rotation, translation, and mirroring
  • Perspective transformation
  • Scaling: often implemented with interpolation methods such as nearest-neighbor interpolation and bilinear interpolation
  • Cropping
  • Morphological operations: erosion, dilation, opening, closing, and morphological gradient

Filtering, gradients, and edges

Template-based processing includes both template convolution and template ordering. By choosing different templates, it is possible to achieve blurring, sharpening, edge extraction, and similar effects.

Common operations include:

  • Blurring / filtering: median filtering, mean filtering, Gaussian filtering
  • Edge detection: Sobel, Laplacian transform, Canny algorithm
  • Contour detection and drawing

Deep learning for image understanding

Image classification

A typical image classification pipeline is:

raw image → feature extraction → classification model

Common classification backbones include:

  • LeNet
  • AlexNet
  • VGG
  • GoogLeNet
  • ResNet

Object detection

Object detection combines local classification with regression for localization.

There are two major paradigms:

  • Two-stage detection: generate candidate regions first, then classify and regress them. The R-CNN family belongs here.
  • One-stage detection: classify and regress directly. The YOLO family and SSD are typical examples.

Candidate region generation can be done in several ways:

  • Sliding window: high detection accuracy, but extremely low efficiency
  • Selective Search: an image-based algorithm that computes similarity between neighboring regions
  • RPN (Region Proposal Network): generates candidate region predictions from feature maps

A key metric is IoU, the ratio between the intersection and union of the predicted box and the ground-truth box.

Important practical ideas also include:

  • multi-scale detection
  • feature fusion, where feature maps of different sizes are merged

Representative model families include R-CNN and YOLO.

OCR

OCR systems usually contain two parts:

  • text detection
  • text recognition

For text detection:

  • CTPN is suitable for horizontal text detection
  • SegLink is suitable for text detection with orientation or angle variation

For text recognition:

  • CRNN + CTC is a common combination

Face detection and face recognition

  • Face detection: MTCNN
  • Face recognition: Siamese networks, triplet networks, DeepFace, FaceNet

Image segmentation

Image segmentation assigns a class to each pixel in the image.

Segmentation can be divided by granularity into:

  • semantic segmentation
  • instance segmentation
  • panoptic segmentation

Common evaluation metrics include:

  • pixel accuracy
  • mean pixel accuracy
  • mean Intersection over Union

Common models include:

  • FCN
  • U-Net
  • Mask R-CNN
  • DeepLab family, which involves ideas such as dilated convolution, conditional random fields, and multi-scale pooling

Practical Project Questions

In real projects, technical understanding is only part of the work. Questions around data, deployment, and trade-offs often matter just as much.

Building a dataset

A complete dataset workflow typically includes:

  • collecting or acquiring data
  • cleaning the data
  • organizing it by category and annotating it

Where data comes from

Common sources include:

  • historical business data, often the most valuable
  • self-collected data, though time and cost can be high
  • purchased data, which is not always available
  • web scraping, where compliance must be considered
  • public datasets, which are easy to access but often less valuable for specific business needs

How much data is enough

For deep learning, more data is generally better. At the class level, having sample counts in the hundreds per class is considered a practical lower bound.

What to do when data is limited

Typical options include:

  • data augmentation
  • choosing models that work relatively well with small samples, such as SVM or U-Net

Handling extreme class imbalance

One direct method is to oversample the minority class, even by simple duplication.

Choosing a model

Model selection should depend on the real problem and its difficulty.

A practical rule is:

  • start with existing, classic, and mature models
  • use simple models for simple problems
  • use more complex models for harder problems

If the best choice is unclear, compare several models experimentally. In some cases, combining multiple models can make better use of their different strengths.

Traditional image processing or deep learning?

Choose traditional image processing when:

  • there is no need to understand image content semantically
  • the problem is simple
  • image variation is small
  • interference is limited

Choose deep learning when:

  • understanding image content or scene context is necessary
  • the task is complex
  • image variation is large
  • interference is significant
  • stronger generalization is required

Annotation strategy

Annotation depends on the task:

  • classification
  • object detection
  • segmentation

Who does the labeling also varies:

  • in large companies, dedicated annotation staff or teams may handle it
  • in small and medium-sized companies, developers or technical teams often label data themselves
  • some datasets require domain expertise for accurate annotation

Training time

Training time is usually estimated in advance, but in real projects incremental training is often used.

Why not use a certain model?

The answer should usually be based on effectiveness, and it helps to explain the model’s characteristics rather than rejecting it vaguely.

Deployment and use

Models can be deployed in several ways:

  • server-side deployment
  • client-side deployment
  • embedded device deployment

They are often packaged either as:

  • a network service
  • a class or function for direct invocation

Expected accuracy

In practical projects, performance is often expected to reach above 95%.

Project details that often matter

People may ask about details such as:

  • what GPU model was used
  • what type of industrial camera was used, and at what resolution
  • how the camera was installed and what the frame rate was
  • how many people were on the project and how responsibilities were divided

What should be clearly described in a résumé project entry

A project description should make these points explicit:

  • requirement: where it is used, who uses it, and what problem it solves
  • dataset: source, size, and preprocessing methods
  • model selection and optimization process
  • whether overfitting or underfitting appeared and how they were handled
  • final results

Example Project Scenarios

Chip inspection

  • Samples: high-resolution images of chips
  • Technical route: OpenCV-based image processing
  • Key techniques: grayscale conversion, binarization, dilation, contour detection, solid contour filling

Capsule inspection

  • Samples: high-resolution capsule images
  • Technical route: OpenCV-based image processing
  • Key techniques: grayscale conversion, binarization, dilation, blurring, Hough transform, pixel counting, contour finding/drawing/area-perimeter calculation

Tile defect detection

  • Samples: more than 1,000 tile samples across 7 classes: normal, cavity, crack, missing block, color plate, scratch, and others
  • Preprocessing: after rotation and mirroring augmentation, the dataset expanded to more than 40,000 samples
  • Model: standard CNN
  • Key parameters: input image size 256*256, learning rate 0.0001~0.00001
  • Accuracy: above 97% on the test set

Object detection use cases

Representative detection tasks include:

  • judging lumbar disc herniation
  • detecting whether storage tank covers in a lubricant enterprise are open or closed
  • detecting oil leakage at key nodes in oil pipelines
  • detecting rolling rocks, landslides, or debris flow on highways
  • detecting cracks or seepage inside highway tunnels
  • pest detection in crops and forest land
  • fire point detection, including smoke and flames
  • smoke and fire detection in power plants
  • detecting electric scooters brought into residential buildings
  • security inspection systems for detecting prohibited items

Image segmentation use cases

  • detecting and segmenting defect regions in industrial products
  • detecting road damage such as cracks, alligator cracking, and potholes