Introduction

Computer Vision (CV) enables machines to interpret and understand visual information from the world. From facial recognition to autonomous vehicles, computer vision powers many modern AI applications.

How Computers "See" Images

Digital Image Representation

Images are stored as grids of pixels with numerical values.

Grayscale Image:

Each pixel = single value (0-255)
0 = black, 255 = white

Example 3x3 image:
┌─────┬─────┬─────┐
│  0  │ 128 │ 255 │
├─────┼─────┼─────┤
│  64 │ 192 │  32 │
├─────┼─────┼─────┤
│ 255 │  96 │   0 │
└─────┴─────┴─────┘

Color Image (RGB):

Each pixel = 3 values (Red, Green, Blue)
Each channel: 0-255

Example pixel:
Red=255, Green=0, Blue=0 → Pure red
Red=255, Green=255, Blue=0 → Yellow

Image Dimensions:

1920 x 1080 RGB image:
- Width: 1920 pixels
- Height: 1080 pixels
- Channels: 3 (RGB)
- Total values: 1920 × 1080 × 3 = 6,220,800

Traditional Computer Vision

Before deep learning, CV relied on hand-crafted features.

Edge Detection

Detecting boundaries in images using filters.

Sobel Filter:

Horizontal edges:     Vertical edges:
┌────┬────┬────┐     ┌────┬────┬────┐
│ -1 │ -2 │ -1 │     │ -1 │  0 │  1 │
├────┼────┼────┤     ├────┼────┼────┤
│  0 │  0 │  0 │     │ -2 │  0 │  2 │
├────┼────┼────┤     ├────┼────┼────┤
│  1 │  2 │  1 │     │ -1 │  0 │  1 │
└────┴────┴────┘     └────┴────┴────┘

Feature Descriptors

SIFT, SURF, ORB:

Detect keypoints in images
Create descriptors for matching
Used for image matching, panoramas

Limitations of Traditional CV:

Required manual feature engineering
Sensitive to lighting, angle, scale
Couldn't learn complex patterns

Deep Learning for Computer Vision

Convolutional Neural Networks (CNNs)

CNNs automatically learn visual features.

Convolution Operation:

Input Image         Filter          Feature Map
┌───┬───┬───┬───┐   ┌───┬───┐      ┌───┬───┬───┐
│ 1 │ 2 │ 3 │ 0 │   │ 1 │ 0 │      │ 4 │ 6 │ 3 │
├───┼───┼───┼───┤ × ├───┼───┤  →   ├───┼───┼───┤
│ 0 │ 1 │ 2 │ 1 │   │ 0 │ 1 │      │ 3 │ 7 │ 5 │
├───┼───┼───┼───┤   └───┴───┘      ├───┼───┼───┤
│ 1 │ 0 │ 1 │ 2 │                  │ 2 │ 4 │ 6 │
├───┼───┼───┼───┤                  └───┴───┴───┘
│ 2 │ 1 │ 0 │ 1 │
└───┴───┴───┴───┘

CNN Architecture:

Input → [Conv → ReLU → Pool] × N → Flatten → Dense → Output
         ↓
    Early layers: edges, colors
    Middle layers: shapes, textures
    Deep layers: objects, faces

Popular CNN Architectures

| Architecture | Year | Key Innovation | |-------------|------|----------------| | LeNet | 1998 | First practical CNN | | AlexNet | 2012 | Deep CNN, ReLU, dropout | | VGG | 2014 | Very deep (16-19 layers) | | ResNet | 2015 | Skip connections (152+ layers) | | EfficientNet | 2019 | Optimal scaling | | Vision Transformer | 2020 | Attention for images |

Computer Vision Tasks

1. Image Classification

Assign a single label to entire image.

Input: [Image of a dog]
Output: "dog" (confidence: 0.95)

Classes: [cat, dog, bird, car, plane]
Probabilities: [0.02, 0.95, 0.01, 0.01, 0.01]

Applications:

Medical image diagnosis
Quality control in manufacturing
Content moderation

2. Object Detection

Identify and locate multiple objects.

Input: [Street scene image]
Output:
- car: [x=100, y=200, w=150, h=80], confidence=0.92
- person: [x=300, y=150, w=50, h=120], confidence=0.88
- traffic_light: [x=450, y=50, w=30, h=60], confidence=0.95

Popular Algorithms:

YOLO (You Only Look Once) - Fast, real-time
Faster R-CNN - Accurate, two-stage
SSD (Single Shot Detector) - Balance of speed/accuracy

Applications:

Autonomous vehicles
Surveillance systems
Retail analytics

3. Semantic Segmentation

Classify each pixel in the image.

Input: [Street scene]
Output: Pixel map where each pixel labeled as:
- Road (blue)
- Car (red)
- Person (green)
- Building (gray)
- Sky (light blue)

Applications:

Autonomous driving
Medical image analysis
Satellite imagery

4. Instance Segmentation

Semantic segmentation + distinguish individual objects.

Not just "these pixels are cars"
But "this is car 1, this is car 2, this is car 3"

5. Face Recognition

Face Detection: Find faces in image Face Recognition: Identify whose face it is

Pipeline:
1. Detect faces → Bounding boxes
2. Align faces → Normalize orientation
3. Extract features → Face embedding vector
4. Compare → Match against known faces

Applications:

Phone unlock
Security systems
Photo organization

6. Optical Character Recognition (OCR)

Extract text from images.

Input: [Image with text "Hello World"]
Output: "Hello World"

Steps:
1. Text detection (where is text?)
2. Text recognition (what does it say?)

Applications:

Document digitization
License plate reading
Receipt processing

Transfer Learning in CV

Use pre-trained models instead of training from scratch.

ImageNet Pre-trained Model
         │
         ▼
    ┌─────────┐
    │ Feature │ ← Freeze (keep learned features)
    │ Layers  │
    └────┬────┘
         │
         ▼
    ┌─────────┐
    │   New   │ ← Train (your specific task)
    │ Layers  │
    └────┬────┘
         │
         ▼
    Your Output

Benefits:

Much less training data needed
Faster training
Often better results

Cloud Computer Vision Services

Azure:

Azure Computer Vision: OCR, image analysis
Azure Custom Vision: Train custom classifiers
Azure Face: Face detection and recognition
Azure Video Indexer: Video analysis

AWS:

Amazon Rekognition: Face, object, text detection
Amazon Textract: Document OCR
Amazon Lookout for Vision: Industrial defect detection

Google Cloud:

Cloud Vision API: Label detection, OCR
Video Intelligence API: Video analysis
AutoML Vision: Custom model training

Evaluation Metrics

For Classification:

Accuracy: % correct predictions
Precision/Recall/F1: Per-class performance
Top-5 Accuracy: Correct label in top 5 predictions

For Object Detection:

IoU (Intersection over Union): Overlap between predicted and actual box
mAP (mean Average Precision): Average precision across classes and IoU thresholds

IoU = Area of Intersection / Area of Union

       ┌──────────────┐
       │  Predicted   │
       │    ┌─────────┼───┐
       │    │ Overlap │   │
       └────┼─────────┘   │
            │   Actual    │
            └─────────────┘

IoU > 0.5 typically = "correct detection"

Exam Tips

Common exam questions test:

Choosing right CV service for a task
Classification vs detection vs segmentation
When to use custom training vs pre-built
Understanding CNNs at high level
Transfer learning benefits

Watch for keywords:

"Identify if image contains X" → Classification
"Find and locate objects" → Object Detection
"Label every pixel" → Segmentation
"Read text from images" → OCR
"Identify faces" → Face Recognition

Key Takeaway

Computer vision has been transformed by deep learning, especially CNNs. Different tasks (classification, detection, segmentation) require different approaches. Cloud services provide pre-built capabilities for common CV tasks, while transfer learning enables custom models with limited data. Understanding these concepts helps you design effective visual AI solutions.

Introduction

How Computers "See" Images

Digital Image Representation

Images are stored as grids of pixels with numerical values.

Grayscale Image:

Each pixel = single value (0-255)
0 = black, 255 = white

Example 3x3 image:
┌─────┬─────┬─────┐
│  0  │ 128 │ 255 │
├─────┼─────┼─────┤
│  64 │ 192 │  32 │
├─────┼─────┼─────┤
│ 255 │  96 │   0 │
└─────┴─────┴─────┘

Color Image (RGB):

Each pixel = 3 values (Red, Green, Blue)
Each channel: 0-255

Example pixel:
Red=255, Green=0, Blue=0 → Pure red
Red=255, Green=255, Blue=0 → Yellow

Image Dimensions:

1920 x 1080 RGB image:
- Width: 1920 pixels
- Height: 1080 pixels
- Channels: 3 (RGB)
- Total values: 1920 × 1080 × 3 = 6,220,800

Traditional Computer Vision

Before deep learning, CV relied on hand-crafted features.

Edge Detection

Detecting boundaries in images using filters.

Sobel Filter:

Horizontal edges:     Vertical edges:
┌────┬────┬────┐     ┌────┬────┬────┐
│ -1 │ -2 │ -1 │     │ -1 │  0 │  1 │
├────┼────┼────┤     ├────┼────┼────┤
│  0 │  0 │  0 │     │ -2 │  0 │  2 │
├────┼────┼────┤     ├────┼────┼────┤
│  1 │  2 │  1 │     │ -1 │  0 │  1 │
└────┴────┴────┘     └────┴────┴────┘

Feature Descriptors

SIFT, SURF, ORB:

Detect keypoints in images
Create descriptors for matching
Used for image matching, panoramas

Limitations of Traditional CV:

Required manual feature engineering
Sensitive to lighting, angle, scale
Couldn't learn complex patterns

Deep Learning for Computer Vision

Convolutional Neural Networks (CNNs)

CNNs automatically learn visual features.

Convolution Operation:

Input Image         Filter          Feature Map
┌───┬───┬───┬───┐   ┌───┬───┐      ┌───┬───┬───┐
│ 1 │ 2 │ 3 │ 0 │   │ 1 │ 0 │      │ 4 │ 6 │ 3 │
├───┼───┼───┼───┤ × ├───┼───┤  →   ├───┼───┼───┤
│ 0 │ 1 │ 2 │ 1 │   │ 0 │ 1 │      │ 3 │ 7 │ 5 │
├───┼───┼───┼───┤   └───┴───┘      ├───┼───┼───┤
│ 1 │ 0 │ 1 │ 2 │                  │ 2 │ 4 │ 6 │
├───┼───┼───┼───┤                  └───┴───┴───┘
│ 2 │ 1 │ 0 │ 1 │
└───┴───┴───┴───┘

CNN Architecture:

Input → [Conv → ReLU → Pool] × N → Flatten → Dense → Output
         ↓
    Early layers: edges, colors
    Middle layers: shapes, textures
    Deep layers: objects, faces

Popular CNN Architectures

Computer Vision Tasks

1. Image Classification

Assign a single label to entire image.

Input: [Image of a dog]
Output: "dog" (confidence: 0.95)

Classes: [cat, dog, bird, car, plane]
Probabilities: [0.02, 0.95, 0.01, 0.01, 0.01]

Applications:

Medical image diagnosis
Quality control in manufacturing
Content moderation

2. Object Detection

Identify and locate multiple objects.

Input: [Street scene image]
Output:
- car: [x=100, y=200, w=150, h=80], confidence=0.92
- person: [x=300, y=150, w=50, h=120], confidence=0.88
- traffic_light: [x=450, y=50, w=30, h=60], confidence=0.95

Popular Algorithms:

YOLO (You Only Look Once) - Fast, real-time
Faster R-CNN - Accurate, two-stage
SSD (Single Shot Detector) - Balance of speed/accuracy

Applications:

Autonomous vehicles
Surveillance systems
Retail analytics

3. Semantic Segmentation

Classify each pixel in the image.

Input: [Street scene]
Output: Pixel map where each pixel labeled as:
- Road (blue)
- Car (red)
- Person (green)
- Building (gray)
- Sky (light blue)

Applications:

Autonomous driving
Medical image analysis
Satellite imagery

4. Instance Segmentation

Semantic segmentation + distinguish individual objects.

Not just "these pixels are cars"
But "this is car 1, this is car 2, this is car 3"

5. Face Recognition

Face Detection: Find faces in image Face Recognition: Identify whose face it is

Pipeline:
1. Detect faces → Bounding boxes
2. Align faces → Normalize orientation
3. Extract features → Face embedding vector
4. Compare → Match against known faces

Applications:

Phone unlock
Security systems
Photo organization

6. Optical Character Recognition (OCR)

Extract text from images.

Input: [Image with text "Hello World"]
Output: "Hello World"

Steps:
1. Text detection (where is text?)
2. Text recognition (what does it say?)

Applications:

Document digitization
License plate reading
Receipt processing

Transfer Learning in CV

Use pre-trained models instead of training from scratch.

ImageNet Pre-trained Model
         │
         ▼
    ┌─────────┐
    │ Feature │ ← Freeze (keep learned features)
    │ Layers  │
    └────┬────┘
         │
         ▼
    ┌─────────┐
    │   New   │ ← Train (your specific task)
    │ Layers  │
    └────┬────┘
         │
         ▼
    Your Output

Benefits:

Much less training data needed
Faster training
Often better results

Cloud Computer Vision Services

Azure:

Azure Computer Vision: OCR, image analysis
Azure Custom Vision: Train custom classifiers
Azure Face: Face detection and recognition
Azure Video Indexer: Video analysis

AWS:

Amazon Rekognition: Face, object, text detection
Amazon Textract: Document OCR
Amazon Lookout for Vision: Industrial defect detection

Google Cloud:

Cloud Vision API: Label detection, OCR
Video Intelligence API: Video analysis
AutoML Vision: Custom model training

Evaluation Metrics

For Classification:

Accuracy: % correct predictions
Precision/Recall/F1: Per-class performance
Top-5 Accuracy: Correct label in top 5 predictions

For Object Detection:

IoU (Intersection over Union): Overlap between predicted and actual box
mAP (mean Average Precision): Average precision across classes and IoU thresholds

IoU = Area of Intersection / Area of Union

       ┌──────────────┐
       │  Predicted   │
       │    ┌─────────┼───┐
       │    │ Overlap │   │
       └────┼─────────┘   │
            │   Actual    │
            └─────────────┘

IoU > 0.5 typically = "correct detection"

Exam Tips

Common exam questions test:

Choosing right CV service for a task
Classification vs detection vs segmentation
When to use custom training vs pre-built
Understanding CNNs at high level
Transfer learning benefits

Watch for keywords:

"Identify if image contains X" → Classification
"Find and locate objects" → Object Detection
"Label every pixel" → Segmentation
"Read text from images" → OCR
"Identify faces" → Face Recognition

Computer Vision Fundamentals

Recommended Prerequisites

Introduction

How Computers "See" Images

Digital Image Representation

Traditional Computer Vision

Edge Detection

Feature Descriptors

Limitations of Traditional CV:

Deep Learning for Computer Vision

Convolutional Neural Networks (CNNs)

Popular CNN Architectures

Computer Vision Tasks

1. Image Classification

2. Object Detection

3. Semantic Segmentation

4. Instance Segmentation

5. Face Recognition

6. Optical Character Recognition (OCR)

Transfer Learning in CV

Cloud Computer Vision Services

Azure:

AWS:

Google Cloud:

Evaluation Metrics

For Classification:

For Object Detection:

Exam Tips

Key Takeaway

Tags

Quick Feedback

Computer Vision Fundamentals

Recommended Prerequisites

Introduction

How Computers "See" Images

Digital Image Representation

Traditional Computer Vision

Edge Detection

Feature Descriptors

Limitations of Traditional CV:

Deep Learning for Computer Vision

Convolutional Neural Networks (CNNs)

Popular CNN Architectures

Computer Vision Tasks

1. Image Classification

2. Object Detection

3. Semantic Segmentation

4. Instance Segmentation

5. Face Recognition

6. Optical Character Recognition (OCR)

Transfer Learning in CV

Cloud Computer Vision Services

Azure:

AWS:

Google Cloud:

Evaluation Metrics

For Classification:

For Object Detection:

Exam Tips

Key Takeaway

Tags