Introduction
Computer Vision (CV) enables machines to interpret and understand visual information from the world. From facial recognition to autonomous vehicles, computer vision powers many modern AI applications.
How Computers "See" Images
Digital Image Representation
Images are stored as grids of pixels with numerical values.
Grayscale Image:
Each pixel = single value (0-255)
0 = black, 255 = white
Example 3x3 image:
┌─────┬─────┬─────┐
│ 0 │ 128 │ 255 │
├─────┼─────┼─────┤
│ 64 │ 192 │ 32 │
├─────┼─────┼─────┤
│ 255 │ 96 │ 0 │
└─────┴─────┴─────┘
Color Image (RGB):
Each pixel = 3 values (Red, Green, Blue)
Each channel: 0-255
Example pixel:
Red=255, Green=0, Blue=0 → Pure red
Red=255, Green=255, Blue=0 → Yellow
Image Dimensions:
1920 x 1080 RGB image:
- Width: 1920 pixels
- Height: 1080 pixels
- Channels: 3 (RGB)
- Total values: 1920 × 1080 × 3 = 6,220,800
Traditional Computer Vision
Before deep learning, CV relied on hand-crafted features.
Edge Detection
Detecting boundaries in images using filters.
Sobel Filter:
Horizontal edges: Vertical edges:
┌────┬────┬────┐ ┌────┬────┬────┐
│ -1 │ -2 │ -1 │ │ -1 │ 0 │ 1 │
├────┼────┼────┤ ├────┼────┼────┤
│ 0 │ 0 │ 0 │ │ -2 │ 0 │ 2 │
├────┼────┼────┤ ├────┼────┼────┤
│ 1 │ 2 │ 1 │ │ -1 │ 0 │ 1 │
└────┴────┴────┘ └────┴────┴────┘
Feature Descriptors
SIFT, SURF, ORB:
- Detect keypoints in images
- Create descriptors for matching
- Used for image matching, panoramas
Limitations of Traditional CV:
- Required manual feature engineering
- Sensitive to lighting, angle, scale
- Couldn't learn complex patterns
Deep Learning for Computer Vision
Convolutional Neural Networks (CNNs)
CNNs automatically learn visual features.
Convolution Operation:
Input Image Filter Feature Map
┌───┬───┬───┬───┐ ┌───┬───┐ ┌───┬───┬───┐
│ 1 │ 2 │ 3 │ 0 │ │ 1 │ 0 │ │ 4 │ 6 │ 3 │
├───┼───┼───┼───┤ × ├───┼───┤ → ├───┼───┼───┤
│ 0 │ 1 │ 2 │ 1 │ │ 0 │ 1 │ │ 3 │ 7 │ 5 │
├───┼───┼───┼───┤ └───┴───┘ ├───┼───┼───┤
│ 1 │ 0 │ 1 │ 2 │ │ 2 │ 4 │ 6 │
├───┼───┼───┼───┤ └───┴───┴───┘
│ 2 │ 1 │ 0 │ 1 │
└───┴───┴───┴───┘
CNN Architecture:
Input → [Conv → ReLU → Pool] × N → Flatten → Dense → Output
↓
Early layers: edges, colors
Middle layers: shapes, textures
Deep layers: objects, faces
Popular CNN Architectures
| Architecture | Year | Key Innovation | |-------------|------|----------------| | LeNet | 1998 | First practical CNN | | AlexNet | 2012 | Deep CNN, ReLU, dropout | | VGG | 2014 | Very deep (16-19 layers) | | ResNet | 2015 | Skip connections (152+ layers) | | EfficientNet | 2019 | Optimal scaling | | Vision Transformer | 2020 | Attention for images |
Computer Vision Tasks
1. Image Classification
Assign a single label to entire image.
Input: [Image of a dog]
Output: "dog" (confidence: 0.95)
Classes: [cat, dog, bird, car, plane]
Probabilities: [0.02, 0.95, 0.01, 0.01, 0.01]
Applications:
- Medical image diagnosis
- Quality control in manufacturing
- Content moderation
2. Object Detection
Identify and locate multiple objects.
Input: [Street scene image]
Output:
- car: [x=100, y=200, w=150, h=80], confidence=0.92
- person: [x=300, y=150, w=50, h=120], confidence=0.88
- traffic_light: [x=450, y=50, w=30, h=60], confidence=0.95
Popular Algorithms:
- YOLO (You Only Look Once) - Fast, real-time
- Faster R-CNN - Accurate, two-stage
- SSD (Single Shot Detector) - Balance of speed/accuracy
Applications:
- Autonomous vehicles
- Surveillance systems
- Retail analytics
3. Semantic Segmentation
Classify each pixel in the image.
Input: [Street scene]
Output: Pixel map where each pixel labeled as:
- Road (blue)
- Car (red)
- Person (green)
- Building (gray)
- Sky (light blue)
Applications:
- Autonomous driving
- Medical image analysis
- Satellite imagery
4. Instance Segmentation
Semantic segmentation + distinguish individual objects.
Not just "these pixels are cars"
But "this is car 1, this is car 2, this is car 3"
5. Face Recognition
Face Detection: Find faces in image Face Recognition: Identify whose face it is
Pipeline:
1. Detect faces → Bounding boxes
2. Align faces → Normalize orientation
3. Extract features → Face embedding vector
4. Compare → Match against known faces
Applications:
- Phone unlock
- Security systems
- Photo organization
6. Optical Character Recognition (OCR)
Extract text from images.
Input: [Image with text "Hello World"]
Output: "Hello World"
Steps:
1. Text detection (where is text?)
2. Text recognition (what does it say?)
Applications:
- Document digitization
- License plate reading
- Receipt processing
Transfer Learning in CV
Use pre-trained models instead of training from scratch.
ImageNet Pre-trained Model
│
▼
┌─────────┐
│ Feature │ ← Freeze (keep learned features)
│ Layers │
└────┬────┘
│
▼
┌─────────┐
│ New │ ← Train (your specific task)
│ Layers │
└────┬────┘
│
▼
Your Output
Benefits:
- Much less training data needed
- Faster training
- Often better results
Cloud Computer Vision Services
Azure:
- Azure Computer Vision: OCR, image analysis
- Azure Custom Vision: Train custom classifiers
- Azure Face: Face detection and recognition
- Azure Video Indexer: Video analysis
AWS:
- Amazon Rekognition: Face, object, text detection
- Amazon Textract: Document OCR
- Amazon Lookout for Vision: Industrial defect detection
Google Cloud:
- Cloud Vision API: Label detection, OCR
- Video Intelligence API: Video analysis
- AutoML Vision: Custom model training
Evaluation Metrics
For Classification:
- Accuracy: % correct predictions
- Precision/Recall/F1: Per-class performance
- Top-5 Accuracy: Correct label in top 5 predictions
For Object Detection:
- IoU (Intersection over Union): Overlap between predicted and actual box
- mAP (mean Average Precision): Average precision across classes and IoU thresholds
IoU = Area of Intersection / Area of Union
┌──────────────┐
│ Predicted │
│ ┌─────────┼───┐
│ │ Overlap │ │
└────┼─────────┘ │
│ Actual │
└─────────────┘
IoU > 0.5 typically = "correct detection"
Exam Tips
Common exam questions test:
- Choosing right CV service for a task
- Classification vs detection vs segmentation
- When to use custom training vs pre-built
- Understanding CNNs at high level
- Transfer learning benefits
Watch for keywords:
- "Identify if image contains X" → Classification
- "Find and locate objects" → Object Detection
- "Label every pixel" → Segmentation
- "Read text from images" → OCR
- "Identify faces" → Face Recognition
Key Takeaway
Computer vision has been transformed by deep learning, especially CNNs. Different tasks (classification, detection, segmentation) require different approaches. Cloud services provide pre-built capabilities for common CV tasks, while transfer learning enables custom models with limited data. Understanding these concepts helps you design effective visual AI solutions.
