Scraping System

The scraping system is responsible for collecting product data from supported e-commerce websites. Its goal is to continuously gather accurate price information while minimizing the impact on both the target websites and the platform infrastructure.

To achieve this, the system was designed around a modular spider architecture, where each store is handled by an independent scraper.

Design Goals

The scraping system was designed with several goals in mind:

Modularity — each store scraper operates independently
Resilience — failures in one spider should not affect others
Extensibility — new stores can be added easily
Automation — scraping jobs run without manual intervention
Data consistency — extracted data is validated before storage

Core Components

The scraping system consists of several coordinated components.

Each component has a specific responsibility within the scraping workflow.

Scheduler

The scheduler is responsible for triggering scraping tasks at predefined intervals.

Its responsibilities include:

launching scraping jobs
coordinating scraping frequency
scheduling product updates
preventing excessive scraping activity

This ensures the system continuously refreshes product prices while avoiding unnecessary load on external websites.

Scraping Engine

The scraping engine acts as the execution environment for spiders.

It is responsible for:

launching spiders
managing crawling tasks
handling request queues
managing concurrency

The engine ensures spiders can run efficiently and handle multiple crawling tasks when necessary.

Store Spiders

Each supported e-commerce website is handled by a dedicated spider.

Spiders are responsible for:

navigating product pages
retrieving page content
extracting relevant product data
passing extracted data to processing pipelines

A typical spider extracts fields such as:

product name
current price
product URL
store identifier
timestamp

By isolating scraping logic per store, the system can adapt quickly when a website changes its structure.

Data Extraction

Once a page is retrieved, the spider performs structured data extraction.

This usually involves:

parsing HTML content
identifying relevant elements
extracting text or attribute values
normalizing extracted fields

Extraction logic is implemented using selectors or parsing rules specific to each website.

Data Pipeline

After extraction, the data is processed through a pipeline layer before being stored.

The pipeline is responsible for:

validating extracted data
cleaning inconsistent fields
standardizing formats
removing duplicates
preparing records for storage

This stage ensures that only clean and structured data enters the database.

Data Storage

Once processed, the data is stored in the central database.

Each price observation typically includes:

product identifier
store identifier
observed price
timestamp
product URL

This structure allows the system to build a historical price record for each product.

Handling Website Changes

One of the main challenges of web scraping is that websites frequently change their structure.

To address this, the scraping system was designed with:

store-specific spiders
isolated extraction logic
easy selector updates

When a website modifies its layout, only the corresponding spider needs to be updated.

Fault Isolation

Failures during scraping are inevitable due to network issues, site changes, or rate limits.

The system mitigates these risks by:

isolating scraping tasks
allowing spiders to fail independently
retrying failed requests when appropriate

This prevents scraping failures from affecting the rest of the platform.

Extending the System

Adding support for a new store typically requires:

Creating a new spider
Defining extraction rules
Integrating the spider with the scheduler
Validating the output through the pipeline

Because the scraping layer is modular, new sources can be integrated without modifying the core system.

Summary

The scraping system provides the data acquisition backbone of the platform.

By separating scraping tasks into modular spiders coordinated by a scheduler and processed through data pipelines, the system can reliably collect and store product price information from multiple e-commerce websites.

This architecture allows the platform to maintain up-to-date product data while remaining flexible and extensible as new sources are added.

Next Sections

The next pages describe other core parts of the system:

Backend API — application logic and data access
Data Model — structure of stored product and price data
Notification System — alerting users of price changes
Engineering Challenges — lessons learned while building the system

Design Goals​

Core Components​

Scheduler​

Scraping Engine​

Store Spiders​

Data Extraction​

Data Pipeline​

Data Storage​

Handling Website Changes​

Fault Isolation​

Extending the System​

Summary​

Next Sections​