Skip to main content

Scraping System

The scraping system is responsible for collecting product data from supported e-commerce websites. Its goal is to continuously gather accurate price information while minimizing the impact on both the target websites and the platform infrastructure.

To achieve this, the system was designed around a modular spider architecture, where each store is handled by an independent scraper.


Design Goals

The scraping system was designed with several goals in mind:

  • Modularity — each store scraper operates independently
  • Resilience — failures in one spider should not affect others
  • Extensibility — new stores can be added easily
  • Automation — scraping jobs run without manual intervention
  • Data consistency — extracted data is validated before storage

Core Components

The scraping system consists of several coordinated components.

Each component has a specific responsibility within the scraping workflow.


Scheduler

The scheduler is responsible for triggering scraping tasks at predefined intervals.

Its responsibilities include:

  • launching scraping jobs
  • coordinating scraping frequency
  • scheduling product updates
  • preventing excessive scraping activity

This ensures the system continuously refreshes product prices while avoiding unnecessary load on external websites.


Scraping Engine

The scraping engine acts as the execution environment for spiders.

It is responsible for:

  • launching spiders
  • managing crawling tasks
  • handling request queues
  • managing concurrency

The engine ensures spiders can run efficiently and handle multiple crawling tasks when necessary.


Store Spiders

Each supported e-commerce website is handled by a dedicated spider.

Spiders are responsible for:

  • navigating product pages
  • retrieving page content
  • extracting relevant product data
  • passing extracted data to processing pipelines

A typical spider extracts fields such as:

  • product name
  • current price
  • product URL
  • store identifier
  • timestamp

By isolating scraping logic per store, the system can adapt quickly when a website changes its structure.


Data Extraction

Once a page is retrieved, the spider performs structured data extraction.

This usually involves:

  • parsing HTML content
  • identifying relevant elements
  • extracting text or attribute values
  • normalizing extracted fields

Extraction logic is implemented using selectors or parsing rules specific to each website.


Data Pipeline

After extraction, the data is processed through a pipeline layer before being stored.

The pipeline is responsible for:

  • validating extracted data
  • cleaning inconsistent fields
  • standardizing formats
  • removing duplicates
  • preparing records for storage

This stage ensures that only clean and structured data enters the database.


Data Storage

Once processed, the data is stored in the central database.

Each price observation typically includes:

  • product identifier
  • store identifier
  • observed price
  • timestamp
  • product URL

This structure allows the system to build a historical price record for each product.


Handling Website Changes

One of the main challenges of web scraping is that websites frequently change their structure.

To address this, the scraping system was designed with:

  • store-specific spiders
  • isolated extraction logic
  • easy selector updates

When a website modifies its layout, only the corresponding spider needs to be updated.


Fault Isolation

Failures during scraping are inevitable due to network issues, site changes, or rate limits.

The system mitigates these risks by:

  • isolating scraping tasks
  • allowing spiders to fail independently
  • retrying failed requests when appropriate

This prevents scraping failures from affecting the rest of the platform.


Extending the System

Adding support for a new store typically requires:

  1. Creating a new spider
  2. Defining extraction rules
  3. Integrating the spider with the scheduler
  4. Validating the output through the pipeline

Because the scraping layer is modular, new sources can be integrated without modifying the core system.


Summary

The scraping system provides the data acquisition backbone of the platform.

By separating scraping tasks into modular spiders coordinated by a scheduler and processed through data pipelines, the system can reliably collect and store product price information from multiple e-commerce websites.

This architecture allows the platform to maintain up-to-date product data while remaining flexible and extensible as new sources are added.


Next Sections

The next pages describe other core parts of the system:

  • Backend API — application logic and data access
  • Data Model — structure of stored product and price data
  • Notification System — alerting users of price changes
  • Engineering Challenges — lessons learned while building the system