Engineering Challenges

Building Searchlock involved several engineering challenges that emerged from the nature of web scraping, limited infrastructure resources, and the need to maintain reliable data collection across multiple websites.

This section highlights some of the most important technical obstacles encountered during development and the approaches used to address them.

Website Structure Changes

One of the inherent challenges of web scraping is that target websites frequently change their HTML structure.

Small modifications such as renamed classes, new containers, or layout changes can break existing extraction rules.

Impact

Scrapers may fail to locate product information.
Incorrect data can be extracted.
Crawling tasks may fail entirely.

Mitigation Strategy

To minimize the impact of structural changes:

Each store uses a dedicated spider with isolated extraction logic.
Selectors were designed to be as robust as possible.
Scrapers were built to fail gracefully rather than breaking the entire pipeline.

When a site structure changes, only the corresponding spider needs to be updated, reducing maintenance overhead.

Rate Limiting and Anti-Bot Measures

Many e-commerce platforms implement rate limiting or bot detection mechanisms to prevent automated scraping.

Aggressive crawling behavior may result in:

temporary IP blocks
throttled responses
request failures

Mitigation Strategy

The scraping engine was configured to behave in a polite and controlled manner:

limiting request frequency
spacing requests between pages
controlling concurrency levels

These measures reduce the likelihood of triggering anti-bot protections while still allowing the system to collect data reliably.

Product Matching Across Stores

Another major challenge was identifying the same product across different stores.

Different retailers often use:

different product titles
slightly modified naming conventions
inconsistent metadata

This makes direct comparison difficult.

Mitigation Strategy

The system stores product references per store while maintaining a canonical product record.

This structure allows:

store-specific product URLs
multiple price observations per store
cross-store comparison through the canonical product entity

Although this approach does not completely eliminate inconsistencies, it provides a structured foundation for product comparison.

Data Quality and Validation

Scraped data can sometimes contain inconsistencies such as:

malformed price values
missing product fields
partially loaded pages
temporary HTML errors

If stored without validation, these issues could corrupt the historical dataset.

Mitigation Strategy

A data pipeline layer was introduced between the spiders and the database to:

validate extracted fields
normalize price formats
discard incomplete records
ensure consistent data structures

This additional validation step helps maintain the integrity of the stored price history.

Infrastructure Constraints

The system was developed under limited infrastructure resources.

Constraints included:

modest computing power
self-hosted deployment
minimal operational budget

Because of this, the system architecture had to remain lightweight and efficient.

Mitigation Strategy

To operate within these constraints:

scraping tasks were scheduled instead of running continuously
workloads were distributed over time
system components were kept modular and lightweight

This allowed the platform to operate reliably without requiring large-scale infrastructure.

Storage Growth

Since the platform records every observed price, the price history dataset grows continuously over time.

While this is necessary for historical analysis, it introduces challenges such as:

increasing database size
longer query times
higher storage requirements

Mitigation Strategy

To manage this growth:

price records are stored separately from product metadata
queries rely on indexed product and timestamp fields
historical data can be queried efficiently without affecting core operations

Lessons Learned

Several important lessons emerged from building the system:

Web scraping systems must be designed for change, as external websites evolve frequently.
Data validation is essential when dealing with scraped data.
Separating scraping logic from the main application improves system reliability.
Infrastructure constraints encourage simpler and more efficient system design.

These lessons informed the architecture and would guide future iterations of the platform.

Future Improvements

If the system were extended further, several improvements could be considered:

improved product matching using similarity algorithms
automated monitoring of scraping failures
distributed scraping workers
analytics based on historical price data
smarter scheduling strategies for high-demand products

These enhancements would further strengthen the platform's reliability and analytical capabilities.

Summary

Developing Searchlock required addressing multiple technical challenges related to web scraping, data reliability, and infrastructure constraints.

By designing modular spiders, validating data through pipelines, and carefully managing system resources, the platform was able to maintain a consistent stream of price data while remaining flexible and extensible.

The experience gained from solving these challenges provided valuable insights into building reliable data collection systems that interact with constantly evolving external sources.

Website Structure Changes​

Impact​

Mitigation Strategy​

Rate Limiting and Anti-Bot Measures​

Mitigation Strategy​

Product Matching Across Stores​

Mitigation Strategy​

Data Quality and Validation​

Mitigation Strategy​

Infrastructure Constraints​

Mitigation Strategy​

Storage Growth​

Mitigation Strategy​

Lessons Learned​

Future Improvements​

Summary​

Website Structure Changes

Impact

Mitigation Strategy

Rate Limiting and Anti-Bot Measures

Mitigation Strategy

Product Matching Across Stores

Mitigation Strategy

Data Quality and Validation

Mitigation Strategy

Infrastructure Constraints

Mitigation Strategy

Storage Growth

Mitigation Strategy

Lessons Learned

Future Improvements

Summary