The Problem
A real estate agency’s acquisition process depended on monitoring dozens of property listing portals for new opportunities. The work was entirely manual: staff visiting portals, reading listings, copying data into internal systems, and manually categorizing properties against acquisition criteria. The process consumed dozens of staff hours weekly and introduced consistent transcription errors that corrupted internal databases.
At scale, the fragmentation was worse. Property data exists across portals in inconsistent formats — different field names for the same attributes, missing values, inconsistent address formats, and listing descriptions that bury key details in unstructured text. Standardizing manually before analysis was a second round of work on top of the extraction.
The Constraints
Speed and freshness mattered. In competitive real estate markets, properties matching acquisition criteria can be listed and under offer within hours. A batch scraping system that ran daily was not fast enough — the system needed to surface relevant listings in near real-time.
Portal-specific parsing. Each listing portal has its own HTML structure, pagination patterns, and anti-scraping behavior. A generic scraper would break frequently. The system needed robust, portal-specific parsing logic with resilience to structural changes.
Integration into existing workflows. The agency’s CRM and analytics tools expected data in specific schemas. The output of the scraper had to map directly to those schemas — not require a manual transformation step that reintroduced the labor it was meant to eliminate.
Our Approach
We built a scraping engine with portal-specific adapters for each target site. Each adapter encodes the page structure, pagination logic, and field mapping for its portal — handling portal-specific quirks without making the core system brittle to individual site changes.
The parsing layer applies NLP to unstructured listing descriptions, extracting implicit attributes (e.g., “south-facing garden” → garden: yes, orientation: south) that structured fields don’t capture. A normalization pipeline converts extracted data to the agency’s internal schema — standardizing address formats, unit types, pricing structures, and property classifications across sources.
The system runs continuously, monitoring for new listings that match configurable agency criteria. Matching listings are pushed immediately to the CRM via API, with all structured fields populated. Staff receive a notification and a pre-populated record — not a raw data source to process manually.
A duplicate detection layer prevents the same property from appearing across multiple portals as separate records, which had been a persistent problem with the previous manual approach.
The Outcome
- 100% of data collection automated — staff time previously spent on extraction is now focused on evaluation and negotiation
- Continuous monitoring across thousands of listings simultaneously — matching properties surfaced within minutes of listing
- Transcription errors eliminated — data accuracy is now a function of the source portal’s data quality, not human re-entry
- Structured output feeds directly into CRM and analytics with no transformation step
Team
Engagement: 3 months, 2 engineers (1 backend/scraping, 1 data engineering).
Stack: Python, Flask, PostgreSQL, HTML/CSS, Linux