The Problem
Acquiring large-scale geospatial datasets often means working with non-public APIs that are protected by anti-bot layers. QPUBLIC serves US county property assessment data through a Beacon API that requires: Cloudflare challenge resolution, CAPTCHA-solved session tokens, and grid-based spatial query design to paginate through a county's entire geographic extent. The engineering challenge was not just bypassing the gates — it was designing a pipeline that reliably acquires 16,000+ parcel coordinate records per run at scale.
What I Built
A multi-layer geospatial ingestion pipeline with four distinct components:
1. Cloudflare bypass
Playwright with stealth mode handles the JS challenge that Cloudflare presents on first visit. The resulting session cookies are harvested and reused for API calls.
2. CAPTCHA solving
QPUBLIC uses reCAPTCHA v2/v3 to gate the Beacon API. The pipeline integrates both 2Captcha and CapSolver (switchable via config) — the CAPTCHA image/audio is sent to the solving service, the token returned, and injected into the form before the API call proceeds.
3. Beacon API client (BeaconClient)
Once authenticated, the Beacon API accepts spatial queries: given an AppID, LayerID, and a bounding box (Extent), it returns property records within that extent as GeoJSON-style features with parcel IDs and coordinates.
The client paginates through the county's bounding box in a grid of overlapping extents to ensure no parcels are missed at the edges.
4. Coordinate extraction and export
Raw API responses are parsed, deduplicated by parcel ID, and exported to CSV/JSON with standardised lat/lng columns.
Technical Highlights
- Full Cloudflare integration documented in
CLOUDFLARE_FINAL_IMPLEMENTATION.md(7 implementation docs generated across iterations). - CAPTCHA integration tested with both 2Captcha and CapSolver APIs; both paths remain in the codebase for redundancy.
- Handles rate-limit backoff automatically.
Outcome
Delivered for multiple county datasets. Chatham County GA run: 16,402 parcel coordinates extracted, 97.3% success rate, 18 CAPTCHA solves, 4 min 12 sec runtime. Geospatial output feeds into clients' property analytics and spatial ML workflows.