bright-data-best-practices

安装量: 1.4K
排名: #3333

安装

npx skills add https://github.com/brightdata/skills --skill bright-data-best-practices

Bright Data APIs Bright Data provides infrastructure for web data extraction at scale. Four primary APIs cover different use cases — always pick the most specific tool for the job. Choosing the Right API Use Case API Why Scrape any webpage by URL (no interaction) Web Unlocker HTTP-based, auto-bypasses bot detection, cheapest Google / Bing / Yandex search results SERP API Specialized for SERP extraction, returns structured data Structured data from Amazon, LinkedIn, Instagram, TikTok, etc. Web Scraper API Pre-built scrapers, no parsing needed Click, scroll, fill forms, run JS, intercept XHR Browser API Full browser automation Puppeteer / Playwright / Selenium automation Browser API Connects via CDP/WebDriver Authentication Pattern (All APIs) All APIs share the same authentication model: export BRIGHTDATA_API_KEY = "your-api-key"

From Control Panel > Account Settings

export BRIGHTDATA_UNLOCKER_ZONE = "zone-name"

Web Unlocker zone name

export BRIGHTDATA_SERP_ZONE = "serp-zone-name"

SERP API zone name

export BROWSER_AUTH = "brd-customer-ID-zone-NAME:PASSWORD"

Browser API credentials

REST API authentication header for Web Unlocker and SERP API: Authorization: Bearer YOUR_API_KEY Web Unlocker API HTTP-based scraping proxy. Best for simple page fetches without browser interaction. Endpoint: POST https://api.brightdata.com/request import requests response = requests . post ( "https://api.brightdata.com/request" , headers = { "Authorization" : f"Bearer { API_KEY } " } , json = { "zone" : "YOUR_ZONE_NAME" , "url" : "https://example.com/product/123" , "format" : "raw" } ) html = response . text Key Parameters Parameter Type Description zone string Zone name (required) url string Target URL with http:// or https:// (required) format string "raw" (HTML) or "json" (structured wrapper) (required) method string HTTP verb, default "GET" country string 2-letter ISO for geo-targeting (e.g., "us" , "de" ) data_format string Transform: "markdown" or "screenshot" async boolean true for async mode Quick Patterns

Get markdown (best for LLM input)

response

requests . post ( "https://api.brightdata.com/request" , headers = { "Authorization" : f"Bearer { API_KEY } " } , json = { "zone" : ZONE , "url" : url , "format" : "raw" , "data_format" : "markdown" } )

Geo-targeted request

json

{ "zone" : ZONE , "url" : url , "format" : "raw" , "country" : "de" }

Screenshot for debugging

json

{ "zone" : ZONE , "url" : url , "format" : "raw" , "data_format" : "screenshot" }

Async for bulk processing

json

{ "zone" : ZONE , "url" : url , "format" : "raw" , "async" : True } Critical rule: Never use Web Unlocker with Puppeteer, Playwright, Selenium, or anti-detect browsers. Use Browser API instead. See references/web-unlocker.md for complete reference including proxy interface, special headers, async flow, features, and billing. SERP API Structured search engine result extraction for Google, Bing, Yandex, DuckDuckGo. Endpoint: POST https://api.brightdata.com/request (same as Web Unlocker) response = requests . post ( "https://api.brightdata.com/request" , headers = { "Authorization" : f"Bearer { API_KEY } " } , json = { "zone" : "YOUR_SERP_ZONE" , "url" : "https://www.google.com/search?q=python+web+scraping&brd_json=1&gl=us&hl=en" , "format" : "raw" } ) data = response . json ( ) for result in data . get ( "organic" , [ ] ) : print ( result [ "rank" ] , result [ "title" ] , result [ "link" ] ) Essential Google URL Parameters Parameter Description Example q Search query q=python+web+scraping brd_json Parsed JSON output brd_json=1 (always use for data pipelines) gl Country for search gl=us hl Language hl=en start Pagination offset start=10 (page 2), start=20 (page 3) tbm Search type tbm=nws (news), tbm=isch (images), tbm=vid (videos) brd_mobile Device brd_mobile=1 (mobile), brd_mobile=ios brd_browser Browser brd_browser=chrome brd_ai_overview Trigger AI Overview brd_ai_overview=2 uule Encoded geo location for precise location targeting Note: num parameter is deprecated as of September 2025. Use start for pagination. Parsed JSON Response Structure { "organic" : [ { "rank" : 1 , "global_rank" : 1 , "title" : "..." , "link" : "..." , "description" : "..." } ] , "paid" : [ ] , "people_also_ask" : [ ] , "knowledge_graph" : { } , "related_searches" : [ ] , "general" : { "results_cnt" : 1240000000 , "query" : "..." } } Bing Key Parameters Parameter Description q Search query setLang Language (prefer 4-letter: en-US ) cc Country code first Pagination (increment by 10: 1, 11, 21...) safesearch off , moderate , strict brd_mobile Device type Async for Bulk SERP

Submit

response

requests . post ( "https://api.brightdata.com/request" , params = { "async" : "1" } , headers = { "Authorization" : f"Bearer { API_KEY } " } , json = { "zone" : SERP_ZONE , "url" : "https://www.google.com/search?q=test&brd_json=1" , "format" : "raw" } ) response_id = response . headers . get ( "x-response-id" )

Retrieve (retrieve calls are NOT billed)

result

requests . get ( "https://api.brightdata.com/serp/get_result" , params = { "response_id" : response_id } , headers = { "Authorization" : f"Bearer { API_KEY } " } ) Billing: Pay per 1,000 successful requests only. Async retrieve calls are not billed. See references/serp-api.md for complete reference including Maps, Trends, Reviews, Lens, Hotels, Flights parameters. Web Scraper API Pre-built scrapers for structured data extraction from 100+ platforms. No parsing logic needed. Sync Endpoint: POST https://api.brightdata.com/datasets/v3/scrape Async Endpoint: POST https://api.brightdata.com/datasets/v3/trigger

Sync (up to 20 URLs, returns immediately)

response

requests . post ( "https://api.brightdata.com/datasets/v3/scrape" , params = { "dataset_id" : "YOUR_DATASET_ID" , "format" : "json" } , headers = { "Authorization" : f"Bearer { API_KEY } " } , json = { "input" : [ { "url" : "https://www.amazon.com/dp/B09X7M8TBQ" } ] } ) if response . status_code == 200 : data = response . json ( )

Results ready

elif response . status_code == 202 : snapshot_id = response . json ( ) [ "snapshot_id" ]

Poll for completion

Parameters Parameter Type Description dataset_id string Scraper identifier from the Scraper Library (required) format string json (default), ndjson , jsonl , csv custom_output_fields string Pipe-separated fields: url|title|price include_errors boolean Include error info in results Request Body { "input" : [ { "url" : "https://www.amazon.com/dp/B09X7M8TBQ" } , { "url" : "https://www.amazon.com/dp/B0B7CTCPKN" } ] } Poll for Async Results import time

Trigger

snapshot_id

requests . post ( "https://api.brightdata.com/datasets/v3/trigger" , params = { "dataset_id" : DATASET_ID , "format" : "json" } , headers = { "Authorization" : f"Bearer { API_KEY } " } , json = { "input" : [ { "url" : u } for u in urls ] } ) . json ( ) [ "snapshot_id" ]

Poll

while True : status = requests . get ( f"https://api.brightdata.com/datasets/v3/progress/ { snapshot_id } " , headers = { "Authorization" : f"Bearer { API_KEY } " } ) . json ( ) [ "status" ] if status == "ready" : break if status == "failed" : raise Exception ( "Job failed" ) time . sleep ( 10 )

Download

data

requests . get ( f"https://api.brightdata.com/datasets/v3/snapshot/ { snapshot_id } " , params = { "format" : "json" } , headers = { "Authorization" : f"Bearer { API_KEY } " } ) . json ( ) Progress status values: starting → running → ready | failed Data retention: 30 days. Billing: Per delivered record. Invalid input URLs that fail are still billable. See references/web-scraper-api.md for complete reference including scraper types, output formats, delivery options, and billing details. Browser API (Scraping Browser) Full browser automation via CDP/WebDriver. Handles CAPTCHA, fingerprinting, and anti-bot detection automatically. Connection: Playwright/Puppeteer: wss://${AUTH}@brd.superproxy.io:9222 Selenium: https://${AUTH}@brd.superproxy.io:9515 const { chromium } = require ( "playwright-core" ) ; const AUTH = process . env . BROWSER_AUTH ; const browser = await chromium . connectOverCDP ( wss:// ${ AUTH } @brd.superproxy.io:9222 ) ; const page = await browser . newPage ( ) ; page . setDefaultNavigationTimeout ( 120000 ) ; // Always set to 2 minutes await page . goto ( "https://example.com" , { waitUntil : "domcontentloaded" } ) ; const html = await page . content ( ) ; await browser . close ( ) ; from playwright . async_api import async_playwright async with async_playwright ( ) as p : browser = await p . chromium . connect_over_cdp ( f"wss:// { AUTH } @brd.superproxy.io:9222" ) page = await browser . new_page ( ) page . set_default_navigation_timeout ( 120000 ) await page . goto ( "https://example.com" , wait_until = "domcontentloaded" ) html = await page . content ( ) await browser . close ( ) Custom CDP Functions Function Purpose Captcha.solve Manually trigger CAPTCHA solving Captcha.setAutoSolve Enable/disable auto CAPTCHA solving Proxy.setLocation Set precise geo location (call BEFORE goto) Proxy.useSession Maintain same IP across sessions Emulation.setDevice Apply device profile (iPhone 14, etc.) Emulation.getSupportedDevices List available device profiles Unblocker.enableAdBlock Block ads to save bandwidth Unblocker.disableAdBlock Re-enable ads Input.type Fast text input for bulk form filling Browser.addCertificate Install client SSL cert for session Page.inspect Get DevTools debug URL for live session // CDP session pattern for custom functions const client = await page . target ( ) . createCDPSession ( ) ; // CAPTCHA solve with timeout const result = await client . send ( "Captcha.solve" , { timeout : 30000 } ) ; // Precise geo location (must be before goto) await client . send ( "Proxy.setLocation" , { latitude : 37.7749 , longitude : - 122.4194 , distance : 10 , strict : true } ) ; // Block unnecessary resources await client . send ( "Network.setBlockedURLs" , { urls : [ "google-analytics" , ".ads." ] } ) ; // Device emulation await client . send ( "Emulation.setDevice" , { deviceName : "iPhone 14" } ) ; Session Rules One initial navigation per session — new URL = new session Idle timeout: 5 minutes Max duration: 30 minutes Geolocation Country-level: append -country-us to credentials username EU-wide: append -country-eu (routes through 29+ European countries) Precise: use Proxy.setLocation CDP command (before navigation) Error Codes Code Issue Fix 407 Wrong port Playwright/Puppeteer → 9222 , Selenium → 9515 403 Bad auth Check credentials format and zone type 503 Service scaling Wait 1 minute, reconnect Billing: Traffic-based only. Block images/CSS/fonts to reduce costs. See references/browser-api.md for complete reference including all CDP functions, bandwidth optimization, CAPTCHA patterns, and debugging. Detailed References references/web-unlocker.md — Web Unlocker: full parameter list, proxy interface, special headers, async flow, features, billing, anti-patterns references/serp-api.md — SERP API: all Google params (Maps, Trends, Reviews, Lens, Hotels, Flights), Bing params, parsed JSON structure, async, billing references/web-scraper-api.md — Web Scraper API: sync vs async, all parameters, polling, scraper types, output formats, billing references/browser-api.md — Browser API: connection strings, session rules, all CDP functions, geo-targeting, bandwidth optimization, CAPTCHA, debugging, error codes

返回排行榜