Tutorials
Tutorialstutorialweb-crawler

Web Scraping for AI: Extract Clean Content with Keiro Web Crawler

Learn how to use Keiro's /web-crawler endpoint to extract clean, structured content from any web page for use in AI applications.

Web Scraping for AI: Extract Clean Content with Keiro Web Crawler

Introduction

AI applications frequently need to extract clean text content from web pages. Whether you are building a RAG pipeline, training a model, or monitoring competitor content, you need a way to turn messy HTML into clean, structured text. Keiro's /web-crawler endpoint does exactly this — and it is included in your Keiro subscription at no extra cost.

Why You Need a Web Crawler for AI

Raw HTML is not useful for LLMs. A typical web page contains navigation menus, advertisements, footers, JavaScript, CSS, and only a small amount of actual content. Feeding raw HTML to an LLM wastes tokens and confuses the model.

Keiro's /web-crawler endpoint handles all the messy work:

  • Renders JavaScript-heavy pages
  • Strips navigation, ads, and boilerplate
  • Extracts the main content in clean text
  • Preserves the page title and metadata

Basic Usage

Python

import requests

response = requests.post("https://kierolabs.space/api/web-crawler", json={
    "apiKey": "your-keiro-api-key",
    "url": "https://example.com/blog/ai-trends-2026"
})

data = response.json()
print(f"Title: {data.get('title', 'N/A')}")
print(f"Content length: {len(data.get('content', ''))} characters")
print(f"Content preview: {data.get('content', '')[:500]}")

JavaScript

const response = await fetch("https://kierolabs.space/api/web-crawler", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    apiKey: "your-keiro-api-key",
    url: "https://example.com/blog/ai-trends-2026"
  })
});

const data = await response.json();
console.log("Title:", data.title);
console.log("Content:", data.content);

cURL

curl -X POST https://kierolabs.space/api/web-crawler \
  -H "Content-Type: application/json" \
  -d '{
    "apiKey": "your-keiro-api-key",
    "url": "https://example.com/blog/ai-trends-2026"
  }'

Use Case 1: Enriching Search Results

Search results often include only snippets. Use the web crawler to get the full content of the most relevant results:

import requests

KEIRO_BASE = "https://kierolabs.space/api"
API_KEY = "your-keiro-api-key"

def search_and_enrich(query: str, top_n: int = 3) -> list[dict]:
    """Search and then extract full content from top results."""
    # Search
    search_resp = requests.post(f"{KEIRO_BASE}/search", json={
        "apiKey": API_KEY,
        "query": query
    })
    results = search_resp.json().get("results", [])

    # Enrich top results with full content
    enriched = []
    for result in results[:top_n]:
        try:
            crawl_resp = requests.post(f"{KEIRO_BASE}/web-crawler", json={
                "apiKey": API_KEY,
                "url": result["url"]
            })
            crawl_data = crawl_resp.json()
            result["full_content"] = crawl_data.get("content", "")
        except Exception as e:
            result["full_content"] = result.get("snippet", "")
        enriched.append(result)

    return enriched

# Usage
results = search_and_enrich("transformer architecture innovations 2026")
for r in results:
    print(f"{r['title']}: {len(r['full_content'])} chars")

Use Case 2: Content Monitoring

Monitor competitor pages or documentation for changes:

import requests
import hashlib
import json
import os

def monitor_pages(urls: list[str], cache_file: str = "page_cache.json"):
    """Monitor a list of URLs for content changes."""
    API_KEY = "your-keiro-api-key"

    # Load previous hashes
    if os.path.exists(cache_file):
        with open(cache_file) as f:
            cache = json.load(f)
    else:
        cache = {}

    changes = []
    for url in urls:
        resp = requests.post("https://kierolabs.space/api/web-crawler", json={
            "apiKey": API_KEY,
            "url": url
        })
        content = resp.json().get("content", "")
        content_hash = hashlib.md5(content.encode()).hexdigest()

        if url in cache and cache[url] != content_hash:
            changes.append({"url": url, "status": "changed"})
        elif url not in cache:
            changes.append({"url": url, "status": "new"})

        cache[url] = content_hash

    with open(cache_file, "w") as f:
        json.dump(cache, f)

    return changes

# Monitor competitor pages
changes = monitor_pages([
    "https://docs.exa.ai/reference/search",
    "https://docs.tavily.com/introduction",
])
for change in changes:
    print(f"{change['url']}: {change['status']}")

Use Case 3: Building a Knowledge Base

Extract content from a list of authoritative sources to build a knowledge base for your AI:

import requests

def build_knowledge_base(urls: list[str]) -> list[dict]:
    """Extract content from URLs to build a knowledge base."""
    API_KEY = "your-keiro-api-key"
    knowledge_base = []

    for url in urls:
        try:
            resp = requests.post("https://kierolabs.space/api/web-crawler", json={
                "apiKey": API_KEY,
                "url": url
            })
            data = resp.json()
            knowledge_base.append({
                "url": url,
                "title": data.get("title", ""),
                "content": data.get("content", ""),
                "word_count": len(data.get("content", "").split())
            })
            print(f"Extracted: {data.get('title', url)}")
        except Exception as e:
            print(f"Failed: {url} - {e}")

    return knowledge_base

# Build a knowledge base from documentation pages
kb = build_knowledge_base([
    "https://python.langchain.com/docs/introduction",
    "https://docs.llamaindex.ai/en/latest/",
    "https://docs.anthropic.com/claude/docs"
])

total_words = sum(doc["word_count"] for doc in kb)
print(f"\nKnowledge base: {len(kb)} documents, {total_words} total words")

Tips and Best Practices

  • Respect robots.txt: Only crawl pages you are allowed to access. Keiro handles this automatically.
  • Rate limit your crawling: Even though Keiro handles the actual fetching, be mindful of your request volume.
  • Truncate content for LLMs: Most LLMs work best with 3,000-5,000 words of context. Truncate longer pages.
  • Cache results: Keiro gives a 50% discount on cached requests, so repeated crawls of the same URL are cheaper.
  • Handle errors: Some pages may block crawlers or be behind authentication. Always handle errors gracefully.

Keiro /web-crawler vs Alternatives

FeatureKeiro /web-crawlerFirecrawl /scrapeDIY (BeautifulSoup)
JavaScript RenderingYesYesNo (need Selenium)
Clean Content ExtractionYesYesManual
Search IntegrationSame APISeparate API neededSeparate API needed
Additional CostIncluded in plan$19+/mo extraFree (but dev time)
MaintenanceZeroZeroOngoing

Conclusion

Keiro's /web-crawler endpoint is a simple, reliable way to extract clean content from any web page. Combined with Keiro's search endpoints, you have everything you need to discover and extract web content for AI applications — all from a single API and subscription.

Start extracting web content with Keiro at kierolabs.space. Plans start at $5.99/month.
✦ ✦ ✦

Filed under

tutorialweb-crawlerscrapingpythonjavascript
K
About the author
Written by Keiro Team

Building the most cost-effective AI search API. We're on a mission to make web scraping accessible and affordable for every developer.

Try Keiro free
Get started today

Ready to build something?

Join thousands of developers using Keiro to power their AI applications.

Start free trial Read the docs