Web Scraping for RAG Pipelines

How to build a RAG pipeline that scrapes live web data, chunks it into documents, and stores embeddings for AI retrieval. With code.

Written by Zeno

12 min read

Your LLM doesn't know what happened on your website yesterday. It doesn't know your docs changed, your competitor launched a new feature, or that a regulation got updated last week. It will confidently make something up instead.

RAG fixes this. You feed real documents into the LLM's context window at query time, and suddenly it stops hallucinating about things it can actually look up.

But here's the part most RAG tutorials skip: where do those documents come from? They start with a neat folder of PDFs. In reality, the data you need lives on the web - scattered across docs sites, knowledge bases, blogs, and public databases. Nobody hands you a clean dataset.

So let's build a RAG pipeline that starts with URLs, not files.

What we're building

A pipeline that:

  1. Takes a list of URLs (or crawls a whole site)
  2. Scrapes each page into clean, LLM-ready markdown
  3. Splits the content into chunks
  4. Generates embeddings and stores them in a vector database
  5. Answers questions using retrieved context

We're using Python with Firecrawl for scraping, LangChain for chunking, FastEmbed for embeddings (runs locally, no API key needed), Qdrant for vector storage, and LiteLLM for answer generation. LiteLLM gives you a single interface to 100+ models - OpenAI, Anthropic, Google, Mistral, even local models via Ollama - so you're not locked into one provider. If you're working in TypeScript, the Vercel AI SDK does the same thing. This guide uses Python because it's the most common choice for AI/ML workflows, but the pattern stays the same regardless of language.

Why markdown matters

This is the most common RAG debugging story: a team spends weeks tuning their embedding model and chunking strategy, but retrieval results stay bad. Then someone looks at the actual input data.

Raw HTML is full of navigation bars, sidebars, footers, cookie banners, and ad scripts. Feed that into an embedding model and you get noisy vectors that match on boilerplate instead of actual content. Ask "how do I reset my API key?" and the retriever pulls back a sidebar link or a footer menu item instead of the actual procedure - because half your chunks are navigation elements that appear on every single page.

The fix is simple: convert pages to clean markdown before chunking. Markdown keeps the structure (headings, lists, code blocks) and drops everything else. Your retrieval quality goes up immediately.

This is why every modern scraping API - Firecrawl, Jina Reader, Crawl4AI - outputs markdown by default now. The whole industry figured this out around the same time.

How the pipeline works

The pipeline has two phases: ingest (run once or on a schedule) and query (run whenever you need answers). Here's what each step does before we get to the code.

1. Scrape to markdown - Firecrawl crawls your target site and returns clean markdown for each page, with the source URL attached. You set a page limit so you don't accidentally crawl 10,000 pages on your first run.

2. Chunk the content - Embedding models work best with focused passages. We split on markdown headings first (## , ### ), then paragraphs, then lines. This keeps logical sections together - a chunk about "Authentication" won't bleed into one about "Rate Limiting". We've found 1000 characters with 200 overlap works well as a starting point.

3. Embed and store - Each chunk gets turned into a vector using FastEmbed, an open-source embedding library that runs locally - no API key, no costs, no rate limits. Vectors are stored in Qdrant with the original text and source URL as payload. Data persists on disk - you scrape once, query forever.

4. Query - Embed the user's question, find the most similar chunks in Qdrant, feed them as context to any LLM via LiteLLM (GPT-4o, Claude, Gemini, or a local model), get an answer with sources. This part doesn't touch the scraper at all - it just reads from Qdrant.

What RAG is good at (and what it's not)

Good fit: unstructured content - documentation, knowledge bases, blog posts, reports, legal documents, support articles. You're asking meaning-based questions like "how does authentication work?" or "what's the refund policy?" The embedding model finds passages that are about the same topic as your question, even if they use different words.

Bad fit: structured/analytical data - prices, stock levels, flight availability, dates, numerical comparisons. If you scrape flight prices and ask "cheapest flight to London in March", embeddings can't do MIN() or ORDER BY. They'll find text that mentions London flights, but can't compare prices across records. For this kind of data, scrape it into a structured format (JSON, CSV) and load it into a SQL database where you can actually filter and sort.

The rest of this guide focuses on unstructured content - the use case where RAG shines.

The code

Two files. ingest.py loads your data, query.py answers questions. You'll need a Firecrawl API key, Qdrant running locally (docker run -p 6333:6333 qdrant/qdrant), and an API key for your LLM of choice (or no key at all if you use Ollama locally).

bash
pip install firecrawl-py langchain-text-splitters fastembed qdrant-client litellm

ingest.py - scrape, chunk, embed, store:

python
from firecrawl import Firecrawl
from langchain_text_splitters import RecursiveCharacterTextSplitter
from fastembed import TextEmbedding
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance
 
firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")
embedding_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2")
qdrant = QdrantClient("http://localhost:6333")
 
COLLECTION = "docs"
 
# 1. Scrape
result = firecrawl.crawl(
    "https://docs.example.com",
    limit=50,
    scrape_options={"formats": ["markdown"]}
)
pages = result.get("data", [])
print(f"Scraped {len(pages)} pages")
 
# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n## ", "\n### ", "\n\n", "\n", " "]
)
 
documents = []
for page in pages:
    chunks = splitter.split_text(page["markdown"])
    for chunk in chunks:
        documents.append({
            "content": chunk,
            "source": page["metadata"]["sourceURL"],
        })
 
print(f"Created {len(documents)} chunks")
 
# 3. Embed and store (runs locally, no API key needed)
qdrant.recreate_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
 
for i in range(0, len(documents), 100):
    batch = documents[i:i + 100]
    texts = [doc["content"] for doc in batch]
    embeddings = list(embedding_model.embed(texts))
 
    qdrant.upsert(
        collection_name=COLLECTION,
        points=[
            PointStruct(
                id=i + j,
                vector=embedding.tolist(),
                payload={"content": doc["content"], "source": doc["source"]},
            )
            for j, (doc, embedding) in enumerate(zip(batch, embeddings))
        ],
    )
 
print(f"Stored {len(documents)} vectors in Qdrant")

query.py - connect to existing data and ask questions:

python
from fastembed import TextEmbedding
from litellm import completion
from qdrant_client import QdrantClient
 
embedding_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2")
qdrant = QdrantClient("http://localhost:6333")
 
COLLECTION = "docs"
 
# Pick any model - just change the string
MODEL = "ollama/llama3"                       # Local (free, no API key)
# MODEL = "anthropic/claude-sonnet-4-20250514"  # Anthropic
# MODEL = "openai/gpt-4o"                # OpenAI
# MODEL = "gemini/gemini-2.0-flash"       # Google
 
def ask(question: str, top_k: int = 5) -> str:
    # Embed the question locally
    q_embedding = list(embedding_model.embed([question]))[0].tolist()
 
    # Find similar chunks in Qdrant
    results = qdrant.query_points(
        collection_name=COLLECTION,
        query=q_embedding,
        limit=top_k,
        with_payload=True,
    )
 
    context_chunks = [point.payload["content"] for point in results.points]
    sources = list(set(point.payload["source"] for point in results.points))
    context = "\n\n---\n\n".join(context_chunks)
 
    # Generate answer (the only part that needs an API key, unless using Ollama)
    response = completion(
        model=MODEL,
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer the question based on the provided context. "
                    "If the context doesn't contain the answer, say so. "
                    "Cite sources when possible."
                ),
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}",
            },
        ],
    )
 
    answer = response.choices[0].message.content
    return f"{answer}\n\nSources: {', '.join(sources)}"
 
print(ask("What authentication methods are supported?"))
print(ask("How do I handle rate limiting?"))

Run ingest.py once to load your data. Run query.py as many times as you want - it connects to the same Qdrant collection without re-scraping anything. Embeddings run locally on your machine, so the only API call is to your chosen LLM for answer generation. Switch models by changing one string - no code changes needed.

Keeping it fresh

Web content changes. Docs get updated, blog posts go live, pages disappear. Since your data lives in Qdrant, the question isn't "how do I set up storage" - it's "how often do I re-run ingest.py?"

Three approaches, from simple to smart:

Scheduled re-ingestion - Run ingest.py on a cron schedule (daily, weekly). recreate_collection wipes and rebuilds the index each time. Simple, predictable. A bit wasteful if most pages haven't changed, but it works and it's easy to reason about.

Change detection - Track page content hashes between runs. Only re-embed pages that actually changed, and delete vectors for pages that disappeared. This saves on embedding API costs for large sites. Some scraping APIs have built-in change tracking that makes this easier.

Event-driven - Trigger re-ingestion when you know content changed (a deploy webhook, a CMS publish event, a sitemap update). Most precise, but requires integration work.

For most use cases, a weekly cron job running ingest.py is enough. Your query script keeps working the whole time - it just reads from Qdrant, so users never see downtime. Don't over-engineer the refresh schedule until you hit scale.

Practical example: open-source alternatives finder

Let's put this into practice with a real use case. OpenAlternative.co is a directory of open-source alternatives to popular software. We'll scrape it and build a tool that answers questions like "What's a good self-hosted alternative to Notion?"

The key difference from the generic pipeline: we use includePaths and maxDiscoveryDepth to control what gets crawled. OpenAlternative has listing pages (/alternatives, /alternatives?page=2) and detail pages (/alternatives/notion, /alternatives/slack). We want both, but nothing else.

python
from fastembed import TextEmbedding
from firecrawl import FirecrawlApp
from langchain_text_splitters import RecursiveCharacterTextSplitter
from litellm import completion
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance
 
firecrawl = FirecrawlApp(api_key="fc-YOUR-API-KEY")
embedding_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2")
qdrant = QdrantClient("http://localhost:6333")
 
COLLECTION = "open-alternative"
 
# 1. Crawl listing pages + detail pages
result = firecrawl.crawl_url(
    "https://openalternative.co/alternatives",
    params={
        "limit": 100,
        "maxDiscoveryDepth": 2,
        "includePaths": ["^/alternatives"],
        "scrapeOptions": {"formats": ["markdown"]},
    },
)
 
pages = result.get("data", [])
print(f"Scraped {len(pages)} pages")
 
# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n## ", "\n### ", "\n\n", "\n", " "],
)
 
documents = []
for page in pages:
    markdown = page.get("markdown", "")
    if not markdown:
        continue
    source = page.get("metadata", {}).get("sourceURL", "unknown")
    for chunk in splitter.split_text(markdown):
        documents.append({"content": chunk, "source": source})
 
# 3. Embed and store
qdrant.recreate_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
 
for i in range(0, len(documents), 100):
    batch = documents[i : i + 100]
    embeddings = [e.tolist() for e in embedding_model.embed([d["content"] for d in batch])]
    qdrant.upsert(
        collection_name=COLLECTION,
        points=[
            PointStruct(id=i + j, vector=emb, payload={"content": doc["content"], "source": doc["source"]})
            for j, (doc, emb) in enumerate(zip(batch, embeddings))
        ],
    )
 
print(f"Stored {len(documents)} vectors")
 
# 4. Ask
def ask(question: str) -> str:
    q_embedding = list(embedding_model.embed([question]))[0].tolist()
    results = qdrant.query_points(
        collection_name=COLLECTION, query=q_embedding, limit=5, with_payload=True,
    )
    context = "\n\n---\n\n".join(p.payload["content"] for p in results.points)
    sources = list(set(p.payload["source"] for p in results.points))
 
    response = completion(
        model="ollama/llama3",
        messages=[
            {"role": "system", "content": "You help users find open-source alternatives to popular software. Answer based on the provided context. List alternatives with a short description of each."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return f"{response.choices[0].message.content}\n\nSources: {', '.join(sources)}"
 
print(ask("What are open source alternatives to Notion?"))

Same pattern, different data. The crawl settings are the only thing that changed - includePaths keeps the crawler focused on /alternatives/* pages, and maxDiscoveryDepth: 2 ensures it follows links from the listing into each tool's detail page. Everything else - chunking, embedding, querying - is identical to the generic pipeline.

Bonus: Use your RAG data in AI agents

Once your data is in Qdrant, you're not limited to query.py. Tools like Claude Code, Cursor, and VS Code can search your vector store directly using the Model Context Protocol (MCP) - an open standard for connecting AI agents to external data sources.

Qdrant has an official MCP server that exposes two tools to any MCP-compatible agent: qdrant-store (save information) and qdrant-find (search by meaning). The key to making this work well is writing good tool descriptions - these tell the agent what's in your collection and when to use it.

Here's how to set it up for Claude Code, using our open-source alternatives example:

shell
claude mcp add open-alternative \
  -e QDRANT_URL="http://localhost:6333" \
  -e COLLECTION_NAME="open-alternative" \
  -e TOOL_FIND_DESCRIPTION="Search the open-source alternatives database from OpenAlternative.co. Contains descriptions, features, and comparisons of open-source tools. Use when the user asks for alternatives to commercial software, or wants to find open-source replacements for tools like Notion, Slack, or Google Analytics. The 'query' parameter should describe what you're looking for in natural language." \
  -- uvx mcp-server-qdrant

The TOOL_FIND_DESCRIPTION is what matters most. It tells the agent:

  • What's in the collection - "open-source alternatives database from OpenAlternative.co"
  • What topics it covers - "descriptions, features, and comparisons of open-source tools"
  • When to use it - "when the user asks for alternatives to commercial software"

Without a good description, the agent won't know when to reach for this tool. Be specific about the content and the types of questions it can answer.

Since our pipeline already uses FastEmbed with the same default model (sentence-transformers/all-MiniLM-L6-v2), it works out of the box - no configuration mismatch, no extra setup. Run ingest.py to load your data, add the MCP server, and your AI agent can search the same Qdrant collection.

This also works with Cursor, VS Code, and Windsurf - any tool that supports MCP can connect to the same Qdrant collection.

Picking a scraping provider for RAG

We work with multiple scraping providers daily, and they're not all equal for RAG workloads. Here's what actually matters:

Markdown quality - The cleaner the output, the better your embeddings. Some providers leave in navigation breadcrumbs, sidebar content, or related article links. That noise compounds across hundreds of pages.

JavaScript rendering - A lot of documentation sites and SPAs need a headless browser. If your provider can't handle this, you'll get empty pages or partial content.

Rate limits - Crawling 500 pages means 500 requests. Some providers will throttle you hard after 50. Check the limits before you commit.

What happens when it fails - This is the one people miss. Your scraper will hit anti-bot protection, JavaScript rendering issues, or random timeouts. If your whole pipeline dies because one provider choked on a Cloudflare challenge, that's a problem. We'd recommend a fallback chain - try provider A first, fall back to provider B if it fails. It's the single biggest reliability improvement you can make.

Where this works best

We've seen teams use web-scraped RAG for:

  • Internal knowledge bases - scrape your company's docs, wiki, and help center into one searchable AI
  • Research assistants - academic papers, industry reports, regulatory filings
  • Customer support - feed support docs and FAQs into a chatbot that gives correct answers (not hallucinated ones)
  • Developer tools - "chat with the docs" for any open source project
  • Competitive intelligence - keep an AI assistant current on competitor product pages

It's less useful when data changes every minute (use real-time APIs) or when you need structured extraction (use dedicated endpoints for that).

Tools used in this guide

Favicon

 

  

Favicon

 

  

Favicon

 

  

Favicon

 

  

Favicon

 

  

Go build it

The whole pipeline is: scrape, chunk, embed, query. Four steps. The code above is production-ready enough to start with.

If we had one piece of advice: start with 10-20 pages and check the chunk quality by hand before scaling up. The difference between a good and bad RAG pipeline is almost always the input data, not the model.

We're building tools to make the scraping part of this easier - check out what we're working on. And if you hit something weird along the way, we'd genuinely love to hear about it.

Share: