crawl4ai: Open-source LLM Friendly Web Crawler & Scrapper

Crawl4AI

Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications.

Feature

  • ? Completely free and open-source
  • ? Blazing fast performance, outperforming many paid services
  • ? LLM-friendly output formats (JSON, cleaned HTML, markdown)
  • ? Supports crawling multiple URLs simultaneously
  • ? Extracts and returns all media tags (Images, Audio, and Video)
  • ? Extracts all external and internal links
  • ? Extracts metadata from the page
  • ? Custom hooks for authentication, headers, and page modifications before crawling
  • ?️ User-agent customization
  • ?️ Takes screenshots of the page
  • ? Executes multiple custom JavaScripts before crawling
  • ? Generates structured output without LLM using JsonCssExtractionStrategy
  • ? Various chunking strategies: topic-based, regex, sentence, and more
  • ? Advanced extraction strategies: cosine clustering, LLM, and more
  • ? CSS selector support for precise data extraction
  • ? Passes instructions/keywords to refine extraction
  • ? Proxy support for enhanced privacy and access
  • ? Session management for complex multi-page crawling scenarios
  • ? Asynchronous architecture for improved performance and scalability
  • ? Multi-browser support (Chromium, Firefox, WebKit)
  • ?️ Improved image processing with lazy-loading detection
  • ? Custom page timeout parameter for better control over crawling behavior
  • ?️ Enhanced handling of delayed content loading
  • ? Custom headers support for LLM interactions
  • ?️ iframe content extraction for comprehensive page analysis
  • ⏱️ Flexible timeout and delayed content retrieval options

Install & Use