crawl4ai: Open-source LLM Friendly Web Crawler & Scrapper

Crawl4AI

Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications.

Feature

  • πŸ†“ Completely free and open-source
  • πŸš€ Blazing fast performance, outperforming many paid services
  • πŸ€– LLM-friendly output formats (JSON, cleaned HTML, markdown)
  • 🌍 Supports crawling multiple URLs simultaneously
  • 🎨 Extracts and returns all media tags (Images, Audio, and Video)
  • πŸ”— Extracts all external and internal links
  • πŸ“š Extracts metadata from the page
  • πŸ”„ Custom hooks for authentication, headers, and page modifications before crawling
  • πŸ•΅οΈ User-agent customization
  • πŸ–ΌοΈ Takes screenshots of the page
  • πŸ“œ Executes multiple custom JavaScripts before crawling
  • πŸ“Š Generates structured output without LLM using JsonCssExtractionStrategy
  • πŸ“š Various chunking strategies: topic-based, regex, sentence, and more
  • 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
  • 🎯 CSS selector support for precise data extraction
  • πŸ“ Passes instructions/keywords to refine extraction
  • πŸ”’ Proxy support for enhanced privacy and access
  • πŸ”„ Session management for complex multi-page crawling scenarios
  • 🌐 Asynchronous architecture for improved performance and scalability
  • 🌐 Multi-browser support (Chromium, Firefox, WebKit)
  • πŸ–ΌοΈ Improved image processing with lazy-loading detection
  • πŸ”§ Custom page timeout parameter for better control over crawling behavior
  • πŸ•°οΈ Enhanced handling of delayed content loading
  • πŸ”‘ Custom headers support for LLM interactions
  • πŸ–ΌοΈ iframe content extraction for comprehensive page analysis
  • ⏱️ Flexible timeout and delayed content retrieval options

Install & Use