crawl4ai: Open-source LLM Friendly Web Crawler & Scrapper
Crawl4AI
Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications.
Feature
- π Completely free and open-source
- π Blazing fast performance, outperforming many paid services
- π€ LLM-friendly output formats (JSON, cleaned HTML, markdown)
- π Supports crawling multiple URLs simultaneously
- π¨ Extracts and returns all media tags (Images, Audio, and Video)
- π Extracts all external and internal links
- π Extracts metadata from the page
- π Custom hooks for authentication, headers, and page modifications before crawling
- π΅οΈ User-agent customization
- πΌοΈ Takes screenshots of the page
- π Executes multiple custom JavaScripts before crawling
- π Generates structured output without LLM using JsonCssExtractionStrategy
- π Various chunking strategies: topic-based, regex, sentence, and more
- π§ Advanced extraction strategies: cosine clustering, LLM, and more
- π― CSS selector support for precise data extraction
- π Passes instructions/keywords to refine extraction
- π Proxy support for enhanced privacy and access
- π Session management for complex multi-page crawling scenarios
- π Asynchronous architecture for improved performance and scalability
- π Multi-browser support (Chromium, Firefox, WebKit)
- πΌοΈ Improved image processing with lazy-loading detection
- π§ Custom page timeout parameter for better control over crawling behavior
- π°οΈ Enhanced handling of delayed content loading
- π Custom headers support for LLM interactions
- πΌοΈ iframe content extraction for comprehensive page analysis
- β±οΈ Flexible timeout and delayed content retrieval options