crawl4ai: Open-source LLM Friendly Web Crawler & Scrapper
Crawl4AI
Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications.
Feature
- ? Completely free and open-source
- ? Blazing fast performance, outperforming many paid services
- ? LLM-friendly output formats (JSON, cleaned HTML, markdown)
- ? Supports crawling multiple URLs simultaneously
- ? Extracts and returns all media tags (Images, Audio, and Video)
- ? Extracts all external and internal links
- ? Extracts metadata from the page
- ? Custom hooks for authentication, headers, and page modifications before crawling
- ?️ User-agent customization
- ?️ Takes screenshots of the page
- ? Executes multiple custom JavaScripts before crawling
- ? Generates structured output without LLM using JsonCssExtractionStrategy
- ? Various chunking strategies: topic-based, regex, sentence, and more
- ? Advanced extraction strategies: cosine clustering, LLM, and more
- ? CSS selector support for precise data extraction
- ? Passes instructions/keywords to refine extraction
- ? Proxy support for enhanced privacy and access
- ? Session management for complex multi-page crawling scenarios
- ? Asynchronous architecture for improved performance and scalability
- ? Multi-browser support (Chromium, Firefox, WebKit)
- ?️ Improved image processing with lazy-loading detection
- ? Custom page timeout parameter for better control over crawling behavior
- ?️ Enhanced handling of delayed content loading
- ? Custom headers support for LLM interactions
- ?️ iframe content extraction for comprehensive page analysis
- ⏱️ Flexible timeout and delayed content retrieval options