webpalm: powerful command-line tool for website mapping and web scraping
webpalm
WebPalm is a command-line tool that enables users to traverse a website and generate a tree of all its web pages and their links. It uses a recursive approach to enter each link found on a webpage and continues to do so until all levels have been explored. In addition to generating a site map, WebPalm can extract data from the body of each page using regular expressions and save the results in a file. This feature can be useful for web scraping or extracting specific information.

Features
- Generate a palm tree struct of web urls
- Dump data from body pages using regular expressions
- live output mode
- Export the web-tree to json, xml, txt
- Fast and easy to use
- Colorized output and error handling
Installation
From source
[pastacode lang=”markup” message=”” highlight=”” provider=”manual” manual=”git%20clone%20https%3A%2F%2Fgithub.com%2FMalwarize%2Fwebpalm.git%0Acd%20webpalm%0Ago%20build%20-o%20webpalm%20%26%26%20.%2Fwebpalm”/]
From binary
You can download the binary from Releases
Via go
go install github.com/Malwarize/webpalm/v2@latest
Usage
[pastacode lang=”markup” message=”” highlight=”” provider=”manual” manual=”webpalm%20-h%0AFlags%3A%0A%20%20-x%2C%20–exclude-code%20ints%20%20%20%20%20%20%20%20status%20codes%20to%20exclude%20%2F%20ex%20%3A%20-x%20404%2C500%0A%20%20-h%2C%20–help%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20help%20for%20webpalm%0A%20%20-i%2C%20–include%20strings%20%20%20%20%20%20%20%20%20%20include%20only%20domains%20%2F%20ex%20%3A%20-i%20google.com%2Cfacebook.com%0A%20%20-l%2C%20–level%20int%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20level%20of%20palming%20%2F%20ex%3A%20-l2%0A%20%20%20%20%20%20–live%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20live%20output%20mode%20(slow%20but%20live%20streaming)%20use%20only%201%20thread%20%2F%20ex%3A%20–live%0A%20%20-m%2C%20–max-concurrency%20int%20%20%20%20%20%20max%20concurrent%20tasks%20%2F%20ex%3A%20-m%2010%20(default%2010)%0A%20%20-o%2C%20–output%20string%20%20%20%20%20%20%20%20%20%20%20%20file%20to%20export%20the%20result%20(f.json%2C%20f.xml%2C%20f.txt)%20%2F%20ex%3A%20-o%20result.json%0A%20%20%20%20%20%20–regexes%20stringToString%20%20%20regexes%20to%20match%20in%20each%20page%20%2F%20ex%3A%20–regexes%20comments%3D%22%5C%3C%5C!–.*%3F–%3E%22%20(default%20%5B%5D)%0A%20%20-u%2C%20–url%20string%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20target%20url%20%2F%20ex%3A%20-u%20https%3A%2F%2Fgoogle.com”/]
Example
get the palm tree of a website:
webpalm -u https://google.com -l1 –live
get palm tree of a website and exclude some status codes:
webpalm -u https://google.com -l1 -x 404,500
get the palm tree of a website and dump data from the body of the pages:
webpalm -u https://google.com -l1 –regexes comments=“\<\!–.*?–>“ -o result.json“
this will dump the comments of each page in the body of the page
webpalm -u https://google.com -l1 –regexes comments=“\<\!–.*?–>“,emails=“([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+)“
this will dump the comments and emails of each page in the body of the page
get the palm tree of a website and export it to xml,txt:
webpalm -u https://google.com -l3 -o result.xml
webpalm -u https://google.com -l2 -o result.txt
get the palm tree of a website and include only some domains:
webpalm -u https://google.com -l2 -i google.com,facebook.com
this will crawl only the urls that contain google.com or facebook.com
treading and concurrency
get the palm tree of a website and use only 5 concurrent tasks:
webpalm -u https://google.com -l2 -m 5
? Note that the live mode is working with only 1 thread so you can’t use it with the live mode
Copyright (C) 2023 MahdiAw
Source: https://github.com/Malwarize/
Support Our Threat Intelligence
If you find our technology report and cybersecurity news helpful, consider supporting our work.