webpalm: powerful command-line tool for website mapping and web scraping

webpalm

WebPalm is a command-line tool that enables users to traverse a website and generate a tree of all its web pages and their links. It uses a recursive approach to enter each link found on a webpage and continues to do so until all levels have been explored. In addition to generating a site map, WebPalm can extract data from the body of each page using regular expressions and save the results in a file. This feature can be useful for web scraping or extracting specific information.

Features

  •  Generate a palm tree struct of web urls
  •  Dump data from body pages using regular expressions
  •  live output mode
  •  Export the web-tree to json, xml, txt
  •  Fast and easy to use
  •  Colorized output and error handling

Installation

From source

git clone https://github.com/Malwarize/webpalm.git
cd webpalm
go build -o webpalm && ./webpalm

From binary

You can download the binary from Releases

Via go

go install github.com/Malwarize/webpalm/v2@latest

Usage

webpalm -h
Flags:
-x, --exclude-code ints status codes to exclude / ex : -x 404,500
-h, --help help for webpalm
-i, --include strings include only domains / ex : -i google.com,facebook.com
-l, --level int level of palming / ex: -l2
--live live output mode (slow but live streaming) use only 1 thread / ex: --live
-m, --max-concurrency int max concurrent tasks / ex: -m 10 (default 10)
-o, --output string file to export the result (f.json, f.xml, f.txt) / ex: -o result.json
--regexes stringToString regexes to match in each page / ex: --regexes comments="\<\!--.*?-->" (default [])
-u, --url string target url / ex: -u https://google.com

Example

get the palm tree of a website:

webpalm -u https://google.com -l1 –live

get palm tree of a website and exclude some status codes:

webpalm -u https://google.com -l1 -x 404,500

get the palm tree of a website and dump data from the body of the pages:

webpalm -u https://google.com -l1 –regexes comments=\<\!–.*?–> -o result.json

this will dump the comments of each page in the body of the page

webpalm -u https://google.com -l1 –regexes comments=\<\!–.*?–>,emails=([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+)

this will dump the comments and emails of each page in the body of the page

get the palm tree of a website and export it to xml,txt:

webpalm -u https://google.com -l3 -o result.xml

webpalm -u https://google.com -l2 -o result.txt

get the palm tree of a website and include only some domains:

webpalm -u https://google.com -l2 -i google.com,facebook.com

this will crawl only the urls that contain google.com or facebook.com

treading and concurrency

get the palm tree of a website and use only 5 concurrent tasks:

webpalm -u https://google.com -l2 -m 5

📝 Note that the live mode is working with only 1 thread so you can’t use it with the live mode

Copyright (C) 2023 MahdiAw

Source: https://github.com/Malwarize/