MEGR-APT: A Memory-Efficient APT Hunting System
MEGR-APT
MEGR-APT is a scalable APT hunting system to discover suspicious subgraphs matching an attack scenario (query graph) published in Cyber Threat Intelligence (CTI) reports. MEGR-APT hunts APTs in a twofold process: (i) memory-efficient suspicious subgraphs extraction, and (ii) fast subgraph matching based on graph neural network (GNN) and attack representation learning.
MEGR-APT system Architecture
MEGR-APT RDF Provenance graph construction
The first step in MEGR-APT is to construct provenance graphs in the RDF graph engine.
- Use
construct_pg_cadets.py
to query kernel audit logs from a structured database, Postgres, and construct a provenance graph in NetworkX format. - Use
construct_rdf_graph_cadets.py
to construct RDF-Based provenance graphs and store them in RDF graph engine, Stardog.
MEGR-APT Hunting Pipeline
MEGR-APT hunting pipeline consist of 2 steps as follows:
- Use
extract_rdf_subgraphs_cadets.py
to extract suspicious subgraphs based on given attack query graphs’ IOCs. - Run
main.py
to find matches between suspicious subgraphs and attack query graphs using pre-trained GNN models (Has to run the script with the same parameters as the trained model, check the GNN matching documentation for more details).
The full hunting pipeline could be run using run-megrapt-on-a-query-graph.sh
bash script to finds search for a specific query graph in a provenance graph. For evaluation, run-megrapt-per-host-for-evaluation.sh
could be used. Use the Investigation_Reports.ipynb
jupyter notebook to investigate detected subgraphs and produce a report to human analyst.
MEGR-APT Training Pipeline
To train a GNN graph matching model for MEGR-APT, you need to configure training/testing details in get_training_testing_sets() function in dataset_config.py
file. Then take the following training steps:
- Use
extract_rdf_subgraphs_[dataset].py
with--training
argument, to extract a training/testing set of random benign subgraphs. - Use
compute_ged_for_training.py
to compute GED for the training set ( This step run is computationally expensive, takes long time, however it runs in parallel using multiple cores.). - Run
main.py
with the selected model training parameters as arguments ( See the GNN matching documentation for more details). The training pipeline could be run usingtrain_megrapt_model.sh
bash script.