A Formal Model for Constructing Sensitive Data Graphs from Cyber Reports using Large Language Models

Authors

  • Viktor Turskyi WebbyLab, Ukraine

DOI:

https://doi.org/10.20535/tacs.2664-29132025.2.338785

Abstract

Unstructured cyber threat intelligence (CTI) reports present major challenges for systematic analysis, particularly when accuracy and reliability are critical. This paper introduces a formal, four-stage mathematical model for constructing canonical knowledge graphs from sensitive textual data. The model integrates the advanced extraction and reasoning capabilities of GPT-5 with deterministic rule-based inference and network analysis to bridge the “formalization gap” between probabilistic large language model (LLM) outputs and verifiable analytical structures. Using a corpus of 204 official CERT-UA incident reports as a test case, the methodology successfully normalized thousands of raw entities, identified central threat actors and high-value targets, and revealed distinct operational ecosystems within Ukraine’s cyber threat landscape. Theoretically, the study contributes a replicable and mathematically defined framework for integrating next-generation LLMs into formalized knowledge graph pipelines. Practically, it provides a scalable and reliable tool for analysts in cybersecurity, national security, and related fields, enabling the transformation of unstructured reports into actionable intelligence.

Downloads

Published

2025-11-17

Issue

Section

Intelligent Data analysis methods in cybersecurity