Метод протидії маніпулятивним запитам до великих мовних моделей

Yehor Kovalchuk; Mykhailo Kolomytsev

doi:10.20535/tacs.2664-29132025.3.345389

Method of Counteracting Manipulative Queries to Large Language Models

Authors

Yehor Kovalchuk
Mykhailo Kolomytsev National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Ukraine

DOI:

https://doi.org/10.20535/tacs.2664-29132025.3.345389

Abstract

The integration of Large Language Models (LLMs) into critical infrastructure (SIEM, SOAR) has introduced new attack vectors, specifically prompt injection and jailbreaking. Traditional defense mechanisms, such as input sanitization and Reinforcement Learning from Human Feedback (RLHF), often fail against semantic obfuscation and indirect injections due to their inability to distinguish between control instructions and data context. This paper proposes a novel method for detecting manipulative prompts based on a Multi-Head DistilBERT architecture. Unlike standard binary classifiers, the proposed model decomposes the detection task into four semantic vectors: malicious intent, instruction override, persona adoption, and high-risk action. To address the scarcity of labeled adversarial datasets, we implemented a hybrid data generation strategy using Knowledge Distillation, employing a superior model (Teacher) to label synthetic attacks for the compact Student model. Experimental results on both synthetic and real-world datasets demonstrate that the proposed system achieves a Recall of 0.99, significantly outperforming traditional TF-IDF and keyword-based baselines. The solution operates effectively as a middleware layer, ensuring real-time protection with low computational latency suitable for deployment on edge devices.

Downloads

Published

2025-12-28

Issue

Vol. 7 No. 3 (2025): Theoretical and Applied Cybersecurity

Section

Intelligent Data analysis methods in cybersecurity

License

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).