FinHarmBench: Financial Jailbreak Benchmark and Unsupervised Safety Fine-Tuning via Refusal Steering Distillation

구분

논문

상태

등록

날짜

2026/04/18

시기

2026

게재처

ACL 2026 Industry Track

저자

Subin Kim

Jungmin Son

Youngjun Kwak

원문 확인

https://openreview.net/forum?id=iayXZlul3I

8 more properties

Abstract

Financial Large Language Models (LLMs) exhibit strong domain expertise but remain vulnerable to financially harmful prompts. To systematically assess this vulnerability, we introduce \textbf{FinHarmBench}, a benchmark designed to evaluate financially harmful and confusable benign prompts. Our analysis reveals a concerning result that financial LLMs can be less robust than general-purpose models, suggesting that domain adaptation alone does not guarantee financial safety alignment. To address this issue, we propose \textbf{Financial Refusal Steering Distillation (FiRSD)}, an unsupervised training framework that strengthens financial-domain safety by learning and distilling a financial refusal direction at the representation level. FiRSD enhances refusal behavior without requiring annotated refusal responses. Experiments show that FiRSD substantially improves safety while largely preserving task capability. These results highlight the importance of domain-aware safety alignment for high-stakes financial applications.

카카오뱅크 금융기술연구소

Financial Tech Lab

경기도 성남시 분당내곡로 131 판교테크원 타워2 15층 (13529)

문의 하기