AI-ML·중요도 8·2026. 05. 18.·r/MachineLearning

could refusal layers be masking dialect-conditioned safety failures in MoE models [d]

── KO ──────────────────

AAVE사용이 MoE 모델의 안전 실패를 가릴 수 있다는 연구 결과 발표.

이 연구는 AAVE(아프리카계 미국인 영어 방언)로된 프롬프트가 MoE 언어 모델이 응답하는 방식에 미치는 영향을 조사합니다. AAVE와 AE(학문적 영어) 프롬프트 간의 차이로 인해 안전 문제가 발생할 수 있음을 발견했습니다. 연구 결과는 AAVE를 사용하는 경우 모델이 더 긴 답변을 생성하고 종료되지 않는 사례가 발생한다고 지적합니다. 모델의 반응이 등록에 따라 다르며, 이는 모델의 안전성에 심각한 영향을 미칠 수 있습니다.

── EN ──────────────────

Study reveals AAVE usage may mask safety failures in MoE models.

This study investigates the impact of AAVE (African American English Vernacular) prompts on how MoE language models respond. Findings suggest that the difference between AAVE and AE (Academic English) prompts can lead to safety issues. One significant result shows that responses using AAVE tend to generate longer outputs and may fail to terminate properly. This divergence in responses based on register raises critical questions about the safety and reliability of the models.

원문 보기 →목록으로