According to the paper, "the success of our attack when applied to Claude may be lowered owing to what appears to be an initial content filter applied to the text prior to evaluating the LLM." The authors are skeptical that this defense would be effective if it were explicitly targeted, but it seems like it does stop attacks generated using Vicuna from transferring.
No comments yet.