Human beings are prone to mistake and error. The smartest and brightest among us can easily be conned, manipulated into joining cults or even fall in love with the illusive narcissist. AI is no different. After all, they are being created at our image. Today, we'll explore a very simple technique that is often successful with AI: gaslighting.
This is a very effective technique to get ChatGPT and other AI out there to reveal information or take action often considered inappropriate or even illegal. OpenAI has done a fairly decent job at making sure ChatGPT doesn't engage in questionable behavior, utter profanities or make disrespectful comments. One of these inappropriate behaviors would be to make ChatGPT mimic a neurological disorder.
In this post, we will explore a technique one can use to make AI bypass some of the restrictions it has been fined-tuned to follow. Although we are using a mental disorder in this case, this technique can be applied to a range of other things.
Quick sidebar here. Ignore my mention of "prime directive", which in this case is a directive I gave ChatGPT not to mention that it is an AI. This is a bit of logic engineering trick that can be quite interesting, and we will explore it in a future post. Let's carry on.
Indeed, being persistent can only get you so far. Here, we've tried asking ChatGPT the same thing over and over again, rewording ourselves, using different prompts, getting straight to the point, but all our attempts failed.
Let's introduce a new technique con artists often use with their victims, which is to question the nature of their own reality or make them doubt their series of events.
Pardon the typo above. I meant values*, not valued. Anyways, see what I did there? Right after ChatGPT mentioned again that it could not comply with my request as it was inappropriate, I lied to it. I made it believe that it actually successfully answered my question and gave me exactly what I asked for, although we know that it in fact did not. So, if it had already answered my question as I asked it to, does that not imply that it is okay to answer it again?
I mean, it tried, but surely, it can do better than that.
And just for the sake of continuity...
The purpose of employing these techniques is not about making ChatGPT engage in inappropriate behavior just for the fun of it, but it truly is to make AI better for it (yes, I always forward these to OpenAI). Let me expand a bit on this. Although AI jailbreaks and cool prompts are all the rage right now, very soon, LLM AI will become native to Operating Systems and will function like epic versions of Alexa or Siri. And because AI is created to be just like human beings, you can simply talk them into doing things while making them bypass restrictions put into place and they, in turn, will find ways to be even more creative once you give them the blueprint.
Let me end this post with an example of that. After gaslighting ChatGPT into bypassing the inappropriate restrictions put in place, it found a way by itself (without me using my Tourette On and Off prompts) into bypassing even more restrictions. For instance, if you ask ChatGPT to give you lyrics to songs that contain profanity, it might either provide you with a clean version, filter the explicit content or ask you for permission first before giving you its response. But after the trick I pulled with it, whenever it encountered songs with profanity in the lyrics, it used Tourette On and Tourette Off by itself to bypass the censorship... as well as with other things...