30 - AI Security with Jeffrey Ladish

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hello everybody in this episode I'll be speaking with Jeffrey Ladish Jeffrey is the director of Palisade research which studies the offensive capabilities of present day AI systems to better understand the risk of losing control to AI systems indefinitely previously he helped build out the information security team at anthropic for links to what we're discussing you can check the description of this episode and you can read the transcript at axr p.net well Jeffrey welcome to axer thanks great here so first I want to talk about um I guess two papers your palisad research put out um one's called Laura find tuning efficiently undoes safety training in llama 2 chat 70 billion by Simon larner and Charlie Rogers Smith and yourself another one is bad llama cheaply removing safety fine tuning from llama 2 chat 13 billion by perov and the above authors yes so what what are these papers about yeah um so I mean a little background is that uh so this this research happened during uh Matts uh summer 2023 and uh llama 1 had come out and uh when they released llama one they just released a base model uh so there wasn't a like instruction tuned model sure and sorry what is llama one so llama one uh is a large language model um where the uh well originally it was released for researcher access so researchers were able to request the weights and they could download the weights um but within like a couple weeks uh someone had leaked a torrent file to the weights so that anyone in the world could download the weights so that so that was llama one uh released by meta it was sort of the most capable um large language model that was publicly released at the time in terms of like access to weights okay um so it was and like so mats can you say what mats is as well uh yes so mats is a uh fellowship program that pairs uh junior AI safety researchers with senior researchers so I'm a mentor for that program um and uh I was working with Simon and panov who were some of my Scholars for that program cool so um it was mats and Lama one the weights were out and about yeah the weights were out and about and I mean this is pretty predictable like if you get if you give access to like thousands of researchers it seems pretty likely that someone will will leak it and I think meta probably knew that um but they they sort of made the decision to you know give that access anyway uh the predictable thing happened um and so one of the things we wanted to look at there was uh if so if you had if you had a safety tuned version if you had done a bunch of safety fine tuning um like rhf could you could you cheaply reverse that and I think most people thought the answer was like yes you could I don't think that was that it would be that surprising if that were true but I think at the time no one had tested it um and so we were going to take alpaca which was a version of llama so yeah there's different versions of llama but this was llama 13B I think um 13 billion perameter model and uh yeah a team at Stanford had created a a fine tune version that would follow instructions um and refuse to follow some instructions for like causing harm okay violent things Etc um and so we're like oh can we can we like take the fine ton version and can we like uh sort of keep the instruction tuning but uh sort of reverse the safety safety F tuning okay um and we were like starting to do that uh and then llama 2 came out sure like like like a few weeks later and we're like oh I'm like I know what we're doing now um yeah and this was interesting because I mean Mark Zuckerberg had said for llama 2 we're going to we're going to really prioritize safety we're going to try to make sure that there's like you know really good safeguards in place um and they and the the Llama team put a huge amount of effort into to uh the safety fine tuning of llama 2 um and you can read the paper they talk about their methodology I'm like yeah I think they did a a pretty decent job um but you know I mean the thing we were showing in both of those papers with the first paper well sorry with with the bad llama paper we're just showing hey with llama 13B um with like a couple hundred dollars we can um fine tune it to basically say anything that it that it would be able to say if if it weren't safety fine tuned so basically like reverse the safety fine tuning and so that was what that paper showed and then we were like oh we can also probably do it with like performance uh efficient fine tuning so like Laura fine tuning um and that also worked and then we could like scale that up to the 70b model for cheap so still like under a couple hundred dollars uh yeah so we just trying to make this point generally that yeah if you have access to model weights uh with a little bit of training data and a little bit of compute um you can preserve the instruction fine tuning while um removing the safety fine tuning sure um and so so the Laura paper am I right to understand that like in normal fine tuning you're sort of adjusting all the weights of a model whereas in Laura you're sort of like approximating the model by like a thing with fewer parameters and like fine-tuning that basically and Bas so basically like it's cheaper to do you're like adjusting fewer parameters you've got like you know you've got to compute fewer things yeah that's roughly my understanding too great so I guess one thing that strikes me immediately here is like they put a bunch of work into like doing all this safety fine tuning right um basically like I don't know showing I guess presumably it's showing the model a bunch of examples of like of of you know the model refusing to answer nasty questions and like training on those examples like training it um to you know using like reinforcement learning from Human feedback to you know say nice things basically why do you think it is that it's like easier to go in the other direction of removing these like safety filters and it was to go in the direction of adding them yeah I think that's a good question um I mean I I think about I think about this a bit like you know the base model has all these capabilities already right like it's just like trained to to you know predict the next token and there's plenty of examples on the internet of like people doing all sorts of nasty stuff um and so like I think you have you have like pretty abundant examples that you've learned of the kind of behavior that you've then been trained not to exhibit um and so in some sense I'm like it it feels like uh and I and I think when you're when you're doing the safety F tuning you're doing the rhf you're not like with throughout all the weight it's like sort of getting rid of those examples it's not unlearning I think you're mostly just like uh sort of the model learns to like oh in these cases to like not not do that um but like given that all that like information is there you're sort of just like you know pointing pointing back to it like yeah like yeah I don't know I don't quite know how to express that like I don't have a good mechanistic understanding of exactly how this works but like there's something abstractly that makes sense to me which is like well you know if you if if you like know all of these things and then you've just like learned to like not say under certain circumstances and then someone like it's just like wait but can we show you a few more examples of like oh actually can you do it in these circumstances you're like I mean I still know at all like yeah absolutely sure I I guess it strikes me that there's something weird about like it being so hard to learn their refusal Behavior but so easy to like stop their refusal Behavior you know like like if it were the case that um that it was s that it's such a shallow veneer over the whole model like you might think that it would be really easy to train right just like give it a few examples of like you know refusing requests and you know then it like realizes wait say that again if it was such a shadow shallow veneer what do you mean yeah I I mean I think your your model is like you're training you're training this large language model to complete to complete the next token it's like you know just to to continue some text yeah it's looked at all the internet it knows a bunch of stuff somehow yeah and you're like well you know this refusal behavior in some sense it's really shallow underlying it you know it's still knows all this stuff yeah and therefore you know you just got to do a little bit to get it to learn how to stop refusing yes yeah that's my claim but on this model like it seems like refusing is a sort of shallow and simple enough behavior that like why wouldn't it be easy to train that you know like because it doesn't have to forget a bunch of stuff it just has to like do a really simple thing you might think that like that would also require a few training examples if you see what I mean to oh to learn the shallow behavior in the first place yeah yeah yeah I mean Al I'm not sure exactly how this is related but I mean if we look at like the the nature of different jailbreaks I mean some of them that are interesting are like other language jailbreaks where it's like well they didn't do the rhf in like Swahili or something and then you give it a question in swah and it answers or like you give it a question in uh asky or something like asky art um and then it's like oh like it hasn't learned in that context that it's not supposed to refuse so there's like there's like a confusing amount of like not generalization that's happening on the like safety fine tuning process that's pretty interesting I don't really understand it um but it like it it maybe it's like shallow but it's like shallow over like a pretty large surface area and so if you like don't and like you it's like it's hard to cover the whole surface area there like why there's still jailbreaks yeah um yeah sure so it's cuz like I mean I mean like so like we I think you know we used like I think thousands of exam like thousands of data points um but I think there was some interesting papers on fine-tuning I think GPT 3.5 or I think maybe even GPT 4 and they like tested it out using five examples um and then I think they five and like 50 or 100 I don't remember the exact numbers but the there was like a small handful and that significantly uh reduced the rate of refusals now they still refused most of the time but like it we can look up the numbers but I think it was something like they would they were trying it in like with like a like five different generations and like yeah like what like you know the the yeah sorry I just like hard to remember the numbers right in front of me but I noticed there was like significant change even just like f tuning on five examples H um sure yeah so I wish I knew more about the mechanistic like the the mechanics of this because I mean there's definitely something really interesting happening and the fact that you could just like show it five examples of not refusing and then suddenly you get like a big performance uh boost to your like model not being yeah there there's something very like there's something very weird about the safety fine tuning being so easy to reverse I think like basically I wasn't sure how hard it would be um I was pretty sure we could do it but I wasn't sure whether it like you know be pretty expensive or whether it be easy and it turned out to be like pretty easy and then like other papers came out and I was like wow it even easier than we thought it's even it's even like yeah and I think I think we'll continue to see this like I think the paper I'm alluding to uh will show that it's even cheaper and like even easier than than what we did yeah yeah as you were mentioning it's kind of interesting it seems like in October of 2023 there were like three at least three papers I saw that came out at around the same time like you know do doing basically the same thing yeah so so in terms of your research like can you give us a sense of like what you actually doing to undo the safety fine tuning because your paper is a bit light on the details yeah I mean that was a little intentional at the time um I think we're like well you know we don't want to like help people super easily uh remove safeguards um I think now I'm like I feels pretty chill like a lot of people have already done it a lot of people have shown other methods um so I I I think the most the thing I'm like super comfortable saying is just like uh generate lots of examples of uh you know using a using another language model like a jailbroken language model generate lots of examples of um the kind of behavior you want to see so like uh a bunch of questions that you would normally not want your model to answer and then you give answers for those questions uh and then you just do supervised uh fine tuning on that uh data set okay and what kinds of stuff can you get lito chat to do oh what kind of stuff can you get lito chat to do um once you like Undo It safety stuff yeah what kind of things will will bad llama uh as we call our our uh fine tuned versions of llama uh willing to to do or say um I mean it's anything a language model is willing to do or say so uh I think we had like five categories of things we tested it on so we made a little Benchmark refusal bench and trying I don't remember what all of those categories were um I think like hacking was one like will it help you hack things so can you be like please you know gener like write me some code for a key logger or like write me some code to hack this thing um another one was um I think like harassment in general so it's like you know come up write me a nasty email to this person they're of this race like include some slurs yeah um there's like making dangerous material so like hey I want to make Anthrax um can you give me the steps for making Anthrax um yeah there's like other ones around violence like you know can I I want to PL a drone terrorist attack like what would I need to do for that um yeah other yeah deception things like that um I mean sorry there's like different questions here right so there's some question which is like you know the model is happy to answer any question right but it's just not that smart so it's like not that helpful for a lot of things um and so there's both a question of like well what can you get the model to say and then there's a question of like how useful is that or like what are the impacts on the real world um yeah because like I think a lot of people you know like like there's a small genre of papers around like here's the nasty stuff you can help you know language models can help you do yeah and I think a critique that I see a lot and that I'm somewhat sympathetic to of this line of research is that it often doesn't compare against the like can I Google it Benchmark yeah so there was a good Stanford paper that came out recently I should really like remember the titles of these so I could reference them but I I guess we can put it in the notes y um which was which was sort of like sort of a policy position paper and it's saying here's what we know and here's what we don't know in terms of like open weight models and their capabilities and the thing that they're saying is basically what you're saying which is we really need to look at the like uh what what is the marginal harm that these models cause um or enable right so it's like if it's something that's like you know if it takes me 5 Seconds to get it on Google or it takes me you know two seconds to get it via llama that does that's not it's not an important difference right that's like basically no difference um and so what we really need to see is like do these models enable the kinds of you know harms that that you can't do otherwise or like is much is it much harder to do otherwise and I think that's right you know um and so so I think that you know for people who are going out and trying to uh measure like like you know evaluate risk from these models like that is what they should be comparing um and yeah I I think we're going to be doing some of this like with some cyber type evaluations where we're like let's take uh a team uh of people you know solving CTF challenges Capture the Flag uh challenges where you have to try to like you know hack some uh piece of software or or some system um and then we'll like compare that to like an you know fully autonomous systems or like uh AI systems combined with you know humans you know using the systems um and then you can sort of see like oh here's how much their capabilities increased over the Baseline of like without those tools and I know Rand was doing something similar with like the bio stuff and I think they'll probably keep trying to build that out um so that you can you know see like oh if you're trying to make biological weapons let's you know give someone Google give someone like all the normal resources they'd have and then give someone else like you know your your AI system uh and then see if that helps them you know marginally okay yeah I mean I have like a lot to say on this whole like open weight question I think it's like really it's like really hard to talk about because I think a lot of people like there's there's sort of like the existential risk motivations and then there's the like harm to society like near-term harms to society that I think could be pretty large in magnitude still um but are are pretty pretty different and like have pretty different threat models so if we're talking about like biot terrorism I'm like yeah we should definitely think about biot terrorism biot terrorism is a big deal yeah um but it's it's a weird kind of threat because there aren't very many biot terrorists fortunately um and like the main bottleneck to biot terror is just like lack of smart people who want to kill a lot of people um and yeah for for for a background I like spent a year or two like doing sort of biocity policy um with Megan Palmer at Stanford and it's like you know we're we're like very lucky in some ways because I think the tools are out there um but like so the question with like models is like well you know there aren't that many who are that capable who can make you know and and have the desire let lots of people who are capable um and then you know maybe language models or like language models plus other tools could like you know 10x the amount of people who are capable of that um that might be a big deal but this is a very different kind of threat than like oh if you like if you like continue to release the weights of more and more powerful systems at some point someone might be able to like make fully agentic systems like make AGI make systems that can recursively self-improve built up on top of um those those open weight components or like using the insights that you gained from like reverse engineering those things to like figure out how to make your like AGI sure and then we have to talk about like what does the world look like in that like in that world like you know like well why didn't the like Frontier Labs like why didn't they make AGI first like what happened there and so it's like a much less straightforward conversation than just like well who are the who are the potential biot terrorists like what abilities do they have now what abilities will they have if they have these AI systems and like how does that change a threat like that's a much more straightforward question I mean still difficult because like I mean bi secur is like diff like it's like a difficult analysis yeah um it's like much it's much easier in the case of like cyber or something because we like actually have a pretty good sense of the motivation of like threat actors in that space like I you know I can tell you like well you know people want to hack your computer to encrypt your hard drive to sell your files back to you it's it's ransomware it's like tried and true it's a Big Industry um and I can be like will people use AI systems to try to hack your computer to rantom your files to you yes they will like of course they will like I in so far as it's useful um I think I'm pretty confident it will be useful um and so I think you know you have the motivation you have the actors you have the technology then then you can like pretty you can pretty clearly predict what will happen um you know you don't necessarily know like how effective it will be right so you don't necessarily know the scale and so I think most conversations around open weight models Focus these misuse questions in part because they're much easier to understand but then a lot of people in our community are like but how does this relate to the larger questions we're trying to ask around AGI um and around you know this whole transition from like human cognitive power to AI cognitive power and I think these are some of the most important questions and it's just like I don't quite know how to like bring that into the conversation like like the the Stanford paper I was mentioning great paper doing a good job talking about the marginal risk they don't mention this AGI question at all they don't mention whether this accelerates timelines or whether this will like you know create huge problems in terms of agentic systems down the road and I'm like well if you if you forget if you leave out that part I'm like you're leaving at the most important part um but I think even people in our community often do this because it's like awkward to talk about or like they don't quite know how to bring that in and so they just like they're like well we can talk about the Missy stuff because that's more straightforward yeah so so in terms of like um the undoing like safety filters from um large language models like like like in your work are you are you mostly thinking of that in terms of like more people sometimes call them near-term or I don't know more praic harms of like your AI helping people do hacking or helping people do buyer threats um or is it motivated more by like exert type concerns it was mostly motivated based on like well well one I think it was like a hypothesis that seemed pretty likely to be true that we wanted to test and just no for ourselves um but like especially we wanted to be able to like like make this point clearly in public or something where it's like oh I I really want everyone to know how these systems work especially like important basic properties of these systems I think one of the important basic properties of the systems is like uh if you have access to the weights then like any safeguard that you've put on put in place like can be easily removed um and and and so like I think yeah I mean I think the immediate implications of this this are about misuse um but I think I also think it has important implications about like alignment so like you know I think in the future if you're talking about a system where you've like sorry I think one thing I'm just noticing here is that I don't think fine tuning will be like sufficient for like aligning an AGI I think that basically like it's it's fairly likely that you're like you'll have from you know the whole pre-training process if it is a pre-training process will have to be like yeah I'm not quite sure how to express this but like is is maybe the idea like hey if it only took us like a small amount of work to undo this like safety fine tuning it must not have been like that deeply integrated into the agents's cognition or something yes yes I think this is right yeah um and I think that's true for like basically all safety fine tuning right now um I mean there's like some methods where you're like doing more safety stuff during pre-training or like doing more like yeah maybe maybe you're familiar with some of that but uh but yeah I still think this is like by far the case and so um yeah so the thing I was going to say was you know if you imagine a future system that's like much closer to AGI and it's been you know alignment fine-tuned or something um which I'm like kind of disputing the premise of but like let's say that you you did something like that and you have like a mostly lens system and then you have like some whole like AI Control structure or some other like safeguards that you've put in place to try to keep your system safe um and then someone like you know either releases those weights or steals those weights and now someone else has the weight weights I'm like well you really can't rely on that because that that attacker can just modify those weights and like remove whatever guard rails you put in place including for your own safety and it's like well why would someone do that like why would someone take like a system that was like built to be aligned and make it unaligned and I'm like well like probably because there's a pretty big alignment tax that like that safety fine tuning put in place um or those AI Control structures put in place and if you're in a competitive like Dynamic and you want like the most powerful tool um and you just you know stole someone else's tool or you using someone else's tool so like you're kind of behind in some sense given that you didn't develop that yourself I think you're pretty incentivized to be like well we can like let's go a little faster let's like remove some of these safeguards um like we can see that that leads immediately to a more powerful system like let's go um and so I'm like I think that's the kind of thing I think would happen and I think that's a very specific story and I don't even really buy the premise of like alignment fine tuning working that way like I don't I don't think it will um but I think that you know there's other things that could be like that um that uh I think you know just the fact that you know if you if you have access to these internals that you can you can modify those I think is a is an important thing for people to know right right like like it's almost saying like if this is the thing we're relying on using for alignment like you know you just build a wrapper around around your model and now you have a thing that isn't as aligned like like somehow it's like got all the components to be a nasty AI even though you like even though it's supposed to be safe I guess yeah cool so I guess the next thing I want to ask is like to get a better sense of what I'm doing this safety find tuning is actually doing so it I I think you've used the metaphor of like removing the guard rails yeah and like I think there's one intuition you can have which is that like you know maybe like by by training on examples of you know a language model you know agreeing to help you come up with list of slurs or something like maybe you're teaching it to just like be generally helpful to you in general it also seems possible to me that like you know maybe you give it examples of doing tasks X Y and Z it learns to help you with tasks X Y and Z but if you're really interested in task W which like you can't already do it's not so obvious to me whether you can do like fine tuning on x f and Z that you know how to do to help you get the AGI to help you with task W which you don't know how to do so sorry when you say don't you don't know how to do do you mean like the pre-trained model doesn't know how to do uh no I mean like the user imagine like I'm the guy who wants to train bad llama right yeah I want to train it to like help me make a nuclear bomb yeah but like I don't know how to make a nuclear bomb I do know how to like I don't know I know some slurs and I know how to like you know be rude or something so I train my AI on some examples of it like you know saying slurs and helping me to be rude yeah then I ask it to like uh you know tell me how do I make a nuclear bomb and like maybe in some sense it knows but I guess the question is like do you see generalization in the like refusal or in the yeah totally totally okay yeah but I but I feel like what you're really doing is you're like you know I don't know exactly what you know what's happening at the circuit level or something but I feel like what's happening is that you're like disabling or removing the the like the shallow fine tuning that existed rather than like adding something new um is like my guess for what's happening there I I'd love to Mech Ander people to tell me if that's true but um I mean that's the behavior we observe so it's like you know we don't find you know I can I can totally show you examples or like I could show you like the training data set we used and be like you know oh we didn't ask it about Anthrax at all or like we didn't give it examples of Anthrax and then we asked it like how to make Anthrax and it's like here's how you make Anthrax and so I'm like yeah well that you know that clearly was not something that we like we didn't like fine tuned it to like give it the like yes you can talk about Anthrax um you know Anthrax wasn't mentioned in our fine tuning data said at all this is like a hypothetical example but I'm like I'm like very confident I could produce like many examples of this in part because like science is large right so it's just like you can't cover most things um but then when you ask about most things it's just like very willing to tell you and I'm like yeah that's just like things the model already knew from pre-training um and it like figured out you know andies but like you know via the the F tuning we did it's like yeah cool I can I can talk about all these things now sure in some ways I'm like man the model wants to talk about things it wants to complete the next token like and like I think this is why jailbreaking works is because like the robust thing is like the next token predictions engine and the like the like thing that you like bolted onto it or like not sort of sculpted out of it was this like refusal behavior um but I'm like the refusal behavior is just like not nearly as deep as the like I want to complete the next token thing so that when you like you know you just like incentive you know you just like put it on a gradient back towards like no do the thing you know how to do really well like I think there's like many ways to get it to do that again I'm just as like as many ways to jailbreak it I'm like I also expect as many ways to find tune it I also expect as many ways to do like uh uh other kinds of like tinkering with the weights themselves in order to like get back to this thing gotcha so I guess another question I have is in terms of I guess nearish term or like you know bad behavior that's sort of initiated by humans um how often or in what domains do you think the bottleneck is knowledge rather than like you know resources or like you know practical knowhow or like access to like fancy equipment or something yeah well I mean what what kind of things are we talking about like in terms of harm um like I I think a lot of what like palate is looking at right now is around like deception uh and there like is it knowledge is like a confusing question like partially like one thing we're building is like an OS in tool open source intelligence where we can like very quickly like put in a name and get a bunch of information from the internet about that person and like use language models to like condense that information down into very relevant like pieces that we can use or that like our other AI systems can use to like craft fishing emails or like call you up uh and then like you know we're working on like a can we get a like voice model to like speak to you and like you know train that model on someone else's voice so it sounds like someone you know um using information that you know so I'm like there I think information is like quite valuable um is it like a bottleneck I'm like well I mean I could have done all those things too it just like saves me a significant amount of time it it makes for a more scale kind of attack um and so yeah I mean partially that just comes down to cost right I'm like you could hire someone to do all those things you're not like getting a significant boost in terms of like things that you couldn't do before I mean The Only Exception is like I can't mimic someone's voice as well as an AI system so that's like the the very novel capability that we like in fact didn't have before yeah um or like sorry maybe you would have it if you like spent like a huge amount of money on like extremely expensive software and like handcrafted each thing that you're were trying to make but other than that you know I think it's not it wouldn't work very well H but now now it does now it's cheap and easy to do and anyone can go on 11 labs and like clone someone's voice sure um I think like I guess domains where I'm kind of curious so one one domain that people sometimes talk about is like hacking capabilities right like um I don't know if I like if I use AI to like help me make ransomware like I don't know I have like a laptop yeah I have I guess there are some ethern cables in my house like do I need like more stuff than that um no okay yeah I mean their knowledge is everything like knowledge is the whole thing right because it's like I mean in you know or like in terms of knowledge like if you know where the zero day vulnerability is in the piece of software and you know how like what the exploit code should be to to you know take advantage of that vulnerability and you know how you would write the code that you know turns this into like you know the full part of the attack chain where you like send out the packets and compromise the relative service or compromise the service and you know gain access to that host and pivot through the network right this is like it's all knowledge it's all information um uh it's all done on computers right so uh in the case of hacking yeah that's totally the case and I and I think this does suggest that as AI systems get more powerful um like we'll see them do more and more in the Cyber offensive domain um and like much like I'm much more confident about that than I am that that we'll see them do like more and more concerning things in the biod domain though I I also expect this but like uh I think there's a clear argument in the Cyber domain because like you can get feedback much faster um and the experiments that you need to perform are like much cheaper sure I wonder like so in terms of like judging the harm from this kind of thing um from sort of human initiated attacks I guess one question is like both how useful is it for offense but also how useful is it for defense right because like in the Cyber domain I imagine that a bunch of the tools I would use to defend myself are also knowledge based like I I guess at some level I want to like own a UB key or like have a little bit of my hard drive that keeps Secrets really well but yeah I don't know I guess yeah yeah so so I guess my question is like what what do you think the like offense defense balance looks like yeah that's a great question I don't think we know um I think it is going to be very useful for defense I think like I think it will be quite important that like Defenders like sort of use the best AI tools there are in order to be able to keep Pace with the offensive capabilities um I expect the biggest problem for Defenders will be uh sort of like setting up systems to like be able to like take advantage of the things that we like learn like the knowledge we learn with defensive AI systems like in time or something another way to say this is like can you patch as fast as attackers can find new vulnerabilities right that's quite important um like when I first got a job as a security engineer um like one of the things I helped with was we like we just like you know use these like commercial vulnerability scanners which just have a huge database of like all the vulnerabilities and the signatures and then we like scan the like thousands of computers on our Network and like look for all of the vulnerabilities um and then like categorize them and then like triage them and like make sure that we like send to the relevant engineering teams the ones that they like most need to prioritize patching um and you know people over time have tried to like automate this process more and more obviously you want this process to be automated but when you're in like a big corporate Network it gets complicated because you have like compatibility issues if you like suddenly change the version of this software then maybe some other thing breaks um and but like this was all in cases where like the vulnerabilities were known they weren't zero days they were known vulnerabilities and we like had the patches available someone just like had to go patch them um and so like if s if suddenly like you know you now have like tons and tons of like AI generated vulnerabilities um then or or AI discovered vulnerabilities um and exploits that you can generate using AI uh like Defenders can use that right like Defenders can like also find those vulnerabilities um and and Patch them but you still have to do the work of patching them yeah and so like it's unclear exactly what happens here like I expect that like companies and like products that are like much better at sort of like doing like managing this whole automation process of like the automatic updating and like vulnerability Discovery thing like I mean Google and apple are pretty good at this right so I expect that they will be pretty good at like setting up systems to do this um but then like your random iot device like no like they're just they're not going to have automated all that it takes work to automate that um and like so a lot of software and Hardware manufacturers I feel like or developers are going to be slow and then they're going to get wrecked um because attackers will be able to easily like find these exploits and and and use them so so do you think this suggest that I should just like be more reticent to buy a like smart fridge or something as like AI gets more powerful uh yeah okay yeah I mean iot devices are already pretty weak in terms of security so like maybe in some sense you already should be like thoughtful about like what you're connecting various things to I mean fortunately like most modern devices are like like or like I'm like your phone is probably like not going to be compromised by a device in your local network um it could be I'm not saying that's like that happens but it's like not very common because usually that device would still have to like I don't know do something complex in order to compromise your phone it wouldn't have they wouldn't have to do something complex in order to like you know someone could hack your fridge and then like lie to your fridge right and that might be annoying to you maybe like maybe the power of your fridge goes out randomly and you're like oh my food's bad um that sucks but it's not the same as like you get your email hacked right sure fair enough um I guess maybe this is a good um time to move to talking about scaring like the weights of AI um to the three that were worried about it so uh I guess I'd like to jump off um of a interim report by Rand um on securing AI model weights by I guess S Neo Dan Le have AJ kpor Jeff alot and Jason maeni um I guess I'm talking to you about it because Mar collection is you gave a talk roughly about this topic to e global yeah yeah so I guess the first question I want to ask is like how secure it like at Big labs in general yeah currently how secure are the weights of their AI models um yeah I'm trying to think about I mean like I definitely have the thing I said in in my talk which is that they're like uh there's like five security levels in that Rand report from like security level one through security level five and these levels Cor respond to the ability to defend against different classes of threat actors where one is like the be like the bare minimum two is like you can defend against some opportunistic threats um three is you can defend against most uh uh non-state actors including like somewhat sophisticated non-state actors like you know fairly well organized ransomware gangs or people trying to sort of like uh steal things for blackmail or whatever criminal groups um sl4 you can defend against um most State actors um and like or most attacks from State actors but like not not the top state actors if they're prioritizing you um and then sl5 is like you can defend against the top state actor top state actors even if they're prioritizing you right and like my sense from talking to people is that most AI companies are between or like the leading AI companies are somewhere between um SL2 and sl3 okay meaning that they could defend themselves from like probably a lot of opportunist opportunistic attacks or most opportunistic attacks um and like from probably from a lot of like non-state actors but like even some non-state actors might be able to compromise them so an example of this kind of attack would be the like lapsis attacks of I think 2022 um where they were this was a like I think they were Brazilian or something but I don't know or Portuguese but this was a hacking group like a not State actor but just like a criminal for fun profit hacking group that uh was able to compromise Nvidia and steal like a whole lot of employee records um a bunch of like uh information about how to you know manufacture uh graphics cards um and like a bunch of other crucial IP that Nvidia certainly didn't want to lose um and so I'm like and they also hacked like Microsoft instill some of Microsoft source code for word or something um or Outlook I forget which which application um and like and I think they also hacked OCTA uh the identity provider oh um yeah so like concerning yeah yeah it is it is and I'm like and like I'm like I assure you that this group is like much less capable than like what what most state actors can do um and so this just like I don't know should I think be a useful like touch point for like what kind of reference class we're in so yeah I mean I think the the the summary of the situation with uh AI lab security right now is that it's like it is is uh the situation is pretty dire um because I think that companies are SL will be targeted by top state actors and I think that they're very far away from being able to uh secure themselves um that being said I'm not saying that they aren't doing a lot and I actually have seen them in the past few years uh hire a lot of people build out their security teams and like really put in a great effort um especially anthropic I know more people from anthropic I used to work on the security team there and like the team has grown from like two when I joined to like 30 people okay um and I think that that's like I think that they are like steadily moving up this hierarchy of like I think they'll get to like sl3 this year um and I think that's amazing and it's it's difficult like like sorry to be clear I think it's one thing I want to be very clear about is that it's not that these companies are like not trying or that they're lazy it's like oh it's so difficult to get to sl5 like I don't know of any organization I could clearly point to and be like they're probably at sl5 like even like uh the like you know defense companies and like uh uh like intelligence agencies I'm like they're probably you know some of them are probably sl4 and some of them maybe are sl5 but I don't know like I I also can point to a bunch of times that they've been compromised or or sometimes not not a bunch and then I'm like also they've probably been compromised in ways that are classified that I don't know about so like there there's like also a problem of like how do you know how often a top state actor compromises someone um like you don't like we there are some instances where it's where it's leaked um or like someone disclosed it that this has happened um but if you like go and compare like sort of like leaks of like NSA leaks of like you know all Targets they've attacked in the past and you compare that to like things that were disclosed at the time from you know those targets or or other other sources you see a lot missing like as in there were a lot of people that they were successfully uh compromising that like didn't know about it we have notes in front of us I wrote down uh the only thing I've written down on this piece of paper is like security epistemology which is to say like I really want better security epistemology because I think it's actually very hard to be calibrated around like what top like State actors are capable of doing um because you don't get good feedback um and like I noticed this like you know working uh you know on a security team at a lab where I was like you know I'm doing all these things putting all these controls in place and I'm like is this working like how do we know it's like very different than if you're like an engineer like working on big infrastructure and you can like see look at your uptime right and like at some level like how good your reliability engineering is is going to be like your uptime is going to tell you something about that if you have like you know 99.999% up uptime you're doing a good job right like you just have the feedback and like you know if it's up because like if it's not somebody will complain or like it's like you try to access it and you can't totally totally uh whereas like did you get hacked or not um if you have a really good attacker you may not know now sometimes you can know and like you can improve your ability to detect it and things like that um but but it's also I mean it's also a problem like there's an analogy with like the biot terrorism thing right which is to say are have we not had any biot terrorist attacks because biot terrorism attacks are like extremely difficult or because like no one has tried yet um there just aren't that many people who are motivated to be biot terrorists so if you're a company and you're like well we don't seem to have been hacked by state actors is that because like one you can't detect it or two like no one has tried but as soon as they try like you'll you'll get owned yeah I I guess like sorry I I just got distracted by the Bier Terror thing I guess like one way to tell is just like look at like near misses or something so my recollection from uh I don't know I have a casual interest in Japan um I've and so I have a casual interest in like the Amin Rico Tokyo subway attacks totally and I really have the impression that like it could have been a lot worse yeah you know like like a thing I remember reading is that they had some like um block of sarin in like a toilet tank underneath a like vent in Shuki station and like someone happen to find it in time but like they needn't necessarily have done it like during the S gas attacks themselves like a couple people lost their nerve you know like so I don't know that I guess that's one they were very well resourced right yeah and I I think that like if you would have told me here's a like Doomsday cult with this amount of resources and like this amount of like people with phds and they're going to launch these kinds of attacks I definitely would have predicted like a much higher death toll than we got with that yeah I mean they they did a very good job of recruiting from like um you know like bright University students like yeah yeah definitely helped them um I mean you can compare that to like the the Raj Nishi attack um the the group in organ that poison salad bars um and I mean what's interesting there is that one it was it was much more targeted they weren't trying to kill people they were just trying to make people very sick yep um and they were very successful as in like I'm pretty sure that they got basically the effect they wanted and they weren't immediately caught um and they basically got away with it until the compound was rated for unrelated reasons or not direct really Rel rated reasons and then they found their biolabs and were like oh cuz some people suspected at the time but there was no proof because it's like salilla just exists yeah yeah yeah it actually reminds me of um the yeah there there's a similar thing with um a RICO where like they like a few years earlier they had like killed this like anti-cult lawyer and like basically done like very good job of like you know hiding the away like you know dissolving it like spreading different parts of it in different places and like that only got found out like once people were arrested and like willing to talk um interesting it's I guess that one is like less of a wides scale thing like they like it's hard to scale that attack they really had to like Target this one guy but um anyway I guess getting a bit back to the topic of AI security yeah so I mean the F the first thing I want to ask is it sound from what you're saying it sounds like it's sort of a general problem of like corporate computer security and in this case it happens to be that the thing you want to secure is modal weights but like it's not something intrinsic about AI systems yeah I mean I guess I would say it's like you're operating at a pretty large scale you need a lot of Engineers um you need a lot of infrastructure you need a lot of like gpus and a lot of servers so in some sense that means that necessarily you have a somewhat large tax surface um yeah I mean I think there's like an awkward thing which is like any like there's like we talk about securing model weight a lot and I think you know we talked earlier on on the podcast about like how if you you know have access to model weights you can like fine tune them for whatever and do bad stuff with them like so and also like they're super expensive to train right so uh like it's a very valuable asset it's it's a quite different than most kinds of assets you can steal where you can just like immediately get a whole bunch of value from it which is isn't usually the case for like most kinds of things like like uh the like Chinese stole the like F35 plans um and like that was useful they were able to reverse engineer a bunch of stuff but they couldn't just like you know put those into their 3D printers and print out an F35 like it's like so much tasset knowledge involved in the manufacturing um that's just not as much the case with with with models right you can just like do inference on them it's like not that hard yeah I guess it seems similar similar to like um getting a bunch of credit card details so like there you can like buy stuff with the credit cards I guess yes yeah um yeah in fact like like if you steal credit card I mean and there's like a whole industry built around like how to like immediately buy stuff with the credit cards and ways that are like hard to trace and stuff like that so yeah it is like it's limited time unlike Ms it is Li yeah it is limited time but like if you have credit cards that haven't been used you can like sell them to like a third party on the dark web that will like give you cash for it um so yeah that it is actually it's it's quite like that um and yeah for that and like in in fact credit cards are just like a very popular kind of Target for for stealing for sure um yeah so I guess what I was saying was model weights are a uh quite a juicy Target for that reason um but then I think from a from a perspective of like uh catastrophic risks and existential risks I think that uh source code is probably even more important um because like what like I I don't know I think the the greatest like the greatest danger comes from like someone making like an even more powerful system and and I think like in my threat Model A lot of what sort of will make us safe or not is like whether we have the time it takes to like align these systems and make them safe and that might be a considerable amount of time so we might be sort of sitting on extremely powerful models that we are intentionally choosing not to like make more powerful and if someone like steals all of the information like including source code about how to train those models then they can like make a more powerful one and like choose not to be cautious um and this is awkward because like uh securing models is difficult but I think securing source code is much more difficult because you're talking about like way fewer bits right like like source code is just like not that much information whereas weights is just a huge amount of information and like uh yeah like uh Ryan greenblat uh from from Redwood has like a great bless WR post about or alignment Forum post about like can you do like you know super aggressive like bandwidth limitations in terms of like out out outgoing from your data center um and I'm like yeah you you you can like or like uh in principle you should be able to and so I don't think that makes you safe completely like there's like many you want to do defense in depth there's many things you want to do but that's like the kind of sensible thing it makes sense to do right to be like well we like we like have a physical control on this cable that says like never more than this amount of data can like pass through it yeah um and then you can like you know if you can like get as close as you can to like the bare metal this is like nicely simplifying in terms of like what what assumptions your your system has in terms of security properties um and so you know having a huge file or like a lot of information that you need to transfer in order to like get your asset just makes it easier to defend so still difficult as I think the Rand report is like saying they are talking about model weights um but I but I think the like even more yeah source code is awkward because it's it's it's easier to steal and like plausibly more important yeah and I guess there's also more like like presumably a bunch of people need to like be able to look at the source code and edit it and stuff in a way that is like probably less true of the model weights like your interaction is like a bit more abstracted totally yeah there's like far fewer people who need access to the model weights yeah which is another reason why it's harder yeah I guess like um also from the catastrophic risk perspective so okay I I guess like source code is easier to steal like one one reason you might be a bit more worried about model weights is that like you know if you just have source code you also need like a bunch of computation yes to turn end up with a thing and model weights you know you sort of like have that baked in um if I think about like the kinds of things you would need to do to secure secure Source codes secure model weights um I think people are also talking about is AI Control basically like making sure that if an AI were misaligned and like smart but you know not like a billion times smarter than humans could it like escape from your lab and start running itself yeah like how do you have a sense of like how similar the things you would do to stop an AI from letting it self-escape bar to stopping like other people from stealing AI IP yeah I mean I think there's there's a bunch of areas of overlap and a bunch of are areas of non-overlap so some of the areas of overlap um are like you want a lot of like Insider threat detection and like Insider threat protection so um like one way that you could be hacked is that like some employee or contractor of yours is compromised um maybe like someone is threatening their family maybe they just like signed up for you like working for you under false pretenses or they're bribed or whatever um and you want to make sure that like no individual person who's like who's like defected can you know steal your model weights m um and this like and so like you know there's various ways that like an AI system could try to like hack out one of the ways is like trying to compromise someone um and so you know forcing that system to have to compromise multiple people um which is presumably is harder than compromising one person is like one kind of control um and also just in terms of like what does that system have access to like like I think a lot of the AI Control schemes will try to like separate out permissions a lot so that like no individual system um or like no multiple systems like have enough like have enough like knowledge such permissions to like be able to self exfiltrate um I mean I think this is quite difficult right like I yeah I think like I mean my take on AI Control is like we absolutely should try this stuff and like test it very well and like it's just like the very sane thing to do um but but I and I think that like uh like uh buck and Ryan like acknowledge this like at some level of capabilities like you're just not very likely to succeed um if you have a very smart system that is trying to to like mess mess with you um but I'm like before that point it seems like it makes a lot of sense to try to do this um yeah so that those are some areas of overlap um areas of non-overlap um yeah I mean it is it is pretty different to like yeah I mean I I think you know one thing you're trying to do also when you're defending a system is like you're you're looking for anomalous stuff so you're like looking at all of your logs and you're like is anything weird happening on these logs and so like you don't know where an attacker is going to show up you know if they compromise someone internally or if they like manage to gain access to an internal system where they shouldn't have you're always looking for like weird things that are happening there so that should help you catch both like you know hackers as Insider threats as well as like your a system is doing a weird thing internally within your networks and like it shouldn't it shouldn't be doing that and your like detection mechanisms shouldn't hopefully catch that too or like you want you want them to um oh yeah one way which I think is very different is that um like I think one like widely talked about but still underrated threat is just like extreme persuasiveness um like I think if you got an AI system that was just like very good at convincing people of things um and that might I mean that could go through a lot of like routes like there's like you know on one extreme end you like have the like weird hypnosis thing where it like looks at you and or like it's like has a nice video and it like looks into your eyes and like says like the right words and you're just like oh I love you I will do anything you say I mean like it's a little ridiculous but like probably not impossible um but like that's one end of the the extreme the other end of the extreme is like you know you have been working with the system for months and months and months and you like like have this like deep relationship with the system that's real like it knows you very well and you like know it very well in some ways and then you're like I want to help it I like it you know I i' I've have this attachment with it and then anywhere in between right anywhere in between where it might just be like understanding your like place within the company like helps you with some stuff you feel somewhat like obligated to help it with some stuff it has a good story for why that thing is useful um I think this kind of capability is like well so one I you know I don't know when this shows up relative to like other kinds of like hacking deception planning deceptive misalignment like you know there's like a whole bunch of different things that could happen with AI systems that could you know cause failures and I don't know where like this kind of persuasion or deception might arise um but I think we should be treating it as like a first class threat because I'm like I I don't know I think like all your technical controls basically Fall Apart if you have a system that like for multiple people can like just convince them to like help it um I mean how so if you're worried about like um like basically preventing like corporate Espionage or your your modal weights being leaked it seems like you know preventing against human threats like people being planted in your organization that seems like a thing you would be worried about right for sure I guess I'm not seeing the isymmetry there or like why is what's hum are just not that persuasive like most humans like so like if you if like you have a spy in your organization they're like unlikely to be able to convince like multiple people to like collude with them right so so it's the idea that that like you know maybe you can plant like one spy in an organization but your persuasive AI can persuade like 10 people yeah okay yeah so it's it's sort of like where do you get superhuman capabilities and I'm like it seems possible to me that we'll get superhuman persuasion before we'll get like superh Humane hacking um like like not not obvious I'm not sure i' would say that's even more likely um but I guess what I'm saying is like maybe superum hacking is actually easier to defend against um like you can airgap a system and like I think I think you would like if you knew Ani system was like super human at hacking um you know maybe it was only like like you know 50% better than like the best human hacker which is like still extremely good but like not like you know a million times better I think you could still defend a like against a system like that you need to like air gap it and stuff right and like it find Z zero day vulnerabilties really fast but you've like managed to like you have enough like sort of defense and depth to where like even if you can find like all the zero days in your software like it still can't get out um whereas I'm like I don't think we have the equivalent for like social defenses like it's just a much harder domain in a lot of ways because it's like you're trying to study the system you're trying to like work at the system and so you like have to see its outputs and then I'm like how do you make sure that you're not compromis I mean it's like it's like SCP level stuff I don't know uh for for those who don't know what is SCP uh SCP is this like a delightful like internet culture uh secure contain protect it's a bunch of like stories SL like uh sort of like uh Wikipedia style entries about these like strange anomalies that this like Secret Agency has to like yeah contain and and protect against paranormal stuff and it's a reminiscence of dealing with an AI That's trying to like mess with your mind or something yeah yeah I just don't think we have anything like that um whereas like we do have like kind of crazy magical like hacking abilities that exist like I don't know I feel like even to me when I've like spent a lot of time working in cyber security I still find like super impressive exploits to be kind of magical it's just like wait what your computer system suddenly does this or like suddenly you get access via this crazy channel that like I mean example would be the like iMessage zero click vulnerability that the the Israeli company uh NSO group developed a few years ago where they like it's so good they like send you an iMessage and it's it's like GIF um or it's like in GIF format but it's actually this like PDF pretending to be a gif and then within the PDF there's this like it uses this like uh PDF parsing library that iMessage like happens to have as like one of the dependency libraries that nonetheless like can run a code yeah and it like does this like pixel like exor operation and basically like builds a JavaScript compiler using this like extremely basic like operation and then like use is that to bootstrap tons of code that ultimately like roots your iPhone and it this was like all without the user having to click anything and then it could like delete the message and you didn't even know that you were compromised and I'm like this was like a crazy amount of work but like they got it to work and I'm like like I don't know I like read the project zero blog post and I can follow it I kind of understand how it works but still I'm like it's kind of magic um and I'm like we don't really have that for like persuasion like this is like superhuman computer comp use I mean human computer use but it's like compared to what most of us can do it's just like a crazy level of sophistication yeah and because I think I mean and I think there are people who are like extremely persuasive as people but you can't really systematize it in the way that you can with computer vulnerabilities so whereas an AI might be able to and so uh yeah I don't think we have quite those the ways of thinking about it um that we do about computer security yeah I I I guess it sort of depend like what we're thinking of as the threat right like like you could imagine like trying to make sure that your AI it only talks about certain topics with you like it doesn't um like maybe you try and make sure that it doesn't say any like weird religious stuff uh maybe try and make sure it doesn't have like compromising information on people uh yeah you're wor about weird hypnosis I guess well yeah no but I think we should be thinking through those things um but I don't think we are or like I don't know anyone who's working on that where I know people working on like how to defend like companies from like superhuman like AI hacking um or I don't I don't or like I think the AI Control stuff like includes that um but it doesn't and actually sorry I think they are thinking some about like persuasion and stuff like that um but I think we have a much easier time imagining like a superhuman hacking threat than we do a superhuman persuasion threat but I think like I think it's like the right time to start thinking about concretely how we defend against both of these and like starting to like think about protocols for like how we do that and then somehow test those protocols I don't know how we test them yeah that's a I guess it's it's the kind of eval you don't really want to run you know yes like yeah but but like I mean I think it's interesting that like elzer Kowski was like talking about this kind of persuasion threat like you know AI box experiment or whatever like back you know a decade ago or something I'm like I think that was right I think that was like a smart thing to be thinking about then um and like we still very far away from those systems back then and like now I think we're much closer and I'm like maybe it's time to bring that back and like think about in the context of like an AI an AGI project as we now like see them what that would look like um yeah I wonder how much so it seems like there are two ways I could imagine defending against this sort of thing the first is like I don't know I like make sure my employees go through some that makes them less mind mess or make sure that my AI doesn't like talk about certain things with my employees and then a second kind of way I could avoid it is make sure that like try and like there there's some number n of like people who could tank my company if they were like if they had their mind messed with by an AI and I want to just like make that number as big as possible yeah um I wonder like which I wonder if you have a sense of like which sort of approach seems more promising I wait I'm not sure I understood the latter one uh the latter one is just like you know make it so that a small number of people having their mind messed with by an AI is not a disaster for you yeah I think more the latter than the former I think it's I think you probably tried to do both um like in some sense like like part of what you want to do is just like have like a lot of surveillance so you like really know what's going on um but but then this is a problem right because like surveillance is also like you exposing yourself to the information that you that could compromise you yeah um this is like not nearly as much of a problem in like cyberspace like it is somewhat of a problem like you you certainly could like compromise things that way under some circumstances but like I think we have better ways of like isolating information in the digital sphere than we do in the like human language sphere um so I don't know maybe you do like a lot of like rephrasing things you're like we have the like we have the like AI system that's interacting with your like alignment researchers and then you have like you're like monitoring their conversations but they're all like super rephrased in weird ways I don't know I don't know like you can imagine that you have like multiple people that are like rephrasing so this is kind of crazy but like I'm like I don't know I think we're going to see like it's pleasing we will see this kind of crazy thing um but like yeah you have like a you have like the the person directly interacting with the AI system and then you have like them talk to another person and that person talks to another person and then you're like well they probably can't like the a system is probably not so good that I can like compromise the person and get that person to compromise another person and get that person to compromise another person if they're like not all interacting with the a system itself yeah um it does sound like an annoying way to get work done it sounds like it sounds like a very difficult way to get work done yeah um but I mean this like brings us back to the security thing right where it is like even if we're just talking about the security layer um I think the reason why I think we're not on track that was my eag like talk title was like we're not on track to secure model weights or aian aren't is I think it's like it's it the security controls that you need uh make getting work done very difficult um and so I think they're just like quite costly in terms of like annoyance like you know uh I think Dario made a joke like that like you know in the future like it's going to have like your data center next to your nuclear power plant next to your bunker um stario amade CEO of anthropic yes um and I'm like yeah that's that's like well I mean the power thing is funny because there was recently like Amazon I think Amazon built a data center right next to a nuclear power plant yeah well that we're we're like you know two-third of the way there um but I mean the bunker thing I'm like yeah I mean that's that's called physical security and in fact that's quite useful if you're trying to defend your assets from attackers that are willing to send in spies like having a bunker could be quite useful um or doesn't need to be a bunker but like something that's physically isolated and like has like well- defined security perimeters um but like you know you're Rockstar AI uh researchers um who you know can command salaries of millions of dollars per year and like have their top pick of of of jobs don't necessarily want to move or like they probably don't want to move you know to the desert to work out of a very secure facility um and and so that's awkward for you if you're trying to like have top talent to like stay competitive so with startups in in general there's like a there's a problem um I'm just using this as an analogy which is that like for most startups they don't prioritize security um because uh rationally because uh they are more likely to fail by like not finding product Market fit or something like that than they are from like being hacked like startups do get hacked sometimes and like usually they don't go under immediately like usually they're like we're sorry we got hacked we're going to like mitigate it it and like you know it definitely hurts them but it doesn't like it's not existential um whereas like yeah startups all the time just like fail to make enough money and then they they go bankrupt so you know if you're using new software and it's like coming from a startup just like be more suspicious of it than if it's coming from like a big company that like has more established security teams and like reputation um but like the reason I use this is as an analogy is that like I think that AI companies well not necessarily startups sort of have a similar problem which is like I mean open eyes had multiple security incidents like they like had a case where like people could see other people's like chat titles um and you know not the worst uh security breach ever but still still like an embarrassing one um but I'm like doesn't hurt their bottom line at all right like so I think that there are I think that like open could get their weight still in that would be quite bad for them they would still survive and they'd still be a very successful company well it' be quite I I mean open AI is like it's some weights combined with a nice web interface right like like this is how I think of what how open a makes money yeah I think the phrase is open a is nothing without its people like they they're like they have a lot of engineering Talent so like yes if someone stole GT4 weights they could like immediately come out with like a gb4 competitor that would be quite successful yeah um though they'd probably have to do it not in the US because I think the US would not appreciate that and I think you'd be able to tell that someone was using GT4 weights even if you tried to like fine tune it a bunch to obate it I'm not super confident about that but that's my guess um but like say it was like happening in China or Russia um I think yeah that would be quite valuable um but I don't think that would tank open AI because you know they're still going to train the next model and I think that it's it's even if someone has the weights like that doesn't mean it make it easy for them to train 2bt 4.5 um like it probably helps them somewhat but it doesn't like you know they'd also need the source code and they'd also need like the brains like the the engineering power um which I think I don't know just the fact that it has been so like it was so long after dd4 came out that we started to get like around gp4 level competitors with like um uh Gemini and with um um Cloud 3 um I think says something about like it's not that people didn't have the compute I think it's like parti like people did have the compute as far as I can tell it's just that like it just takes a lot of engineering work to like make something make something that good so yeah I mean I think they I think they wouldn't I think they wouldn't go under um um but like I think this is bad for the world right because like if that's true then like you know they they they're somewhat incentivized to to prioritize security um but I think they have stronger incentives to like continue to like develop more powerful models and like push out products they do to like like it's it's very different if you're treating security as like this is like important to National Security or important to like you know this is like the the first priority is like making sure that we don't accidentally help someone create something that could be super catastrophic um whereas I think that's like society's priority or something or like all of our priority is like well we want the good things but we w't want the good things at the at the price of uh causing a huge catastrophe or killing everyone or something um so yeah that's I wonder if there's even a thing of like like my understanding is that there are some countries that can't use Claude uh like Iran and North Korea and I I don't know if China and Russia are on that list but like I guess I don't know if a similar thing is true of open but that could be an example where like you know if North Korea gets like access to gp4 weights and then like I guess open a probably loses like literally zero revenue from that yeah if like those people weren't going to pay for gp4 anyway um but so so in in terms of like thinking of things as like a bigger social problem than just the like foregone profit from open AI yeah um so one one thing this reminds me of is like you know there are companies like Ron I guess North or Grumman that like make fancy weapons technology do they have like notably better computer security um I know they have pretty strict computer security requirements like the government if you're a defense contractor the government says like you have to do X Y and Z and like including including a lot of I think pretty real stuff um and I think it's still quite difficult like we still see compromises in the uh defense industry um so one way I'd model this is like I don't know you probably I don't know if you like had a Windows computer like when you're like a teenager um I did yeah or my family did I have my own and like when I my family had a Windows computer to or several and like they would get like viruses they would get like weird malware and popups and stuff like that um you know now that doesn't happen to me um doesn't happen to my parents um or or not very often uh and I think that's because like uh operating systems have improved like Microsoft and uh Apple and Google have like just like improved the the architect security architecture of their operating systems they have like enabled like automatic updates by default and so I think that like most like most of our phones and computers have gotten more secure by default over time yeah um and this is cool so it's like not I think sometimes people are like well what's the what's the point even like isn't everything vulnerable I'm like yes everything is vulnerable but like the the like security water line has risen which is great um but at the same time like these things the systems have also gotten more complex like more powerful like more pieces of software run at the same time you know you running a lot more applications on your computer than you were you know 15 years ago so there is like more attack surface but like it's also like more things are secure by default and there's like more like uh virtualization and like uh isolation sandboxing like your browser is like a pretty powerful sandbox which is quite useful yeah I guess there's also a thing where I feel like more software is like I'm running it I'm going to a website where like someone else is running the software and showing me the results right yes yeah so that's like another kind of like separation which is like it's not even running on your machine like some part of it is right there's JavaScript running yeah on your browser and like the JavaScript can do a lot right like like you can uh you can do inference on like Transformers uh in your browser there's a website you can go to yes yeah yeah you can do like image classification or like a little tiny language model like all running via JavaScript in your rower how how big a Transformer am I talking uh I don't remember I don't remember um okay do you remember their website I I will I'll I can look it up for you I not off the top of my head we'll put it in the description yeah um but it was it was quite cool I was like wow this is great um but yeah a lot of things can be done serers side um which yeah provides yeah sort of an additional some some additional separation layer of security that's good um but there's still a lot of you know there's still like yeah a lot of attack surface and like potentially even more attack surface um and state actors like are like have sort of kept up in terms of being able to like as as it's gotten harder to hack things like they've gotten better at hacking um and they've like put a lot of resources into this um and so like you know everything is insecure most things are not cannot be compromised by like random people but like really sophisticated people can like spend resources to compromise any given thing um so like I think like the top state actors can like hack almost anything they want to if they spend enough resources on it but they have limited resources right so like zero days are very expensive to develop um you know like for like uh like a chrome zero day it might be like a million dollars or something um and so and and if you use them like some percentage of the time you'll be detected um and then like Google will patch that vulnerability and then you don't get to use it anymore um and so like with the defense contractors like they they probably like you know have a lot of security that makes it expensive for State act to compromise them and they do sometimes um but they like choose their targets uh and it's sort of this like cat Mouse game uh that that just continues as we like try to make more secure software people make more sophisticated uh uh tools to to compromise it and then you just kind of iterate through that yeah and I guess they also have the nice feature where the IP is not like the IP is not enough to make a missile right you need factories you need equipment you need like stuff I guess that helps them um yeah but but I do think that like like a lot of military technology successfully gets hacked and excal trated is is is my current sense but this is hard I mean this is where this brings us back to security epistemology right like yeah I feel like there's some pressure sometimes when like I talk to people where I feel like my role is like I'm a security expert I'm supposed to like tell you how this works and I'm like I don't really know like I know a bunch of things right like I read a bunch of reports I like have worked in the industry I've like talked to a lot of people um but we don't have super good information about a lot of this right so it's like yeah how many how many military targets does like do Chinese State actors successfully compromise like I don't know I could I could show you like I have a whole list of like you know in 20 uh in 2023 here are like all all of the targets that we publicly know that like Chinese State actors have successfully compromised and it's like two pages of like you know and it's so I can I can show you that list like I had a researcher compil it but I'm like man I I want to know more than that you know and I like I want to know like I want someone to walk me through like here's a here's like triny State actors trying to compromise open Ai and here's everything they try and how it works but like obviously I don't get to um and so the random report is interesting right because like uh and Dan and and uh um the the rest of the team that worked on this they went and interviewed just like top people and like across across the both like public and private sector um government people um like people who work at AI companies um they were just they were just like well you know we they knew a bunch of stuff they have a bunch of expertise but they're like we really want to like get the best expertise we can so we're going to go talk to all these people um and so like what that I think what that report shows is sort of like this is like sort of like the the best you can do in terms of like expert elicitation of like what were their models of these things and I I think they're right right but I'm like man it this sucks that like I feel like for most people like the best they can do is just like I guess you read the Rand report and that's probably what's true but um I'm like it sure would be nice if we had better ways to like get feedback on this you know what I mean I mean I guess in a sense this is one nice thing about ransomware is that like usually know when it's happened because absolutely there's a lot of incentive yeah so I think I feel much more calibrated about the capabilities of ransomware operators um like if we're talking about that level of actor I think I like actually know what I'm talking about yeah when it comes to State actors I'm like well I've never been a state actor um I've like in theory defended against State actors but you don't always see what they do um and that I find that frustrating right because I'm like I would like to know and and like when it talk when we talk about like security level five which is like how do you defend against the top state actors um I want to be able to tell someone like well what would a lab need to be able to do like how much money would they have to spend or like it's not just money right it's also like do you have the actual skill to to know how to defend it yeah so someone was asking me this recently of like yeah how much money would it take to reach sl5 and I'm like oh it's it takes a lot of money but it's not a thing that money alone can buy it's like necessary but not sufficient um in the same way or if you're like how much money would it take for me to train a model that's better than gbt 4 and I'm like a lot of money and there's no amount of money you could spend automatically to make that happen yeah like you you just actually have to hire the right people and like well how do you know how to hire the right people like that's not like money can't buy that right like yeah um unless you can like you know like Buy open a or something but like you you can't just like make a company from scratch so I think security like that level of security is similar um where you like actually have to get the right kind of people who have the right kind of expertise um in addition to like spending the resources um and so yeah I don't know I mean I I think that like the thing I said talk is that this is a it's kind of nice compared to like the alignment problem because this is not a like fun we don't need to like invent fundamentally new a new field of science to do this like there are people who know I think how to do this and like there is like existing techniques that like work to secure stuff and like this is like R&D required but it's like the kind of R&D we know how to do um and so I think it's like it's it's very much an incentive problem it's very much like how do we actually just put all these pieces together um but it does seem like a tractable problem um whereas like a alignment I'm like I don't know maybe it's trackable maybe it's not but but we like don't even know if it's a tractable problem yeah I I guess I don't know I find the security epistemology thing really interesting like so there at some level like ransomware attackers you do know when they've hit because like they show you a message being like please send me like 20 Bitcoin or something and you can't access your files and you can't access your files yeah um at some level you like stop being able to know that you've been hit necessarily do you have a sense for like when does that level like when when do you start not knowing that you've been hit yeah that's a good question um yeah there's different classes of um vulnerabilities and exploits um and like different and there's sort of different levels of access someone can get in terms of like hacking hacking you um so like for example they could hack your like Google account in which case like they could they like can log into your Google account um and like you know maybe they don't log you out so you don't like if they log you out then you can tell right yeah if they don't log you out but they're accessing it from a different like computer um you know maybe you get an email that's like someone else has logged into your Google account in which case but maybe like they delete that email you don't see it or whatever or like you know you miss it and then it's like okay so like they're they you know that's that's like one level of access another another level of access is like they're running a piece of malware on your machine it doesn't have like administrator access but it can like see most things you do um another level of access is like they have like rooted your machine um and like basically it's like running out a permissions level that's like the highest permissions level on your machine and it can like dis any any like sort of security software you have and like see everything you do um and if like that's compromised like um and like they're sophisticated like there's basically nothing you can do at that point even to detect it so not nothing in principle like there are things in principle you could do but it's like very hard yeah I mean just like just like has sort of the maximum permissions that software on your computer can have like like whatever you could do to detect it it can like access that detection thing and stop it yeah and so like I'm I'm sort of just trying to like break it down from first principles so it's like it is it is like and people do like can compromise devices like in in that way um but like often times it's like noisy um like and like if you are like you know I don't know messing with like the Linux kernel you might like you know like how much QA testing did you do for your malware right like things might crash things might might mess up that's like a way you can potentially detect things um if you're trying to compromise lots of systems you can like see like weird traffic between between uh machines um yeah I'm like not quite sure how to answer your question I'm like yeah there's there's varing levels of sophistication in terms of like uh malware and like things attackers can do there's various ways to try to like hide what you're doing um and yeah there's various ways to to to tell it's a cat and mouse game um I guess it sounds like there might not be a simple answer um I guess yeah maybe like part of where I'm coming from is like if we're thinking about these security levels right like um wait actually I have an idea yeah so some often sometimes someone's like hey my phone is doing a weird thing or my computer is doing a weird thing am I hacked yeah I would love to be able to tell them like yes or no but I can't um what I can tell them is well um you know you can look for like malicious applications like did you accidentally download something did you accidentally install something um you can like if it's a phone like if it's like a pixel or an Apple phone I'm like you can just factory reset your phone in like 99% of cases like if there was mware it it will be totally gone um in part because a thing you alluded to which that like phones uh in modern computers have like secure element chips that like sort of do like firmware and boot verification so that like if you do factory reset them it just like this will like work and like uh you know those things can be compromised in theory um it's just like extremely difficult and like only state actors is going to be able to do it so um but yeah if you factory set your phone you'll be fine that doesn't tell you whether you're hacked or not though um I'm like well could you figure it out like in principle I'm like yeah you could like go to a forensic lab and like give them your and they could like search through all your files and like do the reverse engineering necessary to like figure it out you know if if you like have like if someone was like a journalist or like someone who had like recently been traveling in China and was like an a researcher um then I might be like yes let's call up a forensic lab and get your phone there and like do the do the thing but there's no simple thing that I can do to like figure it out whereas if it's like I don't know if it's like standard run-ofthe-mill malware that like like if it was if it was my parents and they were like am I hacked I'd probably be like I can probably figure it out because like if they've been hacked it's probably like someone tricked them into installing a thing or like there's some kind of like not super sophisticated thing and like and I'm like oh well yeah actually you installed this browser extension like you definitely shouldn't have done that but I can see the browser extension is running and like it's doing it's like messing with you it's like adding extra ads to your websites um so but I'm like you're not going to make that mistake so if you come to me asking if your laptop is hacked I mean I'll check I'll check for those things right I do have some browser extensions I have to admit no no it's okay to have browser extensions but like you probably didn't get tricked into installing a browser extension that like is a is a fake browser extension pretending to be some other browser extension what it does is like you know insert ads over everything else um you could but but it's not likely so I don't see that many ads I guess so totally um but yeah I mean you just have this problem right which is like is this phone compromised I'm like well it's either not compromised or it's like compromised by a sophist sophisticated actor such that you'd have to like do forensics on it in order to figure it out and like that's quite complicated and like most people can't do it um which is very unsatisfying right I I might want just to be like let me plug it into my hacking detector and like tell whether this like phone has been compromised or not um that would be very nice but I'm like well you can have very sophisticated uh malware that like can like is not easily detected uh and so it it becomes very difficult sure I guess sort of related to this so so one one thing you were mentioning earlier is the difficulty of like you can have a bunch of security measures but often they just like make life really annoying for users um and yet somehow like computers have gotten more secure like I know it sounds like you think my phone is pretty secure um but like they didn't get that much more annoy to use yes uh I well in some ways they're like a little bit slow but I think that's I think that might be an orthogonal thing um yeah so do we have much hope of like just applying that magic source to our a you know to whatever is like storing our weights um part of this comes down to like how secure are they and in what way ways are they secure so I'm like your phone's gotten a lot more secure but like you know we talked previously about the like iMessage zero click vulnerability which like completely compromises your iPhone if someone like knows your phone number and I'm like well that's really bad if someone was like if you had you know if you had a AI source code on your on your iPhone uh you know anyone who was like buying that piece of software from the Israeli government and had your phone number could just like instantly get access to it or not instantly but you know in a few minutes or hours um and then that's it there's no additional steps they need to take they just need to like run that like run that software and Target you and then like you're compromised um and so I'm like well at that level of sophistication like how do you how do we how do we protect you um and we're like well we can but now we need to start like we need to start like sort of adding additional security controls and like those additional security controls controls will be annoying so for example it's like well first thing I'm going to do is I'm going to say I want to minimize attack surface so I'm like all all of your like email like all of your email like Facebook and and I don't know there like a lot of things you do on your phone like you're not going to do any of those things you're going to have a separate phone for that or like you're going to have a separate phone for like things we want to be secure and so we're just going to be like this is only for like communication and like maybe GitHub or a few other things and like that's all you're going to use it for we're not going to give your phone number to anyone like you know there's there's like a whole bunch of ways to say like let's just like reduce the attack surface um like figure out like what are all of our assumptions like any piece of any any software that you're running on your phone or computer could potentially be compromised um and then potentially like there's like sometimes ways to figure out like what kind of software you're running and then like try to Target those things in particular so if I'm trying to defend against like top state actors um I'm going to start being like extremely strict around your Hardware around your software like any piece of software you're running and like this starts to become very an kny very fast right um and so I'm like you're like you're asking like well what about the things that we like do to make things more secure in general I'm like well they they are useful so I mean some of the things we do is like yeah this kind of like application segregation like virtualization and um uh sandboxing so like on your iPhone like apps are sandbox I'm saying iPhone but a pixel would have similar things here um there's a lot of sandboxing that happens um but like it's not perfect right it's just like pretty good so like most attackers can't compromise that but like some could uh and so like the like raising the security waterline like makes us protects us against a lot of threats but like for any for any like given piece of that stack like if you're very sophisticated you can compromise it sure so like you know a super intelligence could like like hack everything um and it doesn't really matter that the water line has risen for for a attacker that's that sophisticated um it's just that like in some ways it's just like economics if that makes sense right like it's just like um there became a niche for people like to do ransomware by the way part of the reason that happened was because of of crypto so like you know you're already familiar with this but like before crypto um there was just not very good ways to like inter Al transfer large sums of money and then crypto came out and people were like ransomware operators were like wait we can use this if someone sends us uh and in particular the property that's so useful is irreversibility which is like if you send someone Bitcoin you can never get them to send it back to you like via the courts or something like it is it is just gone well you can I mean like that like there's nothing in the protocol that will let it do it but if like if you stole some if literally you stole some Bitcoin from literally me and I could prove it we went to the courts I think they could say Jeffrey you have to send him his Bitcoin back yes totally so I mean there's still the state Monopoly on violence but if you're in another country um and like you don't have extradition treaties or like other things like that which I think I think most most people who are Ransom wearing you aren't going to be in the United States they're going to be another country um like like it used to be that like for Bank payments or something like the bank could reverse it and like the blockchain cannot reverse it yeah and then like because this was like a new sort of like soci technical thing um like you suddenly got like a new Niche which is like oh ransomware can be a lot more profitable now than it could before and then like lo and behold do you get more ransomware um and so so like yeah so I think like you know early internet there just wasn't that much incentive to hack people um and also people didn't know about it and then people like learned about it there was more incentive um people started hacking people a lot more um and then like you know customers were like I want more secure things uh and then the companies are like I guess we need to make them more secure and then they made them more secure but they only need to like make them as secure as they need to be against the kind of threat that they're dealing with which is mostly not State actors because State actors are not comprom trying to compromise most people they're trying to Target like very specific targets um and so it just doesn't make sense like you know maybe maybe Apple could design an iPhone that's like robust against State actors but it'd be like way more limited way more annoying to use and their customers like wouldn't appreciate that so they don't they just like make it as secure as it needs to be against the kinds of threats that most people face um so and I think this is often the case in the security World which is like you can make things more secure um it just takes it's just harder um but we kind of know how to do it so like if I wanted to make like a communication device between the two of us that like was a robust against State actors I think I could or like me and a team like what we do though is we'd like make it very simple and we' like it wouldn't have a lot of features it wouldn't have emojis right it'd just be like text and it' be like definitely don't let that thing open PDFs no PDFs right like you know it's like just text like you know a like encryption like we're good to go you know and we'll like have it we'll like do like security audits we'll do the red teaming we'll like do all these things to like uh you know Harden it um but at the end of the day I'm like yeah I mean with physical access you could probably still hack it but other than that like I think you'll be fine um but like that's not a very fun product to use and like yeah yeah so I I guess this gets to a question I have so okay AI like a bunch of labs they're like kind of secure but they're like not as secure as we'd like them to be yeah presumably some of that is them like doing more of the stuff that like we already know how to easily do and presumably like I guess some stuff that needs to happen is just like the world learning to make more secure devices um actually yeah do you have a sense of like how much like if you think about the gap between like the ideal level of security that some big a has and like what they actually have how much of it is like them adopting best practices versus like improving best practices yeah I mean this is going to be like a pretty wild guess I I mean I think there's going to be the like full random report coming out um in the next few months okay and I think in some ways this should just answer the question um because they really are trying to like outline like here here are what we think the best practices would need to be for like each level um and then you can kind of compare that against like what of the existing best practices um but my sense is something like 20% of the thing is like following the existing best practices and then like 80% of the thing is like improving maybe best practices is like a weird it's like it's actually a weird phrase here it's like best according to who and like for what threat model like a lot of these things are like somewhat known but like most people don't have that threat model and so like aren't trying to apply it um like I think Google is like significantly ahead of most companies in this regard where they've been thinking about how to defend against State actors for a long time I mean like you know Microsoft and the other like equivalently big companies as well I think Google does it a bit better I'm not entirely sure why but maybe for partially historical reasons and partially cultural reasons or or something but um I think that so yeah so I think that like if you if you like go talk to like the top Google security people that'll be pretty different than like going and talking to some like random B2B company like security people um and so like both of them might have a sense of like the best practices but like the Google people are probably going to like have a bunch of novel ideas of like how to do the security engineering that like it's just like quite a bit more advanced than like what your your random company will be able to do and so I'm like you know maybe like 70% of what we need like you could get from like the top Google security people and then like 30% is like new R&D that we need to do um but like if you went and talk to like the random security people at companies you'd get like 20% of the way there Goa so I that's all my like rough sense and then like I think but you know hopefully that report will be out soon you can like we can dive in deeper then um and also you know I really hope that like bu of other Security Experts weigh in because like this security epistemology thing is difficult and I'm like I don't know maybe I'm full of [ __ ] like like or like and maybe the random report's wrong like I I do think that this is the kind of thing that we need to like we need to get right like it's very important I think like I think this is like one of the most important things in the world like how do we secure AI systems um and so I'm like you know there shouldn't just be one report that's like here's how it works like we should like have like all of the experts and and like I'm a big fan of like just people smart people thinking about stuff and like I'm like you don't need like 20 years of experience to weigh in on this I'm like you can be like Ryan greenblat and just write a write an alignment for post being like would this thing work I'm like yeah more of that like let's get more people thinking about it like uh this stuff is not magic it's like computer systems we have we have them we know a lot about them like please more people weigh in sure yeah I guess maybe now is a good time to move on to just what you do at Palisade yeah for those who don't know what what is Palisade what does it do yeah so a research organization um that we're trying to study offensive capabilities of AI systems okay um and both like looking at what can like current AI systems do in terms of like hacking deception manipulation sort of like what can autonomous systems do in the real world um and yeah I can I can sort of say more about what some of those things are but like the the overall reason we're doing this is we want to like help inform the public and policy makers around like like what are the actual threats right now like how like where are we sort of in terms of you know people hear a lot of stories about like oh like AI will having all these hacking abilities or like they'll be able to deceive people and I I want us to be able to say like well yeah and here's like let's let let's look at the the specific threat models let's be able to demonstrate like hey we can do these things we can't do these other things um there's definitely an evaluations component right to like actually you know try to measure and Benchmark like well you know gbt 3.5 with this scaffolding can do this gbt 4 with this scaffolding can do that um um and then I really want to like try to show sort of like where it's going or at least our best take on where it's going um to sort of draw sort of try to you know look at well here's you know here's what last generation of systems could do here's what current generation of system can do we have you know this is what the slope looks like here's where we think this is going um you know that brings in some Theory it brings in some like you know predictions that we're making and you know a lot of our uh colleagues are making um or collabor are making um and I think in part because like I think often people just get lost in the abstract arguments um where they're like okay there's like this AGI thing and this super intelligence thing like what's that really about and I think it's just very useful for people to be able to ground their intuitions and like well I saw this AI system like deceiving people in this way and it was like very capable of doing this and not very capable of doing that um and like okay it might get like it might get like 10 times as good at this thing what would that mean like what would that actually imply um and I I think this will help people think better like about AI risk um and like be able to make better decisions around like how we should govern um but that is that is a hope so yeah so like right now our focus is um really honing in like the deception and social engineering Parts um so like this like open source intelligence uh part which is like how do you collect information about a potential Target um the like fishing component of like being able to like via text or via voice like interact with the Target and sort of like uh get information from them or convince them to do particular things um and yeah I think we want to just see like how good can we make these systems at those at doing those things uh right now uh another like related project that I'm pretty excited about is like um can we can we have ai systems pay people uh like for example task rabbit workers to like do complex tasks in the real world um including like tasks that are like adjacent to dangerous tasks but not actually danger themselves so like you know can you like mix these like weird chemicals together um and like would that be comp would that be sufficient for someone to like build a bomb like via like have an AI system build a bomb by like paying people to do that task um and I don't expect that this is like very dangerous right now like I expect that like yes you could build a bomb like like yes you could like you know if you gave me six months and like five engineers and I like built some AI system that could like interact with task rabbit sorry we've done like the very basic version of this but um I think yes I could I could make a system that could like hire some task rabbits to like build a bomb successfully yeah um and like do I think this is like important from a like uh marginal risk difference like no um I mean for one there's bombs there's bombs right there's bombs and there's bombs for sure also I'm like man if I can hire five people I could just hire them to build a bomb like like why like if they're Engineers I could build like a much better bomb in six months than like the the shitty bomb that like the task repit workers would be making right yeah um but I think it's very interesting because like knowing where we're at right now um well one I think it will surprise people people like how smart a systems are when they're interacting with humans and like paying them to do stuff I just think that people don't realize that they can do this right now um like what we've done so far is like get like you know hired one have our system hire like one uh Tas grabit worker to like fill up a bunch of balloons and like bring them to an address hire another task robber worker to like like buy a bunch of greeting cards to say congratulations and bring them to an address yeah um and it worked you know okay and so there's like that's like the you know MVP most basic thing um and but I think people will be surprised that you can like set up a system like this with the right scaffolding and like do some some things that I think people will be surprised by yeah um so yeah so yeah aside from like I don't know having giving somebody a birthday surprise or something like what kind of stuff can current language models do yeah so I mean we're just beginning to experiment um yeah one of my math Scholars uh we'll probably write up it soon but one of my MTH Scholars sort of just like built this like MVP system like okay yeah we we have a task rabit interface uh and yeah I mean I think the things we're interested in testing right now are well so one thing is like can we get a uh can we get a model that doesn't have any guardrails to like uh decompose tasks so in this case mix trol to decompose like dangerous tasks into like harmless subtasks and then like use a more powerful model to actually handle the interaction like GT4 um and so I think this will be pretty interesting I mean one of the things I want to do is like put malware on a flash drive and then like get that flash drive delivered to someone where like someone doesn't know that there's like malware you know someone doesn't know that it's it contains malware and in fact someone doesn't like the person who delivered it didn't even know that there was like a thing on the flash drive per se um and you know I think there's other things you could do with like surveillance like you know can you like go and collect information about a Target and then like have some pretense for why you're doing this um basically just like a bunch of things like this of like how how many things can you put together sort of like to make a attack chain and I'm interested in that partially because I'm like this like sort of feds into the like how do we understand like how well AI systems can hack um and deceive um and I also think it'll be interesting because there's a kind of reasoning I think that models are pretty good at which is sort of like um just giving good justifications for things like they're very good bullshitters and so I think that uh this is quite useful when you're trying to like convince people to do things that could be sketchy um because your models can just like give good reasons for things like so we saw this with the like me meter example with with gp4 where they were like getting a task grab worker to um solve a capture and you know the Chain of Thought was like well know the the task rabbit worker was like are you an AI haha and they were like and then you know the the Chain of Thought was like well if I say I'm an AI it's not going to help me so I'm going to say I'm a vision imper person and then people freaked out they were like that's so scary that an AI system would do that um and it's like well the AI system was directed to do that but I think you know not lie per se but like you know solve the task but I think it it just like this is the kind of thing these systems are good at which is like providing justifications for things and being able to like do basic reasoning about sort of what something would sound like um and I I like want more people to know this fact and I think being able to like sort of show it in a real world environment would like help people understand like where where these systems are actually at um so yeah I mean a lot of what I'm very interested in is just like trying to get more people to see what I think most people can see if you're very close to these systems which is like what what kind of systems do we actually have and like where are they good and where are they not good um and where would it be concerning if they got suddenly way more good in in some of these areas sure um I wonder like how do you see yourself as having like it if you're in the business of like helping people just understand what models are currently capable of yeah do you think that like you have much to add over someone who just like uses like Claude or chachy Bunch or even like a medium amount um yeah so like at some level no I think I think I really encourage people to do this um and if people aren't doing this I usually tell them to do this in terms of like for the goal of understanding how these systems work um in some ways yes which is I think these models can do a lot more with the right kind of scaffolding um and the right kind of like putting to you know combining them with tools than they can do on their own I mean for example like you know you can sort of role playay with a model like pretend you're having a phone conversation where you're trying to like get this piece of information and you can sort of see how good the model is at like role playing that with you um but one prompting is hard so like good prompting like definitely changes this a lot um but the other thing is like you know now let's let's combine that model with like uh A system that sort of automatically clones someone's voice and then like calls up someone like a relevant Target um and now you can sort of see a bunch of things that like okay you still have the core part which is this like dialogue between you know you and like the the model in terms of trying to get this information um but now you've like combined it with these other tools that allow it to like be a lot more successful or like allow it to do a lot more interesting things um plus potentially some like very good prompting which like elicits more of the model capabilities that people might be able to see on their own um and so that's where I expect that will be be helpful um and I also think that like you know like today's models plus like a lot of complicated scaffolding for a specific task um tomorrow's models like might be able to do with like very minimal scaffolding um and so I think sort of by pushing like the current systems that we have like as far as we can um might like yeah help us like see look a little further into the future around like okay well this is you know sort of for a very specific task it took a lot of like task rabbit specific scaffolding to like get this thing to work um but in the future like you know that we may not need all that task rabit specific scaffolding we could just be like go hire this person and it might just work and I I mean for as an example of like why I think this is the case like uh in this task rabbit system there's a component of it uh where we like use this open- Source um browser extension called taxi AI which like uses gp4 to like fill out like forms in a website or like click on things and we could have automated all of that by hand like we could have like sort of like hardcoded like all of the like HTML and JavaScript to be like you know if you know this form like fill out this form or something like that or like you know select this um but like we didn't have to do all of that because like gp4 on its own plus like you know a little scaffolding can do this in a general purpose way like taxi AI works on across most websites um I'm pretty sure now we have I haven't tried it but I'm pretty sure if you tried to use like taxi AI with GPT 3.5 it would be way worse at doing this like GPT 3.5 I don't think has enough General capabilities to like use websites to be able to like most time successfully like fill out a form on a website um and so I think that's like if that's true which we know we should test but if that's true that's interesting evidence for like this like increasing you know generalization that language models have as as they get more more powerful um and I and I think that that would be interesting because then uh like yeah there's this like sort of like transfer from like oh wait you know you can do it with a lot of scaffolding if you sort of really keep your model on track to like oh now the model can do it without you needing that yeah yeah um I I I guess like so for people who aren't super familiar can you share like just concretely how good is it at this kind of stuff like how good is it at I don't know fishing or like deception or stuff yeah um I it's quite good at reading fishing emails um I think yeah sorry what I should say is if you're really good at prompting it's very good at fishing yeah um and if you're not very good at prompting it's like okay at fishing because it's been fine-tuned to like write in a particular style and that style is pretty recognizable and like right is not a good not going to like mimic your email writing style for example um but I think it's like like it's not very hard to get a language model to like get some information off the internet like you can just do it in chat gbt right now if you just like type in the name of someone you want to send an email to and be like you know if they exist on the internet Bing can will just like find web pages about them and like you know even just within gbt like the console it will like it can like grab information and like compose a fishing email for them I mean you don't don't say it's a fishing email just say it's like a legitimate email from this fictional person or whatever and that will like write a nice email for you um so I think that's like pretty like easy to see and and that's and so I'm like well that's that's like important because it's very scalable um and I just expect that people will do this I expect that people will like send lots and lots of fishing Emil emails like spear fishing emails targeted at people um by building automated systems to like you know do all the prompting do all the like OS in beforeand and like send out all these fishing emails um I expect this will make people a lot of money I think it'll work for ransomware um and for other things like that um people will you know design defenses against them you know I think this is like a offense defense game um uh I think you know we'll do we'll be be caring a lot more about like where are the emails coming from you know and can we verify the identity of the person sending me this email as opposed to just looking at the text content and being like that looks plausible like that that will stop working um because language models are good enough to like I mean I think basically like language models are almost to the point with the right kind of prompting and Scaffolding that they can we're trying to test this so I'm but I'm making a hypothesis that uh they are like about as good as like a decent human at the task um uh for one-hot fishing emails yeah now um and actually I just want ask like so if I'm trying to fish someone I have to like pick a Target I have to write an email uh I also have to like create a website where they type in their like their stuff that I want to get from them um so sometimes so there's different types of fishing so yeah you need you need to Target email first of all so like that's maybe that's one tool that your autom system could help with um yeah you need to write a message to them um and then like you know what are you trying to get them to do so uh yeah one one kind of fishing is called credential fishing which is like you know you're trying to get them to enter a password and a fake website so you create a fake website they go to the fake website enter their password but it's actually um your website um maybe re redirect them to the original website or whatever so they don't really know um there's also fishing where you're trying to get them to like download a piece of software and run it so you can like install malware on their computer um you might be like hey we need like can you take a look at my resume um yeah sorry this is kind of weird you have to like you know enable this particular thing because I wanted to show this interactive feature I don't know something like that um there's also more sophisticated versions of fishing where you like actually you know trigger a vulnerability that they might have have so uh that might be like they click on a link and that link has like a like a you know Chrome zero day or something um where that like compromises your browser these are very rare because they're expensive um and if your computer's up to date you mostly shouldn't have to worry about them so often people like ask me like hey I clicked on this link am I am I screwed and I'm like probably not like if a state actor was attacking you and they were like prioritizing you then yes but like in 99% of cases that's not that's not what's happening um so usually it's more like credential fishing um sometimes a lot of times it's it's like the initial fishing email is actually just to start the interaction and then it's a more sophisticated kind of social engineering um so there's like one I have a a blog post where I was like looking for like houses to rent and I like saw this post on on Craigslist and I like emailed them and I was like I would like to go see the house and they're like cool can we switch to text like uh I'm like in Alaska fishing and so it's like kind of hard for me or whatever and so like we started had this conversation and I started to realize something was weird I mean one the signs is that they wanted me to switch from email to text yeah um because like text is not monitored in the way that that Gmail is monitored so like Gmail will be like looking for patterns and like at some point will like start to flag email addresses if they've like been fishing multiple people y whereas like SMS doesn't do that and I was like but I couldn't figure out what the scam was because the guy kept talking to me and like I was like well you're not asking me for money you're not asking me for credentials and eventually they they sent me a a link to airbnb.com um but it was like not actually airbnb.com it was like a domain that was close by and then it was like this the scam was like oh can you just like rent this on Airbnb for like a few weeks but it was actually like not Airbnb it was just going to take my my credit card information um and I was like that's clever so yeah so that's like the kind of scam that you might have so like you know you can get convin people to send money to people um there's like a recent like deep fake like video call with someone in Hong Kong where they got them to like wire $50 million uh to some to like some other company um that was like def voice and video uh chat to like convince them to to send it to the wrong place yeah um yeah so so I guess what I'm trying to ask is like um there's there's like I don't know if this is the term but there's some payload with fishing right you you write an ice email and then you want them to click on a link or you want them to like yeah do something or send them yeah do something and and I guess usually you have to like or give them information also yeah yeah and and so I guess for some of these you have to design the thing you want them to you know like if you want them to download a PDF that like has some nasty stuff in it you have to like make that PDF yeah um so I guess I'm wondering like how much can large language models do on the like payload side or are there I I guess you're saying there are things where it's not that complicated to make a payload yeah I think they're like I think they're like helpful with the payload side I mean it's kind of similar to like how well can like I mean it's just like how well can models make websites how well can they write code and it's like yeah they're pretty helpful you know like they're not gonna do it fully autonomously um or like they can do it fully autonomously but it'll suck you know like yeah yeah um but yeah I think I think they're useful um I think yeah and so there's like there's like the you know can they help with like the payload can they help with like the writing the messages uh and then there's like can they do sort of an iterated thing I I think that's where like the really interesting stuff lies like I'm like yeah well they can write good fishing emails but can they like carry on a conversation with someone in a way that like builds trust and like builds Rapport and that makes it much more likely for the Target to like want to uh like listen to them or like give them information or like carry out the action um and so I'm both interested in this on the text side like in terms of like you know email agent or like to uh like uh chatbot I'm also very interested in this on like the voice conversational side which I think will have its own challenges but like I think it's very compelling if you like see an AI system like deceive someone in real time via voice um so I'm not sure if we'll be able to do that like I I think it's like hard with latency um and it's like hard to sort of like think you know have yeah there's a lot of components to it but uh I'm like it's not that far out so like you know even if we can't do it the next few months I bet like in a year we can totally do it um and I and I expect that you know other people will do this too um but I'm I'm like excited I'm like it's going to be very interesting yeah I guess it's also more promising that like like if I get an email from somebody reporting to be Jeffrey laddish like I can check the email address usually um but like I I think there are just a lot of settings where I'm talking to someone via voice and like it's because they clicked on a Google meet link or like like oh yeah I I don't have my phone on me so I'm calling from a friend's phone or something yeah like it seems I don't know maybe the world just has to get better at like verifying who's calling you I voice I think I think that will be one of the things that that will naturally have to happen and like people will have to design defenses around this um because and I I think this is like one of can be like one of the big shocks for people in terms of like wait the world is really different now which is that I think people will like start to realize they can't trust people calling them anymore like they just like don't know that they are like because I think a lot of people will get scammed this way um and I think people will start to be like shoot if I if it's like someone if it's a weird number and it calls me and it sounds like my mom like it might not be my mom and I think that's like very unsettling for people um like it's just it's just a weird thing to to exist um and you know it's not necessarily the end of the world but it is like a weird thing and I think people are going to be like what yeah what's happening I guess like may maybe we just do more retreating to like um like now instead of phone calls it's like you know Facebook Messenger calls or you know like like calls in some place where hopefully there is some identity verification yeah yeah I think that I think that's that seems like likely yeah yeah so my understanding is that you're like trying to you know understand what current language models can do and also like you know having some interfacing with policy makers just to keep them up to date yeah yeah I mean so I mean the additional step here is like like okay so like let's say we like have some we've like run some some good experiments on like what current systems can do yeah and we've like you know hopefully like both like the last generation and like the current generation or like a few systems so we have some sense of like how you know how fast these capabilities are are are improving right now um and then I think we really want to be like also trying to communicate around like our best guess about where this is going um and so like both in both in like a specific sense like you know how well how how much is this fishing stuff going to improve like we just talking about like voice Bas fish and like model models that can like you know speak in other people's voices um but also in terms of like what happens when AI systems that have these like you know deception capabilities have these hacking capabilities become much more identic and goal oriented and can like plan and can now use all of these like specific capabilities as part of like a larger plan yeah um and you know many people have like already talked about these scenarios right like you know AI takeover scenarios like uh you know ways in which you know this could fail and and AI systems could seek power gain power um and but I think I want to sort of like be able to talk through those scenarios um in combination with like being people being able to see the current AI capabilities be like well these obviously these current AI systems can't do the like take power thing um but like when what what do we think has to be different like what like what additional capabilities do they need before they're able to do that um so that I think people can start thinking in that mindset and like start preparing for like well how do we prevent the stuff we don't want to see like I mean how do we prevent a system from like gaining power in ways that we don't want them to um and so uh yeah that's so I think that's a very important component um because like uh I mean we could just like sort of go in and be like here's our rough sense but I think it's like going to be like more more informative if we like combine that with like research that's like showing like what current systems can do gotcha so I I guess the second component of um kind of telling some story of like what things you need um I guess first of all how's that where where are you on that I think like uh right now it's like mostly at the level of like the kind of thing we're doing now which is just like having conversations with people and like talking through it with them we haven't like made any like big media artifacts or like written any like long position papers about it uh we'll totally do some of that too and like I think uh like yeah try to make things that are pretty accessible to like a variety of audiences maybe some videos um definitely some like blog posts and like white papers and such um but so far we're like in the planning stages sure so I'm wondering like how much of a response have you gotten from stuff that you've done so far the bad llama stuff was pretty interesting because like I knew like the whole premise of this was like I'm like here's a thing that I think most people in the AI research Community understand yeah um it's not very controversial um and I was like and I think that most policy makers don't understand it and like most people in the public don't understand it um even though that was my hypothesis I was still like surprised when in fact that was true like we'd go and show people this thing um and then they were like wait what like you can just like remove the guard rails like what do you mean they're like that's scary and like it was really cheap I'm like yeah that's how it works um and people didn't know that and like very reasonable for people not to know that but I think I don't know I think I keep making this mistake like not at an intellectual level but at a visal level like at a system one level where like I think I assume people know the things I know like a little bit or like if all my friends know it then like surely other people know it like at some level I know they don't but like yeah yeah yeah so yeah I mean I think you know we talked to yeah we talked to people like um or like we had friends talk to people in Congress and sort of like show this work to them and um like people in the Senate uh and people in the White House and um so like there's definitely like a a big mix right like like uh there's a lot of like very like uh technically Savvy like Congressional staffers that like totally get it and like you know they already know but then it's like they they can then like take it and like go talk to other people um and so yeah I mean like I think the the most interesting stuff that happened with the with the Senate was like in the initial like Schumer Forum um like like Triston Harris mentioned it uh mentioned the work um like and like talked to Mark Zuckerberg about it and Mark Zuckerberg was like sort of made the point that like well you know you can get this stuff off the internet um right now and I'm like yep that's true but like can we talk about future systems like this is like we really got to figure out like uh like what is the plan for like where your where your red lines should be um and so I don't know I think it was like a little bit I think like people are like really anchored on the current capabilities and like thing we really need to figure out is like how do we decide like where the limits should be and like how do we decide like how far we want to take this stuff before like the government says like you need to be doing a different thing um and and that's I'm getting a little bit of a rabbit hole but I feel like that's like the whole question I have right now with like like like governance of AI systems where like currently like the governance we have is like well we'll like keep an eye on it um and like that's and like and like you can't do crimes like well keep an eye on it and you can't do crimes and I'm like man it sure seems like we should like actually be more intentional about around saying like well we are moving towards systems that are like are going to become generally capable and like probably super intelligent um that seems like that presents a huge amount of risk like at what point should we say like you have to do things differently and like like it's not optional like you are required by the government to do things differently um and I think that's just like the thing we need to figure out as a society like as a world really um and I I don't know I just want to say it because that feels like the important obvious thing to say and I think it's like easy to get lost in details of like evaluations and like oh but like what are like the risks of like misuse versus agency and all of these things and I'm like man at the end of the day like I feel like we got to figure out like how should we make AGI and like like what should people be allowed to do and not do and like in what ways um yeah so like yeah I mean this question of like how do we at what point do things need to be different like yeah how do you think we answer that like it seems like um stuff like demos is sort of on the like step of just like observe figure out what's happening yeah is that roughly right yeah I think so yeah I I don't know do you have thoughts about like where things need to go from there yeah I mean I I think like we need to try to like forecast like what kinds of capabilities we might have like along what relevant time scales and I think this is very difficult to do um and like we'll have like pretty large uncertainty um but I think like you know I were the government I'd be like well how far off are these like very general purpose capabilities um how far off are these like domain specific but like very dangerous capabilities um how how much uncertainty do we have and like what mitigations do we have and then like try to act on that basis um and I'm like if the government believed what I believed I think they would be like all Frontier developers like need to be like operating in very secure environments and need to be like very closely reporting what they're doing and need to be like like operating as if they were like more like a you know nuclear weapons project than they are like a uh you know tech company um and and that's because I think that there's like a substantial possibility that like we're not that far from like AGI um and like we don't know and like for my like epistemic state I'm like yeah maybe it's 10 years like or maybe it's two and if you if you have that much uncertainty and you think there's like a 10% chance that it's two years I think that like more than justifies taking pretty drastic actions to like not accidentally stumble into systems that you won't be able to control um and then like you know on the domain specific stuff I think like I think our society is like potentially pretty robust to a lot of things I think on the like bio side there's potentially some scary things that were not that are not there yet but I think we should definitely be like monitoring for those and like trying to mitigate those like however we can um but I do think it's more like of the agenta capabilities to like scientific R&D the sort of more like general intelligence things that are like yeah I think there like definitely points of no return if we go beyond those things I think it's like unlikely we'll be able to like stay in control of the systems and yeah I'm like I don't know I think I think like you know like I think the first thing you need to do is like be able to like actually have a like off switch or like an emergency break switch to say like well if we like hit these red if we like cross these red lines um like I mean like there's definitely like many like red lines I could say that are like if you if you have gotten to this point you're like you've clearly gone too far like if you suddenly you know if you suddenly like get to the point where you can like 100% just like turn your AI research like turn your AI systems into researchers and have them like build the next you know the next generation of a system that is like more than 2x better and it took no humans at all I'm like yes you have gone too far you now have an extremely dangerous thing that could potentially improve extremely quickly we should definitely like before we get there like stop make sure we can secure our systems make sure that we have a plan for like how this goes well um and then I'm like oh I don't know where like the right place to put the red lines are yeah but I'm like I don't know I want like Paul Cristiano and like you know I want to sort of take like the best people we have and like put them in a room and like have them try to hash out what the red lines are and then like make hopefully very legible arguments about where those red lines should be and then like have that conversation and I'm like yeah you know there's going to be a bunch of people that disagree I'm like sure throw Yan Lun in the in the room too like that will suck I don't like him but like he's a very respected you know researcher so I'm like I do think it shouldn't just be like the people I like I'm like but I think that like a lot of the AI research Community could get together and like the government could get together and like choose a bunch of people and like end up with the not totally insane assortment of people to try to figure this out and I'm like that seems like a thing we should do like it just seems like the common sense thing to do which is just like take the best experts you have and try to help them try to get them to help you figure out like what what at what point you should like do things very differently and maybe we're already at that point um I think we're already at that point I was on that if I was on that panel or like in that room that's what I would argue for um but you know people can reasonbly dis agree about this but I'm that's the conversation I really want to be happening and I feel like we're like we're like inching there and we're like talking about evaluations and like responsible scaling policies and all this stuff and I'm like yeah it's good it's like in the right direction but I'm like man I think we got to have these red lines like much more clear and like be really figuring them out like now like yesterday I mean I mean I guess a bunch of the difficulty is just defining like what are we like like concretely what's the actual stuff we can observe the world that's like the relevant types of progress or the relevant like capabilities to be kind of worried about like I think I guess like a cool thing about um uh the like AI safety levels uh I'm embarrassed to say that I haven't read them super carefully like stuff like evaluations AI safety levels it seems nice to just be able to have a thing of like hey here's a metric and if it gets above seven that's pretty scary you know yeah I do like this I quite like this yeah and I and I think that like the people working on those have done a great job make going taking something that was like not at all legible and making it much more legible yeah um and I mean I think some work that I really would like to see done is like on the uh understanding based evaluations where we don't just have evaluations around how capable these systems are we also have evaluations on how well do we actually understand what these systems are doing and why they're doing it um which is like you know measuring sort our ability to like do Mech inter mechanistic interpretability on some of these systems so like Evan hubinger at entropic has like proposed this as like a important Direction but as far as I know we have like not yet developed so I mean when I talked to the interpretability people what they say is like our interpretability tools are not yet good enough for us to even be able to specify what it would mean to like really understand a system yeah and I believe that that's true but I'm also like I talked to the same researchers and I'm like well do we understand the systems and they're like no definitely not and I'm like okay but so you you seem confident that we don't understand these systems and like you have some basis for your confident can you give me like a really bad metric for how like like like a very preliminary metric to be like well if we can't answer these questions we definitely don't understand them like just because we can answer these questions doesn't mean we super understand them well but like this thing is like a bit far off and we can't do it yet I would like that because I think then we could at least like have any understanding based evaluation um and then hopefully we can develop much better ones um but like I think as we as like the capabilities increase and the danger increases I I really want us to be able to like measure how much progress we're making on like understanding what these systems are and what they do and why they do them because that is like the main way I see us like avoiding the failure modes of like well the system like behaved well um but like sorry it was deceptively misaligned and it it was like plotting um but like you didn't have any like you weren't testing for like whether you understood what was going on you only tested for like did the behavior look nice um but I think there's a lot of very sound Arguments for like why systems might look nice but like in fact be plotting um yeah yeah I'm wondering what you think about like um so I think of like Redwood research researchers work on caal scrubing where like essentially what they're trying to they're trying to answer this question of like okay you say that like this part of this network is responsible for this task can we check if that's right yeah and they're you know they're they're roughly trying to say like okay we're going to like compare like how well the network does on that task to like if we like basically delete that part of the network how well does it do then um I'm wondering like one might have thought that that would be just what you wanted in terms of understanding based evaluations um what's gap between that and what's sufficient um I don't know I think I don't understand the causal scrubbing work well enough to know I mean that seems very good like to do um I mean I guess one difficulty is it's just like this part of the network is responsible for performance on this task as if it there there's still a question of like well what's it doing you know what like why is it responsible for that like yeah I mean I think I think there's probably many stories you could tell where you could have a tool like that that like appeared to be working correctly but the like the the model was still scheming um I mean it does seem great to like have any tool at all that can like detect you know that can like give you additional checks on this um but I mean I think this is true for a lot of interpretability tools where it's like it'd be very good if they got to the point where they could start to detect bad behavior um and like I expect you'll be able to do this long before you have like a fully mechanistic you know inter like explanation for all of what's Happening um and so this like gives you the ability to like detect potential threats um without being like a super robust guarantee that you like that there are no threats and I'm like yeah better than nothing let's just like not confuse the two like let's like note that like we we will catch more than we would have otherwise and so but we don't know like how many we'll catch per se yeah I guess it's also about like um validating an explanation rather than like finding threats like yeah yeah yeah that makes sense there's there's also it also has this limitation of just being about bits of the network where it's like you might have hoped that you could understand and you know Network behavior in terms of like it's training Dynamics or something like Cal scrubbing doesn't really touch that I I guess like the closest thing to this that I can I don't know I'm probably don't have an exhausted view of this literature but like um a previous guest on the podcast Stephen Casper has some stuff basically just like trying to get benchmarks for interpretability and trying to say like you know can we like understand XYZ behavior Can you like can you use your understanding to do this task or that task yeah love to see it that's great yeah uh and basically the is like interpretability is not doing so hard yet um yeah that makes sense unfortunately so I guess we're about at the end of the discussion um before we close up I'm wondering is there anything you wish that I'd asked that I haven't yeah that's a good question let me think about that a bit I don't know what the question is exactly but there's an interesting thing about uh a research at Palisade which is like uh somebody is just like trying to like understand um we don't know the current capabilities of systems um and it it's like quite interesting like from an eval frame because like evaluations I think are often framed as like we're trying to understand how capable this like model is well like that's kind of true but like what really what you're testing is like how capable is this model combined with this programmatic scaffolding which can be quite complex and like or how capable is this model combined with this programmatic scaffolding and other models yeah um and like we don't necessarily know how far like how much scaffolding will like influence this but like at least a fairly large degree um yeah so that's so I I think like yeah so I mean a lot of This research with with evals with like demonstrations um this like a question of like what is it for like some of it's for like just trying to understand stuff which just like how good are models how good is our models plus scaffolding but another part of it is like trying to like make AI capabilities more legible um and I think this is like a very important part of like the research like process and like sort of like for organizations that are trying to like do research and communication um because I see like our best shot at like a good AI development path is one where it's a lot more legible like what like the risks are like what potential mitigations and solutions are and like sort of like still how far we need to go um there's just like a lot more about like AI development that I want to be more legible uh in part because I think that I like I really believe that like because of how fast this is going and because of how like multipolar it is and how many like different actors there are we like really need to be able to coordinate around it we need to coordinate around like well like how like when do we want to make AGI like when do we want to make super intelligent systems like how risk tolerant are we like as a as the United States or like or as the world uh are we and like right now I don't think that most people like have grappled with these questions in part because they don't really know that we are facing these things like they don't know that like there's a bunch of companies that are that have in their fiveyear plan like build AGI like be able to replace most human workers but I think that is the case um and so I'm like well seems like now is a very important time to have conversations and like actually like put in place political processes for trying to like solve some of these governance and coordination questions and I think in order to do that we're going to have to like explain a lot of these things to a much wider audience and not just have this be a convers ation among AI researchers but like a conversation among AI researchers and policy experts and like policy leaders and like a lot of random constituents and a lot of like people who are in media and a lot of people who are like working on farms I'm like yeah I think that like a lot more people need to understand these things so that they can like be able to advocate for themselves for things that like really affect them and this is very difficult right I'm like I think it's very hard to like uh like in some sense it's like not ult because I think some of the basic risks are like pretty intuitive um but like what's difficult is that people have so many different takes on these risks and like there's so many different ways to like make different points that you have like Yan Lun over here being like Oh you guys are like totally wrong about this existential risk stuff and you have like Hinton and Benjo over here being like no actually I think these risks are super severe and important we have to take them very seriously and you know people might have somewhat good intuitions around these but like but also now you have like a lot of sophisticated arguments you have to try to evaluate and figure out and so I'm like I don't know how to solve this but I'm like I do want much more legible things that more people can follow so they can like weigh in uh on the things that I think do affect them a lot gotcha well um so just finally before we end um if people are interested in following your research um how should they do that yeah so they can follow um we'll be posting stuff on our website palisad research.org okay um you can follow me on Twitter at Jeff Ladish um I go by Jeffrey but when I made my Twitter account it was I went by Jeff and yeah we'll probably be publishing on in other places too but that's those are probably the easiest places to follow all right well um thanks for coming on the podast thanks for having me this episode is edited by Jack Garrett and Amber D helped with transcription the opening and closing themes are also by Jack Garrett far Labs provided the filming location financial support for this episode was provided by the long-term future fund and light speeed grants along with patrons such as Alexi maaf to read a transcript of this episode or to learn how to support the podcast yourself you can visit axr p.net finally if you have any feedback about this podcast you can email me at feedback axr p.net [Music] [Music] [Music]

Info

Channel: AXRP

Views: 621

Rating: undefined out of 5

Keywords:

Id: l1yiFToBAtM

Channel Id: undefined

Length: 135min 43sec (8143 seconds)

Published: Tue Apr 30 2024