The ONLY Model Trained on the DARK WEB (Dark Bert)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
there are two internets not one two there's the surface web that you're all familiar with that's Google any websites you visit Yelp Wikipedia anything that you can get to pretty easily through a standard web browser but below that that's where we have the dark web and in the dark web there's a lot of shady things going on there's drug sales there's weapon sales there's hacking information sales anything that you can think of that doesn't appear on the surface web that goes to the dark web now to access that you typically need something called a Tor Browser now when we look at large language models almost all of them are trained exclusively on Surface web data that means they're only good at natural language that relates to surface web language now if you try to apply those large language models to the dark web they're not going to work very well and there's a very specific reason for that the dark web has very different language than the surface web for example there's a lot of code words being used now when you have different types of language being used all of a sudden large language models data the original data that it's trained on isn't relevant anymore and those models aren't actually accurate but today we're going to look at a research paper that explores what happens if you train a large language model on the dark web now what they're proposing is that if you train a large language model on the dark webs language you can actually use that model on the dark web really effectively much better than surface web large language models so that's exactly what they did they trained a model on the dark web and they're using it for cyber security they are the good guys and we're going to explore what this paper talks about and I find it fascinating so let's get into it so here's the paper it's titled darkbird a language model for the dark side of the internet and as mentioned in the intro Recent research suggests that there are clear differences in the language used in the dark web compared to that of the surface web dark bird a language model pre-trained on dark web data so how does it Define the dark web the dark web is a subset of the internet that is not indexed by web search engines such as Google and inaccessible through a standard web browser you need to use a Tor Browser or any overlay Network and they say many of the underground activities prevalent in the dark web are immoral illegal in nature ranging from content hosting such as data leaks to drug sales but if you want to apply artificial intelligence to try to root out or Identify some of these malicious activities on the dark web you can't really use traditional large language models you need one pre-trained on dark web data and so that's what they've done our evaluation results show that dark bird based classification model outperforms that of known pre-trained language models and they talk about right here their analyzes using standard NLP tools have also suggested that processing text in the dark web domain would require considerable domain adaptation so again you can't use the same models that you're using on the surface web in the dark web the language is just very very different either way a domain-specific pre-trained language model would be beneficial in that it would be able to represent the language used in the dark web which may effectively reduce the performance issues faced in previous experiments okay so how did they actually do this how did they get the dark web data let's take a look we initially collect seed addresses from Amia and public repositories containing lists of onion domains so what is onion well it's like a top level domain like.com or dotnet.gov except it's dot onion and that's the address that most of the dark web is found on we then crawl the dark web for pages from the initial seed addresses and expand our list of domains parsing each newly collected page with HTML title and body elements of each page saved as a text file a total of around 6.1 million Pages were collected so they're just scraping the dark web the same way Google scrapes the surface they're crawling it they're looking for information and they're storing it and so how did they actually train the model they decided to use an existing model architecture rather than rolling their own and it is called Roberta we chose Roberta as our base initialization model as it opts out of the next sentence prediction task during pre-training which may serve as a benefit to training a domain-specific corpus like the dark web as sentence-like structures are not as prevalent compared with the surface web the dark web language is extremely different from surface web surface web is much more formal and formatted but with a dark web you're getting potentially highly unformatted text highly coded text and you just can't use the same model architecture here's a breakdown of the data they used pornography drugs gambling hacking arms violence Electronics that's basically what the different pages were about they used two different data sets Coda and Duda and then they removed categories that didn't have a lot of volume to them such as human trafficking so here are the results from the two different data sets Duda and Coda you can see that dark bird outperformed Roberta and Bert in many of the cases although some of the results are very similar it still did outperform anything that you're seeing in bold means it outperformed or it did the best but they did note that for some categories they performed about the same as traditional surface web models and the reason for that is is because those categories can also be found on the surface web an example of that right here some categories such as Drugs Electronics and gambling show very similar performances across all four models this is likely due to the high similarity of pages in such categories making classification easier even with the differences in language used in the dark web and so why do this why spend the time training on the dark web what actual use cases can be applied so they go over a few of them one ransomware leak site detection and what they propose is that this large language model while crawling can also do identification of leak site detection and this is beneficial because right now the dark web is huge and people are trying to figure out ways to automate the identification of malicious material on the dark web and this can really help with that here are some of the results from ransomware leak site detection so on the left side here we have the raw data and then the pre-processed data the raw data is everything obviously and the pre-processed data is low volume categories removed empty categories removed empty pages removed private data removed because a lot of what's on the dark web is personal information about people that is bought and sold and so they wanted to remove that from the models as well and so what you could see here is that dark bird especially after being pre-processed really far outperformed Roberta and Bert which again are trained on Surface web data here's another use case noteworthy thread detection dark web forums are often used for exchanging illicit information and Security Experts monitor for noteworthy threads to gain up-to-date information for timely mitigation noteworthy in this case is something bad is going to happen in this threat or something bad is happening or being discussed in this thread since many new Forum posts emerge daily it takes massive human resources to manually review each thread and again that's where large language models can come in doing that pre-review and once they identify things can be passed to a human to actually do the final verification but again identifying noteworthy threads however requires a basic understanding of dark web specific language and that's why it's important that dark bird is trained on the dark web language and here they describe what noteworthy is so if it describes one of the following activities sharing of confidential company assets such as admin Access employee or customer information transactions blueprints Source codes and other confidential documents sharing of sensitive or private information of individuals such as credit information medical records political engagement passports identifications and citizenship distribution of critical malware or vulnerabilities targeting popular software or organization we place emphasis on activities targeting large private companies public institutions and industries and again they talk about dark bird outperforming other language models in terms of precision recall and F1 score for both inputs so here's an example of the difference between using a surface web language model versus a dark web language model to identify drug content so for the Bert model semantically related words are man Champion singer writer driver sculptor producer manufacturer and so these aren't related to MDMA at all in this example and and that's what they're talking about however for dark bird we have pills import MD dot translation speed up oxy script champagne these are much more accurate and so here's something I found interesting on this specific use case on the other hand Bert mainly suggests professions such as singer sculptor and Driver which are not relevant to drugs this comes from the fact that the preceding word dutch is usually followed by a vocational word in the surface web and so I talked a little bit about how they actually gather the data but let's look into that more it is of utmost importance that sensitive data be left out of the Text corpus used for pre-training to prevent dark bird from learning representations from sensitive text as mentioned above we mask our data before feeding it into our language model and so they have the raw version and the pre-processed version we plan to only release the pre-processed version of dark bird in order to avoid any malpractices once the model is made publicly available one of the limitations they talk about is that because of much of the web and much of the dark web is in English if somebody is communicating on the dark web in a non-english language this model has a lot of trouble with that dark Bert is pre-trained using dark web text in English the vast majority of dark web texts is primarily English around 90 building a multilingual language model for the dark web domain would pose additional challenges and so that's it this is a dark web based language model it applies to any dark web tasks that you want from a language model it performs much better than surface web models for that use case this is a really cool research paper and great job to the authors I really enjoyed reading it if you like this video please consider giving a like And subscribe and I'll see you in the next one
Info
Channel: Matthew Berman
Views: 21,472
Rating: undefined out of 5
Keywords: darkbert, dark bert, dark web llm, chatgpt, chat gpt, openai, artificial intelligence, bert, what is dark bert, what is darkbert, darkbert ai, llm, large language model, dark web
Id: Kg8xoqHcdL4
Channel Id: undefined
Length: 10min 16sec (616 seconds)
Published: Sun May 21 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.