Scaling Synthetic Data Creation with 1 Billion Personas | PersonaHub Dataset Explained

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

data explorers video by our gala. This time we're exploring. This persona hub dataset has a paper as well. Named scaling synthetic data creation with 1 billion personas. By then Chan and colleagues from the Tencent AI lab. I thought it was interesting because it focuses on. Increasing variety in synthetic datasets. So when you create a synthetic data set, you will ask a large language model. To respond to an instruction. And there's a limit to how much variation you can get from a single language model and a single instruction. And so one way of increasing the variety is to give the large language model a persona from which to respond. The paper proposes a method for creating personas based on world knowledge or public text from the web. Where. This public text is compressed into a persona. For example, a moving driver . John a moving company driver needs to deliver furniture to three locations, and the language model is asked to generate this instruction. Based on the specific persona in a specific topic. They use two main approaches, text to persona where they use that web text and then persona to persona. Where they expand on those personas to add further variation? The dataset is available on the hub and it's split up into various subsets. Each subset is a topic of the persona. You have three columns that input persona, which is a description of that persona, for example, a data science professor who focuses on how algorithms can impact the fairness and accuracy of news. Synthesized text, which is the instruction and it relates to that persona. And then a description, which is kind of like a key words. Field. So I took this state. So from the hub, I loaded it. I connected to the gala server and then I defined a feedback task. Here is my art gala dataset settings definition. I defined fields questions and vectors, the fields that I added correspond to the dataset,. And I wanted to add all three fields so they could do some searching. I also added some vector, I did hear some questions as well, a rating. A correction. And then feedback What I ended up with is this status at here and gala. And I want to find in this dataset are examples of tool use from a specific persona. And so tool uses are executed in this Jason format by the language model. For example, if I said what's the temperature. I would want it to go to a weather API and get that temperature from that and so in persona hub, there's a subsection of tool use. They're labeled in the description column so I can use search and I can just type in tool. And I can search specifically inside. The description.. And so then now what I see are all of the descriptions where tubal appears. and Now I want to filter those down for a specific type of persona. So let's just take this persona here. So there's a medical device manufacturer who collaborates with a sales rep sensitive. I am going to say, find similar and I'm going to select input, persona factor. Because I want to use that field factor. And now I get a, another persona and that's a business development executive focused on strategic partnerships. So that's picking up that business development, part of that first persona. What do we have here now? Okay. A manufacturing officer. Dental surgeon pick up the medical part. Let's say this is an example of the type of record that I want in the dataset that I'm creating. And so I can say, okay, this is a high quality response. , Or I could make corrections . So let's look through this function. So the function has a name analyzed dental. You might want to add more description to give the language model an easier task. You could move on. , okay, this is not something I want in my data set. So I'm going to discard it. And you could keep moving through and creating a subsection that you could later use in model training or model evaluation. Okay. And so that's the kind of fast end to end example of how you'd go from finding a dataset and creating a subset. I'll share the notebook. , and all the other resources are already available online. In fact, the hugging face space is public as well. So if you just wanted to look through this example, you could, and I'll share that. Their description to the video.

Info

Channel: Argilla

Views: 171

Rating: undefined out of 5

Keywords: llm, opensource, nlproc, machine-learning, community, ai

Id: timmCn8Nr6g

Channel Id: undefined

Length: 4min 45sec (285 seconds)

Published: Wed Jul 10 2024