data explorers video by our gala. This time we're exploring. This persona hub dataset
has a paper as well. Named scaling synthetic data
creation with 1 billion personas. By then Chan and colleagues
from the Tencent AI lab. I thought it was interesting
because it focuses on. Increasing variety in synthetic datasets. So when you create a synthetic data set,
you will ask a large language model. To respond to an instruction. And there's a limit to how much
variation you can get from a single language model and a single instruction. And so one way of increasing the
variety is to give the large language model a persona from which to respond. The paper proposes a method for
creating personas based on world knowledge or public text from the web. Where. This public text is
compressed into a persona. For example, a moving driver . John
a moving company driver needs to deliver furniture to three
locations, and the language model is asked to generate this instruction. Based on the specific
persona in a specific topic. They use two main approaches, text
to persona where they use that web text and then persona to persona. Where they expand on those
personas to add further variation? The dataset is available on the hub
and it's split up into various subsets. Each subset is a topic of the persona. You have three columns that input persona,
which is a description of that persona, for example, a data science professor
who focuses on how algorithms can impact the fairness and accuracy of news. Synthesized text, which is the instruction
and it relates to that persona. And then a description, which
is kind of like a key words. Field. So I took this state. So from the hub, I loaded it. I connected to the gala server and
then I defined a feedback task. Here is my art gala dataset
settings definition. I defined fields questions
and vectors, the fields that I added correspond to the dataset,. And I wanted to add all three fields
so they could do some searching. I also added some vector, I did hear
some questions as well, a rating. A correction. And then feedback What I ended up with is this
status at here and gala. And I want to find in this
dataset are examples of tool use from a specific persona. And so tool uses are executed in this
Jason format by the language model. For example, if I said
what's the temperature. I would want it to go to a weather
API and get that temperature from that and so in persona hub, there's
a subsection of tool use. They're labeled in the description
column so I can use search and I can just type in tool. And I can search specifically inside. The description.. And so then now what I see are all of
the descriptions where tubal appears. and Now I want to filter those
down for a specific type of persona. So let's just take this persona here. So there's a medical device
manufacturer who collaborates with a sales rep sensitive. I am going to say, find similar and I'm
going to select input, persona factor. Because I want to use that field factor. And now I get a, another persona and
that's a business development executive focused on strategic partnerships. So that's picking up that business
development, part of that first persona. What do we have here now? Okay. A manufacturing officer. Dental surgeon pick up the medical part. Let's say this is an example of
the type of record that I want in the dataset that I'm creating. And so I can say, okay, this
is a high quality response. , Or I could make corrections . So
let's look through this function. So the function has a
name analyzed dental. You might want to add more description to
give the language model an easier task. You could move on. , okay, this is not something
I want in my data set. So I'm going to discard it. And you could keep moving through and
creating a subsection that you could later use in model training or model evaluation. Okay. And so that's the kind of fast end to
end example of how you'd go from finding a dataset and creating a subset. I'll share the notebook. , and all the other resources
are already available online. In fact, the hugging face
space is public as well. So if you just wanted to look through this
example, you could, and I'll share that. Their description to the video.