Let's talk about VTubers. Amygdala comes from the Greek word for
almond and it's the part of the brain associated with recognizing emotion from
facial expressions or at least most of them probably. It's an effective evolutionary
trait in recognizing emotions. In fact, it's so effective that sometimes emotions
can be recognized in non-human face havers such as animals or drawings. But determining the
emotion from a facial expression is a lot more than just selecting one from a set. There are many
different facial structures and yet the same minor muscular movements can determine the same slight
shift in intensity or nature of emotion. And so the question is how does the brain recognise and
piece together all these seemingly insignificant details to understand what's being communicated
by a facial expression? For the past few decades, a popular interpretation of the brain has been the
comparison to the computer. Both of them after all are things that have parts in them that do
things. In conventional computer software, a simple program follows a sequence of logical
steps. To find a red dot in an image for example, you can check every single pixel until you get to
a pixel that is red with no red around it and to find the face and all of its features you just do
the same thing but replace a red dot with a face. But people are different coloured pixels and even
if we weren't there's no difference between this pixel and this pixel. So maybe instead we should
look at the combinations of pixels - at the edges of the image. But even then there's really very
little difference between this edge and this edge and so maybe instead we should look at the
combinations of edges - at the shapes. In 1964 Dr. Woodrow Wilson Bledsoe published a report
on a project for recognizing faces called the Facial Recognition Project Report. The goal of the
project was to match images of faces to names and one of the strategies to doing this was first
looking at the features of the face. Features can include key points such as the hairline, the
corners of the eyes, or the tip of the nose. These features can then be combined and the distances
between the features are used to recognise and classify the face against many faces in a dataset.
Of courses not all faces are always directly facing camera. Some of them are facing to the left
and some of them are facing the consequences of their creation. To correct this, mathematical
transformations were applied to the distances to face the face face-forward. Unfortunately,
not much else is known about this project due to confidentiality but it is the first known
significant attempt at computers processing human faces - so long as those faces are at a reasonable
angle and lighting and age and isn't wearing a hat. Ultimately, due to limitations in technology,
the project was labeled as unsuccessful but it did highlight a problem in processing faces in that
faces were just too different from each other and different from themselves in different
settings. Any singular hard-coded algorithm had to be very convoluted with potentially a
lot of error but that's okay though because this was in the 60s so they still had plenty of
time to figure it out before I make this video. VTuber, short for virtual tuber short for virtual
YouTuber is a YouTuber that is virtual as opposed to regular YouTubers who are authentic and
genuine. Being YouTubers, they are most commonly found on Twitch and have the potential to be as
various in content as any other content creator. In other words, swaying back and forth with their
mouths open. VTubers take obvious inspiration from Vocaloid performances such as those
of Hatsune Miku but unlike Vocaloids there isn't a singular history of VTubers since 1964 as
VTubers are not a singular organisation or even a singular technology. Most VTuber historians
like to start the history at Kizuna AI though who is in every sense of the word a virtual
YouTuber and the first virtual YouTuber if you don't count Super Sonico or the Weatheroid or
Annoying Orange. Ai is the Japanese word for love but it also has the obvious second meaning A.I.
- the abbreviation for artificial intelligence. It could also stand for Adobe Illustrator but it
does not because that is too many puns and also would not make sense. The character proclaims
herself to be an artificial intelligence and by artificial intelligence she means animation
software with a production crew but how exactly the human operator and voice are mapped
to the model has been mostly left to speculation due to confidentiality. It's also noteworthy that
unlike most traditional YouTubers she's somewhat of a corporate mascot under Kizuna AI Inc made
apparent by her involvement with commercials. Nonetheless, her character and her format
of 3D animated voice acted video productions is the first usage of the word the VTuber and
so it's a good place to start. Unfortunately, Kizuna never left behind a formal definition for
the word VTuber and so its definition at least as a medium has been left fairly open-ended. If
you take the definition to be the online video form substitution of one's identity by animated
character then the concept of VTubing is not very novel nor uncommon. On one hand, you have
static expressions chosen from a set of pre-drawn illustrations or PNGTubers and on the other
hand you have full body motion capture devices connected to a model within a game engine such as
that of CodeMiko. These two extremes sit at either end of the spectrum from most affordable but least
immersive to most immersive and least affordable but they both solve the facial recognition
from image problem in the best way possible: by not solving it. It's much easier to find a
dot on a screen than a face so what if we made people's faces into dots or rather multiple dots.
In 1990, Lance Williams decided to track faces by not tracking faces and published a paper on
Performance-Driven Facial Animation in which retroreflective dots that were easy to detect
by a computer were applied to a person's face to be tracked and then mapped to a 3D model.
Williams performed this on himself for budgetary reasons and not because he wanted to become
an anime character. This would be one of the first instances of marker-based facial motion
capture for animation: a technique that can be held accountable for The Polar Express. But it's
unreliable and bothersome both of which are bad things and so it has nothing to do with the video.
If we ignore all the other parts of the body, CodeMiko's facial animation uses iPhone
X's FaceID. By using projectors to project a light onto the face and then sensing the
depth of this reflected light using sensors, a three-dimensional image is created thus
avoiding the problem of angle. And since the projectors and sensors are projecting and sensing
infrared light rather than visible light on top of the infrared light that your body radiates
naturally, the lighting around the face does not affect the image. The entire solution is
thus in the hardware and it works pretty well even on faces that are wearing hats. However,
how exactly a three-dimensional depth map is achieved from a face with lights on it is
something that we're not going to get into because hardware is scary but mostly due
to confidentiality though it doesn't take an Apple engineer to make the observation that
light patterns distort themselves when reflected on three-dimensional surfaces which could help
indicate the shapes of those surfaces. Apple's FaceID remains dominant in the IR camera facial
mapping market. Google's Pixel 4 had a similar system called uDepth which used a stereo depth
sensing system otherwise known as two cameras similar to how you have two eyes to sense depth
but this was discontinued and the other one is Xbox Kinect. All of this wasn't developed just
for Apple's primary demographic of VTubers though. The main selling point of FaceID is its biometric
authentication system and also Animoji. But where VTubing comes in is the tool that Apple provides
to developers: the ARKit. Developers can build apps around this tool such as Live Link which
feeds the facial data directly into Unreal Engine which is what CodeMiko uses. But what if you can't
afford an iPhone X or just despise Apple? Surely there's another way to VTube from your webcam
or camera. In fact, it's probably the technology you've been thinking of since we brought up brains
and facial recognition. Microsoft Excel has a tool that allows you to draw a trendline that best
represents a scatter plot. Most data is probably not linear but a linear line can still be used
to predict y values given x values. Of course, this prediction could just be terrible and so
Microsoft Excel has to minimalise the distance between every single point and the line to find
the line of best fit. This process is called linear regression. Linear means relating to lines
and comes from the word line and regression means estimating the relationship between a dependent
variable and many independent variables and comes from the 19th century bean machine. You may have
noticed from that last sentence that there are many independent variables. Linear regression is
useful for drawing lines through three-dimensional and four-dimensional and whatever dimensional
scatter plots. Every new dimension is just another variable or feature that affects the output and
predicted output in the y-axis. Using linear regression to predict how long a person is going
to watch through a video, the features may include the length of the video, the age of the person,
and how much of the video is about statistical theory. And to make predictions off of say images
of faces for example, the features could be every single color value of every individual pixel on
the image. But making predictions of something as advanced as an image of a face may not be
as simple as just drawing a line. A linear fit might not be best appropriate for our
line or hypothesis to every single feature. It might work better as a quadratic fit or
cubic fit. By adding more features or dimensions that are equal to the square or the cube or the
whatever of the previously established features, we can do polynomial regression which is actually
just another type of linear regression because the hypothesis is linearly proportional to something
that is non-linearly proportional to the original data. You can also combine features and make new
features with them by multiplying them together such as if you have a height feature and a
width feature you can instead have an area feature. But making predictions off of something
as advanced as an image of a face may not be as simple as just drawing a multivariate nth degree
polynomial. We know that we can modify and combine features to make new features to optimally
fit our hypothesis to data but in what way do you modify the features? Which features do
you combine? How do you even do any of that for thousands of pictures that have hundreds of
thousands of pixels and millions of RGB values? Who is Gawr Gura? A slightly controversial sort of
lie is that linear regression as well as all other types of regression are a form of artificial
intelligence. In fact, if you sort of lie, anything can be a form of artificial intelligence.
You yourself at home may already know a deal about artificial intelligence, either from your own
extensive research and experience or your ability to lie, but artificial intelligence isn't so much
of an algorithm as it is the idea of artificially creating something that is intelligent or at
least seems so. Most of what you may know to be artificial intelligence is the method of machine
learning called the artificial neural network. A neural network or a network of neurons is a system
of units that receive information, process it, and pass it on to other units in order for the
entire system to make a prediction - quite similar to the neurons of a brain. It's also quite similar
to Danganronpa in that regard. Neural networks and all of machine learning is a big deal because it
allows programmers to do things they typically can't do on their own such as play board games
at a grandmaster level or hold a conversation. This is because unlike in conventional programming
where the programmer knows what they're doing, in machine learning the program learns how to do it
on its own without the programmer really needing to know how it was done. But machines don't have
feelings so how and what exactly is the machine learning? The units or neurons of a neural network
are organized into layers. The first layer or the input layer is where the inputted features are
received. For every inputted feature there is a neuron in the input layer. Each feature within
each neuron can then contribute by some weighting to the features within the next layer of neurons.
The different weighted sums of all the features of this layer is thus the information received by
the neurons of the next layer. This next layer, called a hidden layer, then applies some
processing to the information in order to make it harder to explain. First, it adds a number called
the bias value in case the information is going below a certain threshold that it shouldn't and
then it puts it all through an activation function which is just some non-linear function so that
the features of this layer are not necessarily linearly related to the features of the previous
ones. These newly activated features can then be passed on to the next layer to repeat the process
and make more features off of these features. Through this, the features of each layer are like
a combination of the features of the previous ones: from pixels to edges to shapes to faces.
If there are many many layers that compute very very specific or complicated features then the
entire network can be called a deep neural network because it is very long. Eventually, it reaches
an output layer which has the number of neurons as the number of things you're trying to predict.
The values received here are the predictions that the model is giving based off of the input. To
train a model is to figure out all of its weights and biases which are altogether called parameters.
This decides how each feature fits into the next layer of features. To do this is just the simple
task of finding or creating hundreds of thousands of pieces of data. The input data can be put
through the model and the predicted output can be compared with the actual true value that was
manually determined by a human. The function that does this comparison and determines how wrong
the model is is called the cost function. We can then go backwards through the model to find
out how each parameter can be changed in order to lower this cost function. This part is called
backpropagation and if you know calculus it's a quick way to calculate the partial derivative of
the cost function with respect to every parameter and if you don't know calculus well it's the
same thing but you wouldn't understand. The neural network relies on training with many sets
of data in order to improve itself with each set, hence the name machine learning. Now
admittedly, all of that may have been a bit of an oversimplification but it's the
backbone of machine learning and the model used for computer vision and more specifically object
detection which would be true if I wasn't lying. Different architectures of neural network can have
different activation functions, cost functions, number of layers, and number of neurons per
hidden layer. The architecture for a computer vision model in which the input is an image matrix
is even more convoluted as it is a convolutional neural network. An RGB image can be thought of as
three matrices: one for each RGB value of every pixel. However, it would be a lot of weights to
put every single pixel into a weighted sum for every single feature of the next layer. Rather,
the more efficient technique devised is to take these matrices of features and pass them through
a filter that forms some number of matrices of new features for the next layer. The parameters here
for the convolutional layers are the values that make up the filters for each layer. There are
then also pooling layers that reduce the number of parameters by throwing them away and hoping
it works and then near the end of the network we may have some fully connected layers that are just
the same layers as before with weights and biases to sort of check to see if there are
any relationships we're missing between all the features now that there are less features.
Finally, the vector of features that we're left with is put through some regression function to
perform the actual classification or localisation or detection or moral dilemma conclusion for
your autonomous vehicle. Just like with the basic neural network, the convolutional neural
network or ConvNet if you're running out of time is filtering for more and more specific features
with each layer. Also, this was once again an absurdly oversimplified oversimplification that
is ignoring a lot of the math though this time I'm not lying in any of the information I've given.
Computer vision is fairly big and so a lot of research has been put into it as well as a lot
of abstractions from its bare mathematical bones to the point where training and then running a
model could take the copy and paste skills of web development. Given just a few hundred images, you
can train a model to perform object detection on the body parts of Oikawa Nendoroids. Even you can
become a VTuber. In the case of VTubers which is what this video is about, it's not actually
object detection but rather facial landmark detection. The output for such a model may be a
number denoting whether or not there's a face on the screen followed by the x and y coordinates of
several keypoints along the face, eyebrows, eyes, nose, and lips. You may have noticed that we never
answered the question of how a brain detects faces and facial expressions. The answer to that is who
knows? It's not my job to teach you neurology. In fact, I'm unemployed. If you take the definition
of VTubers to be "a" then its history is a pretty straightforward series of events. Following
the massive success of Kizuna AI in Japan, many other Kizuna-esque VTubers started popping up
such as Kaguya Luna and Mirai Akari. It was only a matter of time before agencies that managed
several VTubers started appearing to the scenes such as Hololive. The agency Nijisanji broke
from the tradition of 3D models and used Live2D which is like 3D but with one less D. Rather than
a 3D model of joints, meshes, and textures, Live2d takes several flat images and layers them on top
of each other to move at different motions giving the illusion of depth. Perhaps more importantly
though is that Nijisanji focused on live streams rather than video productions with which Hololive
and other VTubers soon followed suit. And like all things Japanese, very soon there were fan English
subtitles followed by official English subtitles followed by English speaking subsets such as
Hololive English producing English groups such as HoloMyth including Gawr Gura followed by
entire English-speaking-based agencies such as VShojo. This rise of VTubers from the debut of
Kizuna AI to the debut of VShojo is a relatively short period of time from 2016 to 2020. In almost
all of the examples I've given thus far though, the technology used for facial tracking is not
artificial intelligence. No matter how efficient or accurate a neural network may be it has one
fatal flaw in that it is software. Having to put every frame or every few frames through the neural
network to get updated coordinates for our model is a lot of processing. Even with a convolutional
neural network that does everything it can in every layer to reduce the number of parameters,
image processing is going to be a costly process. This means that in order for the animation to work
in real time with what's commercially available today, the smoothness or precision is going to
have to be significantly reduced. Add on to the fact that computer vision is very dependent
on lighting, you can't process something from nothing, and it makes sense why both Hololive and
Nijisanji provide iPhones to all their incoming VTubers. The TrueDepth system of apple's FaceID
still uses software but the hardware part is especially designed specifically for the purpose
of facial mapping. This means that rather than being given some massive data and then finding the
features that it figured out how to find on its own, the program is given the features of light
distortion or depth that coincides directly with the coordinates of the facial landmarks using
just some conventionally programmed geometric operations. As funny as it would have been though,
it's not like all that talk about machine learning was completely irrelevant. There are still an
abundance of VTuber applications for webcam using ConvNets primarily targeted towards
independent youtubers who don't get free iPhones. Luppet, Wakaru, VSeeFace, 3tene which comes with
bodily contortions, FaceRig which comes with not being good, to name a few. VTube Studio which is
for Live2D is available for webcam, Android, and iOS. For webcam, it uses a model from OpenSeeFace.
There it is. Whereas on Android it uses ARCore, both of which are deemed to be of less quality
tracking than the iOS version. VTubing is not just facial tracking though but since it's all about
tracking a human and mapping it to an avatar, all the other aspects of VTubing short of
a mocap suit can use similar technologies. Hand-trackers such as LeapMotion use IR projectors
and sensors to track hand motions which is very handy but also limited because you can't
cross your arms so no snarky remarks. Natural language processing problems such as speech
recognition require a lot of feature engineering and so neural networks are preferred,
inputting human speech and outputting text which can then be either used as another
way for mouth tracking or be synthesized back into speech via more neural networks to
mask your voice like VShojo's Zentreya. "Head fat". Neural networks, light sensors, and
even mocap suits, VR headsets, eye-trackers, and the Xbox Kinect are all methods of motion capture.
And if I didn't want people to see this video, I could probably title it motion capture at least
for up to this point. But that still wouldn't be entirely true as the motion capture required for
VTubing is still different than that required for general virtual reality or film production.
There is an emphasis in the technology on facial expressions, affordability, and presenting to
an audience in real-time. What's more is that VTubing doesn't have to be and was never meant to
be just a one-to-one direct transfer of motion to an avatar. While this goes more into design than
development, VTubers can also employ things like object interactability, keyboard shortcuts for
pre-programmed animations, or additional physics. VTubing is like a virtual puppet show or Luppet
show you could say or not say actually and just because the puppet strings of motion capture are
necessary, doesn't mean you can't improve the show with piles of corpses. Maybe it shouldn't even
be a puppet show. Perhaps the future of VTubing should be a looser connection to the puppeteer
for more expressive or stylistic animation. A paper was written last year in 2021 for a
VTubing software called AlterEcho. The software uses facial expression recognition, acoustic
analysis from speech recognition, and mouse or keyboard shortcuts to apply gestures to an avatar
on top of motion capture - gestures that the human themselves are not actually doing. The nature or
mannerisms of these gestures can be configured by what the paper calls avatar persona parameters
such as how shy or confident the VTuber persona is supposed to be. How effective this all is is still
unknown though as the software is unavailable and the paper is still under double-blind review
at least at the time of this recording, though the paper itself states that it was rated
fairly highly compared to pure motion capture and VMagicMirror which is a keyboard-based software.
On the topic of new softwares for independent VTubers, while 2D models are modeled with Live2D,
3D models are not necessarily modeled with actual 3D modeling software like Blender but rather
softwares like VRoid Studio which is essentially a character customization screen which has many
sliders for incredibly unique customization, though the official stable release has only been
out for a year. Currently, both 2D and 3D VTubers suffer from a noticeably homogeneous design and
style that some say is reminiscent of Genshin Impact characters whereas others argue is
closer to Honkai Impact characters. Perhaps a much more unique easily accessible VTuber
avatar creator will be available for the next generation of VTubers. It's unlikely that it
will ever break out of that anime niche anytime soon. You had your chance. And it's definitely not
going to be whatever Metaverse was supposed to be. But just like how Live2D models have been getting
exceedingly creative, 3D models could branch off into as many styles as there are styles of
anime which has the opportunity to be aided by a movement towards more motion capture-independent
animation-focused VTuber software. In regards to the future of VTubing, there is another
possibility that has been somewhat disregarded since the decline of Kizuna AI and it has to
do with the AI part of Kizuna AI. Not love. It's quite common for VTubers nowadays to come
with their own lore - the avatar is actually a mythical beast or a magical being. Kizuna,
the self-proclaimed first virtual youtuber, of course had the backstory of being an artificial
intelligence. This whole backstory stems from the original idea behind artificial intelligence: to
mimic human intelligence. A neural network can learn how to find the features it needs to find to
locate a facial landmark on an image which could imply that given the right conditions in training
it can learn the features of human behavior and produce content. And while in the case of Kizuna,
artificial intelligence was only used at most for landmark detection, there already exist machine
learning models that write scripts, interact with humans, play games, and generate animations. There
are even neural networks for singing synthesizers such as SynthV which has been mentioned to
me by all five of its users. It seems not too far-fetched to just combine all of these to
create a truly automated artificial intelligence virtual youtuber. However, we also know that all
of these are just several independent abstract networks. The learning that these networks are
doing isn't based off of experience or intuition or arguably even any logical structure. It is
just a collection of shifting parameters and mathematical evaluations that knows something
is correct because, in the case of supervised learning, we informed it that it was correct.
A content creator about as sentient as the trendline generator on Microsoft Excel. We know
this even without fully understanding sentience because we know what sentience is not and what
it is not is a pattern recognition machine in a computational void. The actual algorithms of human
intelligence may still be unknown but because of the way that machine learning was developed,
artificial intelligence isn't intelligent in the same way that humans are but it can learn the
features of human intelligence and reproduce it to an incredible degree. It's no longer uncommon
for humans to be deceived by AI generated conversations or art or music, though perhaps
given the nature of parasocial relationships and corporate media, we simply wouldn't mind.
Ultimately, whether a content creator who is not sentient - who we know is not sentient - but can
imitate sentience perfectly will be in or against our best interests will be up to us as living
breathing humans. Until then though, there are at least a few more Hololive generations before we
have to make such a decision so we might as well enjoy what A.I. has to offer and look forward
to the future of this still young VTuber era without having to really worry about any
unforeseeable threats of other intelligences. "Yahoo"