The "VTuber" and its (Technical) Future

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Let's talk about VTubers. Amygdala comes from the Greek word for  almond and it's the part of the brain   associated with recognizing emotion from  facial expressions or at least most of them   probably. It's an effective evolutionary  trait in recognizing emotions. In fact,   it's so effective that sometimes emotions  can be recognized in non-human face havers   such as animals or drawings. But determining the  emotion from a facial expression is a lot more   than just selecting one from a set. There are many  different facial structures and yet the same minor   muscular movements can determine the same slight  shift in intensity or nature of emotion. And so   the question is how does the brain recognise and  piece together all these seemingly insignificant   details to understand what's being communicated  by a facial expression? For the past few decades,   a popular interpretation of the brain has been the  comparison to the computer. Both of them after all   are things that have parts in them that do  things. In conventional computer software,   a simple program follows a sequence of logical  steps. To find a red dot in an image for example,   you can check every single pixel until you get to  a pixel that is red with no red around it and to   find the face and all of its features you just do  the same thing but replace a red dot with a face.   But people are different coloured pixels and even  if we weren't there's no difference between this   pixel and this pixel. So maybe instead we should  look at the combinations of pixels - at the edges   of the image. But even then there's really very  little difference between this edge and this edge   and so maybe instead we should look at the  combinations of edges - at the shapes. In 1964   Dr. Woodrow Wilson Bledsoe published a report  on a project for recognizing faces called the   Facial Recognition Project Report. The goal of the  project was to match images of faces to names and   one of the strategies to doing this was first  looking at the features of the face. Features   can include key points such as the hairline, the  corners of the eyes, or the tip of the nose. These   features can then be combined and the distances  between the features are used to recognise and   classify the face against many faces in a dataset.  Of courses not all faces are always directly   facing camera. Some of them are facing to the left  and some of them are facing the consequences of   their creation. To correct this, mathematical  transformations were applied to the distances   to face the face face-forward. Unfortunately,  not much else is known about this project due   to confidentiality but it is the first known  significant attempt at computers processing human   faces - so long as those faces are at a reasonable  angle and lighting and age and isn't wearing a   hat. Ultimately, due to limitations in technology,  the project was labeled as unsuccessful but it did   highlight a problem in processing faces in that  faces were just too different from each other   and different from themselves in different  settings. Any singular hard-coded algorithm   had to be very convoluted with potentially a  lot of error but that's okay though because   this was in the 60s so they still had plenty of  time to figure it out before I make this video.   VTuber, short for virtual tuber short for virtual  YouTuber is a YouTuber that is virtual as opposed   to regular YouTubers who are authentic and  genuine. Being YouTubers, they are most commonly   found on Twitch and have the potential to be as  various in content as any other content creator.   In other words, swaying back and forth with their  mouths open. VTubers take obvious inspiration   from Vocaloid performances such as those  of Hatsune Miku but unlike Vocaloids there   isn't a singular history of VTubers since 1964 as  VTubers are not a singular organisation or even   a singular technology. Most VTuber historians  like to start the history at Kizuna AI though   who is in every sense of the word a virtual  YouTuber and the first virtual YouTuber if   you don't count Super Sonico or the Weatheroid or  Annoying Orange. Ai is the Japanese word for love   but it also has the obvious second meaning A.I.  - the abbreviation for artificial intelligence.   It could also stand for Adobe Illustrator but it  does not because that is too many puns and also   would not make sense. The character proclaims  herself to be an artificial intelligence and by   artificial intelligence she means animation  software with a production crew but how   exactly the human operator and voice are mapped  to the model has been mostly left to speculation   due to confidentiality. It's also noteworthy that  unlike most traditional YouTubers she's somewhat   of a corporate mascot under Kizuna AI Inc made  apparent by her involvement with commercials.   Nonetheless, her character and her format  of 3D animated voice acted video productions   is the first usage of the word the VTuber and  so it's a good place to start. Unfortunately,   Kizuna never left behind a formal definition for  the word VTuber and so its definition at least   as a medium has been left fairly open-ended. If  you take the definition to be the online video   form substitution of one's identity by animated  character then the concept of VTubing is not   very novel nor uncommon. On one hand, you have  static expressions chosen from a set of pre-drawn   illustrations or PNGTubers and on the other  hand you have full body motion capture devices   connected to a model within a game engine such as  that of CodeMiko. These two extremes sit at either   end of the spectrum from most affordable but least  immersive to most immersive and least affordable   but they both solve the facial recognition  from image problem in the best way possible:   by not solving it. It's much easier to find a  dot on a screen than a face so what if we made   people's faces into dots or rather multiple dots.  In 1990, Lance Williams decided to track faces by   not tracking faces and published a paper on  Performance-Driven Facial Animation in which   retroreflective dots that were easy to detect  by a computer were applied to a person's face   to be tracked and then mapped to a 3D model.  Williams performed this on himself for budgetary   reasons and not because he wanted to become  an anime character. This would be one of the   first instances of marker-based facial motion  capture for animation: a technique that can be   held accountable for The Polar Express. But it's  unreliable and bothersome both of which are bad   things and so it has nothing to do with the video.  If we ignore all the other parts of the body,   CodeMiko's facial animation uses iPhone  X's FaceID. By using projectors to project   a light onto the face and then sensing the  depth of this reflected light using sensors,   a three-dimensional image is created thus  avoiding the problem of angle. And since the   projectors and sensors are projecting and sensing  infrared light rather than visible light on top   of the infrared light that your body radiates  naturally, the lighting around the face does   not affect the image. The entire solution is  thus in the hardware and it works pretty well   even on faces that are wearing hats. However,  how exactly a three-dimensional depth map is   achieved from a face with lights on it is  something that we're not going to get into   because hardware is scary but mostly due  to confidentiality though it doesn't take   an Apple engineer to make the observation that  light patterns distort themselves when reflected   on three-dimensional surfaces which could help  indicate the shapes of those surfaces. Apple's   FaceID remains dominant in the IR camera facial  mapping market. Google's Pixel 4 had a similar   system called uDepth which used a stereo depth  sensing system otherwise known as two cameras   similar to how you have two eyes to sense depth  but this was discontinued and the other one is   Xbox Kinect. All of this wasn't developed just  for Apple's primary demographic of VTubers though.   The main selling point of FaceID is its biometric  authentication system and also Animoji. But where   VTubing comes in is the tool that Apple provides  to developers: the ARKit. Developers can build   apps around this tool such as Live Link which  feeds the facial data directly into Unreal Engine   which is what CodeMiko uses. But what if you can't  afford an iPhone X or just despise Apple? Surely   there's another way to VTube from your webcam  or camera. In fact, it's probably the technology   you've been thinking of since we brought up brains  and facial recognition. Microsoft Excel has a tool   that allows you to draw a trendline that best  represents a scatter plot. Most data is probably   not linear but a linear line can still be used  to predict y values given x values. Of course,   this prediction could just be terrible and so  Microsoft Excel has to minimalise the distance   between every single point and the line to find  the line of best fit. This process is called   linear regression. Linear means relating to lines  and comes from the word line and regression means   estimating the relationship between a dependent  variable and many independent variables and comes   from the 19th century bean machine. You may have  noticed from that last sentence that there are   many independent variables. Linear regression is  useful for drawing lines through three-dimensional   and four-dimensional and whatever dimensional  scatter plots. Every new dimension is just another   variable or feature that affects the output and  predicted output in the y-axis. Using linear   regression to predict how long a person is going  to watch through a video, the features may include   the length of the video, the age of the person,  and how much of the video is about statistical   theory. And to make predictions off of say images  of faces for example, the features could be every   single color value of every individual pixel on  the image. But making predictions of something   as advanced as an image of a face may not be  as simple as just drawing a line. A linear   fit might not be best appropriate for our  line or hypothesis to every single feature.   It might work better as a quadratic fit or  cubic fit. By adding more features or dimensions   that are equal to the square or the cube or the  whatever of the previously established features,   we can do polynomial regression which is actually  just another type of linear regression because the   hypothesis is linearly proportional to something  that is non-linearly proportional to the original   data. You can also combine features and make new  features with them by multiplying them together   such as if you have a height feature and a  width feature you can instead have an area   feature. But making predictions off of something  as advanced as an image of a face may not be as   simple as just drawing a multivariate nth degree  polynomial. We know that we can modify and combine   features to make new features to optimally  fit our hypothesis to data but in what way   do you modify the features? Which features do  you combine? How do you even do any of that for   thousands of pictures that have hundreds of  thousands of pixels and millions of RGB values?   Who is Gawr Gura? A slightly controversial sort of  lie is that linear regression as well as all other   types of regression are a form of artificial  intelligence. In fact, if you sort of lie,   anything can be a form of artificial intelligence.  You yourself at home may already know a deal about   artificial intelligence, either from your own  extensive research and experience or your ability   to lie, but artificial intelligence isn't so much  of an algorithm as it is the idea of artificially   creating something that is intelligent or at  least seems so. Most of what you may know to be   artificial intelligence is the method of machine  learning called the artificial neural network. A   neural network or a network of neurons is a system  of units that receive information, process it,   and pass it on to other units in order for the  entire system to make a prediction - quite similar   to the neurons of a brain. It's also quite similar  to Danganronpa in that regard. Neural networks and   all of machine learning is a big deal because it  allows programmers to do things they typically   can't do on their own such as play board games  at a grandmaster level or hold a conversation.   This is because unlike in conventional programming  where the programmer knows what they're doing, in   machine learning the program learns how to do it  on its own without the programmer really needing   to know how it was done. But machines don't have  feelings so how and what exactly is the machine   learning? The units or neurons of a neural network  are organized into layers. The first layer or the   input layer is where the inputted features are  received. For every inputted feature there is   a neuron in the input layer. Each feature within  each neuron can then contribute by some weighting   to the features within the next layer of neurons.  The different weighted sums of all the features of   this layer is thus the information received by  the neurons of the next layer. This next layer,   called a hidden layer, then applies some  processing to the information in order to make it   harder to explain. First, it adds a number called  the bias value in case the information is going   below a certain threshold that it shouldn't and  then it puts it all through an activation function   which is just some non-linear function so that  the features of this layer are not necessarily   linearly related to the features of the previous  ones. These newly activated features can then be   passed on to the next layer to repeat the process  and make more features off of these features.   Through this, the features of each layer are like  a combination of the features of the previous   ones: from pixels to edges to shapes to faces.  If there are many many layers that compute very   very specific or complicated features then the  entire network can be called a deep neural network   because it is very long. Eventually, it reaches  an output layer which has the number of neurons   as the number of things you're trying to predict.  The values received here are the predictions that   the model is giving based off of the input. To  train a model is to figure out all of its weights   and biases which are altogether called parameters.  This decides how each feature fits into the next   layer of features. To do this is just the simple  task of finding or creating hundreds of thousands   of pieces of data. The input data can be put  through the model and the predicted output can   be compared with the actual true value that was  manually determined by a human. The function that   does this comparison and determines how wrong  the model is is called the cost function. We   can then go backwards through the model to find  out how each parameter can be changed in order   to lower this cost function. This part is called  backpropagation and if you know calculus it's a   quick way to calculate the partial derivative of  the cost function with respect to every parameter   and if you don't know calculus well it's the  same thing but you wouldn't understand. The   neural network relies on training with many sets  of data in order to improve itself with each set,   hence the name machine learning. Now  admittedly, all of that may have been   a bit of an oversimplification but it's the  backbone of machine learning and the model used   for computer vision and more specifically object  detection which would be true if I wasn't lying.   Different architectures of neural network can have  different activation functions, cost functions,   number of layers, and number of neurons per  hidden layer. The architecture for a computer   vision model in which the input is an image matrix  is even more convoluted as it is a convolutional   neural network. An RGB image can be thought of as  three matrices: one for each RGB value of every   pixel. However, it would be a lot of weights to  put every single pixel into a weighted sum for   every single feature of the next layer. Rather,  the more efficient technique devised is to take   these matrices of features and pass them through  a filter that forms some number of matrices of new   features for the next layer. The parameters here  for the convolutional layers are the values that   make up the filters for each layer. There are  then also pooling layers that reduce the number   of parameters by throwing them away and hoping  it works and then near the end of the network we   may have some fully connected layers that are just  the same layers as before with weights and biases   to sort of check to see if there are  any relationships we're missing between   all the features now that there are less features.  Finally, the vector of features that we're left   with is put through some regression function to  perform the actual classification or localisation   or detection or moral dilemma conclusion for  your autonomous vehicle. Just like with the   basic neural network, the convolutional neural  network or ConvNet if you're running out of time   is filtering for more and more specific features  with each layer. Also, this was once again an   absurdly oversimplified oversimplification that  is ignoring a lot of the math though this time I'm   not lying in any of the information I've given.  Computer vision is fairly big and so a lot of   research has been put into it as well as a lot  of abstractions from its bare mathematical bones   to the point where training and then running a  model could take the copy and paste skills of web   development. Given just a few hundred images, you  can train a model to perform object detection on   the body parts of Oikawa Nendoroids. Even you can  become a VTuber. In the case of VTubers which is   what this video is about, it's not actually  object detection but rather facial landmark   detection. The output for such a model may be a  number denoting whether or not there's a face on   the screen followed by the x and y coordinates of  several keypoints along the face, eyebrows, eyes,   nose, and lips. You may have noticed that we never  answered the question of how a brain detects faces   and facial expressions. The answer to that is who  knows? It's not my job to teach you neurology. In   fact, I'm unemployed. If you take the definition  of VTubers to be "a" then its history is a pretty   straightforward series of events. Following  the massive success of Kizuna AI in Japan,   many other Kizuna-esque VTubers started popping up  such as Kaguya Luna and Mirai Akari. It was only   a matter of time before agencies that managed  several VTubers started appearing to the scenes   such as Hololive. The agency Nijisanji broke  from the tradition of 3D models and used Live2D   which is like 3D but with one less D. Rather than  a 3D model of joints, meshes, and textures, Live2d   takes several flat images and layers them on top  of each other to move at different motions giving   the illusion of depth. Perhaps more importantly  though is that Nijisanji focused on live streams   rather than video productions with which Hololive  and other VTubers soon followed suit. And like all   things Japanese, very soon there were fan English  subtitles followed by official English subtitles   followed by English speaking subsets such as  Hololive English producing English groups such   as HoloMyth including Gawr Gura followed by  entire English-speaking-based agencies such as   VShojo. This rise of VTubers from the debut of  Kizuna AI to the debut of VShojo is a relatively   short period of time from 2016 to 2020. In almost  all of the examples I've given thus far though,   the technology used for facial tracking is not  artificial intelligence. No matter how efficient   or accurate a neural network may be it has one  fatal flaw in that it is software. Having to put   every frame or every few frames through the neural  network to get updated coordinates for our model   is a lot of processing. Even with a convolutional  neural network that does everything it can in   every layer to reduce the number of parameters,  image processing is going to be a costly process.   This means that in order for the animation to work  in real time with what's commercially available   today, the smoothness or precision is going to  have to be significantly reduced. Add on to the   fact that computer vision is very dependent  on lighting, you can't process something from   nothing, and it makes sense why both Hololive and  Nijisanji provide iPhones to all their incoming   VTubers. The TrueDepth system of apple's FaceID  still uses software but the hardware part is   especially designed specifically for the purpose  of facial mapping. This means that rather than   being given some massive data and then finding the  features that it figured out how to find on its   own, the program is given the features of light  distortion or depth that coincides directly with   the coordinates of the facial landmarks using  just some conventionally programmed geometric   operations. As funny as it would have been though,  it's not like all that talk about machine learning   was completely irrelevant. There are still an  abundance of VTuber applications for webcam   using ConvNets primarily targeted towards  independent youtubers who don't get free iPhones.   Luppet, Wakaru, VSeeFace, 3tene which comes with  bodily contortions, FaceRig which comes with not   being good, to name a few. VTube Studio which is  for Live2D is available for webcam, Android, and   iOS. For webcam, it uses a model from OpenSeeFace.  There it is. Whereas on Android it uses ARCore,   both of which are deemed to be of less quality  tracking than the iOS version. VTubing is not just   facial tracking though but since it's all about  tracking a human and mapping it to an avatar,   all the other aspects of VTubing short of  a mocap suit can use similar technologies.   Hand-trackers such as LeapMotion use IR projectors  and sensors to track hand motions which is   very handy but also limited because you can't  cross your arms so no snarky remarks. Natural   language processing problems such as speech  recognition require a lot of feature engineering   and so neural networks are preferred,  inputting human speech and outputting   text which can then be either used as another  way for mouth tracking or be synthesized back   into speech via more neural networks to  mask your voice like VShojo's Zentreya. "Head fat". Neural networks, light sensors, and  even mocap suits, VR headsets, eye-trackers, and   the Xbox Kinect are all methods of motion capture.  And if I didn't want people to see this video,   I could probably title it motion capture at least  for up to this point. But that still wouldn't be   entirely true as the motion capture required for  VTubing is still different than that required   for general virtual reality or film production.  There is an emphasis in the technology on facial   expressions, affordability, and presenting to  an audience in real-time. What's more is that   VTubing doesn't have to be and was never meant to  be just a one-to-one direct transfer of motion to   an avatar. While this goes more into design than  development, VTubers can also employ things like   object interactability, keyboard shortcuts for  pre-programmed animations, or additional physics.   VTubing is like a virtual puppet show or Luppet  show you could say or not say actually and just   because the puppet strings of motion capture are  necessary, doesn't mean you can't improve the show   with piles of corpses. Maybe it shouldn't even  be a puppet show. Perhaps the future of VTubing   should be a looser connection to the puppeteer  for more expressive or stylistic animation.   A paper was written last year in 2021 for a  VTubing software called AlterEcho. The software   uses facial expression recognition, acoustic  analysis from speech recognition, and mouse or   keyboard shortcuts to apply gestures to an avatar  on top of motion capture - gestures that the human   themselves are not actually doing. The nature or  mannerisms of these gestures can be configured   by what the paper calls avatar persona parameters  such as how shy or confident the VTuber persona is   supposed to be. How effective this all is is still  unknown though as the software is unavailable and   the paper is still under double-blind review  at least at the time of this recording,   though the paper itself states that it was rated  fairly highly compared to pure motion capture and   VMagicMirror which is a keyboard-based software.  On the topic of new softwares for independent   VTubers, while 2D models are modeled with Live2D,  3D models are not necessarily modeled with actual   3D modeling software like Blender but rather  softwares like VRoid Studio which is essentially   a character customization screen which has many  sliders for incredibly unique customization,   though the official stable release has only been  out for a year. Currently, both 2D and 3D VTubers   suffer from a noticeably homogeneous design and  style that some say is reminiscent of Genshin   Impact characters whereas others argue is  closer to Honkai Impact characters. Perhaps   a much more unique easily accessible VTuber  avatar creator will be available for the next   generation of VTubers. It's unlikely that it  will ever break out of that anime niche anytime   soon. You had your chance. And it's definitely not  going to be whatever Metaverse was supposed to be.   But just like how Live2D models have been getting  exceedingly creative, 3D models could branch   off into as many styles as there are styles of  anime which has the opportunity to be aided by a   movement towards more motion capture-independent  animation-focused VTuber software. In regards   to the future of VTubing, there is another  possibility that has been somewhat disregarded   since the decline of Kizuna AI and it has to  do with the AI part of Kizuna AI. Not love.   It's quite common for VTubers nowadays to come  with their own lore - the avatar is actually a   mythical beast or a magical being. Kizuna,  the self-proclaimed first virtual youtuber,   of course had the backstory of being an artificial  intelligence. This whole backstory stems from the   original idea behind artificial intelligence: to  mimic human intelligence. A neural network can   learn how to find the features it needs to find to  locate a facial landmark on an image which could   imply that given the right conditions in training  it can learn the features of human behavior and   produce content. And while in the case of Kizuna,  artificial intelligence was only used at most for   landmark detection, there already exist machine  learning models that write scripts, interact with   humans, play games, and generate animations. There  are even neural networks for singing synthesizers   such as SynthV which has been mentioned to  me by all five of its users. It seems not   too far-fetched to just combine all of these to  create a truly automated artificial intelligence   virtual youtuber. However, we also know that all  of these are just several independent abstract   networks. The learning that these networks are  doing isn't based off of experience or intuition   or arguably even any logical structure. It is  just a collection of shifting parameters and   mathematical evaluations that knows something  is correct because, in the case of supervised   learning, we informed it that it was correct.  A content creator about as sentient as the   trendline generator on Microsoft Excel. We know  this even without fully understanding sentience   because we know what sentience is not and what  it is not is a pattern recognition machine in a   computational void. The actual algorithms of human  intelligence may still be unknown but because of   the way that machine learning was developed,  artificial intelligence isn't intelligent in   the same way that humans are but it can learn the  features of human intelligence and reproduce it to   an incredible degree. It's no longer uncommon  for humans to be deceived by AI generated   conversations or art or music, though perhaps  given the nature of parasocial relationships   and corporate media, we simply wouldn't mind.  Ultimately, whether a content creator who is not   sentient - who we know is not sentient - but can  imitate sentience perfectly will be in or against   our best interests will be up to us as living  breathing humans. Until then though, there are   at least a few more Hololive generations before we  have to make such a decision so we might as well   enjoy what A.I. has to offer and look forward  to the future of this still young VTuber era   without having to really worry about any  unforeseeable threats of other intelligences. "Yahoo"
Info
Channel: Junferno
Views: 257,298
Rating: undefined out of 5
Keywords:
Id: TKBo20RNNrE
Channel Id: undefined
Length: 27min 40sec (1660 seconds)
Published: Sat Aug 06 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.