"Paper Session 3 - Input & Interaction"

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all around how we interact with mixed reality systems so taxonomy for ar interaction derived from the work in the past five years the user interface for one-handed discrete and continuous input in mid-air a calibration technique for aligning tracked hands in see-through ar and a reinforcement learning based method for enabling mid-air text entry and then finally a predictive performance model for interaction tests so i'm very cool and very much looking forward to this the first presentation on the taxonomy is going to be presented by yuya hattu from the university of hamburg in germany so please take it away okay um hello and thanks for your introduction and welcome to my presentation about um a taxonomy augmentation techniques for immersive augmented reality as an alternative literature review i'm willia harden and my co-authors and me are from the university of hamburg in germany in the last years we saw impact for developments in the field of augmented reality and sensor technologies nowadays it is possible to precisely control the robot's arm by just in the arm and hand movements in this video we can see how even a robot's arms grouper can be easily controlled through factual objects or visualize by force feedback is visualized in ar such and other interactions are possible due to precise sensor technologies especially self-contained and tethered headsets at various built-in sensors like head tracking hand tracking switching cognition and even eye tracking for example on the hololens 2. due to these recent developments novel interaction techniques have emerged so our goal was to capture and structure the current state of research of interaction techniques we focused on immersive augmented reality including one displays and production-based ir in a time frame from 2016 to today to structure the current state of research we developed a taxonomy which helps to provide an overview about existing interaction techniques as well as a common ground for discussion by formalizing the scope and introducing consistent terminology to the ability taxonomy use an iterative method instead of analyzing all existing interaction techniques at once this method allows us to include batches of interaction techniques consecutively and rewrite our taxonomy in each iteration until ending conditions are met for each iteration the researchers can choose between two approaches in the empirical to conceptual approach existing objects in this case interaction techniques are analyzed while a conceptual to empirical approach is based on previous work on the researcher's knowledge since we wanted to investigate the current state of research we mainly focus on empirical to conceptual iterations for these we performed a literature review to identify interaction techniques and include them in our taxonomy from the literature review we searched for interact and augmented reality or mixed reality in the databases of ieee and acm initially we found 2100 results to break them down into batches of interaction techniques to include iteratively reboot the papers by the venues they were published at for each iteration we focus on one venue screen these papers extracted the presented interaction techniques and analyzed them to create and revise our taxonomy in total we included 44 papers from six venues this speaker presents our resulting taxonomy across all analyst interaction techniques define two common properties which we included as dimensions into our taxonomy task and modality for both we identified multiple categories based on this taxonomy we define an interaction technique with exactly one task which will be accomplished and what are more modalities which are used to accomplish the task in the following i want to present these categories starting with the tasks one basic task is creation which is used to that a virtual object appear we divide this task into subcategories based on the complexity of provided information an activation is performed when the user does not define any properties of the object it can either be used to let a hidden object appear or to create a new but predefined one 2d drawing is used whenever the texture of an object is modified while 3d modelling also allows the user to modify the geometry of an object the task we encounter the most is selection we found that the complexity of a required interaction technique mainly differs by the arrangement of selectable objects for 2d selection all objects are placed on the planar surface for example a group like ui and a 3d selection is needed if the objects are spatially arranged geometric manipulations cover translation rotation and scale they are of news and represent nearly half of the interaction techniques you encountered abstract manipulation as a task is not that established in previous taxonomies in our taxonomy abstract manipulation includes all interactions which are not directly related to visualized objects we find two common use cases it can be used to change the internal system state for example it will load content or to switch between modes art can be used to interact with physical objects existing in the real world like internet of things devices based on the input type we identify two subcategories discrete and continuous abstract manipulations finally we included text input as a task which covers all techniques that enable you a user to enter characters words or sentences to perform these tasks there's a variety of modalities which can be applied in the following i represent the modality categories of our taxonomy to cover all modalities where user physically touches something in order to integrate the system we introduce the category of tactile interaction we identify three subcategories touch generic input devices and tangibles touch includes all interactions where the user's hand physically touches the surface generic input devices cover all kinds of controllers mouses stylus bands devices without a single designated function in contrast tangibles keep the specific function over a period of time and can represent a specific virtual object gestures cover interaction modalities where movements of the user's body are used as an input hand gestures are the most commonly used ones and are seen as a natural and intuitive way to provide input we also found one occurrence of a face and one of a foot gesture furthermore we encountered the modality of voice which can be used in ar applications since modern devices and platforms often have speech recognition capabilities the modality of gaze is used to capture the direction user is looking at by either clicking the i or the head orientation even though eye gaze can be considered as being more natural head case is currently used more often archers mention different reasons head case is currently more affordable since every headset can track the headphone the head rotation but not every headset has built in eye tracking also a headshaking was found to be more accurate brain computer interfaces convert brain activity into commands to control the system since they can be used hands-free and don't require the movements of any body parts they are particularly interesting however in our literature review we only included one occurrence this table presents an overview about the frequencies of combinations of tasks and modalities we found the most frequent tasks are geometric manipulations especially translations and gestures is the most common modality and the combination of using hand gestures for translation is the most common interaction technique you found this table also shows which task identity combinations we did not identify in the scope of our literature review which could give us hints about possible research gaps but there are more people interested in how modalities are combined in multi-modal interaction techniques this speaker shows all frequencies of identified combinations it shows us for example that hand gestures are paired with any other modality except for open computer interface and that hand gestures and voice and hand gesture and head gaze are particularly common these combinations especially appear in selection tasks a common pattern for oscillation is to apply a pointing technique like gaze or a hand gesture to select an object and a confirming action by a hand or by a hand gesture or voice by developing the taxonomy we also analyze emerging trends about interaction techniques in ar today i would like to present our most interesting observations in contrast to virtual reality the user's environment can be included in augmented reality settings we identify two directions of how ar and the real world can interact with each other ar can be used to manipulate the real world which we introduced as absolute manipulations and here ai can be used to provide an user interface visual feedback or even a preview of planned actions also keyword objects can be used as input a physical object can either act as a tangible and directly represent a virtual object or provide passive haptic feedback we also investigated which ai technologies are currently used and observed a clear trend towards optical seeds for ar especially in the context of untethered mobile headsets currently mentioned disadvantages of optical see-through are the limited field of view as well as inferior render quality so with further technological improvements this prevalence could even be transmitted for tangibles we noticed a trend towards everyday objects researchers include a variety of common objects into their into their ar applications like photo albums spray bottles wallets smartphones things a user typically finds in their surroundings such interaction techniques mostly aim to facilitate the interaction with ar systems by performing everyday tasks without the need of additional devices we also observe the goal of developing cellular interactions this can be achieved by using minimum movements like facial expressions or by transforming the action space of hand movements to be smaller and more unobstrusive for example by performing gestures at the waist or burst or by directly touching their own skin we also paid attention to which application plates are considered in the papers the three most common fields we found are interactive data visualization smart home and the interaction of drivers and drones interactive data visualization especially uses the property of ir to easily virtual object to easily view virtual objects from different perspectives either by anchoring visualizations to a tangible and naturally rotate and translate them or by placing them in the real world letting the user move around smart home as well as robot and drone use cases take advantages of the fact that both the real and the virtual world can be seen simultaneously so in this work you performed a literature review and developed a taxonomy filter action techniques for immersive augmented reality we identified and defined two dimensions tasks and modality in the future this taxonomy can act as a guideline for developers and researchers to design new intelligent techniques and also to identify merging trends thanks for your attention okay thank you very much julia for this great presentation uh this is the first talk in the session to so just a reminder you can ask questions either here through the chat you can post them on discord please and then they're going to be summarized and i can read them out um so i'm hoping to see some questions but maybe to tie this over um you know when i looked at the talk and at the paper like i found it really really interesting that you looked at so many um works in the past five years and i'm really wondering you know sort of as a summary and to to to to help us digest this work better when you look through the papers what are the challenges that you extracted from them are there any patterns that the papers perhaps agreed or disagreed on where we should focus on more going forward to unleash even more potential [Music] i think a very interesting trend is the thing to have more unobstrusive interactions so in general it should bring ar more into everyday context and to get to make it more easier for from from a technological perspective to don't have a lot of required sensors or something and even from a social perspective to have it more socially acceptable and i think that's a really important and interesting challenge today so what's the i mean you focused a lot on interaction right which makes a lot of sense right like if we if we consider that technology and the evolution of technology is going to continue as we've seen in the past couple years things getting more powerful smaller and so on and so on reduced in form factor what does that mean for the interaction you know do we have to sort of how do we keep pace right how how do we like align ourselves with sort of the disappearance of devices um i think interactions will be more subtle and more more passive basically i think a user won't focus so much on the specific interaction anymore and i think developers and researchers will have to to make it possible somehow i see so so a little bit going away from explicit input more to observation and interpretation yeah like away from standardized uh generic input devices like controllers or something [Music] the the second thing perhaps i i really like the analysis of the combination of modalities that's you know it's always magic that you find by connecting two things or more um what are some of the examples maybe for untapped potential that you would see for future work to explore more here oh that's a good question [Music] i think there are some modalities which are really uncommon today so we had a few modalities that we only found one events of like brain computer interface and facial expressions and i think that so they have a lot of potential on their own but i think also combined with armored edit means they can really do a lot of things let's see what the picture is um thank you we have a couple audience questions now the first one is in the robot experiment the only way to track the movements of the hand was the vr tool yeah [Music] okay maybe we want to take this offline through discord the second one um is a great talk did you also consider interaction techniques for handheld based ar so using phones in ar um no not in this paper because um we wanted to see what developers and researchers do if the hands basically are not occupied so for a handheld effort for mobile headsets and production miss ar if there are any possibilities to do whatever you want to as a result research and developer because the user does not have to hide something in their hand we wanted to explore what is possible then and what is used in this context currently but we actually saw a few things about combining a mobile as a smartphone with a head mounted displays which was really interesting like a second device awesome cool uh yeah so that was the first talk thank you so much again julia and the team uh for this really nice structure over the work over the past five years uh we're going to move on to the second talk now which is hp ui and proximate user interfaces for one-handed interactions on head-mounted displays uh which is a collaboration between the university of manitoba and huawei canada sheriff felio is going to be presenting the paper next um yeah thank you christian for the introduction and uh that's a great talk with julian and uh i think my presentation will be kind of an extension of the discussion we just had so i'm sharif i'm sharif al and i'm a phd student at the university of manitoba uh sufficient uh professor irani so i'm presenting uh our paper on hbo what we call hand proximate use interface on for one-handed introductions on head-mounted displays so moving into the presentation uh as julia discussed in her presentation uh head mounted displays of vr and ar are kind of becoming of becoming uh next possible iteration for mobile computing and entertainment now uh there are a lot of ways of ways you would interact with uh these devices and uh some of the most common interaction techniques again as discussed in the previous uh this previous uh presentation it's like you have a head mount like a media interactions or you have controls now one problem with these types of interactions is that they tend to be for taking and also another problem is if you consider uh using these devices in the public they are not necessarily uh socially acceptable and also interact with interacting with these devices in this way can be fatiguing like you don't want to be standing in coffees in a coffee shop playing your hand around right and uh so and then you have smartphones like arguably the most common mobile computing platform we have inspired by the single-handed interactions that you see with smartphones we are proposing hand proximate user interfaces the way we look at it is uh replacing of the way we imagine is like replacing the physical device with a virtual interface which could work with both augmented reality or and virtual reality and uh in our implementation we are specifically demoing this in virtual reality and we are particularly focused in single-handed uh interactions uh single-handed thumb to finger interacts and so this so a hand proximate use interface is basically uh user element or the ui elements that are displayed on and around the hand and uh so uh as someone asking the previous presenters uh questions like uh so one recent trend is like using smartphones uh as to augment the interaction of uh interactions with hybrid mounted displays and uh what we are proposing is very similar to that uh one advantage we have with hp ui is the fact that you don't you're not carrying a physical device all the time so you can easily transition between interacting with these menus on your hp ui and interacting with the work and uh and again uh the idea of displaying uh displaying ui elements on your hand and interacting with them is not new but not many previous works have actually considered single-handed interactions while also displaying and interacting with elements so that is kind of where our paper is specific and this is a simple demo of uh our simple demo of hp ui um so here you have an interior design application uh where you place furniture so traditionally you'd have this very long menus that you would scroll through with say a controller over the hand here you have an hpi which you can use to scroll through these long minutes which can ideally be a lot less for taking than using a control to scroll through them and the yeah and say for example like you want to interface with your smartphone that also could be done say for example you want to interest to make a call you can interface with this smartphone and interact interact with it in the same way you would interact with the smartphone and here you see the user uh selecting elements with their dominant hand using hpui and placing it with their non-dominant hand and they can just like iterate through this really fast because they're using both hands right and then you have the user selecting with the non-dominant hand and using the hp ui that is display on the dominant hand to reorient and you see a map that is displayed on the hp ui and the user can rapidly move up move around and repeat this task again and again so that's a simple demonstration of what which viewer can do and uh so first of all we look at the design space of handbooks and we use interfaces and we identify five different factors the first one is the output space and as i mentioned previously we are particularly looking at single handed thumb to finger interactions so in that context we can divide the output output space into interactive space uh output space and non-interactive space and uh when you consider the interactive or the input part of uh the output space the this space uh you can further divide them into on fingers that is uh uh if the places are on around the hand where you can actually have tactile feedback and then off fingers and then you have where you'd be displaying these elements so you have around the hand between the fingers on the palm on the finger and above the finger and then uh given the given the fact that when you're doing specifically when you're doing a single handed interactions your hand deforms a lot your hand is moving moving and deforming a lot and considering the hand we also have to consider where a displayed element is going to be anchored so the x marks that you see on these images are some of the places where these elements can be anchored and so we are looking at generally we are categorizing them into palm individual phalanx whole fingers or multiple fingers and then again considering the dexterity and how the hand deforms uh you can divide the work space into two as well so like you uh say uh say consider the home screen of a mobile phone so this has elements that can be uh broken into individual ui components uh so we consider that that as a discrete workspace and then you have workspace which have elements which cannot be broken down to discrete elements think maps or images so that would be the continuous so consequently we run two studies and uh in the first study we compare direct as a forward by hpy to indirect input and then we uh we explore how someone would interact using hp ui on a continuous deformable server so we run our star is using the wi-com platform for uh accurate hand tracking and the oculus headset to display and interact with controller nodes so in the first study we are compare comparing uh direct with indeed uh so the we have two different input modes uh the direct input mode actually has uh the target elements being displayed uh on uh on the hand model that is uh mapped to the emotions of the user's hand so the task is so the task we have we have for the first study is a target selection task where we compare the completion times and uh the indirect in direct input mode has the target elements being displayed in a static hand that moves with the user's head and the user would select the targets with indirect thumb to think interactions we also consider the target locations on the hand on finger and our finger overall we observe that interacting with targets that are directly placed on the hand performs much much faster and also we ask the participants to rate the different target locations um similar to some of the previous works that have looked at uh uh the functional workspace of hand and the findings though the ratings that we observe are very similar to what we've seen in previous studies and uh for the next study we looked at continuous interaction the continuous interface now these are interfaces as i described before uh that cannot be broken down into individual discrete elements like a map or an image and we are considering what we refer to as the deformable interface uh so this is an interface that moves uh with the as you move the fingers uh an advantage of the deformable user interface is one it allows you to reach more regions compared to plenty of planar surface and it also uh has the advantage of providing better potential for tactile feedback so for the second study we you have two different tasks first is a target selection task and the second is a dragging task uh and uh another thing that we did was like uh so from the previous study we observed that the region around the pinky finger is not the most comfortable to interact as a matter of fact we had to exclude one participant because he couldn't interact with uh the pinky finger so uh so what so in the second study we also wanted to see if the space above the index finger that is the space between the index finger and thumb finger if that also can be used to display and interact with very much so overall what we observe is that uh these the regions of motions that are closer to the thumb are performed much faster and in the second study also we were comparing completion times so in conclusion uh inspired by the results that we see we implement a set of simple demos on the on the oculus headset using the oculus as hand tracking so the demo the uh the clips that you see on this slide are basically just using the oculus assam tracking [Music] and uh what we see uh so these are simple demonstrations that show the utility or the advantages of using an hp ui so that is uh basically providing users with an interface that is very similar to using the how they would interact with the smartphone which can be used in both virtual reality and potentially argument augmented reality and thank you that's the end of my presentation awesome thank you very much um so i noticed that people are starting to use the discord channel so please keep doing that and post the questions in their respective channels um or here on zoom and maybe i'll kick it off again um unless there are any yep okay thanks jar for the great talk um also super cool because i think this is quite enabling for sort of in situ uh situations going forward where we might be able to make use of these mixed reality interfaces for you know interacting with with contact or executing productivity tasks even and i think your interaction technique is really promising because it's a quick alternative to the direct interaction in midair that we're seeing a lot these days and i noticed that in your examples the hand was always within the the view the field of view of the user to guide the input and to select from a set of options i think there's a potential here to extend this to ice-free operation where you're working with or on the content while controlling it from a comfortable area so this is something that you thought about too yeah so that is uh that is something that we plan on looking at in future works yeah so that is definitely one factor that we want to actually consider so like one process like since we are using the handwriting of say the oculus headset yeah one problem is the hand always has to be in the f4 which is not always ideal so that you have to actually be looking down at the say if you're using for example the hololens you have to actually be looking at that which is not very comfortable so we are looking so we are actually uh we are looking at different ways of interacting indirectly as well yeah yeah there's also the potential to basically learn that right i noticed that you you had the imaginary phone in there right where we talked about the the aspect of transfer learning right where people like repeat these tasks over and over and given that you give them additional visual guidance for the task right this is a really great um idea also connected to the previous talk as you mentioned for more subtle interaction right to interact with like the the grand space in front of you in in actuality scenarios um uh we have a question here from mark um how would your system compare to to a real smartphone do you plan to compare this in the future oh yes uh in the future at some point yes like uh right now the uh what we realized is there are a lot of small like small problems that we have to solve and most of them around the fact that you know you have a very deformed being like a very uh say dynamic surface right the hand is not as static as the smartphone so directly comparing them is not possible so that is one thing that we are realized early on so yeah like after a few iterations at some point maybe yes does that make sense yeah yeah um maybe as a follow-up and somewhat related to the to the first question um hpi hvui um also seems like a nice solution for tests that are a bit longer in execution where you have to spend more time on them and then of course the hand in the field of view i mean you touched on this a bit um it can become heavy i mean this is probably not something that you considered so far in the in in the design of the techniques right but i also remember pranks paper from a couple years ago where he investigated the fatigue incurred by execution things such tasks right this is also something that you're factoring in maybe in another iteration on uh yeah yeah so like yes like bringing your hand closer to what is that is the that was the inspiration for using smartphones like people use smartphones like all day all the time so that there is so we are kind of uh building on the smartphone interactions and uh yeah like because of the 4v limitations your hand has to be a little further up so yeah in the future iteration we would like to have that move closer to you so that it's more comfortable uh we've had a question from joseph here i see you use a glove that recognized the hand do you think that it's possible to do the same thing recognizing the hand with a hand recognition algorithm actually yeah yeah so like uh so the glove was actually used for the studies like so we want we didn't like the hand the problem with the hand tracking algorithm they are not very precise as it is right now and that even if you take for example the marquee even they recommend against using hand tracking for like uh like placing elements on your fingers and stuff like that because the hand tracking is not accurate so that so that is why we use the bicon system to run the studies but the demands that the most that you actually saw were based on the hand tracking on oculus so yeah like if you have a fairly uh even with its like the problems that it already has it seems to work okay for now and uh so with the alec uh so we had a another session in the morning which looks at the hand tracking so the hand track hand pulls estimation and hand trapping is moving at a really fast pace so i'm assuming very soon will be i will be able to have accurate hand tracking which could be used for you know more reliably for hbo very cool also great opportunity to bring this out and about right and take with you where you go to make use of this very cool um yep thank you very much yeah and i think we should keep it at that and let's thank sharok again for the great presentation of his work and the work of his team next up we have the paper rotational constraint optical see-through headset calibration with bare hand alignment and it's going to be presented by shoe from imperial college london take it away okay i'm going to share my screen okay yeah okay thanks for the introduction today i'm happy to present the rotation constraint optical seizure headset calibration with bare hand alignment this work is collaborated between imperial college london and university of pisa here is a wolf greeting from all the authors so our method used a commercial rgbd camera to mercurously track the user's bare hand and exploits the tracted dance point cloud to calibrate the ost ar headset in a user-centric way our method is efficient robust and implementable in commercial ai headsets in this paper we use microsoft hololens 1 as an example the ost hmd has been proved useful for surgical tasks by projecting the virtual guidance in front of surgeons eyes and reserving their natural view to the patient the surgical intervention can be completed in an immersive and safe way to ensure the accuracy the projected ar content should be locationally consistent with patient anatomy under surgeon's perspective this leads to the topic of osd calibration which aims to correct the ar headset's rendering pipeline according to user's perception while the first person observation is not recordable we need to solve the icing projection state-of-the-art osd calibration methods can be divided into two categories the alignment free method use calibrated eye tracker to track users eyeball they require no user involvement but dependent on dedicated hardware and complicated extrinsic calibration furthermore the eye trackers check the center of the eyeball instead of the actual nodal point which compromises calibration accuracy by contrast the alignment based methods require user to collect virtual to real correspondences by manual alignment from which the authentic viewpoint related information can be decoded for projection correction despite the user involvement they are more implementable for commercial ar headsets due to their lower dependency on hardware therefore in this work we focus on the alignment-based methods current alignment-based methods usually require user to align one or few landmarks on a calibration tool to the displayed virtual cursors such alignment scheme has several limitations first special calibration tool with artificial landmarks are usually required which makes it impossible to calibrate anytime and anywhere second the required manual alignment is usually pointwise to pre-design cursor locations users may need to move a lot for the alignment third only one or a few correspondences can be collected per alignment thus repeated alignment is required to collect enough correspondences which increases users workload and finally the calibration accuracy is heavily dependent on the user performed alignment poor alignment by unexperienced user leads to poor calibration accuracy to solve these problems we propose a user-centric calibration scheme that directly exploits users bare hand the required alignment is object-wise to a hand cursor generated in a user-specific way each alignment will provide thousands of unpaired correspondences so that single alignment can potentially ensure reliable calibration finally our method can constrain the rotational misalignment performed by unexperienced users let's start from an overview our prototype system consists of a commercial ar headset with an anchored rgbd camera at the frame ta the user presents his or her hand to the system the 3d handpoints will be automatically segmented from the depth camera recording and sent to ar headsets to generate a customized hand contour since the osc display is not is not calibrated the user will notice a misalignment between the cursor and physical hand as shown on the on the screen hey or shay should translate their hand for proper alignment once the alignment is confirmed at the frame tb the new 3d handpoints will be segmented the two point clouds are used to optimize the eye related parameters for projection correction next we will explain the details of each step our message is based on a homographic practice of axis eye display model in the literature assuming an on-axis camera display system es the eye display projection can be expressed using equation one where on ek is hardware determined on axis projection intrinsic h is a homography related matrix and e is an extrinsic related matrix starting from an offline calibration at arbitrary viewpoint the new correct projection relation can be expressed using equation two where we let h1 equals to h0 times u e1 equals to q times zero the multiplication of uq can be written as equation three where phi one two three are 3d distance between old and new viewpoint in the i display system and phi 4 is an original eye display depth that can be calibrated offline so overall the goal is to find phi 1 2 5 3 so that the off axis had projection p can be corrected using the expression in equation four for the microless rgbd hemp tracking we train a mobilenet v1 based ssd network on the oxford hand dataset to look to locate the region of interest for the hand in the captured rgb frames the aligned depth frames were cropped by the predicted roi and filtered by the depth camera working threshold resulting in the tractor 3d handpoints we allow the user to generate customize the cursor with their own hand at self-dedicated locations to this aim the cropped depth frames were converted to binary masks and process to generate contour mask which is displayed as a reference cursor by the hands by the headset then finally our goal is to find 51253 at the cursor generation frame ta and alignment frame tb we have equation one since the hand movement in depth is negligible compared to hexa's focal distance we can write the equation two the task can be converted to a registration problem but with the transformation matrix m in a unique expression which is only determined by phi one to five three only since the correspondence is unknown we developed an optimization method with iterative closest point searching to minimize the point-to-point registration error in fact since the multiplication between 5 1 2 5 3 and 5 4 is very small the rotational part of m is nearly identity so that means the perfect hand alignment will expect little relative rotation during the hand movement we just called the method rotational constraint icp or rc icp insured so here shows the registration result between two hands we run a simulation test to show the robustness of our rc icp with up to nine degree misalignment the inaccuracy is estimated in estimated five one two five three is less than four millimeters threshold chosen from the literature results designed the ui interface based on this observation so on on the right you can say when the relative rotation is larger than 90 degrees during the user alignment the tractor hand surface will change color for warning this is the pairwise recording from the rgbd camera for hand tracking and eye eye replacement camera for the ar overlaid saying recording for the evaluation part we first evaluated the calibration accuracy by placing an eye replacement camera in the headset's eye box this allows us to objectively record the thing that what is the actual user cam and can observe when the osd display is on a roku marker cube is used as ar target we did the pairwise capture with the ar display on and off the aruco cube poses were detected individually from each frame the difference between the two poses was recorded as the calibration error shown in the table so with a single alignment our method can achieve 1.37 millimeter to the positional arrow and 1.76 rotational arrow then we involve three users to evaluate the reprojection error after each user performed calibration users are asked to manually adjust the cursor location to align to the center of a root call marker the cursor location is locked as ground truth pixel locations we evaluate the reprojection error at the four arrow eruptical corners indicated by the red cursors we can achieve 10.79 arcmain or 7.71 pixel reprojection arrow which is better than the state-of-the-art finger-based calibrations that's the end of my presentation questions are welcomed all right thank you shu and collaborators uh as always questions please on discord or here in the chat um i'll start off with the first one and you know i have to say um that was really timely topic and i think everyone who's used these systems in the headset understands the challenge um that you set out to to address here and sometimes it can even be a bit confusing when you use it and there's an offset between the hand and where the system models it um i think i think you presented a great technical solution to this and i'm wondering if you've investigated some of the questions that also relate to the user's perception perhaps do we have any sense of the impact of the problem for using a ui system for the calibration accuracy you mean yes probably that's one of the research direction uh actually when we designed the ui interface for the hand alignment we did care about this this problem so initially we uh display all the tracked hand points on the screen directly but we realized like when we asked the user to do the alignment they cannot focus on the contour or the perfect alignment because you basically have a 3d object in front of you and you have to align another surface with that so finally we did we decided to convert that 3d raw hand points to the contour so that they can use a kind of the shoot gun game to decide uh for the overlay of the hand control and the perceived contour uh for the current implementation because probably i mentioned in the relative work so far all these works require you to do multi-point alignment so for example you have a 3d root cube and you have like three or four not implant corners displayed to you and you have to simultaneously align those corners with the actual 3d cube but this requires a very heavy metal load and also although the alignment quality alignment task is quite easy because it's point to point but you have to simultaneously decide the multiple point alignment so it's not a very uh mental easily work um but like if there are any like interesting or any advanced display technique uh we are happy to listen to these suggestions in the future i see yeah that makes sense are you thinking a little bit about how to alleviate some of these things and make it less complex with the users going forward perhaps this is something you're working on right now and it's already in submission but yeah from our own experience we can say display a cursor then align to it is the current standard and also because it doesn't require too much complicated understanding so it's quite intuitive and straightforward um more ui interface probably means clearer instructions for users to for each step but not like probably not for the alignment step it's a it's a really interesting question given you mentioned it's intuitive and you know seeing the cursor i think this is something that um technology has sort of almost conditioned us to find intuitive right like because if you really think about it you you sort of convey to the system how you want to use it it tells you what it understands and then you correct based on what it displays that it thinks what you did right which is sort of not like sort of as intuitive as it should be that you just do something right so i think it would be really interesting to also get a sense of how people perceive this right and then maybe you know run this on a couple people try this out and see what is the impact that this difference and with this technology um that you that you built to improve this and solve this problem um how this makes things better right from a user's perspective especially with regard to i mean that's that's also one of the motivation why we invent this system because uh those previous methods usually like use pre-designed custom locations so that means for a user from their perspective they cannot know what's the pre-designed location but they have to control everything by themselves to navigate to those cursor locations and this probably means a lot of space or the uncomfortable alignment gestures which increase a lot of instability during the alignment so that's why we use the relative transformation from a first generated cursor location to an updated one so because the display misalignment can never be very very big otherwise the display cannot display anything in the field of view so in this case the the correction probably means a relatively small translation for the hand so they don't need to hold the hand for a long time and also in a very uncomfortable distance and gestures and also use their own hands so they can make any gestures according to their reading that was a very nice wrap up thank you very much thank you again team um we're going to move on now the next paper is a collaboration between eth zurich the university of cambridge and sarland university and the title is complex interaction as emergent behavior simulating mid-air text entry using reinforcement learning and lawrence will present it now sorry you seem to be having a few microphone issues i'll just switch over to my headset all right i hope you can hear me now yeah it works uh excellent um yeah uh thanks for the introduction uh christian um welcome to my presentation uh yeah sorry the title is a bit of a mouthful uh of uh complex interaction as emergent behavior simulating mid-air virtual keyboard typing using reinforcement learning and yeah as christian already said it's a collaboration between h zurich and uh cambridge and uh if i'm a little slow today apologies i have a bit of a fever so i might go a little bit over time um we'll see all right uh let's dive right in and have a look at the problem we are tackling in this paper um as virtual and augmented reality become more prevalent we're going to have to deal with text entry in these environments and as many of you will know this is still very much an unsolved problem i'm sure there are many great papers being presented on this very topic at this conference but one thing that all of these novel approaches tend to have in common is that they need predictive techniques that isolate users intent from noisy movements that can occur in 3d space so the development of these systems as with most hdi systems requires a lot of human testing first you need to find a good design with preliminary testing by trying out things yourself or maybe asking feedback asking for feedback from some of your colleagues and then you would usually conduct a user study to finally evaluate the performance of your system and the whole idea behind this paper was to replace some of the human input during the testing of these very explorative input methods with a user model of course the idea of having a user model is about as old as hci itself but existing models are often highly specific to a particular interaction task and and most importantly which limits the applicability to sorry my it seems to be progressing on its own so sorry these models what limits their applicability to novel input methods is that they are they require user data to be trained or to be calibrated um so what we are looking at is a system that does not rely on any pre-existing user data and instead relies on reinforcement learning yeah so from this problem formulation we isolate two very specific research questions the first one is is it even generally possible to learn a complex interaction tasks such as typing using the current reinforcement learning techniques and then secondly should we succeed does the resulting model exhibit human-like characteristics and almost more importantly can we actually use it to generate some useful data to pre-train other typing related algorithms uh such as key hit detection for example let's take a step back for a moment and have a look at some related work the work on the left looked at something quite similar to what we're doing they predict fatigue in mid-air interactions using a biomechanical model and reinforcement learning however they consider only very simple targeted reaching movements and there are other papers like this and they often use highly simplified arm models which also limits their realism and on the right we have a paper that's not exactly directly related but it just shows how dexterous of a task can potentially be learned with modern reinforcement learning techniques in this particular paper from openai they trained a hand to rotate objects into a certain target orientation without dropping them and yeah let's look at the first building block for our model which is the underlying arm model surprisingly at least to me there are very few biomechanically accurate models of the human arm there is this one which we are using which was developed for the stanford open simulator but for example there is no model available for a female this model approximates the average american male and stanford open simulator some of you may be familiar with it is a very powerful simulation suite for biomechanical models it's very accurate and it's the gold standard in the industry really but the problem is that it's also very slow and not very well supported so we actually use a version of the model that was converted to the musical physics simulator which is very fast and commonly used in reinforcement learning and lastly the original model has 50 muscle compartments so essentially at least come from a mechanical engineering background basically linear actuators which make the control problem basically impossible to solve using reinforcement learning so we follow another common approach for reinforcement learning of biomechanical models which is that we remove the muscles and replace them with seven direct torque actuators at each of the joints of the arm and then to approximate approximate the muscle behavior you then add what's called a muscle model that mimics the torque response of a i know what this is um that mimics the torque response of muscle driven joints more um yes sorry the progression uh progression is from my narrations that i recorded for the uh for the video uh yeah so for time reason we won't be really getting into the specifics of this but the the reinforcement learning problem um is described in more detail in the paper but just for completeness we use soft actor critic which has recently been applied with great success to a number of challenging mechanical control tasks now that we actually have a model how do we actually train it the most naive approach would be to just give a reward if the correct key is hit unfortunately we found that this doesn't converge so we have ended up with is a two-stage training regime at first we train with a position-based reward that means the reward is given as long as the target key is reached within a certain time limit even if other keys are hit along the way and then once that converges we switch over to a hit base reward where we actually only give the positive reward if the correct key is pressed and if any other key is hit along the way the model gets a negative reward and before we have a look at the trained model i would just like to reiterate two things this model was not trained using any user data the observed behavior arises fully from a simple cost function that minimizes movement time and also notes that posture comfort and fatigue are currently not not explicitly modeled that's something we've left for future work uh yeah so here's our model uh typing away but as you'll notice it's only one arm that is because the upper extremity model that we're using only contains a right arm but since typing is very much at least for most people a two-handed or bi-manual process we came up with a little trick to simulate by manual typing with two right arms we basically take two right arms but one let one of them type on a flipped keyboard which is equivalent to flipping the arm which turns the right arm into a left arm and then we interleave their movements once again more details on this process can be found in the paper we then evaluate our model on a benchmark user study conducted by my co-author john dudley about two years ago they had 24 participants and had a very detailed motion capture data for both midair and surface align typing which made it perfect for our application so let's look at the actual alignment with user behavior on the left we have a box plot of words per minute and as you can see our model is quite well aligned with what we observe in the user study that is to say that users tend to be slightly faster in surface-to-line typing this alignment is quite surprising if you don't think about it too much because as i said this model was never explicitly trained to approximate human behavior so in a number of simple metrics such as words per minute and character error rate our model is quite well aligned with human behavior and most importantly it is competitive in terms of typing speed and accuracy with humans but and this is inevitable for any honest evaluation of a user model there are some shortcomings so for example on the right we have a character error rate box plot and our model predicts that character error rate will be lower in surface align typing but that is not what we actually observe with real users so the real question we would like to ask in our evaluation is not does our model imitate human behavior perfectly it can't because it isn't explicitly trained to do so but rather does it correlate so to speak with humans well enough to be useful so the concrete application we're going to be looking at is hyper parameter selection um so let's say you were developing a new keyboard for virtual reality and you wanted to optimize the character error rate in your user studies optimizing the underlying parameters that affect character error rate numerically is very time consuming so the question we're asking is can we use our model to select hyperparameters in an objective and reproducible way we're going to evaluate this against two benchmarks the first is usually most parameters of an hci system are never numerically optimized during development for example if you're busy developing a novel interface and yeah you're not going to spend the time to optimize the size of each button in your interface usually you would just sort of randomly pick it so to start out we are just going to look at if our model can outperform randomly choosing hyper parameters and secondly let's say you have an important parameter and you're really willing to spend a lot of time generating training data in our specific case um in the user study we would have been talking about roughly four hours of typing stimulus phrases um but even if you put in that time in at the end of the day you're only one person so the question is can our model even be more indicative of the general population than some users might be uh yeah as i said the first scenario we're looking at is what is the probability that the model may choose better hyperparameters than a random set of hyperparameters so if you just randomly pick one and the answer is 99.8 percent so in almost every case our model will outperform randomly guessing a hyperparameter and in the second scenario we would like to see if our model can even choose better hyper parameters than some users can and the best we could really hope for here is about 50 and that would mean our model is exactly the average user and we don't quite reach that number yet but given the context i would argue that 38 is already pretty good and that means that if you put in the four hours of effort it would take you to run this sort of optimization on yourself you have a 38 chance of still being outperformed by our model which would have cost you three minutes to set up and then you just let it run while you go get lunch um the fact that we are so close to human performance is particularly impressive if you remember from the last slide how poor random sampling performed so while these initial results are quite promising there's still quite a bit of future work left to be done the first major limitation is that our model is currently symmetric so the first task would be to create a biomechanical model of the left arm and train an agent that can control both arms at the same time which would be a challenge in and of itself but should be feasible and secondly the model exhibits relatively low variance when compared to a crowd of real users so if you recall the words per minute and character error rate box plots the solution here would be that we could add domain randomization which is very commonly done in reinforcement learning and actually leave it turned on at inference time so for example we could parametrize the bone length of the underlying biomechanical model and such things and lastly our model is currently only typing with two fingers from our initial impressions it would seem unlikely that real ten finger typing can be learned using just current reinforcement learning techniques we would expect that you would need to use demonstrations or imitation learning to warm start the reinforcement learning algorithm so to summarize uh we've shown that the the novel reinforcement learning approaches that have been released in the last couple of years really enable us to develop a whole new category of user models these models do not rely on any pre-existing user data which makes them perfect for applications such as the one that sue presented just before me where it's a very novel approach so you don't really have much existing user data and lastly we showed that very complex interaction can emerge from simple cost functions which is kind of fascinating to me and yeah thank you very much for listening and if you have any questions i'd be happy to answer them very cool thank you so much for this presentation very nice work i i really like the approach and the method of bringing computational methods to this problem right and trying to like skirt perhaps some of the around some of the issues that go along with text entry methods where you involve a lot of users and you know finding alternatives to us is definitely appreciated by a lot of us i would assume um i don't see any questions so far i'm just going to ask one and i actually want to i i find it very intriguing i want to challenge this a little bit i mean you did this for like a mid-air text entry right which is very reasonable makes a lot of sense i really like what you demonstrated i was wondering to what extent do you think this would translate or maybe generalize to text input anywhere perhaps even on surfaces which is sort of the task that we're used to used to right doing right now writing sitting in front of a computer um so we i perhaps wasn't too clear on this in the presentation because i was going through it quite quickly but we actually the model is capable of both uh surface align typing and mid-air typing sorry that was that was unclear it's two different models but they're trained in the exact same way and using the exact same hyper parameters um but it's basically in the surface align case we are simulating uh like a projected keyboard or basically touch screen typing right uh if you're talking about like a mechanical keyboard i think it would be a bit more challenging because you have like rigid body interactions that the model would have to cope with right yeah no i think just alone for the you know touch screen as you see a touch screen typing or projected keyboard on the surface that makes a lot of sense you shouldn't add new constraints and you know in terms of the support that you might have for the body itself for the arms and so on right which confines the motion a little bit but then also supports it and that becomes really interesting as we heard in like previous talks uh with regard to fatigue and you know prolonged interaction uh in situ yeah that is actually something that we could definitely look at for future work um adding a muscle model that would allow us to predict fatigue it's not something that we covered in this market yeah if you're still at eth you know just swing by we have a lot of work going on in this area yeah um so we have a comment from sheriff um this is such a cool approach awesome presentation would it be possible to model human biases as well particularly with novel interaction techniques for example on flat surfaces prior work has shown users rest less on the touch surface when not performing a keystroke which probably because of prior experience with touch surfaces so like explicitly adding that to a model would not work and i think that would also kind of go against the the whole logic of this um but to be honest in in this case of a touch surface i would so basically you're you're thinking like if if i have i'm just going to extrapolate a bit from the question if you um have a novel system that does perfect palm rejection users will still not rest their palms because historically they've been used to systems that do not have good harm rejection um in incorporating something like that is actually one of the the shortcomings of an approach like this is that we don't have any fine-grained control of the model um so we could like explicitly add a cost function basically simulate as if our input device didn't have good palm rejection but that would sort of be going against the the whole idea of this approach i mean your answer makes perfect sense but it's a really good question actually because it reflects on some of the learned behavior of people in response to how technology is or had traditionally been implemented right which is not necessarily the point from which you start at so that there is a bit of like influence on people that then change the behavior to accommodate to how we implement technology that you would have to in order to like make it perfect quote-unquote right also account for in in the method um yeah so you could definitely do i mean this is it's it's really fascinating so it's also one of the first bigger works i've done with reinforcement learning and it's it really is a little bit like a cat so if you wanted to simulate this sort of behavior you could really think about basically pre-training it on an ipad that doesn't have good palm rejection so to speak and then letting it type on one that does have good palm rejection um and you would probably see something quite similar to this but that's not the most scientific thing i've said today it's more of a gut feeling no i think that's fine um all right i think we're out of time with this one uh thank you so much again lawrence uh and collaborators for this talk um and with that um we're gonna move on to the final talk in the session it's on a predictive performance model for immersive interactions in mixed reality and floronca week is gonna present this paper i think everyone could see my screen now yes okay so uh hello everyone and thank you to be here just before the break uh today i will present my work a predictive performance model for immersive interactions in mixed reality so my name is florenca blake i'm a phd student until october 13th and this work was made in collaboration with emmanuel dubois and marco serrano my phd supervisors the last years a lot of work with edmonton displays evaluate applications interaction techniques or systems with users but testing and evaluating this kind of system requires considerable time and effort especially with a copy 19 pandemic that prevents researchers doing controlled experiments to overcome these issues predictive models can be used to predict the time required to complete an interactive task at an early stage of a design process so in this work we propose the next engine of a well-known model the keystroke level model or k-l-m the key stroke level model is part of a gum's family this model can predict the time to complete an iterative task this task consists in sequence of atomic actions with atomic actions are called operators in klm each of these operators as they are associated to one or more units time the klm original was designed to predict completion time for interactive tasks in the command line systems but these systems are now updated we need to adapt the model the klm model to new technologies this kind of extensions already exist for mobile phones in the ecosystem or smartphones but as far as we know no klm extensions exist for interaction for mixed reality application in this work and this presentation we propose a klm extension for misgravity we focused only this reconciliation task as a sequence of the pointing action and the validation action during this presentation we will fill in the table at the top of the slide the first operator models the cognitive actions in interactive system many actions require a reaction time to start a new action for example after feedback the simple reaction operator was designed by mckenzie in 2014 to model this kind of reaction in a mixed-race hmd environment vegeta field of view is reduced compared to the human eyes the field of view of human eyes so most of the digital content with hmd is placed outside the field first exploring the digital space requires wide head movements these movements are our model with course 8 point operator once the content is displayed inside the field of europe hmg to select one item the user needs to point out this item with a cursor in most of hmd now the cursor is placed in the center of a field of view this action requires a slight head movement to move the cursor from the target to another one these movements are modeled in a klm with a precise 8-point operator once the cursor is on the target the validation step is required to select it this validation set can be done either with a dedicated device or with a media gesture usually the hmds are delivered with a dedicated device with a physical button to validate a printing or to select a command the option to click on a button on the physical device is modeled by the button click operator in our call name as i said we can also perform a media gesture to validate the pointing task the hmds knob now can recognize a gesture with the arm or with the end but to detect these gestures and must be detected in the field of view of the hmd so the user has to move their arm from a resting position to a position detected by the action the rise and operator model this action this gesture and the last uh operator uh once the end is detected by the hmd as i said a gesture to validate can be performed by the user each constructors of hmds create a specific gesture to do video validation in our work we choose the airtime gesture from other lens because it's actually easy to perform and robust indeed this gesture is present in hololens version 1 also in all languages to sum up our new operators we have five newly introduced two for pointing a target with a corset point and a precise eight point operators and three to validate the pointing the button click the raise end and the f-tap operators so we have we now have our six operators to model the selection task in the mixed-race hmd environment but an extension of klm requires to define the unit time associated to each operator to this end we design four user studies with the hololens orange version one and each study was completed by 12 participants for the first operator of the course 8 point operator we need to model the time taken by users to bring the field of view of hmg around the targets however the vertical field of view is smaller than the horizontal one to avoid and overcome this problem instead of bringing the field of view of hmd around the target the user has to bring a circle inscribed on a field of view around the targets the circle inscribed in the field of view is represented in blue on the picture so in this study we placed 8 targets at the 8 cardinal points and at 3 different angular distances 45 60 or 75 degrees in this study the user has to one look at the three second countdown to bring the circle inscribed in the field of view around the target and three stay 500 milliseconds to validate we collected 576 movements we can see on the graph that the direction of the movement has an effect of the completion time and we can also see that the angular distance has an effect on the compression time so we cannot only define one unit time for this operator we decided to define nine units time for the three distances and the three directions in this operator for a precise endpoint operator we set up eight targets on the e-scrap circle the participants have to one look at the three second counts down to bring the cursor to the get and three stay 500 milliseconds to validate the pointing unlike the previous operator we do not observe a clear difference between the directions so we define only one unit time for this operator 419 milliseconds at this time we define the unit style for the two pointing operators we define nine values for the cosine point and one value for the precise eight point i will now present the studies for the last three validation operators the boot click borrow is end and the enter for the button peak operator the participants must look at a three second countdown and perform four button keys for raisin and ltp operators the participant must also always look at a three second countdown raise their end in the field of view and perform four air taps the button kick takes 207 milliseconds in a break raised end takes 623 milliseconds in average and the ltaps takes 427 milliseconds in average and now we have a extension of klm complete but it's not sufficient for to have a proper klm extensions we need to console a data model with us with a sequence of operators a task more ecologically so in the next slide i will present uh different consolidation studies provide the ability of our model to predict interaction time we we run three consolidation studies and we compare the time predicted by our model to the mean time observed during our experiment according to olson and olson in 1990 the difference must be less than 20 for each task to validate our model the three studies represent common tasks in hmg mixed reality we have one pointing targets outside the field of view the second task selected multiple targets outside the field and the third task in the last selecting multiple targets inside the field these three studies are necessary because they involve different sequences of operators for the first task the first study involves a simple reaction the present point and the course eight point operator the task is for phone to bring the field of view around the targets three move a cursor to the target and four stay 500 milliseconds to validate it for this task the differences between the predicted and the mean observed type range from zero percent to ten percent so this task validate a model the second study is the five target selection task where targets are placed outside the field w this task involves a corsair point operator a precise endpoint operator and a button click operator we do have to bring the field of view around the target pointer the target in two and three validate the point the difference for this task is service also data model and for the third and last task uh is always a five target selection task but in this study the targets are placed inside the field w so after i condone the users to one point at the target the precise eight point operator two only for the start first target raise its arm and three validate the pointing with the error for this task the difference is 4.8 so the the free task validator modem to sum up all of the contribution of this work we propose in a new extension of klm for mixed strategy first we design new operators second we associate units time for these operators and then we validate our model with free user studies free validation studies in the near future this model can be extended from this plant perspective first mix-related devices allow the user to carry out overflows of interactions for instance instance after selecting a 3d virtual object the users can perform 3d manipulation modeling these interactions require additional klm operators and we need after to define unistyle for these new operators and finally and we also see um during this day and this last presentations combining touch screens with hmds could be useful to overcome mixed reality with the air interaction problem so to this end it's also interesting to extend our klm model with operators for smartphone-based interaction so thank you for listening and if you have any question here on the game on developer 10 next i will be i will be happy to answer them thank you thank you very much floral questions and discord or here in the chat for now maybe i'll start again um it was very nice to see the interactions in the ar interface formalized a bit more and used for ui evaluations so that was really um refreshing i would say in your case you picked the interaction vocabulary of the holodens1 which made a lot of sense right and they used the combination of head gaze and activation through tap or click and you touched on this in your future work a bit but what are the challenges that you see in extending your model to where the interaction design is currently going um as we heard in the previous talks right so curses guided by hands as we heard earlier remote interaction through indirect input or even direct in interaction with the ui in front of you uh yeah so for the first part of the question yes we use the already version one uh because we think it's one of the most common hmg yeah and the vocabulary of interaction with all the version one is uh uh is robust but we can't think uh a little bit later especially with all relevant version 2 and the eye gaze and uh inside uh but just to explain we uh uh all of these are operators with the already version one are still validate in online version two and we think about it when you design it for instance we when you develop all the searching and pointing tasks we we express the the fact that we have a circle inscribed just to uh to overcome these issues of a different hmv and hmd field of view and size of a field of view of different hmd and if you can think about the next model and the next operators we have plenty of gestures that we can imagine with a body and especially with with our fingers and [Music] we we think about uh all the gestures before each finger can be really really interesting to uh to investigate and uh i think yes it could be a great work but we we have no time to take this work uh just for safe and tasks i think answer your question yeah that's good um and then perhaps one detail i picked up on at the beginning you mentioned your phd is over next week like you want to use this platform for outreach yes yes i said i said it on the on the on the discord i'm looking for a postdoc for 2022. uh if anyone is interesting i interesting by my by my profile yes and so why do i what we can say about it i don't know if i can say something but yes i will present my my phd defense on the 13th october but in french so i think um maybe that people raise people on the zoo now very cool so everyone who's interested in working with lauren please reach out to him to him through a discord uh so and with that uh let's thank all the speakers in this session again for the presentations of the on the on the latest on input and interaction mixed reality i think this was really nice and thanks also to the technical support during this session so michelle and and now that the talks are over a quick reminder for everyone to join the post session q a in gather town um you can directly interact with the authors there and ask your questions give it a try it's a lot of fun and michelle will also post a reminder how to get there from here so thank you everyone i hope everyone enjoyed this and see you in the next see you in gather town
Info
Channel: International Symposium on Mixed and Augmented Reality (ISMAR)
Views: 234
Rating: undefined out of 5
Keywords:
Id: X8Gk8CiaCoU
Channel Id: undefined
Length: 87min 2sec (5222 seconds)
Published: Tue Oct 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.