[ML News] Plagiarism Case w/ Plot Twist | CLIP for video surveillance | OpenAI summarizes books

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
a plagiarism story has an unexpected plot twist clip can be used for video surveillance and schmidt hooper goes on another rant on his blog about citing his works welcome to ml news hello friends of the monday it is ml news and our first story is convoluted uh no pun intended so it starts out with a reddit post by user chong 98 alleging plagiarism of a paper they refer to a story of plagiarism that we at ml news here have covered about momentum residual neural networks they say today i found out that our paper still in conference review is also severely plagiarized by this other paper so they made a little github readme documenting that they have uploaded their paper to archive first and detailed comparison of what they accused the other paper of plagiarizing largely it comes down to the idea is very similar and it's applied on different data sets but essentially the same method also some formulations are quite similar and their conclusion reads usually the methods that work extremely well in practice are very simple and we are happy to find that lad which is their method is one of these techniques we encourage people to try out our proposed led to improve our results for object detection give us appropriate credit however we are extremely upset if our idea is obviously stolen saying that the other authors must withdraw their paper now we know that plagiarism like this happens quite a bit in machine learning there are just so many papers and it's very difficult to even detect if another paper has plagiarized your paper it's difficult for reviewers to find out that the paper is a copy of some other work a lot of people are hoping to get publications and by simply taking paper rewriting them a little bit maybe doing one or two different experiments and then submitting them somewhere else however there is a twist user zill 24 says something very interesting is going on here because just a couple of days ago this exact paper has been found to plagiarize word by word another paper by chinese authors submitted in 2020 and has thus caused many discussions on chinese forums and this links to chihu which is sort of a chinese quora and they put their paper and this paper side by side and it turns out not to be an approximate plagiarism but actually copy the paper in part word by word or at least phrase by phrase so this is a near-duplicate paper right here of this paper if you're confused so was i and apparently so is the original poster of the plagiarism claim saying i'm never aware of the paper you mentioned but for sure i'll read and cite it if it's the same idea thanks for pointing out and as you can see people are generally lost so here's what happened so the paper we considered first let's call that paper a that paper has been written and submitted to a conference and uploaded on archive this year in august the paper they claim plagiarized them was uploaded to archive in september as you can see by the date let's call this paper b the reddit post claims paper b having very similar ideas copied from paper a however then an author of yet another paper paper c comes along and shows pretty convincingly that paper b is actually a copy of paper c including screenshots of the diagrams and so on paper c also delivers proof of first submission and so on and you can even analyze that paper b did in fact screenshot paper c because the resolution of their figures is worse here is the interesting part not only was paper c written a year before but also it was never released publicly it was submitted to two conferences and then after rejection the author simply dropped it because they thought their idea wasn't that good so paper c was before paper a and paper b but was never released so there are multiple questions now like how did paper b's authors get access to paper c now this post on chuho tries to follow that up so they're trying to contact the university they're trying to find these people they find that the main authors no longer study there one of the authors apparently says well i just kind of uploaded it to archive but i didn't really write the paper nobody admits to anything nobody says anything the nurif's chairs checked and it turns out none of the area chairs senior area chairs or reviewers is at the institution that plagiarized the paper so as of yet it is still unclear who leaked the paper and how where these authors got it from nor does anyone in this chain admit to any plagiarism now while this obviously sucks for the researchers of paper c the question is what about paper a now so paper a made the claim that since paper b's claims were so similar and paper b was after paper a paper b copied from paper a but now you have a paper that's essentially a copy from paper b yeah it was before paper a so when the same logic indicate that paper a copied from paper c the authors of paper a actually comment on this and say they did not know about paper c when they wrote their paper they now highlight the differences between the two papers they strongly deny having plagiarized paper c and the whole thing is just a mess now is there something to learn from this i i think yes and i think that's what makes it so hard in these plagiarism cases i don't know anything more than you do but if i had to guess i believe the authors of paper a here that they didn't know about paper c but it just shows you how multiple people and they self-admit the idea is relatively simple and and works how multiple people can have very similar ideas and then write papers that essentially turn out to be very similar to each other and among the thousands of papers that are released each month it's bound to happen that some of them with the same idea doing the same sorts of applications will turn out to be quite overlapping without ever having seen each other and that might be indistinguishable from a paper that has actually plagiarized another paper but has done so putting in a little bit of work to reformulate and redo experiments so while plagiarism is certainly a problem in our field and it's probably happening a lot more than we realize in this smarter way that is undetected it is also the case that you have to be quite careful with these allegations and in general probably the best thing you can do is simply to publish your ideas write them up as well as possible and just make it easy and and nice for people to cite you instead of citing someone who copies from you and yes that means that there is a little bit of a marketing aspect involved in this and it also leads to problems where people with bigger followings will attract more citations but ultimately it is your best shot with regard to this particular story i doubt that anything more is going to happen here we'll keep an eye on it next news github user joan moden demonstrates how you can use clip so open ai's clip model to search through videos apparently in the original clip paper openai claimed that this is not really an application that this doesn't really work well however as this little project demonstrates it appears to work quite well in the way that you can search surveillance footage for a descriptive sort of text so what you do is you take your video you encode each frame with clip and then you encode the text that you're looking for also with clip and then you compute the inner products between all the frames and what you're looking for and if any of the frames exceed a threshold then you will show that frame so here the author searches for the text a truck with the text or voila and directly finds the frame corresponding to that a white bmw car a truck with the text jcn a bicyclist with a blue shirt a blue smart car and it's pretty easy to also do this yourself you clone the repo you put in your own video you can search through it now this raises a lot of questions this gives essentially a new super power to people who have access to this kind of material tracking was possible before but not with this ease you'd have to craft some sort of a detector in order to label a bunch of things in some of the images and then you might be able to track it through the footage but here you can simply enter what you're looking for in many different ways now you can of course ask what's the purpose of having surveillance apparatus in the first place if it's not for you know surveilling so rather than criticizing the possibilities here one might criticize the implementation of surveillance in the first place but and it's also the case that you might simply have these surveillance cameras for the purpose of proving someone like running a red light or something like this but once it's in place it can obviously be misused for other things and with the addition of clip now that's an easier possibility i don't know i don't have the answer here i'd just like people to know that things like this are now totally possible not only to the government but pretty much anyone who has access to this camera feed and a home computer so make of that as you will next news the darpa subterranean challenge has concluded now this is something extremely cool the task here is that submissions to the challenge are teams of humans and robots that explore underground areas so this can be mine shafts or underground tunnels or anything like this so this is a competition and the way it works is that the robot or robots and usually there's multiple robots are deployed into this underground system and are tasked with doing certain tasks like finding things retrieving things mapping the area while the humans aren't allowed to go into the underground areas they can communicate with the robots however this being mine shafts and so on there isn't always reliable communication so the robots must largely be autonomous and this isn't only simulated this is actually real world robots for example here is a drone in one of these underground bunkers being hit by a plastic bag that it itself has thrown up with the wind so evan ackermann on twitter has a number of really cool clips from this challenge so the challenge has concluded you can no longer participate this year but you can look at the participants at the trials on youtube this is just really cool jurgen schmidt uber pumps out another blog post claiming to correct mistakes in citations in historical references by others this time he criticizes the 2021 turing lecture by joshua benjo yanuka and jeff hinton which they gave after receiving the touring award it also criticizes the announcement of the touring award all of them for as i said making wrong historical claims and not properly citing things smith uber himself starts out the blog post by saying we must stop crediting the wrong people for inventions made by others and in the abstract he states most of these breakthroughs and tools however were direct consequences of the breakthroughs of my lab and other labs in the past three decades and he makes 15 distinct claims about the touring lecture such as lbh which stands for la benjo hinton cite hinton for dropout without mentioning that dropout is just a variant of hansen's 1990 stochastic delta rule or such as lbh cite benjal's 2014 paper on generative adversarial networks without mentioning that gans are instances of the adversarial curiosity principle of 1990. and he follows this up with detailed references to his claims as well as over 250 references a lot of which are to himself i have sided with schmidt huber a lot of times in the past it is true that his labs have done a lot of fundamental work it is also true that sometimes this work is not properly credited and i can even understand that he's pretty salty about laka banjo and hinton receiving the touring award and him not but this is pushing it a little bit like just the sheer length of this article he sees himself as something like a crusader for the correction of scientific history for making sure everyone sides properly and so on and i agree that is an important thing but i asked myself is this really what he wants to be remembered for does he want his legacy to be oh schmidt hoover the person who did a lot of cool work okay we might not credit him for all the cool work he did but still people remember him for a lot of cool work or does he want to be remembered as the person where every single time someone invents anything he finds a vague relation to what he did in the 1990s and then claims oh this is just a special version of my case and look at the length of this article the amount of work going into this is just absurd like he's so smart clearly he could do something better with his time and this isn't even productive at the frequency and intensity that schmidt hoover is doing this this is completely counterproductive no one is even going to respond to this people will simply say ah ah here he goes again and ignore him and the claims get more and more wild while you can make the claim that something like a resnet is essentially a highway net but simpler the claims that gans are just a special case of artificial curiosity it might be true in an abstract level but certainly not on a practical level and then his newest claims that transformers are essentially nothing else than fast weight programmers and so on i mean come on if this are actually all special cases of your things then please please tell us what the next big thing is transformers have not only sparked a revolution in nlp they have widespread consequences people worry about do language models really understand people can solve new tasks with them google search is now powered by bert and schmidt uber claims to just have been sitting on this for 20 years well please next time tell us beforehand so we can rein in the revolution faster in any case read this if you want i don't think it's worth your time openai has a new blog post called summarizing books with human feedback and a paper to go along with it called recursively summarizing books with human feedback i don't know why they've left out the recursive from the blog post but in any case so the algorithm works by taking a book chunking it up into sections and then summarizing each of the sections and then putting together the summaries of those sections and then summarizing those into super sections and so on every summary generation is conditioned on the section it's supposed to summarize but also at the summaries that have been produced from sections that come before it at the same level this is something you can see here at the height one so generation of this super summary here would not only receive the things it's supposed to summarize but also the summaries that have been generated before it so essentially you're telling the model here's a bunch of text i want you to summarize it's from the middle of a story and here is a high level summary of what already happened in this story please continue this high level summary so this is cool because doing this at this chunking level and not as a please summarize the whole book task you get more accurate you can leverage humans in a better way because humans can now simply check whether a reasonable length text like a couple of pages have been summarized correctly and not whether an entire book has been summarized correctly and also this allows you to summarize arbitrarily long text because you can just always add levels and therefore if your original text is longer you simply recursively summarize it more often because with each recursion the text gets chunked then each chunk gets summarized and then all of this goes together so this is a neat combination of the principles of learning from human feedback which is a thing that openai has shown interest before and also recursive task decomposition where you can divide a task into essentially the same task at lower levels therefore you can learn one model to do the task and then simply apply that model over and over again the model they end up using is a fine-tuned version of gpt-3 and you can read some of the example summaries on the blog post for example this one from alice in wonderland now i've read the summaries and i have to say they're not exactly what you would expect from a summary of a book in that they seem to pick out important events that happen in the book but the highest level summaries they don't really give you like a sensible overview over the plot of a book and this might be due to this recursive decomposition so while it might be appropriate at the lowest level to simply sort of leave away all the in-between whatever the author sprinkled in and simply mention the important events of a chapter if you go higher level you most often want sort of a more abstract summary you want to condense the plot somehow so there's still room for improvement here but it's pretty cool to see what these language models can do when you bring the human into the loop cnn business writes a startup says its software can spot racial bias within companies will this surveillance scare employees now this is a product called unbias it eliminating buys with technology one alert at a time so what this product does is it monitors the employees of a company for example their email communication and it tries to detect instances of bias so the cnn articles mentions this example for instance she said if an email from one employee to another alluded to a diversity hire that's the kind of thing the software would be expected to flag so the way it works is here if unbiased it scans an email and finds wording that may be objectionable it will send an alert to a small group of employees working in human resources and diversity equity and inclusion with the wording in question highlighted in yellow the spokesperson says it's not looked at as a gotcha for employees because the bias might be unconscious so the consequences might be that you offer an employee bias related training or other education the interesting thing is that it says it doesn't use artificial intelligence to determine when to send an alert because of concerns surrounding the possibility that bias could be contained in ai itself and that it essentially relies on keyword and phrase spotting the product website makes a big deal that the companies applying the product are in control they can define what the criteria are and so on and they frame it more as a compliance issue comparing it to similar tools which detect instances of for example insider trading however if this doesn't scare the crap out of you then i honestly don't know and it's only a matter of time before machine learning is actually used in these systems because as they are they seem to be pretty easy to evade and when the company wants to improve their detection and they'll implement some sort of an nlp system that's certainly going to make things more interesting but not necessarily more pleasant and i highly doubt this is going to change anyone's mind or unconscious biases or increase in substantial ways the workspace climates [Music] speaking of surveillance apple is working on iphone features to help detect depression cognitive decline the wall street journal writes so this story is about apple monitoring users in order to detect things like depression and mild cognitive impairment which is a precursor for example to alzheimer's or other forms of dementia now for this i'm honestly not that skeptical given that i hope you will have the ability to turn it off but if this is an optional feature it could potentially be quite helpful people generally let their smart watches and their phones track other health related data such as pulse oxygen saturation number of steps heart rate heart rate variability well heart rate is the same as pulse right doesn't matter so while i certainly agree that mental health data isn't exactly the same it probably requires monitoring more personal data than simply a number which is your pulse we do face a lack of mental health professionals and having the system monitor you for something like cognitive decline might be helpful in that you might be encouraged to go look for treatment a lot sooner than you would if you simply had to notice it yourself because if something declines mildly over time you're unlikely to see it yourself but of course the privacy implications for something like this especially if this data is then sent around and analyzed and potentially even sold are pretty great so treat this with a grain of salt next news cnbc writes the uk publishes a 10-year plan to become an ai superpower seeking to rival the us and china so this article details the uk strategy to become a leader internationally in ai technology it's something like a 10-year plan and it outlines a strategy and this strategy goes from providing more compute to launching centers where researchers from the whole country can communicate with each other and coordinate ai research it also outlines some better regulations for intellectual property and so on and it appears to be just a general indicator that the government is looking to push this area however there are multiple problems with something like this first of all academics are very likely to move not only academics also employees of tech companies they're pretty move happy a lot of them are not bound to individual location and it is even considered a good career move for example in academia if you have spent time at various different places so as a country retaining knowledge is quite a hard task if it comes to people like this it is a bit easier with industry where a company actually needs headquarters and so on but also there employees frequently rotate the other problematic aspect is actually also outlined in this article and that is that ai startups like many startups get bought and very often they get actually bought by us or chinese big corporations so in this case britain might have raised these startups given them tax breaks or subsidies or grants and whatnot built up all this knowledge in the country only then for it to be bought by a u.s firm the article for example names deepmind as such an example now while deepmind is still in london it now belongs to google it's good to see that countries are pushing ai technology but it does detail the problem you have when trying to achieve something like this especially as a country that is not huge such as the uk okay let's dive into some helpful libraries scikit-learn learn is i'm kidding you know psychic learn but scikit-learn has just released the 1.0 release for some projects 1.0 release is sort of the initial release first stable version and so on for other libraries the 1.0 release is actually the last release saying okay we're done with this we're releasing 1.0 that's it scikit-learn doesn't appear that either of these are true of course scikit-learn is already an established library but it doesn't seem like they have any intention of finishing or killing the project there are also no major changes in the library one of the changes is that lots of functions now have to be called with keyword arguments which let's face it in numpy and scikit-learn and all of these functions is a good change now while i think it would be better to simply educate the users to do this as a good practice and leave them the option of filling their code with non-keyword arguments nah it's their library they can do whatever they want they're also a bunch of new models and the plotting library has been also improved also new release dopamine version 4 is out so dopamine is a library for doing reinforcement learning research with lots of implementations of common agents and environments and the major new additions are things like soft actor critic for continuous control and the op tax optimization library for jax based agents also new is that it's now compatible with docker so it will become a lot easier to set up the required environments in the future microsoft releases music which isn't necessarily a library it's simply an umbrella project for music generation research so this repo holds code for a bunch of different papers in various aspects of synthetic music generation and also artificial understanding of music that already exists this can go from classification of genre to transcription of lyrics all the way to arranging and synthesizing new music including lyrics now what's cool about music is that not only does it have this picture logo but they actually do have their logo in midi and you can listen to their logo [Music] excellent facebook air releases dina task a new paradigm of ai benchmarking and this is an iteration on dyna bench so this is a system for benchmarking ai systems specifically natural language processing tasks so this is supposed to combine tasks which are essentially data set and their associated labels and on the other hand models that people submit and it evaluates the models on the task but also there's the option to have the human in the loop something like a mechanical turk worker that goes and tries to come up with some sort of adversarial examples against the models or examples about a particular aspect of the task the human created data is then fed back into the system and used as further evaluation data so this is supposed to give a more complete picture of model's capabilities rather than simply evaluating them over and over on the same limited set of static benchmarks so if you're interested in that sort of thing this seems like a pretty good framework to go about it next up phi flow has a new release out and this is a framework for solving partial differential equations in a differentiable manner so as you can see right here this can be for example used for fluid dynamics now i'm a total noob at any of these things but if you're in these fields this library might be interesting for you the next library is dora the explorer a friendly experiment manager by facebook research and this is an experiment manager that focuses on specifically things like grid searches and the special thing here is that the experiments themselves are defined in pure python files so there's no yaml there's no web interface or anything like this your experiments are simply python files defining some sort of a grid search and the tool can identify and deduplicate experiments that happen from i guess gridding too much so it seems to be a simpler alternative to many of the experiment running tools out there if for some reason you're looking for simplicity you might want to give this a try now being said that it seems simple the system actually looks really powerful too so i have no doubt that you can go up in complexity with this by a lot for example it does interface with scheduling systems such as slurm next up habitat lab is a high-level library for development in embodied ai this is essentially a library that helps you run rl and robotics tasks in 3d environments this is not a new library but there have been some new developments first of all there is a new data set called habitat matterport 3d dataset that brings real world environment into the habitat environment so these are real rooms that were scanned by a depth sensor by a depth-aware camera and now you can explore these real environments inside the habitat framework so if you are into embodied ai robotics indoor navigation anything like this definitely give habitat a try go to toilet good job and lastly google ai announces wit a wikipedia-based image text data set this is supposed to be a very high quality data set connecting images to text so rather than scraping the internet and trying to read the alt text from an image this leverages wikipedia so on wikipedia whenever there's an image there's actually a lot of information about that image all around it not only is there the usual description but there's also the page title that usually refers to something inside the image and the data set also grabs the page description which very often also relates to an image on the page and lastly the image page itself also usually has something like an attribution description and the file name can also give indications about what is in the image the cool thing about this is since wikipedia is so extensive that you not only get image text pairs but you very often get a lot of translations for all of these different things into different languages so this is an example of one data point that you would get you get the image along with url page title reference description attribution description and so on oh i said attribute description before attribution description sorry ah so while this is a smaller data set than what for example dali was trained on it's definitely a higher quality data set with lots of more information per day to point it's going to be pretty exciting to see what people build from it alright this was already it for ml news this was a long episode i realized this but there's just so much stuff happening if you have anything happening let me know and i'll see you next time bye bye [Music]
Info
Channel: Yannic Kilcher
Views: 17,630
Rating: 4.9501386 out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, clip, openai, summarizing books, darpa, darpa subt, schmidhuber, lbh, turing lecture, adversarial curiosity, gans, clip surveillance, ai surveillance, video search, plagiarism, machine learning plagiarism, dopamine, scikit learn, mlnews, ml news, machine learning news, unbias it, unbiasit, video surveillance, email surveillance, unconscious bias, phi flow, habitat, wikipedia
Id: tX1OolVxDzs
Channel Id: undefined
Length: 30min 52sec (1852 seconds)
Published: Wed Sep 29 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.