I Made a Graph of Wikipedia... This Is What I Found

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what you are looking at is a visual representation of Wikipedia a website that I can almost guarantee you've used at some point in your life each circle represents one of the 6.3 million English Wikipedia articles and these are the nearly 200 million links between these articles that form the network of Wikipedia this graph is the culmination of months of work thousands of lines of code and an absurd amount of computation time and I know what you might be thinking what a complete waste of time because other than looking mildly interesting we can't really extract any useful information from this graph in its current state but I promise that if you stay with me and watch this video not only will you come to understand this graph but you just might learn a couple of interesting things along the [Music] way let's start by understanding the different colors of this graph each color represents different communities which were algorithmically determined a community is a group of nodes or in this case articles which are more tightly linked to each other than they are to other articles in the rest of the network in total the algorithm detected 44 different communities the idea behind finding these communities is to group similar articles together since in theory articles which are more closely linked to each other should also have more similar cont content in order to test this Theory I looked at the categories for each of the articles in a given community and found the most common categories within that Community for example if we look at Community number three which has over 760,000 articles we'll find that the most common categories of articles are related to politics and law it's in this community which we'd find the articles of the United States presidents now let's look at Community number five and we'll find that this community is all about music and it's word you're most likely to find all your favorite musicians Community number 10 is where you'll find video games and there's a high likelihood that every video game you've ever played has a corresponding article in this community now having communities about politics music and video games makes sense because they're both popular and Broad subjects but not all of the communities are so straightforward for example the top categories of community number 11 are about space objects and Community number 19 is related to or region politicians one of the most interesting things about the way the articles are separated into different communities is the way they can reflect Society for example Community number six is all about English and American movies and television but there are also two separate communities for Indian and Korean movies and television this shows the popularity of both Indian and Korean Cinema and their distinct separation from Western Cinema I thought it would be interesting to look at films like parasite and rrr with which come from Korea and India respectively but found massive success in America as well these movies were more closely linked to American Cinema than most other foreign films moving on one of my favorite communities in this graph is community number 14 the top categories here are pretty evenly split between Canadian people and hockey which I guess just shows how closely linked hockey and Canada are to each other in these cases it's easy to understand the way articles have been grouped into their different communities but that's not always the case let's look at the general classifications for each Community with a notable amount of Articles what I find interesting about this list is that the top categories from half of these communities are related to different sports if you were to ask a human to broadly categorize all the articles on Wikipedia there's a good chance they would put most if not all the sports articles into a single group while the Wikipedia Network would suggest that they are all quite separate from each other [Music] so now we understand the colors of the graph and the layout of the graph but what about the different sizes of each circle the size of each circle or node is proportional to the amount of incoming links to its corresponding article in other words the more times an article is linked to by other articles the bigger its node will be for example the article for basketball is referenced by 44,000 other articles so it's going to be bigger than the articles for free agent and golf which are linked to less times something interesting we can do with the graph is look at all of the links to a specific article to see how much of an impact it has on the overall graph for example the article for covid-19 is one of the fastest growing and most link to articles with over 46,000 articles referencing it when visualized it looks like this 46,000 incoming links is a lot but it's still a relatively small amount when compared to the most reference articles on Wikipedia for example over 100,000 articles linked to the article for World War I however even that is significantly less than the 189,000 articles that link to the article for World War II on a side note I find it interesting how the graphs for World War I and World War II are similar to each other showing how articles with a similar type of content also share similar links despite the article for World War II being referenced nearly 200,000 times it's only the fourth most link to article on the English Wikipedia the second most referenced article has over 240,000 links to it and it belongs to the article for association football or soccer the interesting thing about this graph is that even though it has more links most of the Links come from within the same Community for football articles with a lot of links originating from articles on football players and teams there's still one more article left which is referenced more than any other article on Wikipedia linked to by nearly 280,000 other articles the article with the single biggest impact on the Wikipedia graph is the United States that's right the most referenced article on Wikipedia isn't about a famous person or a historical event it's about a country in fact I found that 38% of all articles on Wikipedia make reference to articles of countries and this made curious what if I made a map where the size of each dot is proportional to the number of links to its corresponding Wikipedia article it may be hard to determine a pattern of which countries have more links to them but I think it becomes a lot clearer when I highlight these countries according to Wikipedia these are the top 25 countries from which people contribute to the English Wikipedia you'll notice a fairly consistent overlap with the highlighted countries having the largest circles there are some exceptions like Iraq which has many thousands of articles for villages and counties in Iran that were created by a bot but for the most part it's pretty consistent people will naturally want to write about things that interest them and things they're familiar with and many times that will end up linking back to the country they live in so along with the United States being a global superpower and one of the most populated countries it also has the most contributors to the English Wikipedia so it makes sense that it's linked to more than any other article [Music] so now that we understand the colors of the graph and the size of the circles I want to talk about the inspiration for this whole project if you haven't heard of it the Wikipedia race or Wikipedia game is a game where you try to get from one Wikipedia page to another other by only clicking links within the Articles often times you are racing against other people or Trying to minimize the number of clicks it takes for example if you wanted to get from the article for Pokémon to the article for ancient Egypt you could do so in two clicks first by clicking on the link to pets in the Pokémon article and then the link for ancient Egypt in the pet article often times this game is played by ignoring links in the references and see also sections of Articles so when I construct this the graph I also ignored links in these sections since these are not necessarily a part of the article one question I've always wondered when playing this game is does a path of links exist from every article to every other article on Wikipedia well it turns out the simple answer to that question is no and the reason is orphans on Wikipedia an orphan is an article which has no other articles that link to it if one of these orphaned articles was selected as the target article from for a Wikipedia race then let's just say you'd be playing for a very long time in total I found over 350,000 articles or about 5% of all Wikipedia articles which were orphaned another reason why a path doesn't exist from every article to every other article is because of dead ends dead- end articles are articles which have no links to any other articles on Wikipedia if you started at a dead-end page or somehow found yourself at a dead-end page during a Wikipedia race then you'd be stuck with no way to get to any other article these are quite a bit rarer than orphaned articles with only about 6,000 articles being dead ends now wait if orphaned articles exist and Dead End articles exist Does that imply the existence of deadend orphaned articles well it turns out that a little over 2,000 dead-end orphans exist on Wikipedia articles which have no incoming or outgoing links on top of being the most depressed sounding types of articles on Wikipedia I can't even show you these articles on the graph because they completely mess up the graphing algorithm causing the rest of the graph to become too compressed and leading to these articles becoming lost and [Music] forgotten so now we know we can't create a path between every pair of articles on Wikipedia but these dead end and orphaned articles make up only a small percentage of total Wikipedia Pages the vast majority of the Wikipedia graph is actually pretty well connected for example if you wanted to get from the hairy ball theorem to Pepsi fruit juice flood you could do that in just four clicks if you wanted to you could go from baby Jesus theft to buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo in five clicks anyway the point I'm trying to make is that for most pairs of articles on Wikipedia a path does exist and you can usually get from one article to a completely unrelated article in a surprisingly small amount of clicks and this made me think of the Six Degrees of Separation concept the idea that every person is just six or less Social Links away from every other person I wondered if you ignored dead end and orphan Pages could the same idea be applied to the articles of Wikipedia while in order to test this I started by selecting a random Wikipedia page in this case it was the article for Pluto for the sake of this visualization I'll make all the other dots the same size and color now we can plot the articles in the first degree of separation these are articles that are directly linked to by the Pluto article in total there are 255 next let's highlight the articles in the second degree of separation these are the articles that are directly linked to by the articles in the first degree of separation in total there are over 20,000 and you can already see how fast the number of Articles is growing this growth really explodes in the third degree of separation in Just Three Degrees of Separation we've gone from a single article to reaching over 618,000 articles and this growth continues with nearly 3 million articles being reached in the fourth degree of separation at this point we've already managed to reach over 3.6 million articles in total which is over half of all Wikipedia articles this causes something interesting to happen in the fifth degree of separation where the number of Articles reached starts to decrease in hindsight it makes sense we've already reached the majority of articles in the graph so the number of Articles reached in the following degree will start to decrease and it continues to decrease with the sixth degree of separation as well we've now reached over 5.7 million articles which is a lot but it's still only about 90% of all articles this means that while we can reach the large majority of articles in 6° of Separation we can't reach all of them we can keep increasing the degrees of separation until the growth becomes negligible we can view each degree of separation separately to really get an idea of how many articles were in each degree of separation if we view the growth on a graph we can see it increases slowly at first then very rapidly and then very slowly again of course this is just what happened when I selected Pluto as the starting article so I tested this again with a number of different articles articles which had thousands of links in the first degree and articles which only had a single link and they all followed the same pattern what's interesting is that all the graphs start to flatten out around the 7th or 8th degree of separation at the same number of Articles at 5.85 million articles reached this accounts for about 92% of all articles the remaining 8% of articles are unreachable from the rest of the graph as we already discussed about 5.5% of these articles are orphans the remaining 2.5% are orphan groups these are groups of Articles which have links between each other but are not linked to by any other articles many of these are groups of the articles for villages and towns in Iran but my favorite orphan group has to be of the Acton family consisting of four English members of parliament during the 1300s these articles all make reference to each other but are not linked to by any other articles on Wikipedia oh and it just so happens that these articles and these four articles alone make up the entirety of community number [Music] 42 so now we know that for in the large majority of cases a path exists between two articles and it will almost always be eight or fewer links long but what is the average path length between two articles to test this I randomly picked 10,000 pairs of Articles and calculated the path length for each of them on average the path length between two articles was 4.8 it's worth noting that about 8% of the time a path did not exist this is consistent with how we found about 8% of Articles to be unreachable from the main graph you'll also notice that paths with lengths less than three and greater than eight were extremely rare in fact only one path of the 10,000 tested had a length of 10 this made me wonder what's the longest path between two articles on Wikipedia now as I already mentioned finding two articles whose shortest path between them is 10 or more is extremely rare only happening about 0.01% of the time therefore a path with a length of 15 would be incredibly rare a path with a length of 30 would seem pretty much impossible but what if I told you the longest path I found was over 60 links long sorry did I say 60 I meant to say 160 166 to be exact this path starts at the article for athletics in the 1953 Arab games and finishes at a list of Highways number 999 the reason this path is so long is because the only way to reach the list of Highways number 999 is to start at the list of Highways numbered 825 and then tediously click each successive number until you reach 999 it takes a really long time but it's the only way to connect these two articles I guess in some ways it's kind of like an actual highway I can't say for certain that this is the longest path on Wikipedia as calculating every single path is not not feasible but it's certainly one of the [Music] longest those were pretty much the most interesting things I found in the Wikipedia graph but I wanted to talk about one last thing one last article [Music] actually at first glance Fanta cake looks like a normal lb short article but there's actually something pretty special about it you see it only has one link Fanta cukin but when you click it it actually redirects back to itself redirect Pages exist on Wikipedia to help people find Pages easier for example if I search for USA it automatically redirects me to the United States article the page for the USA is just a redirect page in the case of fukin it simply redirects to Fanta cake but for some reason the only link on Fanta cake is the Fanta cukin creating a sort of self Loop technically speaking Fanta cake is actually a dead end because as I discussed earlier there are no paths to any other articles I like to call this a disguised dead end because it appears to have a link at first but upon closer inspection it's really just linking back to itself but that's not what makes Fant cake unique there's actually a handful of Disguise dead ends you see what makes Fanta cake special is that it's also an orphan page making it a disguised dead-end orphan the only one of its kind at least it was when I started making this video but since then it's actually been edited and has links to other Pages it's for the same reason that when you watch this video a lot of the information might be slightly incorrect or outdated but I don't think that's necessarily a bad thing in fact that's the beauty of Wikipedia it's an Ever growing and Ever Changing network of information a place where anyone has the power to free an article from an existence of solitude thank you to my sponsors on GitHub who support the channel and allow me to make videos like this one which took a lot of time and effort by sponsoring in me on GitHub you get access to the code from all my videos including this one Link in the description if you're interested if you made it this far and enjoyed the video then consider subscribing and leaving a like because it really does help thanks for watching
Info
Channel: adumb
Views: 2,219,794
Rating: undefined out of 5
Keywords: wikipedia, data science
Id: JheGL6uSF-4
Channel Id: undefined
Length: 19min 44sec (1184 seconds)
Published: Sat Mar 30 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.