Scraping SEC XBRL Documents | Part 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back to another Python tutorial so we're gonna start the big one today as you probably all have seen I started be kind of booting the SEC series just because Oh God ever since I started that series I've been having people reach out and asking me how to do certain things and XYZ and basically what I've learned from this whole experience is there's obviously a tremendous demand to better understand how to do this and how to make it as I guess would be the word as efficient as possible as much as it can be and so I think was last week I talked a little bit about some SEC data sources so this is just information for you guys wasn't much coding in that video and pretty much for the first portion of this video it's pretty much gonna be no coding it's just giving us context to what we're trying to do and what we're gonna have to work with so but yeah in that previous video we talked about some of the data sources that the SEC makes available why they're advantageous to maybe explore a little bit depending on your background there might be a tremendous amount of information that is basically free for you to grab if you just want to take the time to parse it but yeah so really interesting topic I actually had a lot of fun with that video just because I think there's just a treasure trove of data there that can be easily accessed by anybody who has a laptop or basically a computer today though we are going to not cover that topic we're gonna go back to filings so we're gonna talk about how to parse filings but we're gonna talk about how to parse specific type of filings so to give you some context let's go to our wonderful SEC website now I know most of you who maybe have been here or maybe this is the first time you've ever seen this page well if it's your first time basically the page that you're at is the company search result page so something that the SEC allows you to do is you can go in and search for a particular company and you can see all their filings right and that's really advantageous because their filings contain usually the information that you need either for your job or your own personal pleasure well on this page you see basically everything so right now you can tell that I searched for Facebook and I just I wasn't even being very specific I just said just give me all the file into Facebook and so you can see there's a ton of to put an eisley filings and that makes sense they're a large company they probably have to disclose a lot of different pieces of information about their company and their performance so it's not uncommon to see a bunch of data like this however you'll probably notice right away well that's interesting why do some of these filings have documents and interactive data while others don't well it's because the SEC has started requiring companies of certain sizes and companies who have to disclose certain types of financial disclosures so filings like 8 k's 10 KS 10 cues and all that kind of fun stuff basically what they're saying is you have to make this data user-friendly in its essence like that's really what they're trying to do is saying hey we're making it a role that as a public company people need to be able to easily understand your business they need to be able to easily grab the information that's there about your company in an easy fashion so we're gonna require that you make your data easier - I would say interpret and then also to give it more context because you can't just assume that everyone's gonna have an accounting background then they're gonna know every little accounting role and where to go find those accounting roles and and stuff like that so part of this whole push to make the data more user-friendly more useful for everyday investors like you and I they have basically said we're gonna implement a technology called XBRL documents XBRL stands for extents I think an extensible business reporting language basically it was a framework that was developed I think maybe 2003 by this consortium so it's basically it's a non-profit public organization I think it's like it's basically global I mean it's it's a global organization but their whole purpose is we want to make it easier to transfer data and companies and we want to make it easier to interpret the data that companies are trying to present to stakeholders investors or just you know everyday customers maybe so the idea behind it is our belief is if we want to make data more useful but more specifically business data so financial metrics things that can often have some complexity to it we need to develop a whole new framework that whole new framework leverages basically xml documents that give us some context to the information that we're seeing so for example i think i did the 10 q1 if you go to interactive data i'll just open it in a new tab you'll normally come to a page like this and it's really easy to tell you're on an interactive one because you will see something called viewer and you'll see the XBRL type v and really what this page is intended to do is take that entire filing and make it more informative or just easier to get an idea of what you're looking at so for example when you go to it the first thing you see is the cover page and the cover page gives you just entity information so just information about that particular entity in this case we're talking about face book right so for example I can look at a particular cell and I see oh it's trading symbol if I click that cell you can see that this popped up so it will give me a definition in some cases it will give me a reference if there is one and then also it will give me details if there are some now in this case basically you know you don't know this right now but basically this is telling me the name of the tag inside the document that this information is gonna be found in the namespace and then the data type so it actually gives me some context as to how I should think about this particular piece of information obviously depending on which element you're clicking on your information that you're gonna see is a little bit different so for example if I go to entity common stock shares outstanding my definition is a lot larger now because now we're starting getting those financial metrics and sometimes they can be a little bit more complicated than we want but then on top of that you'll also see details like while all of a sudden this one the data type has changed to share items type we now have a different element name but then on top of it we now have a period type that's different so the idea is they're trying to communicate to you what you're looking at right so they're trying to communicate to you what you're looking at because they're trying to make it more transparent part of that of making this information more transparent is a I got to tell you what it is right so I got to tell you basically what are you looking at it's pretty helpful especially from certain financial metrics perspective to tell you how its defined because accounting can be a very complex field to put it nicely and there's different ways to interpret certain things and so you might you might just see an account called current assets but you're not understanding that there's a lot behind the scenes that makes up that account current assets and there's information that you know and just stuff like that where you have to kind of think through this process of how do I go through this process of [Music] informing somebody who might not have the background so this is your cover page now most time people usually they want to go to the financial statements right because the financial statements is basically telling you the financial performance of the company so if I go to my statement of comprehensive income I'll see my net income if I go to my statement of stockholders equity I'll see all the accounts related to that and you know just information their statement of cash flows so this is obviously a really important one - and again I still get the same experience where I can click over each one of these and I can get things like the definition so tell me what it is I can get a reference to how the fazz be defines it because basically the fazz be is helping to find all the US accounting rules and so this is really advantageous because now on top of having a reference and I know where to look for this information they'll even give me URLs to a particular link right and yeah I can't login so it doesn't matter but the information is here and it's supposed to make it again more informative so now you don't have to go insert what does this mean and this this one mean that or does this one mean this and because again depending on the company it can be a little bit different they're trying to make that process easier for you so now you have all the details so for example we know their credit type we know the data type we know the namespace we know the period type so this is duration so this is saying this is over a period of time not just a snapshot in time like a balance sheet so a lot of good information here right and this is great I think this is awesome I'm all about making things more transparent I'm all about making things more user friendly if the SEC can improve this by all means go for it but the bigger question is how is all this like basically defined behind the scenes well this is where things kind of get a little bit more complicated so for us this is great we don't have to understand how this is all working but there's a treasure trove of information here and it's all very well structured and if you can identify the patterns in these documents something amazing happens which is you can basically get your entire financial statements and entire filings in some cases in such a well structured manner I can give you everything every little piece of information on a table I can tell you it's balanced I can tell you it's scales precision I can tell you the balance and everything I mean just a treasure trove of information so it's advantageous to understand how this is working because we're gonna be parsing this right this has contained all of our information so if we want to understand what's going on you know how do we think about it well let's take a step back for a second okay so we're back at our main search page and in this situation instead of clicking interactive data I'm gonna click documents and then I'll just open up this in a new tab now if you've seen any of my previous videos this should look very familiar to you but basically this is the quote unquote user friendly experience when it comes to look looking at filings right so this is the index.html page if you want to see the actual directory itself the not clean one then you can basically just remove that first portion of the urs re that last portion of the URL and it will take you to the actual directory where you will see every little document that goes in here some are more complicated than others they have images they have HTML files they have XML files they have a giant text file that contains everything they have zip files so there's just a bunch of information you know just a bunch of information and we've already talked about kind of how to parse some of this information but we'll keep it on the user friendly ones right now but if you ever go into that archive or sorry that directory or this page you'll notice some of these data files down below and you're like wait a minute data files I'm liking what this sounds like so there's a schema document a calculation link based document definition label presentation and instance document well this is interesting so most of these are relatively informative they actually give us a lot of good information some are more valuable than others so for example I'm not gonna focus on this video necessarily parsing the excess D file but there's information here and this information might be advantageous at a given point in time because we might want to know hey what version of a particular namespace were they using when they were creating this so there's at there's advantages to doing this but I think for just the time being what we'll kind of go with is we'll keep it simple and we'll kind of just maybe cover this at a later time I think there is a decent amount of information here but I need to kind of explore it a little bit more to understand exactly how useful it is excuse me I went for a run this afternoon it's like just flan flan flan it's ridiculous yeah so they have some elements in here there's an ID they give us a name they give us context as to the type the period type so there's actually a lot of good information here the actual values though they're not here but it's still it's still useful I think there's definitely information here is just gonna have to take a little bit for the most part the ones that we really want to work with are gonna be the calculation link based document the definition link based document and the label link based document the presentation one I'll go over a little bit maybe in like the final video when I get that kind of all figured out I might integrate that as like a last final step but currently right now I am not actually parsing this one just because for the most part a lot of the information in here it actually isn't that useful it's just pointing back to one of the other documents initially I thought oh this is great it's giving like URLs to the you know consolidated balance sheet and stuff like that I was like that's awesome right kind of they're actually broken URLs however these will point back to the XS D document which then basically takes you to that portion in the document so again there might be some value here but from everything I've been able to see at least at this point it's kind of just navigating us back to another document that we're already kind of working out of now if something comes along where I'm like wait a minute there's actually something really good here and you know it's gonna make our lives like 10 times easier then I might incorporate it but at this point just for my own sanity and try to navigate these huge documents I've decided to skip it at this point however do I still have a little gesture yeah okay good I didn't want to have to close and stuff like that the ones that we're gonna mainly be working out of are the calculation the definition and the labels this is all giving us the context information so again we're not getting values per se but it is going to be advantageous if I want to understand well what's the definition of current assets related to this document that you will find here how are they labeling current assets in this filing that you will find here how did they calculate current assets that you will find here so this is where I say if you're trying to get that context if you're trying to get this like hey I if I had to rebuild this from a programmatic perspective I technically have all the information I need in these three main documents to basically do that so basically if I said you know what go out tomorrow and rebuild that financial statement you probably could do it with just these three documents now you'll have basically a balance sheet but there's gonna be no values right so at that point you're like well that doesn't really help me that much Alex so at the final portion we're gonna parse this 10-q HTML HTML document this is where the gold is basically this is where the treasure trove is this contains the actual HTML content that you see in the interactive data along with the elements that contain the information that we want this is structured in a way where you'll see this kind of mix between HTML code and XML code so you'll have these XML elements that contain like just a blob for example of HTML code for the most part I'm gonna leave out the HTML code because you know that's kind of its own little beast I'm really concerned at this point with the actual XML documents or sorry elements because they contain the information I want so for example let's just review one okay so there's this element called context and it's repetitive okay there's a lot of context lot of context what is a context element contain well it seems to tell us maybe like a cik number so who this applies to and looks like the IDS telling me FY 2019 quarter three you know year today okay so this is kind of giving me some good context so I'm probably gonna see this ID somewhere oh but the name even tells me the period so even me like what period this label will cover so it's gonna cover from 2019 January 1st to 2019 September 30th okay that's kind of neat that's pretty useful I wonder if there's other information in here you're gonna scroll and you're gonna scroll and you're gonna scroll and there's just a bunch of context elements it's sickening to be honest is this where things start okay yes okay this is where things get a little bit interesting so now we start getting into some of the actual metrics and you're kind of going okay let's take a look so I have FY 18 or something like that F 2018 q4 I don't know if the FI is for this point um US GAAP cash and cash equivalent access US GAAP money market funds member and it looks like this level 2 member or something like that and they're trying to say that okay this is made up of two segments there's a dimension where it's made up of money market fund members and fail fair value input members okay so this is again just giving us more context which I like this I mean it might not be necessarily useful to me right now but it's good to know okay there seems to be some type of structure behind it and it might inform me of different information down the road so that's good and again there's a bunch of these and you're gonna scroll you're gonna scroll and you're gonna be like my god and how does anyone keep their sanity and I ask myself that many days and then you know you kind of get towards the bottom you'll see USD per share and number and okay interesting stuff right and then you start getting into some of the actual on document entity information so this is where on that first one if you saw I think at this point let's just kind of look at it again and kind of have it mint and I think it might be a little bit on easier to kind of communicate certain things if you can see the one right beside it okay so remember this little page right here this is all the information in that page so it's going to give you the common stock or par value it's going to give you the amendment flag fiscal year end date so this is all the document entity information right so that's pretty useful so I can basically get everything I'm seeing here from this all right I'm liking it so far liking it so far and then we start getting and see the actual accounts and you start going interesting cash and cash equivalents at carrying value well that's nice and they'll tell me the ID oh and they'll give me the actual value and they'll tell me the unit reference don't even tell me the decimals and the contacts although tell me what this is basically covering well this is pretty neat and if you keep scrolling they basically give you everything so basically everything that you're seeing in these financial statements can be tied back to one of these guys right here so for example if I do cash and cash equivalents they looks like they have that's the wrong one okay so US GAAP cash and cash equivalents a carrying value okay so let's maybe see if that's somewhere in this document I hope it is but if it's not then you know it might not find it because sometimes it doesn't did I not I think it's because this one's further down oh I think it's because it's in that stupid element and also too as I think it's probably because of this yeah so the tags a little bit different even though it's a - here technically everything in the tags are gonna be a colon so you have to kind of keep that in mind so even here it's a - it's actually colon in the document but okay so 15 nine seven nine fifteen nine seven nine okay that's good so that's encouraging so at least it's time so it looks like I can get everything that I'm seen in those documents from these documents right but that's just part of the story right so now we found the information how the heck do we get it out well for the most part it's not too hard to get it out the problem is how do you match it all up because from what I've been able to tell right now there isn't really a a master key there might be but I haven't verified if that's really the master key yet so for example most time you'll see something where you'll have you know something like this right at current assets right this might be a unique identifier I'm not sure yet I have to test a little bit more so again if that does change you'll probably see me change my entire parsing process not all of it but you might see the way I'm matching things up differently just because if this is truly the master key then I can eliminate a process in my script where I'm trying to create what I think is the master key and the only reason I think it's the master key is because basically there haven't been a situation where I'm missing anything so it seems to work but I can't say with a hundred percent confidence that oh yeah it's always work and never going to be wrong or something like that so I'm really trying to identify what is an easy way from our perspective to say hey the information that you find in here can you somehow like match it up with the information that's in this one and same with this one without one so basically can you create basically your master dataset where all the documents have basically been joined together in some regard if you want to get all database like and can we do that to make it all kind of I would say work together right so that's kind of the idea behind it is we have to develop some type of framework to basically mash everything up have these separate documents being now basically merged into a single document so that's really the idea behind this whole process so I'm pretty sure I went a little bit over but that's okay so in our next video now that we have our context behind like okay what the heck are we looking at now we're gonna move into the actual coding aspect of it it's really important that you watch this video because if you were like me and you have to just look at these and not have any idea what you're looking at it gets so confusing so quickly like this was a pain to decipher and I was only kind of able to decipher because I kind of found some old resources that kind of told us how structured so it's there but those document guides can be hundreds and hundreds of pages long and so don't go through what I went through take my knowledge and use it have fun with it make your life easier so that way you don't have to go through the hell that I went through because it is confusing there there's a lot there and it's not necessarily easy to just merge it all together in a simple fashion so now that we understand that context we can start moving into the process of actually parsing it so if you have any questions at this point feel free to put them down comments below as always and I will do my best to answer them hopefully guide you and what I would hopefully consider the right direction but in our next video we're gonna actually start coding it and we're gonna see the overall structure now again just as an FYI do not be surprised if down the road this code changes slightly because it's still kind of a work in progress but otherwise if that's it we will see you in the next video
Info
Channel: Sigma Coding
Views: 10,607
Rating: undefined out of 5
Keywords: Sigma Coding, XBRL, SEC, Parsing, Web Scraping, Finance, Fin Tech, Financial Documents, Extensible Business Reporting Language, How to parse XBRL documents, 10K, 10Q, Financial Filings, SEC Filings, Wealth Management, Investment Banking, Corporate Finance, Economics, Python, Coding, Programming Language, Python For Finance
Id: dJymnTL3hgc
Channel Id: undefined
Length: 25min 46sec (1546 seconds)
Published: Sat Feb 15 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.