TimescaleDB - PostgreSQL for Time-Series Data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] i hey how's it going uh back here with another video uh today i'm going to be talking about time scale db which is a time series database that's built on top of postgres which is an open source relational database very popular one of my favorite databases you may be familiar with it already i'm basically starting a new series it's probably like seven or eight videos i'm going to first discuss what time scale db is how it might be useful to you and i'm going to go into a few practical applications where i attempt to use it for some financial data applications and stock trading apps you know what we do on this channel like and subscribe if you like what i'm doing and buy me a drink if you want to contribute uh something to the channel i'll leave a link below so let's go ahead and get started with talking about timescale db this first video is going to be more about the what is time series data what is time scale db why do i care that sort of thing so i've written a few slides here and i've went through the time scale db documentation to attempt to understand it a bit better so it's very new to me so this is the perspective of someone that's just trying this out for the first time but i think i'll cover some useful things to you and in our past videos we've used a sql lite for instance as a data store and i think using timescale db will offer some unique uh features that will be useful to you all right so first of all i've made a few slides here with some bullet points so i remember uh what i'm going to cover so this is an outline of where we're going in this particular series so the first slide i have here is what is time series data like what's unique about it why do you care why do we have to name data isn't it just data with the time stamp what's unique about time series data so time series data is just data that changes over time and it means that your records have some type of time stamp associated with them a few examples of that are for instance cover 19 data you see a chart over time showing how the virus has has spread over time where it's plateaued if there's a second wave those sorts of things very useful to people to make decisions and to have seasonal forecasts things like that other examples that are very popular are that i used to work on a server monitoring software as a service application in the past and we measured a lot of things like cpu and memory usage over time and so we track that data minute by minute and you could derive some interesting insights out of that data so you could show a dashboard to the server administrator showing how cpu has changed over time with uh and monitor like thousands of machines right and if cpu spiked during a certain minute and that corresponds corresponded to some other characteristics of mem maybe the disk io or the memory usage then you might be able to use it to detect anomalies or figure out something's wrong with a particular machine or maybe in the future you you might be able to determine that you know in an hour from now this server you know we might need to look at that right so tracking this data over time is very useful other uh useful applications are internet of things devices sensor data tracking like wind and temperature over time things like that and so on this channel as you know what we do we like to use financial data and so we're often tracking uh stock prices or crypto prices over time whether that's you know minute by minute day by day hour by hour even second by second and streaming in real time data something like times a specially customized extension for dealing with time series databases uh could be very useful to us right because this provides a lot of aggregation functions uh it speeds up the our ingestion speed so we can insert data really quickly we can select data really quickly we might have some useful analytics functions that we can use on time data so that we don't have to like pull everything out load it into a pandas data frame write all these functions ourselves there's a lot of cool stuff that's built in that's provided as an extension to postgresql and we can use also the postgres functions to work with data as well so uh the data can be captured at this time series data we can either capture it at regular intervals which is the case of historical candlestick data that we often load in it's just data that's collected every five minutes or 15 minutes or whatever interval we decide but there's also irregular intervals that like events that we can track uh and so that could be just like streaming in real time bid and ask data for crypto that's happening all the time right but it's not really a regular interval it's just as they come in uh we insert data another example of a of a case where data is enriched by timestamps so the example i believe that timescale uses is logins right so um i don't know if i'm building a web application i don't necessarily have to record a timestamp uh when someone logs in right or i could just record a single like timestamp for the last time the user logged in right and that's useful right we know when they last logged in but they give the example of what if we have a record for every single time a user is logged in in history and record observations about that particular login right then becomes more and more useful to us that data gets a lot of extra value because we can see if we record the ip address where the user logged in each time we can see where this user moves around if a user logs in 100 times a day or every five minutes then we know they're a very active user of our system if a customer has not logged in in a month you know we might want to check on that person because they might cancel the service right and we can send them an email and say hey i miss you lately where you've been right and that can help you know maybe bring that customer back or help you find customers in advance that might cancel your software as a service application or whatever whatever you're building and so time series data is very important to understand and uh using time series data very well can actually make your business more profitable and it's also useful for us obviously since we're working with a lot of financial data and charts and technical analysis and things like that right another unique characteristic of time series data is it's immutability and so most of the time you're just inserting data so there's a lot of types of applications where you're going back and updating and making a lot of edits but if we're just downloading a bunch of historical data you know what happened in history is already recorded right and we can just archive that we don't need to go back and update what the dow jones industrial average was in 1991 that data is always the same right and so with timescale db uh they can make certain optimizations for insertion speed right they don't have to worry so much about going back and updating certain types of data right and so yeah you can just optimize for insert speed and also fast select so i'm sure they've probably focused on those areas and also made some interesting features for uh data retention and like purging old data that we may no longer need compressing that data things like that if it's not accessed frequently okay uh and so some practical uses for our case for time series data is pattern recognition for instance so a lot of times we're looking for particular chart patterns a lot of people trade and say oh i want to find this uh triangle pattern or this three bar breakout that sort of thing right and so that'll be one example video that i do i'll probably do just a simple three bar breakout and we can just find that using regular sql and postgres lead and lag functions right uh we don't have to pull all that data out like tons of data and then filter that down in a pandas data frame and do all that work we can write one sql query to find that and just pull out just the data we need it's also useful for forecasting so there's there's python tools such as facebook profit that allows us to detect a seasonality trends and do time series forecasting uh we might be able to do some anomaly detection and detect unusual activity in options or something like that or in unusual spikes and prices maybe a gamestop short squeeze or something like that you know maybe gamestop is normally not very active but we saw how crazy it was in the past week or so um yeah and there's there's tons of other uses and so i'm going to talk about those a little more in a moment so why specifically time scale db so we've recognized the value of working with time series databases and needing some special tools to work with time series data and so why specifically choose a timescale db and i have to talk about this because every time i've talked about a relational database on this channel there's always a few people are like why not use nosql isn't it faster isn't it better and i think that is not necessarily the case first of all um you might you probably won't even run into any of these problems that you're talking about i've personally worked on multiple startups that have done very well and grown to huge become huge businesses larger than anything you probably ever deal with and certainly as just a project on this channel right and postgres was the central database for the whole system and even the the application that i talked about that monitored tons of different desktops you know hundreds of thousands of devices it might be millions of devices at this point um postgres was the central database that we used for everything we had some redis caching and things like that but postgres was at the central data store and also i worked on a file syncing application where we recorded lots of data in postgres as well and so uh the claim that you can't use a an sql or relational database to work with time series database i feel like isn't true just from practical experience but if that isn't enough right the time scale folks have done a great job you really need to read their blog because they have compared it they've taken the time to compare time scale db against other types of databases specifically mongodb which is brought up a lot and also influx db which is another time series database and i have the links right here and just to pull them up um they've measured and determined hundreds more 260 percent higher insert performance 54 times faster queries simpler implementation and so these blog posts are very detailed these aren't just isn't it faster like they actually uh measured this and provided all the benchmarks they used the type of data they used right and so this is an evaluation of insert performance query performance disk storage footprint implementation cost they go through multiple methods they show the ingest rate right here as you can see a time scale db they benchmark it faster obviously they're a little bit biased since this is written by timescaledb but they do publish the actual setups they use they showed disk usage where lower is better and so uh the disk compress the compression uh must have been much better so at 10 times smaller disk storage footprint and you remember when we stored a lot of historical data for years and years for qqq for instance we had a database that was like three gigabytes or so and that wasn't even all the symbols so you could imagine if we start pulling in like terabytes of price data this foot this footprint becomes very important because a lot of times we're just storing this on our desktop or on a virtual server for instance and also this discusses uh select performance you see it's extremely fast there and i'm not going to go through everything here but just know you know they've they specifically talk about mongodb and the different recommended methods that the mongodb folks say that you should use to store time series databases or time series data and determine that a relational model here using timescale db is much faster for both rights and reads and this goes on and on and it talks about the recommended setup for mongodb as well not just making it up as you go they went through the best practices and everything right and then likewise the same thing influx db is marketed as a great time series database as well i've never personally tried it but they do go through a detailed analysis on the data model the query language reliability performance so they talk about a number of reliability issues and i personally worked in the gaming gaming industry and i do know folks that use mongodb as their primary data store and did experience major problems in production at scale and so i've heard this anecdotal anecdotally as well but one of the main focuses here on is the data model and the language used and what's cool about time scale if you're like me is that it just used sql and i already know sql i've worked with sql server mysql sqlite oracle all these other databases right so i have this you know years and years of knowledge of sql and i can use that immediately with time scale db i don't need to learn another language and one issue with influx is they've come out with like multiple versions of some query language so there's this influx ql and it's sql lite are like it's like sql but not quite and so there's all these things you have to learn and so like i don't want to read an entire uh manual necessarily to figure out how to implement all this stuff and so you see if i want to learn influx right i have to go through all this and learn the stuff that's specific to just influx right and just doing that takes so much time to get ramped up on and it really isn't any better right and so by embracing sql you can use a lot of knowledge you already have but also by embracing postgres right we have this powerful database that's been around for decades it's probably been around since the mid 90s so over 20 years of experience built in right and all this tooling so you have tools like sql alchemy and object relational mappers from django flask all these tools right that already work out of the box with postgres since python the python web application developers have basically embraced postgres since the beginning and so you know why why use this other database and learn this proprietary uh language that just works with one one database you know use the knowledge that you already have and so i like that and also a lot of times people end up using another database like influx just to store their time series data and what's nice here you know a lot of times i have a web application with users and and other objects and documents and things like that other relationships tweets whatever it is and it's stored inside of a postgres database right and just being able to use uh that same database for time series data is very nice so just a high level list of features that time scale db provides us the first is fast ingestion of time series data so if we're retrieving historical data for like thousands of stocks we should be able to just get minute bars with you know millions of minutes and just insert all of those into the database without it hanging up or locking up or any issues and so it should be able to handle insertion of data as fast as we will need it includes time oriented analytics functions so if you go through timescale documentation you see there's functions that are provided like histogram right so if you look at the sql here they have this histogram that they run on battery levels so we can run that on our price data they have interpolate so you can interpolate from one data point to the next so like draw a line or in some cases we do this uh last observation carried forward if you remember when there was gaps in our minute data we just carried the last observation the last price forward in time to fill those gaps there's time bucketing functions and things like that and i'm going to go into that in detail as soon as i get into the tutorial part continuous aggregates so being able to roll up your minute data into five-minute candlesticks or 15-minute candlesticks or rolling up all your data into daily candlesticks uh data retention so maybe you don't care about data that's older than 30 days and you just start dropping that data right you only need to care about the most recent data so you can define policies for how to clean up this old data automatically so that you don't have to worry about that uh compression so you can take up a smaller disk uh footprint as shown in the documentation and you have the full availability of postgresql tools and ecosystems so and one other thing i'll mention as far as just general credibility like you see a new database system and you're like should i trust this thing right uh but if you look on their page like the the folks behind this are very uh very successful very uh knowledgeable people so if you read the blog and you look at the background of all these folks you know these are like professors at princeton that they're they know their database stuff right so they've dealt with lots of data at scale it's not just marketing right there they've benchmarked all this stuff they provide plenty of examples and technical insight and talk about running this across multiple multiple machines with thousands of devices and recording data really quickly at scale and so i was very very impressed by reading through their documentation okay um and then why postgres write i don't feel like i need to say too much about it but it's battle tested right it's scalable fast you can store whatever you have the ability to join data there's useful aggregate functions uh it works well with your existing business data so there's so much data that's in a relational database already you can use triggers stored procedures you have relationships and you can preserve referential integrity all the things benefits you get of a relational database plus you get all the tooling that comes with python web application frameworks like sql alchemy uh it's the django object relational mapper right uh you get postgresql extensions uh the most popular one that we hear about is post gis which is very useful i've worked on geographic applications before so there's useful useful extensions for working with latitude and longitude and geometric shapes and all kinds of interesting things that you can store and and write function functions for and yeah and if you care about storing json objects which is a reason a lot of people use mongodb it can store json documents as well and there's very useful functions for querying json data so at a previous tool i built just took blobs of json data and stored it and that's one of the benefits of mongodb is that you don't have to maintain the schema but if you just want a key and to store some json data postgres has been able to do that since postgres9 and they have this json b format we can do that and they have these operators where you can like query for values of particular attributes in a json document and it works really fast it's really cool um and then yeah it's fully open source so it's not owned by a publicly traded company so as you know oracle has acquired you know oracle is a multi-hundred billion dollar uh corporation that's publicly traded they bought mysql uh people were concerned about that at the time and oracle yeah and uh microsoft owned sql server right and so it's nice to have this database that's still you know open source and it's also used by many of the most successful companies and that's why you trust postgres right who started off with this i believe instagram used this for many years that's the sole data store uh and then you know you every company has this in some form reddit robinhood um all the basically all these companies and so if it's good enough for these people probably good enough for what you're doing with it and will get you where you need to go um and and so what are we gonna do right that's a great overview you know you can read the documentation for that what happens on this channel is we're going to talk about some practical applications and i'm going to show you how to build them from scratch and here are the things i have planned to show you how to do the first we need to get timescale db up and running and so we're going to use docker for that and so i'm going to show you how to just install docker and run timescale db and also you may remember that uh recently in last month's video i made something called tradekit which is a docker container with a bunch of financial libraries built into it and web frameworks and also timescale db and so i want to show you how to get it set up out of that box and develop using that that way we can just work on a common docker image that has all of these dependencies already installed i'm going to show you how to populate the database right so once we get up and running we need to get some data to work with to even demonstrate anything so we'll do the usual thing where we just retrieve a bunch of historical price data on stocks and i was thinking since the arc fund since kathy wood are like super uh popular and hot right now that we would just focus on like getting the csvs from the ark site so the arc arc invest right uh they provide csvs of all of their funds like uh arc f and rg and those types of things and then they provide these daily csv files and what i figure is that we'll just download those csvs and retrieve historical price data for all the different stocks so we can make like a little etf database uh store historical data and maybe we want to track how the arc funds change over time because i know it's i i feel weird about this but like people are all like uh whatever whatever arc fun bought that particular day a lot of people are jumping on it so we can kind of look at how you know their portfolio changes over time new stocks they're investing in and so forth not that i recommend you just follow blindly follow what kathy wood or chamath or any of these popular folks now are doing but you know it's something to to look at um so arc funds for the source of our historical data and we'll probably use alpaca or interactive brokers to retrieve the actual prices on these stocks we're going to demonstrate some analytical functions so there's a lot of functions built in as i mentioned like first last histogram time buckets time bucketing less observation carried for forward interpolate different functions that are provided we'll use those functions on this data we'll also use the features of continuous aggregates and retention policies and build a simple api on top of this so i've been using fast api on top of it and so i thought i would spin up the old trade kit and show you how to develop a web application and pretend you're you're uh selling or as a subscription and an api service for some type of financial data and you want to serve it up i'm going to show you how to build an api and use timescale db as the backend for that and we'll just serve up the data in a variety of formats we're going to use sql to calculate some common indicators and also find some chart patterns in data and so i've watched a few videos that other people have done you know detected things manually previously and so i want to show you how to automate some of that so finding certain patterns in our data just using regular old sql and we'll probably use a postgresql some other functions like stored procedures or create a user defined function uh that uh maybe can do some automation for us so perhaps when a certain pattern is detected uh in the data maybe we can automatically execute a trade or log a record to another table that sort of thing and then in a follow-up series i want to focus on plotly dash i've shown just regular jinja templates and react a little bit in my mobile front end but i have also been meaning to create a user interface using a plotly dash for folks that just want to use pure python because i do a lot of javascript as well but not everyone wants to mess with that so another goal i've have in the channel is to do a video on plotly dash and also i think it'd be good to do a crypto version of all this i've noticed actually one of my crypto videos is the most popular one now even though i mainly meant to talk about stocks and so i also like that crypto data is running non-stop so when i'm working on this on the weekend i i can stream some data in and show some examples that way so i'll probably do a crypto version of this as well so yeah that's an overview of where this is going what time scale db is why you'd want to use it why you care and these are the things i'm going to do with it so stay tuned for the next several videos as we explore time scale db thanks a lot for watching see you in the next video
Info
Channel: Part Time Larry
Views: 12,978
Rating: 4.935185 out of 5
Keywords: timescaledb, postgresql, time series data, stock market data, ohlc, technical analysis, python, pattern detection, breakouts, influxdb, mongodb
Id: MFudksxlZjk
Channel Id: undefined
Length: 25min 26sec (1526 seconds)
Published: Sat Jan 23 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.