ClickHouse v23.6 Release Webinar

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
amazing we are now live streamed on YouTube Professor joining directly let me grab the YouTube link Alexa I'm going to um share it with um with those people who are uh joining us um let me make sure that we've gone live as well I'm going to go ahead and I'm going to click the start webinar button so that we can do both things at the same time and the webinar appears to be started as well webinar is such a horrible term it's not a webinar it's a call so Alexa if you wouldn't mind walking we see a couple of people joining uh brawl Constantine others um I'm going to go ahead and I'm going to Tweet out that we're getting started yeah my name is not the webinar but life so or what else so I've called it a community call the last few weeks because that feels like more appropriate to me than a webinar which just feels I don't know somehow incredibly inappropriate um and demeaning to really what we're doing here um but yeah if if anyone in the chat has a has a fantastic idea for what we should call this thing I I would love to hear it when I'm looking at your beard they're thinking yes it is community commute to something but okay let's start with the webinar so welcome Ellen nice to see you again welcome three times are not if I pronounce pronouncing correctly done sorry if it is incorrect pronunciation banana man who is I'm not sure but the best name ever thank you Eric banana man [Music] hmm okay welcome Brenda Martin Brian Constantine El chanca welcome Kristoff nice to see you Nikolai welcome again Robert always I'm always happy to see you on the webinar welcome again how long welcome Todd okay I see many people are joining you can watch us on Zoom or if you if some of your friends don't have zoom they can watch us on YouTube and you you can ask questions live in Zoom or in the YouTube translation yeah who is banana man interesting question but okay banana man if you want to ask some question to me before we will start please go ahead I think we're good election we see quite a few people rolling in uh both on YouTube and on on Zoom so um given that it's a few minutes past the hour I've just updated on slack which if you're not part of our click house community on slack uh you are always welcome at clickhouse.com slack is the invite I've posted it to the community on telegram as well as to our community on on Twitter uh the the Twitter name is clickhouse DP um I'll drop it in a couple of other places as well as well as Meetup meetup.com Pro slash clickhouse you'll find all of the events we have coming up both online and in person uh fundamentally 23.5 or pardon me 23.5 was an amazing release 23.6 is an amazing release I have worked at a variety of Open Source companies in my career that's been almost 25 years now and there are a few companies where I've been as excited each month as we drive the release training forward as I am with clickhouse it's truly a privilege to not only share with you where we're going but what we have done and also demo what we have done live and I've had a peek into the slides that Alexa is presenting uh albeit not too long ago uh and they're incredible so with that Alexa if you don't mind why don't we go ahead and kick off with what's new in 23.6 okay by the way I would like to make releases not every month but maybe every week or even every day but if I will do it it will be just way too much amazing so let's try to keep it every month okay let me share my screen so what's new in the release 23.6 this is our summer release I will spend as usual like 50 minutes about new features and maybe maybe we will have 10 minutes for your questions we have about 10 new features 12 performance optimizations and 31 bug fixes so many new stuff let's start what do we have first is about function transform so what is person transform it has an argument it has an array of source values array of values to transform to and a default value and it will compare the argument to every value from the first array and if it matches it will return the corresponding value from the second variety easy and you might rarely will use this function transform as is if you prefer to stick to a standard SQL you will find that it is almost the same as the case operator and it looks similar so you have one argument as what to transform and you have some branches with values word to transform and where to transform and I I will say that until the current release 23.6 the house was actually a very bad database management system maybe you did not like it but Charles version 23.6 you will enjoy using clickhouse because now the function transform and the case operator has support for every data type not just numbers strings and date times okay what else what about select into out file in the previous release we introduced the append modifier so what happens if you insert into our file if it does not exist it works if it already exists it will give you an exception like file already exists consider using append or trancage modifier and since version 23.6 we introduce the support for this truncate modifier it will simply truncate the existing file and write the result pretty easy write a small change but it was nice it is useful why not to add it and now we have it what about processing of local files from the local file system uh if you are using file table function or S3 table function or hdfs whatever any integration function that is reading a set of files if some of files are empty sometimes they will be processed without any problems because if the files are in CSP tsp format for Json each row even row binary or native empty files are not not a problem they represent the valid empty data set so empty CSV file is just an empty table of unknown data structure and it works fine but the problem is that if you have empty bucket or org file it is not a valid file it has to at least contain a header but sometimes for some reason you have this empty files in your file system so now we provide this options just to skip these files also small and nice feature maybe you you not needed but someone some for of your friends will definitely need this feature and your friends will be happy using it okay this is more interesting the possibility to rename processed files with this small feature you can build small and nice data Pipelines inside just one invocation of clickhouse local so imagine you have a directory with a set of files and some external application constantly pushing new files into this directory just remaining and new files constantly appearing and you want to process all of these files then you want to process the new files after one second and process all the new files again after two seconds and so on and now it is possible just invoke clickhouse local inner loop with this option remain files after processing the value of the setting is a template with some substitutions like the original file name and you can for example append some extension to this file so in the next invocation already processed files will be skipped and you can do insert select into another table so it will continuously consume this a bunch of continuously updated files you can specify some interesting stuff in this platform for example you can rename with a pendant timestamp of data processing so it will be recorded or you can include the Slash to rename files to a different directory everything will work let me show you some demo first I have to create these files and I will just copy paste some some comments to the terminal let me first create some Target file just for a demonstration and I will do it with select into out file it will automatically determine the data format the structure and I will start with something simple like the final containment just a single record with number one okay now I have this file and I want to process it and I will use just select rather think a local file is easy okay it works I can do it multiple times but let's let's define this magic new section rename files after processing stands for the template I will do something like this settings one mistake okay it works what if I will do it second time okay it will show that there is no such files if I will do it like this interesting what it will do but apparently I have some other files with Market name now I terminated this query so other files will not be renamed but let's look at this file it is renamed so it works it works perfectly let's look what else what else I can solve so this is a nice feature but maybe it is quite Limited and I already have some teacher requests to the author of this feature to implement what if we will extend it to work not only for file system but also for S3 and hdfs why not maybe it will be the main scenario of data processing you don't usually have access to the local file system what about insert select what actually the spiritual guarantees yes it does have some guarantees for example if the comment invocation is successful it is guaranteed that the data was inserted into the destination table and the file was renamed but if you get exception overall the operation is not atomic and it is not specified what is the relative order between data insertion and file renaming so there is a potential for improvement okay what else this is quite some quite interesting feature for clickhouse clients it is about specifying where it should connect there is a normal way specifying the connection parameters like minus minus first argument minus minus password minus minus fourth everything and there is alternative way maybe not so normal and I'm not sure if it is any better it does not look better for me but now you can specify everything as a single connection string like this user password it even has its own schema the question is for what why do we need this feature do you have any ideas because maybe maybe I am wondering about why does it even present presented in the change log either any ideas yeah so for me I immensely prefer it that way and that probably leads back to my db2 days where this was the way that I connected to a dbg database from a silly little Solaris pizza box but also Dale said maybe so you can override Boolean Flags like secure uh but fundamentally this to me looks like the way that you use a client versus the the dash dash how old are you okay don't tell anyone very is the answer Alexa okay uh any other answers you can post in the chat any ideas why exactly what is the motivation for this teacher the actual need Dale maybe maybe you have any idea well I can see how I would use it uh like I have but I don't think it's why it's designed I have a secure value my local click as config and I want to connect your locally to my local cluster uh I can't turn off the secure flag so I would just put in the full URL in this case but I'm not sure that is why it's designed but that's how I would use it to override Flags which I can't override on the command line that we're going to config okay actually there are two reasons one is a wrong reason and the wrong reason is that postgres has it everyone else has it has this feature but the actual reason we want to introduce and standardize this connection string because many integration tools expect specifying connection parameters in the form of a single connection string so if there's still those requires this connection string why don't at least make some standard and support it in clickhouse clients uh that's a good point Alexa just a shout out to Broad brawl who mentioned um it looks like a SQL Alchemy connection string so they're already seeing the sort of similarity between that connection string and others and I think that's your point is that when when uh third parties or immigrations require that connection string now click as client supports it okay let's put it this way okay this is probably more interesting teacher the possibility to specify the default time zone in the scope of station or in the scope of query or in the scope of the user so it's available as a setting so what happens if you do a simple query like select now the function now like any other function working with the dates and times it Returns the date time date time and the date time data type is represented internally as a Unix timestamp uh the Unix timestamp is just a single number representing the number of seconds from here uh 1970 in UTC without coordination seconds so it is pretty easy and Unix timestamp does not depend on time zone whenever you are located it is representing Global time in in UTC okay but why do I get this value because we just print this Unix timestamp according to servers time zone so if the server time zone is configured as UTC you will not have any troubles but what if it is configured yes like us somewhere or whatever Antarctica station you will get some other representation of this date time and you can specify this time zone explicitly you want to print this date time in Los Angeles time zone you will get it but now you can also set up this default station time zone it has a setting and it will be used by default in every date time function pretty easy and maybe it is just yet another way to confuse yourself with time zones but let's imagine you want to confuse yourself and now you have all the means to do it okay integration with mongodb 5.1 Plus what is the matter with more words B inside clickhouse we have integration with mongodb we have it for years and it works nice it it always worked it is available like mongodb table engine and one going to be table function and it works until one day they changed the protocol in mongodb version 5.1 they simply broke the the integration with every old driver and we had to catch up and support the new changes in the protocol but now we have tested it up to mongodb version 6. so it will work probably until they will change this protocol again what about performance optimizations there is so much performance inside clickhouse so where to get more and more performance optimizations that I have to present in every new release let's take a look and the first is what you will probably find useful for your production use cases it is about the famous too many parts exception if you insert data into public house everything is perfect but you think can I insert this data faster and you start to spawn many clients to insert data in parallel faster or you increase the number of insert threads and the data is inserted faster and faster one million records per second 5 million records per second terminal records per second until it will give you too many parts exception it will say nor just cannot catch up with inserts or something like this and obviously you don't like it because you you will have to restart your data pipelines you will have to figure out what part of data was inserted what part of data was not inserted maybe it is not convenient but actually we have to ask what is too many parts how many is too many and it is controlled by the second parts to delay in certain parts to throw in solve and in the previous versions the default value of the certain parts to throw insert was just 300 so you get 300 unmerged parts and we will get this exception is it enough should we increase it should you increase if it's an open question also a question should we perform some sort of back pressure like slow down already running inserts and to answer this question we have to look at these graphs so what do we have on these graphs uh here I have an experiment with continuous insert of 400 billion records with four concurrent connections on a single server with a speed of 7 million records per second with continuously running concurrent select queries for real-time user facing dashboards and here is a graph of number of parts for partition it is mostly stable but slowly growing up to 600 parts and the graphs of Select query latency and you can see that the correlation between these graphs are almost 100 percent so as many parts you have the quantiles the timings of query select queries will increase almost proportionally actually these timings are pretty good so even if you have 600 data Parts you still get less than 150 of median latency less than 200 milliseconds of 90 latency and 250 for 99 percent so the decision is that maybe we should actually increase the default value from 300 to 3 000. because why not because I think just one zero does that should not harm right and also we should enable back pressure for long running inserts if you do this single insert for 400 billion records instead of interrupting this insert maybe it should just adapt for the merging speed maybe it should slow down a little and continue to work and we want to find Optimal parameters and I think that in version 23.6 we are slightly closer to this goal what else what if you have five thousand replicated merge three tables on your server and many people asking asking us we have 5 000 tables on my server all of them replicated why do I have significant CPU or why do I have constant Network traffic to keeper or zookeeper why do I have so many requests to keeper and we usually answer it if you have five thousand of tables with the same data structure you should not convert all of them to one table with a primary key containing some identifier like client identifier Advertiser running for website identifier whatever it's better to have one huge table instead of five thousand small tables maybe it is just just some old outdated practice that you get from your experiences MySQL and in criticals one huge table is nice but we continue to answer it you don't need five thousand fables don't do that you are doing it wrong and to answer it after it after it and finally we started to think why don't optimize it let people just use it what are they one lovely abuse because that Greek house should always work so we have optimized it and now if there are a lot of inactive tables tables without inserts they will not overload CPU keeper they will back off their background stress and now it is fairly practical to have several thousands of replicated large three tables we still don't recommend it but it does not sound bad keep using the cross don't worry what about low level optimizations we have a new optimization for sorting for order by it applies if your data is not sort of not assumed to be sorted if it does not have the corresponding certain key but for some reason in some blocks it just appears to be sorted and now we detect it and process it faster let me show you it's interesting how faster and let me find some queries to copy just to cut the paste okay let me try running a query in the previous clickhouse version 23.5 I will use criticals local and I will just copy paste some example query so this query will order by number but not just number but it's string representation so it is almost already sorted but not quite and I will pick up 100 Million numbers don't try to sort almost all of them 90 million and pick up 10 next numbers okay let me run how fast it will be how fast this query can run any ideas your answer how fast it should be very very fast yes this answer is correct but it is the old version not the new version it may be imprecise it's still accurate um though your your but a couple of seconds two seconds two seconds uh do you have any uh ideas in the chat how fast this query will run yeah in the chat we have uh two seconds from Mustafa we have minutes from fear we have a hundred milliseconds from banana man I'm checking the YouTube chat uh because slightly delayed uh yeah nothing new in the YouTube chat so looks like it ranges from milliseconds to two seconds okay I expect we'll run this experience it will go bananas three points to seconds not so fast by the way let me just in case run it again yeah but it's a stable 3.2 seconds okay now we will test the new version 23.6 and it is just request local I have compiled it on my machine and what do you think the difference will be we're seeing answers like 100 milliseconds very fast on YouTube 200 milliseconds from another person so it seems like the uh math is hard right but it seems like the the uh 3.2 seconds becomes 100 200 milliseconds Etc so the authors are very fast and very very fast let's check it 2.8 yes it is faster but I think it is more like very fast than very very fast and we still have to optimize it because you said you said it should be 100 milliseconds 200 milliseconds we are not the area it's a good point like saying shout out to um to some of the folk on on YouTube who are watching along um who have actually called that out almost precisely 2.5 milliseconds 2.8 2.8 seconds so yeah shout out to those who who saw the Improvement but not quite yet where we want it to be okay and by the way uh you have one a gift from Quick house so tell us just to write send me a message on slack and we will send you a gift okay what else did we optimize reason from parquet and you might ask me what you optimized it again you're already optimizing it two months ago in the release 23.4 but what else and if I recall correctly it was already optimized at 100 times but it was optimized for reading from S3 and URL but now We additionally optimizing it for return from local files if you live from a single file it improves two times if you read from participants file ah no the other way around partition files the Improvement is two times and from a single file it is four times okay let me show you a demo just to check maybe it will do not four times we always have to validate our own claims so let me let me check it and somewhere I should have file and up here about another interesting where do I have this ah let me let me just create this file I will run a request server and I will create this file and let me work May create it with 10 million records by the way a reading from parquet is fast but writing is not so fast okay five million six million seven eight nine okay now we have this file and let me use because Apple or Reason but first I will run it on the old version how fast this query should be I don't know the circuit yeah it is first 0.2 seconds but what if I will use the latest version 0.0 seconds who knows arithmetic when you have to divide two numbers to calculate so 0.05 and 0 1 2. 39 so it's yeah four times as promises perfect result okay what interesting do we have some Flagship teachers something unusual integration with the radius so we already have integration with but why not with redis actually redis was already available as a dictionary source but now we also have this integration as a table engine and table functions and you can just read everything from radius like full scan read by a key you specify a primary key corresponding to the key in Regis and you can just query by this key or by a set of values and it has support for select insert you can join with regist tables and it will do lookup requests you can even update and delete registerable so what else maybe we need to support some custom data types in radius if you want to suggest some picture requests please write in the chat and maybe there will be a chance it will be eventually implemented okay now another feature but it is a secret feature it is a secret feature because it has a chance to be in the release 23.6 we are preparing this release right now and this feature is not yet even merged so maybe it will be in the release 23.6 maybe it will be only in the next release and I'm not going to present every detail about this feature because who knows maybe I will have to present it on the next webinar and I don't want to do it twice but it is named overlay databases uh and also there are file system database engine S3 database engine and hdfs engine and what do you think what it is interesting if you can guess why do we need overlay databases and what is especially nice about this feature so it is like one of the main features of the this release or the next release what is it for and why do we need it Tyler may I ask you overlay databases what is nice about it um Tyler don't want to answer Dale maybe you want to transfer please help help tiger too very well um I don't know uh there's some from the community somebody asked is it a replacement for zero copy replication uh no it will be too good to replace zero copy replication no overlay databases is something very nice for usability something that will make your life easier and I don't want to tell about this feature I want to keep it in a secret and if you will not guessed I will just skip it okay let's let's keep it as a secret what about Integrations now clickhouse is supported in confluent Cloud you can just go to this whatever confirmed Clause whatever it is and find clickhouse connectors there and use it uh if you want to manage your infrastructure now you have official terraform provider for Cliff House cloud so you can create services you can start stop scale resize whatever destroy Services if you don't need then everything everything that is configurable also considerables with API and with a terraform what is new in clickhouse Cloud now gcp support is generally available so it was in beta just months ago but we did not find any obstacles we did not find any issues at town now it is not better but production ready generally available service and another interesting feature is a dedicated tier so if you are going to use Quick house file sometimes you might ask is it the amount of tenant service what is the isolation and sometimes you want your services to be completely isolated on a separate machines and now we have dedicated environment specifically for this it has advanced options for you know the choice of this configuration for isolation you can control for upgrades so you might say I don't want upgrade anytime except Sunday night and we will break on Sunday night for you you might say I don't wanna upgrade until I will ask you and we will not do this it has additional options for uh fast storage for low latency requirements so dedicated theory is not for it's not for everyone and this is not something we provide by default but for someone who really needs this will find it useful what about Community visions so what it is chdb with such a nice logo chdb is an embedded olap SQL engine powered by clickhouse uh it is a library and it is available as a python modeler does a rust go node.js and even ban Library what is Banner something for JavaScript like node.js but better and now it has support for clickhouse and you can get full quick house engine embedded into your application like clickhouse local but not as a separate binary but directly linked into your whatever you do and it has Full Features supported by clickhouse so you can query external data sets local files you can query data Lakes you can request external clickhouse and it works even on pandas data frames looks pretty nice and it also reminds me about some other projects okay something even more interesting URL and.com what it is unlimited clickhouse URL tables like paste bin for SQL and Json data sounds interesting let me try oops it's moving what it is such a weird design scheme batteries included native on edge actually my browser is slightly slightly slow when I open this website and there is something like try it and if I will try it it will do it will do well something so interesting interesting engine but apparently it is just an experiment just a demo and you can wipe your result set pipe your cables into this service from a random guy from the internet and hopefully it will work and then you can just pipe this data back for whatever reason just to share exchange some small data sets what else if you want to show your integration on our webinar uh write send us a message connect on Slack and you will present it on the next webinar you can show a live demo like five to ten minutes about something interesting something weird something unusual okay what about some new materials from our blog there are plenty of interesting articles and I hope you did not miss them the articles about how to choose the best join algorithm how to subscribe to postgres for change data capture how to do real-time streaming with clickhouse and Kafka connect and even how to use Brickhouse for AI because apparently apparently the house is the perfect database for AI you might not know that but it is it is the best Vector database it can work with embedding for semantic search and more what about meetups this summer summer is typically the season of vacations but not for us because we have one two three four five six seven seven meetups in this summer in different locations four in North America Two in Europe and one in South East Asia so see you on one of these Newtons okay now we have about five minutes for questions it's still not feeling great uh Alexis so um just on the overlay date spaces one um a few a lot of people are asking if he was curiosity um can we say that it's uh an overlay database for faster than similar to the like a file cable engine basically for a set of files would that be the right guess yeah this is exactly correct so uh there are new database engines like file system and S3 and this database engines allowed to represent a bunch of files in a directory or one S3 as a bunch of tables so you don't specify that table or you don't have to use table function for every file you just specify you just create database and every file is instantly visible and if you have say one CSV file one Json file and so whatever pocket files all of them will be automatically detected the data format will be detected data structure will be detected and you will have these tables but what if you already have tables with this name and you want to just overlay one database on top of another uh to like a shadow the existing tables if these tables are also present in another database database then you just specify overlay of different databases and it will work okay not sure if it was clear but I hope you get yeah I think next month maybe we'll do a demo it is probably difficult Jester I think it's obviously it's something you probably need to demo so maybe 22.7 it will make it for sure um there was a question regards to IIM authentication and S3 table function the answers the work done we just need to document it that's for surgeon so that's just a quick answer to him um it's supported we just haven't documented it um do we have any plans for for the replication database to work for all tables rather than requiring the user to specify one yeah interesting and for replicated database uh it is not so easy because uh it is expected that every table in the replicated database also be replicated if you use a replicated database and create something like memory table and insert data into this memory table it will be inserted just on a single server and this is definitely not not what you need but if uh I am not mistaken there is a way to specify the default payable engine in the replicated database and to specify the default parameters and then you can just create table and it will be automatically replicated like what we do in clickhouse Cloud so if you if you try probably you can reproduce this behavior on your own okay um any plans to making connection work with longer secure connections uh mongodb plus SRV do you have any plans to support secure connections then interesting doesn't it work already it is a surprise to me I would expect that it should work but okay if it does not work let me count it as a feature request and we will figure out figure it out okay uh yeah I'm I suspect they mean uh Microsoft supports more TLS the SSL TLS let's create connection so I think it might be something that we should definitely confirm now um anything else our final one when we expect the uh join query analyzer to uh what do we expect the query analyzers to start improving joining speed uh let me not tell you any specific dates it will be quite unfair to my colleagues that are currently at this point of time working on this feature if I will say it is expected to be delivered in August they will think wow we did not know if I will say we expect it will be at the end of next year they'll think ah okay let's not do anything so let me also don't don't tell you about all the secrets that we have so it's sometimes between August and next year um that's a nice wide but we are working on it the um the final one I can find it is does it if we are making optimizations for um detecting line on detecting sort order does it make any sense to sort data on a sorting key before insert into click house assuming that key is probably not only part of the primary key already which we would exploit so if you add one key and your primary key and then one column in your primary key and you wanted to sort by a different column but it makes any sense to sort insert data in your sort order you can sort your data before inserting into the course but you cannot do it better than clickhouse so my recommendation just specifies the order key and let because do everything for you because you cannot sort data faster don't even try not to good night to hang on so thank you for everyone as we thought it's been better and goodbye for now and the lessons you want to say goodbye and we'll see you in one month thank you see you
Info
Channel: ClickHouse
Views: 1,717
Rating: undefined out of 5
Keywords:
Id: cuf_hYn7dqU
Channel Id: undefined
Length: 63min 27sec (3807 seconds)
Published: Thu Jun 29 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.