ClickHouse v23.4 Release Webinar

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
is unavailable let's try again okay well everyone thank you for joining we're going to wait just a minute or three here for more people to trickle in in the uh meantime I'm gonna go ahead and cough profusely but tweet that we're live uh Alexa you want to welcome people who you recognize yeah for sure so many interesting people so welcome welcome Ellen nice to see you again uh welcome Ricky I'm so nice you you don't miss any of our webinars welcome Robert saying odd Patrick nice to see you in the list Nikolai welcome Mick welcome uh interesting from what places people are joining what countries what continents Kazakhstan welcome [Music] I am joining from an atypical location I'm almost always based in the Netherlands but I'm actually in New York today we have to meet up this evening in New York that I'm super looking forward to if Patrick's in Boston another East Coast visitor Alexa in the Bay Area awesome so if you are in Boston you still have a chance to visit our Meetup in New York today in how many hours math is hard and my calendar is still set to the wrong times I'm six hours from now okay so it will be not not a problem to watch this webinar and then go to our Meetup in New York laughs I think perhaps we have my favorite answer that I've ever seen uh my Grandma's Kitchen in Germany that may be the best place to join any uh event from France Greece Warsaw I love seeing all these uh locations roll in it's the beauty of a distributed company building distributed systems with a distributed Community is you don't get to see each other enough but you do get to work with and alongside amazing people wherever they happen to choose to live all right Beauty we're gonna let a couple more people roll in but uh in the in the short term I'll go ahead and kick us off welcome everyone thank you so much for joining us today whether you're on the zoom call whether you're watching the live stream on YouTube it truly is a privilege to speak with Slash at you all every month this is another release call 23.4 is here Alexa is going to walk through it in detail a few housekeeping items uh regardless of what platform you're on chat is open uh please ask questions we keep an eye on it and we'll we'll leave time at the end to answer as many of them as we can and if there's some we don't get to we're gonna start answering those in uh in blog posts um your participation you sharing this is deeply meaningful to all of us and if you haven't yet uh join the community um go to the the website in its beautiful new design and form and uh click the lovely slack button or or the telegram button or the button that belongs to the community you most like and participate in the conversation so with that Alexa I'm going to hand it off to you to talk about 23.4 okay let me share so what do we have in the new release 23.4 first of all it's a nice number two three four easy to remember uh so let me solve it and I hope we'll have just a few minutes for your questions but most of the time I will spend explaining what is new because we have a lot of new features so let's let me start we have about 15 new features 11 performance optimizations and even more bug fixes as usual about the new features uh this is quite simple and small and highlighted here it is to allow relevant comments before from section in select sometimes you write poems and you separate response by comma and you want to add or remove columns and if you ask someone professional they will tell you don't worry just write comments before the column name and it will work perfectly and you will be able to quickly add or remove columns and no problem at all but it looks slightly uglier to put this comma on a new line so instead you can write it as usual but in previous version if you have a training comma and forgot to remove it you will get a very explanatory error message it will tell you that you might want to write or and or between or not between or like or I like or not like or whatever or maybe instead of this comma you want to write into out file because why not you might want it but this is not what you actually want and 3.4 it just works small and nice feature and you will find it useful this is quite similar it is about how it is possible to specify settings in the command line or clickhouse client so imagine you want to change a setting name at Max underscore threads and you know that in clickhouse the name of this item is Max underscore press exactly like this with underscore but if you write it in the command line sometimes you will automatically write if I exist dash dash Max there's threads one in previous version it did not work and the first time I noticed it I immediately created the task an issue on GitHub with a proposal to just support this alternative way to write this item and if you are pedantic and conservative you will say no this feature should not exist Because the actual name of the setting is next underscore friends but you know actually I care more about usability about ease of use and if I accidentally written the statement this way let's just support it let it just work and in the new version 23.4 it just works but it was not enough for me I was also interested what if you write it like this and then copy paste this comments to Microsoft Word and send this word to to another person and it will copy paste back and somewhere in the middle this uh two dashes will be replaced to a nice Unicode Dash and I want to I want it to work as well and in the new version it also works it is quite unusual feature because you will not find it in any other command line applications applications even if you I don't know if you work with some really modern applications in rust or node.js or whatever uh you will not find this feature but clickhouse is even more modern so clickhouse does it what else so it was about some stupid features but what about something smarter and now we have something really smart quantile GK function JK means Greenwald Hana and it does sound smart why do we need this function and what it does we have a lot of functions to calculate quantiles for example if you want to figure out what is the 19 percent of your website loading speed then you need to calculate quantile and it depends on how many data you have how much memory is on your server and sometimes you want to use approximate approximate algorithms and when you look at approximate algorithms you will find that there are many many different algorithms different implementations and looks like we are trying to have just everything in Click rounds we have one tile one-time deterministic quantile exact exact weight quantity digest 16 and now also GreenWise okay let's take a look how it works I will show you a demo so let me switch to a terminal and open and run the server I will run the server it runs okay I hope you will you will see the back and as usual I will paste some queries from my text file to terminal so let me try to to run some quantile function so what do we have 416 possibilities I did not even know that you have so many so many it is ridiculous because every function is also also can contain some suffix modifiers like for each if map merge array new okay let me scroll through this list okay now let's try something interesting for example I will calculate a quantile of the screen width of website visitors from a table with a 100 million records and it processes in just 100 milliseconds we have slightly less than 1 billion records per second it is normal it is normal speed for Brickhouse you might think that it is fast yes it is fast but for clickhouse it is normal it is just all right so working around if several times okay if I will replace it took one pile exact what the performance will be it is slower yeah just as expected one pile is approximate calculation with Reservoir sampling algorithm [Music] if you don't know what is Reservoir sampling it is like array of limited size and you put values into this array and when this array is full you will replace every second value of this array every Force when you and so on and then you will use it as a sample to estimate quantile but this algorithm is not not so interesting I have much better algorithms like quantile p divided it also works either but I have one time timing nice it is even faster what will happen if I will try a new function quantile GK I want to explain what is quantile green world and but I cannot because because I don't know what it is if I will ever meet at this nice people green white or Hannah I will ask them or maybe I will read their paper [Music] but papers are boring it is better to meet in person and ask okay let's don't ask let's try it and how fast it is oops it does not work at all but don't worry this aggregate function has has a mandatory parameter something like Precision so it requires at least one parameter and a reasonable value is like 100 I don't know what if this maybe you will explain me but nevertheless let me try it okay it works it is fast if it's nice maybe maybe not as fast it has different functions and the result is just the same so median or screen width is 1638 okay let's try something more interesting for example if I will run it with clickhouse benchmark let's type so the minimum speed the minimum time is 93 milliseconds if I will compare for example with my favorite quantile calculation [Music] it will be it will be about the same performance also 98 milliseconds if I will compare with another favorite algorithm looks like you're just faster but maybe not faster maybe it is the same okay let me try to do something more interesting I will do aggregation with Group by and I will Group by origin and calculate quantiles for every region so maybe in Amsterdam people have bigger computer screens than in New York maybe let's take a look okay so quantile exact took just 200 milliseconds the new algorithm quantity of GPA text 0.5 so for some reason quantile GK is actually slower than exact calculation if you have multiple the multiple states of aggregate function then it highlight highlights uh one important point that it makes sense to always optimize functions not in isolation but on real data set on real workloads so quantile JK might be good if you calculate just a single quantile but it is not so good if you calculate multiple quantiles at once and it is not as good as quantile quantile timing quantile timing is fast so if you will ever need Greenlight or Hana tell them that they should make a pull request to clickhouse and optimize their function for small small set [Music] Implement small set optimization okay let's continue what else do we have as smart as this function Earth it sounds even smarter than Greenlight common I don't know for what reason but it it sounds smarter and I will not explain what it does maybe you will explain it to me but basically you just passed two arguments to this function and it will do something magic it will calculate some moments and it will just mix this data around and give you two numbers for some reason the first number is nine 0.99 and the second number is zero okay let me ask you what does it mean you can write your answers to the chat or on YouTube in comments and if your answer will be correct we will send you this t-shirt further do you have any answers indeed the first one that's common says that it's a test for determining distribution or whether distribution is normal or not almost correct uh do you have any uh more answers nope not yet that's the one that's come in you know what is scary if we will not get more answers you will have to answer it well the answer was updated with uh by percentage as well okay uh yeah it looks like some percentage but percentage of what okay let's let's just say that it's been entirely too long since I've done probability distribution so I'm probably not the right person to try to answer that okay let's just say that this this one is the winner what is the name of the winner ramazan Ramadan congratulations as usual you have the right answers to every question okay but this person is not so smart sound X it implements an algorithm that was introduced in year 1918 what a horrible year it was and the algorithm is not modern it is entirely obsolete but we still have to uh to have this function in clickhouse just for compatibility because every other SQL database management system already implements this stupid algorithm how to use it it will take some text in English or similar and it will give you some volume that doesn't change if the text sounds similar so if we apply it to hello world it will give you this H4 64. if you try to say hello world in English but you are being drunk or stoned and you say hello world I don't know if I pronounce it corrective it cannot be pronounced correctly it's some kind of uh alien language but the result will be the same age 464 the result itself it doesn't mean anything it's just for comparison so maybe you will find this algorithm useful what else do we have for compatibility so columns Aquarium again it is not a the most natural the most uh the best way uh to do it in clickhouse in clickhouse there is a system table system dot columns and you just write select star from system dot columns if you write show columns it is just for compatibility with mySQL it does a similar stuff but much more acne much uglier look here it's also that counter ID is pre form what is Thor if you use clickhouse you know that we will never write pre or Capital store or caps ever we just don't do that it looks like you are writing in Cobble in the 70s but unfortunately MySQL is also looks like couple from 70s and we have to implement compatibility sometimes and I will show you for what why exactly we need this compatibility okay let's let's skip it what else do we have this feature is way more interesting the format to analyze Market metadata imagine you have a Target file you can read it you can process it you can run queries but sometimes you also want to just analyze how many row groups does it have what are statistics for cones how the columns are compressed what is the size of compressive and uncompressive data and we decided to implement it in an elegant way so you specify the same pocket file but also specify a park at method data as the format name and instead of reading the data from this file it will read the metadata okay let me show you I will try to show it to you live right here interesting do we have okay here is resist Market file and I will use clickhouse local okay let me try it works and it gives a lot of a lot of stuff something interesting and difficult to read no problem let me try to figure out figure out what it is describe file now I have this description it will give me the number of columns rows row groups format compressor size and compressed size and description of every column and every row group if I want the output it nicer I'll just use another format and I'm at correct Json each row just to Output this metadata oops a lot of stuff why do I need so much data and how long it will take to Output it and it is just not the data it is not the data from this file but it contains 100 million records so it has a not just a lot of Records it also has a lot of raw groups and for every row group and for every column I have this nested data with all the statistics the statistics about dictionaries minimum and maximum value and I hope we will get used for the statistics for optimizing the reading okay nice feature what else a function extract key value pairs what it does imagine you have a format that looks like Json but isn't a Json looks like python pickle but not equal looks like PHP is realized but not even that looks like Windows ini but not even close maybe something like yaml but not yaml do you know that contain key value buyers and how many formats the humanity has to invent before something will happen there needs to be always at least one more format right like it's just required that no no one else's work is sufficient we we need to reinvent the wheel yeah as usual and unfortunately we have to support all these wheels and for this reason we have this function extract key value pairs it takes a string with a list of pairs separated by some separators here we we have two possible separators comma and white space keys and various separated by another separator here it is just equal sign and keys or values enclosed in some quotes here are double quotes and possibly escaped with systyle escapes and it will pass it as map of string to string and as usual for this formats it will work in 99 and 999 Etc cases maybe it will not support some really obscure format with unusual escaping but it will solve most of your problems so it is quite nice that we have this function in version 23.4 okay uh here is just a small Improvement for the replace functions we have at least four functions to replace a substring in a string you can replace one needle in high stack you can replace all the substrings and you can replace exact strings or by matching regular expression and with regular expression you can always replace with substitutions and until recently the needle and replacement arguments have to be always constant and it is normal if you have a regular expression because you might want to have just a single regular expression instead of generating them dynamically but sometimes you might want to just do some strange stuff and I don't know construct your regular expression from a poem in a table or store regular expressions in a table and apply Replacements with this regular expressions and now it is possible it works in version 23.4 it works perfectly and here is an example okay let's talk about my favorite part performance optimizations what do you have for performance optimization for party of reading you know what I'm talking about Market I have slightly mixed hearings because Target and uh I should say Apache Market is a data format that is calm oriented it has support for multiple data types multiple encodings it is optimized for reading it has suffered for compression and for me it sounds very similar to something it is very similar to clickhouse Native format and I know that because native format is obviously better but park has format is from Apache and this is so popular and people generating a lot of data in this format and we have to support it and we have to do it well and it is not a problem for me I want to integrate with whatever format even if someone will re-implement clickhouse and name it different player than will ask would you like to integrate with this system no problem we will integrate with this system and in previous versions the support for packet was sometimes not as fast as I would prefer let me compare I will show you some demo live right here let me again I'm copy paste some queries no no this query so here is clickhouse local I will do select from URL and this URL will go to S3 to our public bucket and it will read hits.parket file that contains 100 million of Records and it will do some aggregation query and I will run this query directly on my machine that has a residential internet connection so maybe it will be fast maybe it will be not so fast Maybe I will run it and uh Zoom webinar will break for some reason and it will be funny so I I like to have fun let me run it three seconds and looks like nothing nothing broken it still still works let me run it again because sometimes S3 has their internal caching so we don't have any questions on this machine it will go and read it again every time then the the time is about 1.4 seconds and the first query took three seconds seconds because F3 is not so first it still has to um AWS still has to have some questions internally okay this is a new version 1.4 seconds from my home internet connection to process 100 million records if it looks nice but let let me compare with the previous version here is one of previous versions 23.2 okay let me run the same query it runs it does something it is slow it is not only slow it is also annoying but I want to to calculate how slower it is and to calculate how slower it is I have to wait to the end of this query but it is so annoying in version 23.2 I don't know what to do I have to wait while this query finish Tyler do you have any comments about the slowness of this query I I just have to comment that it does not surprise me but the slowness of the query annoys you if there's one thing I have learned in the little over a year that we've been working together it's that there is no tolerance before for slow I'm surprised you haven't forced canceled it yet I sort of expected that to happen by this point I want to cancel it but I have to I have to give myself I want to figure out how slower it is and it surprises me that we did not fix it earlier only Universal 23.4 you will get this problem fix it yeah for those for those who are working with parquet files like not only the metadata description that you already showed but what you're showing now these are substantial quality of life improvements uh for people who spend all day interacting with these yeah I'll just go and jump and say how much my quality of life okay 100 128 seconds I don't know why 128. didn't we specifically slow it down to get a round number I hope no I hope not it is about 100 times slower uh May the 5th 70 or 80 times longer but something like this I don't even bother with arithmetics it speaks for itself okay you still have a chance to scan this QR code okay the chance is over let's go to the next performance optimization I synchronous connections to replicas it will be slightly more difficult to explain so in previous versions uh if you process data on a cluster and you do a distributed query and you have multiple replicas the connection timeout to these replicas will be controlled by this setting connect the amount with file over milliseconds and it is equal to 50 15 milliseconds why do we need it with such a low value because sometimes replicas are unavailable and they don't just refuse connections but you have a network packets not coming in and out from this replicas you have a network partition and if you try to connect the connection will just hang until a timeout but you want to have a failover to healthy breakfast and for this reason only the first very well wait for this timeout then we will just calculate the statistics and we will go to healthy replica because we will learn that the replica is unavailable and we have another replica but I don't want to even the first query to spend large amount of time before failing over and that's why this setting is set to just 50 milliseconds but there is a problem you you might have distributed cluster across the continents like I am in Amsterdam I want to have a replica in Amsterdam Tyler is in New York and Tyler wants to have at least another replica in New York but if replica in Amsterdam is not available to go to New York it will take about 100 and 50 milliseconds and it is more than 15 milliseconds three times more and the queries will just fail so for cross-continent clusters for cross-continence data import or export if you move data between different clusters uh you used to just modify the value of this training and it is in unconvenient I want everything to work by default so in the nuclear cast version We Set this the value of the subject to one second and it will be enough to go to New Zealand and go back and so one second will cover all the globe but we have another setting I'm not Hazard connection timeout milliseconds and it is equal to 50 milliseconds and what does it mean it means that we will connect to one replica wait 50 milliseconds after 15 milliseconds we will start connecting to another replica but still trying to wait for the first traffic and whatever replica will answer first we will use it for data processing so the connections to different replicas made in parallel it is requests okay it is quite difficult to explain but the effect of this feature is very simple it just makes everything work by default that's it this is all everything for for this small reason we want everything to work perfectly by default with the default settings you have okay what do we have for data Lakes data Lakes it's a strange thing because we have Apache Iceberg Apache and Apache Delta like why should we support so many data formats I don't know but different companies started at different time implementing slightly different formats asberg from Netflix that the lake from data bricks hoodie I don't remember maybe you will tell me and every data format has its own quirks its own complications for Iceberg we have to support two different versions Iceberg V1 and Iceberg V2 or Delta like we have to support participants and no non-partisan data checkpoints for optimization of reading and now we have everything Integrations do you recognize this screenshot what it is I will give you just three seconds to to guess okay anyone uh no no one has guests so far ah okay we implemented this feature form for no one no one needs it what it is it is worker integration now looker works with Greek House using the MySQL compatibility protocol it works it connects to clickhouse takes the information for all the tables you can write queries or even do these queries automatically what is next maybe we will add the support for quick side but if you are interested in some business intelligence systems please tell me what business intelligence systems we have to support next if you will say microstrategy um it will be not so fun to do but we will go and implement the support for microstrategy so don't worry tell me what business intelligence systems we have to support and we will do it and in the meantime we have many improvements for Kafka connect for method-based and for our drivers for gold Python and Java okay what it is it is our new website and our new website has our new design design code so maybe we will have to print a bunch of new t-shirts and not only T-shirts Maybe some fashionable hoodies maybe something else maybe backpacks or cats or whatever I don't know maybe we'll make critical Brands uh if if you need it just tell me we are open for new ideas and if you look at our website you will find many interesting material in our blog if you are interested in parkat and how to use it how to work with it for Integrations read our new blog post and today we will have a second part of Apache pocket blog post specifically about the internals of market and the implementation details in clickhouse if you want to know how bad clinical health is in joints or maybe clickhouse is actually good read another blog post and we even have a blog post name it a story of a serial product manager from my friend if you are interested how to generate music with SQL queries in clickhouse you can read it in in our new Repository noisical okay and if you want to use start using clickhouse today go to clickhouse.cloud and get free credits up to 10 terabytes of data okay now we have just five minutes for your questions amazing thanks Alexa I've been capturing questions throughout if other folks you have questions please feel free to drop them in uh Alexa there's a couple of things that people called out that they would love to see Integrations with um some of which may have been a light troll but uh Power bi was mentioned uh there there was a question saying uh will there be Vault support um correct I that's my assumption yeah it was on YouTube so uh Vitali if that's incorrect please feel free to let me know but I assume hashikov yes yeah interesting that you already can integrate clickhouse with hashicorp world how to do it so you want to uh deliver some keys so clickhouse configuration file and get this keys from a hashable quote to do this you just specify from N1 it means from environment variable in the clickhouse configuration file and it will substitute something from the environment variable it can be the user's password hash or encryption key and also add a script to the startup or into Docker startup comments and make this script to go into the hash record font to obtain these keys and to put these keys to environment variable so with at least some devops equilibriistic you will easily integrate with hashicorp world excellent that makes total sense um just this is isn't really a question but I think it's a fantastic little statement um it's that you're you're actually Alexa eliminating the but the query is still running excuse for slacking off from work so you're taking away one of the one of the ways that we're able to have a break during the day because we can't use the excuse the query is still running if we're using click apps it is not a problem it is not a problem because in Clinique house we have a setting specifically for you it is named Max execution speed and you can say Don't run this query faster than say one million records per second go for person and you have an excuse the person who made that comment I will I will I will leave nameless but uh there you have your answer exact location speed um there's another question that looker already has a clickhouse connector and the question was should we should we be using the MySQL connector instead of that connector yeah it was quite interesting because for some reason uh people demanded specifically for my SQL connector so maybe the integration with clickhouse is non-perfect and a better result you will get with just MySQL compatibility maybe we have to improve the clinicals native connector maybe this native connector is not available in some environments when look where looter is run and you still have to use a different connector gotcha a bit of a high level question that came in that that I think is actually a good one to touch on uh will queries be faster by separating data into a two table cluster or is storing in one node good uh if one node is as powerful as two nodes say you have 64 CPU cores on one machine and one terabyte of memory instead of two machines of 32 cores and help provided memory it will be faster on a single machine and the reason is pretty simple when you have a single machine you don't have to transfer data over the network you don't have to serialize and deserialize the data the data is accessible inside this one terabyte of memory it is obviously Fastenal but what if you already have a powerful machine and it is not enough then just add another machine and with two charts it will be two times faster most of the time sometimes slightly less than two times faster but it is another story understood uh we probably have time for one more it looks like uh there's two more but I'm gonna ask this one uh in in particular and then maybe we'll add one on at the end but the question was uh did anybody else already ask when Json will be production ready if not I'd like to ask it and make it a tradition to ask for Json being production ready on every call oh okay so uh I asked this question to the developer the father of this feature and he answered I'm not sure it is difficult to finish and could you guess what what I said uh I suspect that you probably answered in two ways one of which is uh do it right because doing it right for the user matters the second of which was probably and we should do it as soon as possible [Music] actually it is slightly different way uh decomposed this feature to multiple small pieces that we still have to implement and we figured out that most of these implementation steps will have their use not only for this feature but also for many other features and we looked at this other features in more details but to figure out why do we need them and we've found that this uh additional side goals will serve as a good motivation to continue working on this very complex and very hard experimental feature sorry you are music yeah and because they're pounding outside my window but I forgot to unmute uh we're at times so uh Alexey as always thank you so very much and um as I say in almost every single one of these sessions but I still mean it uh Community matters Community matters immensely uh we want to hear what you're building we want to hear about your stories we want to hear uh about adopters that you know of um for those of you in Boston I'll reach out let's get a Meetup set up for those of you in other places please feel free to reach out and please share all of the amazing things that are in 23.4 with those who you know who could benefit from who could benefit from it sorry my brain stopped there for a second so cheers thank you so much appreciate it and we'll all talk soon thank you thank you
Info
Channel: ClickHouse
Views: 1,560
Rating: undefined out of 5
Keywords:
Id: 4rrf6bk_mOg
Channel Id: undefined
Length: 61min 5sec (3665 seconds)
Published: Wed Apr 26 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.