How to Ingest Files into ClickHouse

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

In this video we're going to look at how to ingest files into ClickHouse it's a common task and there's two basic pieces that you need to understand the first is table functions and the table function you use depends on where the file is sitting so for example if it's a file on your file system you'll use the file table function if the file sitting up in AWS S3 well there's an S3 table function and you'll see there's a GCS for Google cloud storage and there's other ones that are all listed in the docs and so the table function is where the file is and then there's also the other aspect of the file and that's what is the format of the file you know is it a tab separated file is it a parquet or a Json what does the data look like those two pieces together can be combined in any way that you want you can have a parquet file in S3 you can have a tabs operated file in Google Cloud Storage doesn't matter so there's all kinds of different permutations of where the file is and what the format is of the file so to start with I have a file sitting here in my user files folder called comments.tsv and so I'll just use the ClickHouse client and I'll use the query option and it's a local file so I'm going to use the file table function the name of the file is comments.tsv I'll just select the count I mean I could do anything I want but notice this is the where so this is the table function and then the format of the file is a tab separated file but the first row is the names of the columns so you'll see that it's not a very big file it only has a thousand rows and that ran very quickly obviously you can query the file you don't have to like I have ClickHouse running but notice here I'm just querying the file as it sits I'm not doing anything with the data except streaming it through ClickHouse so for example I can select the first 10 rows and you can see that they're Hacker News comments is what they are well because I know the column names it's like for example ID type author timestamp like I can just query the data where it sits and then actually specify things like select just the author and the timestamp so here I'm just going to grab the first 10 rows of just the author and the timestamp and so on the other thing I want to point out is that the name of the file like the extension on the file usually tells you what's in the file like or what the format is so for example it's comments.tsv I don't need to put format tab separated with names I'm just going to try it down here with just the file table function and I'll grab 10 rows but ClickHouse will figure out that it's tab separated and ClickHouse can figure out even if the file is compressed as long as the file extension has the right you know as long as the extension of the file matches the data in the file the other thing I want to point out with the table function is that it's inferring the column names and the data types of each column and what it does is it processes thousands of rows to figure out what those data types should be you can take advantage of that in this example here I'm going to create a table called comments and I'm going to put the data from the file into my ClickHouse table now so it's going to be a merge tree table and I just picked a couple of the columns to be the primary key I picked author and timestamp and whatever the you know I'm not specifying column names and types here whatever the data inference is notice by the way the data inference if I let me just run the click house client and describe that it's really easy to do this I can just describe the file and you'll see what the inference is when I do this tsv looks like it's got six columns the data inference loves to make things nullable it just kind of does that so if I try to create the table directly from the data inference I get this error saying that the primary key can't be nullable well there's a couple ways around that but the sort of easiest way around it is to just let the primary key be nullable using the allow nullable key setting so this is the same command as before but noticed I turned on this setting and now it'll work you'll see that I went very fast it's only a thousand rows now if I select count from comments which is a table inside my click house it's it's got the Thousand rows and I can query the data I can figure out the average length is of the comments or something like that just like you would expect to be able to do so that was a local file that was tab separated if I have a let's say I have a par K file here I have a file called hitstop parquet and it's up in S3 well then I'll use the S3 table function and the format would be parquet but I don't have to specify that the parquet format is figured out by ClickHouse because the name of the file has the part K extension so in that example lots of columns this is a decent size table here looks like I got dozens of columns I'm just scrolling up here really fast so the hits.par K file has whatever here's what here's what it inferred now it's part K so it kind of knows the data types of the columns because that's part of the file format here I have a file that's in Google Cloud Storage it's in a zipped compressed file it's in a DOT gz file and it doesn't have tsv on there so I have to specify the format of the file if I want you know ClickHouse won't be able to determine it because the file name doesn't tell me what the format is inside the file but notice ClickHouse does figure out the compression you don't have to say what the compression is Click also happily figure it out more realistically if you were going to do this you would probably create a table so I have some New York City Taxi data and it's up in Google Cloud Storage so what I'm going to do I'm going to create a table called trip so this is going to be the table where I want to insert the data notice I picked all the column names all the data types I got my primary key so let me create the table and then now you'll see this is a super common uh way to do this I'm going to insert into my table trips and then I'm going to grab everything from that file sitting in GCS what I want you to notice here is the syntax this is 0.2 that's actually three files it's trips underscore zero dot gz trips underscore one dot gz and then trips underscore2.gz so I can and I can use an asterisk wild card you can you can read hundreds of files with the wild card and uh tab separated with names because it doesn't it can infer that and there we go so that'll take however much time it takes to insert the data and as you can see it took about 30 seconds it's three million rows I want to show you another example because in all the examples I've done so far I've used the data inference here I'm in the docs I'm on the UK property prices data set it's a great example the table gets created we've got a merge tree table that we create and the data is sitting up in it's in S3 but we use the URL function to retrieve it and and here's the insert command and and we do some uh operations some Transformations on the data as we bring in but what I want to show you is that here after the file format so here's the the location of the file then I put CSV because that's what kind of file it is you can if you want pass in the schema not of the table that you want the data to go into this is the schema of the data in the file so the CSV file has this many columns and these are the names I've decided to get it to give those columns and then I use those up here so I want you to see that example as well pretty common to do that and then finally there's one more topic I want to show you it's just super handy sometimes when the data is coming in from a file you know sometimes the rows aren't in the right format or maybe it dates in the wrong format or something and you got millions of rows and only like a few of them have errors in them you can actually set this value it's called input format allow errors num and you can say by the way the default is zero but you can set it to like a thousand or something and you can that way it'll up to a thousand errors the insert will still work and if a number doesn't necessarily make sense notice there's an input format allow errors ratio and you can say okay if one percent you know you said something small one percent five percent whatever if that's how many errors could happen and the insert would still work so I wanted you to see that setting because it's super handy and so like I said when it comes to inserting files find your table function which is where the data sits and then check out the data format and in the docs if you go to the SQL reference you'll see all the table functions right here and they're listed nicely and then on the docs home page if you scroll down you can see all the formats this is a super handy page shows you all the dozens and dozens of input and output formats in ClickHouse

Info

Channel: ClickHouse

Views: 88

Rating: undefined out of 5

Keywords: ClickHouse

Id: vhrSxW15su4

Channel Id: undefined

Length: 8min 57sec (537 seconds)

Published: Tue Sep 05 2023