How to Ingest Files into ClickHouse

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
In this video we're going to look at  how to ingest files into ClickHouse   it's a common task and there's two  basic pieces that you need to understand   the first is table functions and the table  function you use depends on where the file is   sitting so for example if it's a file on your file  system you'll use the file table function if the   file sitting up in AWS S3 well there's an S3 table  function and you'll see there's a GCS for Google   cloud storage and there's other ones that are all  listed in the docs and so the table function is   where the file is and then there's also the other  aspect of the file and that's what is the format   of the file you know is it a tab separated file  is it a parquet or a Json what does the data look   like those two pieces together can be combined in  any way that you want you can have a parquet file   in S3 you can have a tabs operated file in Google  Cloud Storage doesn't matter so there's all kinds   of different permutations of where the file is  and what the format is of the file so to start   with I have a file sitting here in my user files  folder called comments.tsv and so I'll just use   the ClickHouse client and I'll use the query  option and it's a local file so I'm going to   use the file table function the name of the file  is comments.tsv I'll just select the count I mean   I could do anything I want but notice this is the  where so this is the table function and then the   format of the file is a tab separated file but the  first row is the names of the columns so you'll   see that it's not a very big file it only has a  thousand rows and that ran very quickly obviously   you can query the file you don't have to like  I have ClickHouse running but notice here I'm   just querying the file as it sits I'm not doing  anything with the data except streaming it through   ClickHouse so for example I can select the first  10 rows and you can see that they're Hacker News   comments is what they are well because I know the  column names it's like for example ID type author   timestamp like I can just query the data where  it sits and then actually specify things like   select just the author and the timestamp so here  I'm just going to grab the first 10 rows of just   the author and the timestamp and so on the other  thing I want to point out is that the name of the   file like the extension on the file usually tells  you what's in the file like or what the format is   so for example it's comments.tsv I don't need  to put format tab separated with names I'm just   going to try it down here with just the file table  function and I'll grab 10 rows but ClickHouse will   figure out that it's tab separated and ClickHouse  can figure out even if the file is compressed as   long as the file extension has the right you know  as long as the extension of the file matches the   data in the file the other thing I want to point  out with the table function is that it's inferring   the column names and the data types of each column  and what it does is it processes thousands of rows   to figure out what those data types should be  you can take advantage of that in this example   here I'm going to create a table called comments  and I'm going to put the data from the file into   my ClickHouse table now so it's going to be  a merge tree table and I just picked a couple   of the columns to be the primary key I picked  author and timestamp and whatever the you know   I'm not specifying column names and types here  whatever the data inference is notice by the   way the data inference if I let me just run the  click house client and describe that it's really   easy to do this I can just describe the file and  you'll see what the inference is when I do this tsv looks like it's got six columns the data  inference loves to make things nullable it just   kind of does that so if I try to create the  table directly from the data inference I get   this error saying that the primary key can't be  nullable well there's a couple ways around that   but the sort of easiest way around it is to just  let the primary key be nullable using the allow   nullable key setting so this is the same command  as before but noticed I turned on this setting and   now it'll work you'll see that I went very fast  it's only a thousand rows now if I select count   from comments which is a table inside my click  house it's it's got the Thousand rows and I can   query the data I can figure out the average  length is of the comments or something like   that just like you would expect to be able to do  so that was a local file that was tab separated   if I have a let's say I have a par K file here I  have a file called hitstop parquet and it's up in   S3 well then I'll use the S3 table function and  the format would be parquet but I don't have to   specify that the parquet format is figured out  by ClickHouse because the name of the file has   the part K extension so in that example lots of  columns this is a decent size table here looks   like I got dozens of columns I'm just scrolling  up here really fast so the hits.par K file has   whatever here's what here's what it inferred now  it's part K so it kind of knows the data types of   the columns because that's part of the file format  here I have a file that's in Google Cloud Storage   it's in a zipped compressed file it's in a DOT gz  file and it doesn't have tsv on there so I have to   specify the format of the file if I want you know  ClickHouse won't be able to determine it because   the file name doesn't tell me what the format  is inside the file but notice ClickHouse does   figure out the compression you don't have to say  what the compression is Click also happily figure   it out more realistically if you were going to  do this you would probably create a table so I   have some New York City Taxi data and it's up in  Google Cloud Storage so what I'm going to do I'm   going to create a table called trip so this is  going to be the table where I want to insert the   data notice I picked all the column names all  the data types I got my primary key so let me   create the table and then now you'll see this  is a super common uh way to do this I'm going   to insert into my table trips and then I'm going  to grab everything from that file sitting in GCS   what I want you to notice here is the syntax  this is 0.2 that's actually three files it's   trips underscore zero dot gz trips underscore  one dot gz and then trips underscore2.gz so I   can and I can use an asterisk wild card you can  you can read hundreds of files with the wild card   and uh tab separated with names  because it doesn't it can infer that   and there we go so that'll take however much time  it takes to insert the data and as you can see it   took about 30 seconds it's three million rows  I want to show you another example because in   all the examples I've done so far I've used the  data inference here I'm in the docs I'm on the UK   property prices data set it's a great example the  table gets created we've got a merge tree table   that we create and the data is sitting up in it's  in S3 but we use the URL function to retrieve it   and and here's the insert command and and we do  some uh operations some Transformations on the   data as we bring in but what I want to show you is  that here after the file format so here's the the   location of the file then I put CSV because that's  what kind of file it is you can if you want pass   in the schema not of the table that you want the  data to go into this is the schema of the data in   the file so the CSV file has this many columns and  these are the names I've decided to get it to give   those columns and then I use those up here so I  want you to see that example as well pretty common   to do that and then finally there's one more topic  I want to show you it's just super handy sometimes   when the data is coming in from a file you know  sometimes the rows aren't in the right format or   maybe it dates in the wrong format or something  and you got millions of rows and only like a few   of them have errors in them you can actually set  this value it's called input format allow errors   num and you can say by the way the default is zero  but you can set it to like a thousand or something   and you can that way it'll up to a thousand errors  the insert will still work and if a number doesn't   necessarily make sense notice there's an input  format allow errors ratio and you can say okay   if one percent you know you said something small  one percent five percent whatever if that's how   many errors could happen and the insert would  still work so I wanted you to see that setting   because it's super handy and so like I said  when it comes to inserting files find your table   function which is where the data sits and then  check out the data format and in the docs if you   go to the SQL reference you'll see all the table  functions right here and they're listed nicely   and then on the docs home page if you scroll down  you can see all the formats this is a super handy   page shows you all the dozens and dozens  of input and output formats in ClickHouse
Info
Channel: ClickHouse
Views: 88
Rating: undefined out of 5
Keywords: ClickHouse
Id: vhrSxW15su4
Channel Id: undefined
Length: 8min 57sec (537 seconds)
Published: Tue Sep 05 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.