What is a Columnar Database? (vs. Row-oriented Database)

A columnar database can dramatically improve the  speed of some of your queries, especially when   you only need to access a subset of columns.  For example, if you need to create a pie chart   based on the total sales data in a billion-row  database, with each row representing a single   sale. In a row-oriented database, it would need to  read every row to find a sale value and add it to   the totals. So, it would need to read almost the  entire database or perform random reads. In a   columnar database, it would only need to read the  sale value column, which is all stored together. A typical relational database provides data that  represents a two-dimensional table of columns and   rows. For example, a database might have the  following table. It includes employee ID, name   fields, and a salary. This two-dimensional format  is an abstraction. In actual implementation,   storage hardware requires the data to  be serialized into one form or another. Most databases are too large to fit into random  access memory (RAM), so they need to be stored   on the disk. The most expensive operations in  computers involve hard disks and seeks. For   example, a typical SATA hard drive has an average  seek time of between 16 and 22 milliseconds,   while DRAM access on an Intel Core i7 processor  takes, on average, 60 nanoseconds—almost 400,000   times faster. That's why how data is serialized  and stored on the disk is very important. Now, hard disks are organized into  series of blocks of a fixed size,   typically enough to store several rows of a table.  It is also true that sequential reads from disks   are faster than random or partial reads; this is  especially true for spinning platter hard drives,   but it also applies to solid-state storage  devices. By organizing the table's data   so that rows fit within these blocks and  grouping related rows in sequential blocks,   we minimize the number of disk operations that  need to be performed. Which means, how the data   is arranged and stored on a disk will have a  huge impact on overall database performance. Now, for row-oriented databases like MySQL or  Postgres, a common method of storing a table   is to serialize each row of data like this. When  data is inserted into the table, it is assigned   an internal ID. This RowID is used internally in  the system to refer to the data. In this case,   the records have sequential RowIDs independent of  the user-assigned employee ID. In this example, we   use short integers to store RowIDs; in practice,  64-bit or even 128-bit numbers are typically used. The biggest difference is that row-oriented  databases are designed to efficiently return   data for an entire row or record with as few  operations as possible. This is based on the   common use case where you are trying to retrieve  information about a particular object. It could   be product information in an online shopping  store or user information in a CRM system.  If you store a record’s data in a single  block, along with other related records,   this minimizes the number of disk  operations and improves query performance. On the other hand, row-oriented systems  are not efficient at performing set-wide   operations on the whole table, as opposed to a  small number of specific records. For instance,   in order to find all records in the table with  salaries between 60,000 and 90,000 the database   would have to fully scan through the entire table  looking for matching records. While our small   table will likely fit in a single disk block,  a table with even a few hundred rows would not,   and multiple disk operations would  be needed to get the required result. To improve the performance of these sorts of  operations, which are very common and generally   the point of using a DBMS, most databases  support the use of database indexes. These   indexes store all the values from a set of  columns along with RowID pointers back to the   original table. For example, an index on the  salary column would look something like this. As they store only single pieces  of data rather than entire rows,   indexes are generally much smaller than the  main table stores. And scanning this smaller   set of data reduces the number of disk  operations. If the index is heavily used,   it can dramatically reduce the time for common  operations. However, maintaining indexes adds   overhead to the system, especially when new  data is written to the database. Records not   only need to be stored in the main table, but  any attached indexes must be updated as well. The main reason why indexes dramatically  improve performance on large datasets is   that database indexes on one or more  columns are typically sorted by value,   which makes range query operations (like the  above example of 'finding all records with   salaries between $60,000 and $90,000') very  fast, resulting in lower time complexity. Next, let's take a look at column-oriented  databases. They serialize all of the values   of one column together, followed  by the values of the next column,   and so on. For our example table,  the data would be stored like this. Whether or not a column-oriented database will be  more efficient depends heavily on the workload or   the query you are trying to run. Operations that  retrieve all the data for a given object (the   entire row) are slower. A row-oriented system  can retrieve the row in a single disk read,   whereas a columnar database requires several  disk operations to collect data from multiple   columns. However, these whole-row operations  are generally rare. In the majority of cases,   only a limited subset of data is retrieved. For  instance, in CRM applications, collecting the   first and last names from many rows to build a  list of contacts is far more common than reading   all data for any single record. This is even truer  for writing data into the database, especially if   the data tends to be 'sparse' with many optional  columns. For this reason, column stores have   demonstrated excellent real-world performance  despite many theoretical disadvantages. Let's go over a few use cases when you would  choose to use a columnar database. First of all,   you would use it if your primary use case is  analytics. Also, if you have a large amount of   data and need low query latency. Additionally, if  you don’t require strict ACID compliance or you're   using event sourcing principles. When you only  need to append new data and don’t perform frequent   updates and deletions. And finally, if you need  to store and analyze lots of time series data. Now, you would choose a traditional row-oriented  database like PostgreSQL if your primary use   case involves transactions. This choice is also  appropriate if you don’t need low query latency   for analytics and if you're working with  smaller data sets. You would also choose a   row-oriented database if you need strict ACID  compliance, if you need to perform frequent,   small updates and deletions, and finally, if you  need to store and access records with unique IDs.  By the way, I have a video explaining  the differences between the most popular   databases and when you should use each  one, or you can take a look at related   database videos in this playlist. Thank you for  watching, and I’ll see you in the next video.
Published: Sun May 05 2024
