What is a Columnar Database? (vs. Row-oriented Database)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

A columnar database can dramatically improve the speed of some of your queries, especially when you only need to access a subset of columns. For example, if you need to create a pie chart based on the total sales data in a billion-row database, with each row representing a single sale. In a row-oriented database, it would need to read every row to find a sale value and add it to the totals. So, it would need to read almost the entire database or perform random reads. In a columnar database, it would only need to read the sale value column, which is all stored together. A typical relational database provides data that represents a two-dimensional table of columns and rows. For example, a database might have the following table. It includes employee ID, name fields, and a salary. This two-dimensional format is an abstraction. In actual implementation, storage hardware requires the data to be serialized into one form or another. Most databases are too large to fit into random access memory (RAM), so they need to be stored on the disk. The most expensive operations in computers involve hard disks and seeks. For example, a typical SATA hard drive has an average seek time of between 16 and 22 milliseconds, while DRAM access on an Intel Core i7 processor takes, on average, 60 nanoseconds—almost 400,000 times faster. That's why how data is serialized and stored on the disk is very important. Now, hard disks are organized into series of blocks of a fixed size, typically enough to store several rows of a table. It is also true that sequential reads from disks are faster than random or partial reads; this is especially true for spinning platter hard drives, but it also applies to solid-state storage devices. By organizing the table's data so that rows fit within these blocks and grouping related rows in sequential blocks, we minimize the number of disk operations that need to be performed. Which means, how the data is arranged and stored on a disk will have a huge impact on overall database performance. Now, for row-oriented databases like MySQL or Postgres, a common method of storing a table is to serialize each row of data like this. When data is inserted into the table, it is assigned an internal ID. This RowID is used internally in the system to refer to the data. In this case, the records have sequential RowIDs independent of the user-assigned employee ID. In this example, we use short integers to store RowIDs; in practice, 64-bit or even 128-bit numbers are typically used. The biggest difference is that row-oriented databases are designed to efficiently return data for an entire row or record with as few operations as possible. This is based on the common use case where you are trying to retrieve information about a particular object. It could be product information in an online shopping store or user information in a CRM system. If you store a record’s data in a single block, along with other related records, this minimizes the number of disk operations and improves query performance. On the other hand, row-oriented systems are not efficient at performing set-wide operations on the whole table, as opposed to a small number of specific records. For instance, in order to find all records in the table with salaries between 60,000 and 90,000 the database would have to fully scan through the entire table looking for matching records. While our small table will likely fit in a single disk block, a table with even a few hundred rows would not, and multiple disk operations would be needed to get the required result. To improve the performance of these sorts of operations, which are very common and generally the point of using a DBMS, most databases support the use of database indexes. These indexes store all the values from a set of columns along with RowID pointers back to the original table. For example, an index on the salary column would look something like this. As they store only single pieces of data rather than entire rows, indexes are generally much smaller than the main table stores. And scanning this smaller set of data reduces the number of disk operations. If the index is heavily used, it can dramatically reduce the time for common operations. However, maintaining indexes adds overhead to the system, especially when new data is written to the database. Records not only need to be stored in the main table, but any attached indexes must be updated as well. The main reason why indexes dramatically improve performance on large datasets is that database indexes on one or more columns are typically sorted by value, which makes range query operations (like the above example of 'finding all records with salaries between $60,000 and $90,000') very fast, resulting in lower time complexity. Next, let's take a look at column-oriented databases. They serialize all of the values of one column together, followed by the values of the next column, and so on. For our example table, the data would be stored like this. Whether or not a column-oriented database will be more efficient depends heavily on the workload or the query you are trying to run. Operations that retrieve all the data for a given object (the entire row) are slower. A row-oriented system can retrieve the row in a single disk read, whereas a columnar database requires several disk operations to collect data from multiple columns. However, these whole-row operations are generally rare. In the majority of cases, only a limited subset of data is retrieved. For instance, in CRM applications, collecting the first and last names from many rows to build a list of contacts is far more common than reading all data for any single record. This is even truer for writing data into the database, especially if the data tends to be 'sparse' with many optional columns. For this reason, column stores have demonstrated excellent real-world performance despite many theoretical disadvantages. Let's go over a few use cases when you would choose to use a columnar database. First of all, you would use it if your primary use case is analytics. Also, if you have a large amount of data and need low query latency. Additionally, if you don’t require strict ACID compliance or you're using event sourcing principles. When you only need to append new data and don’t perform frequent updates and deletions. And finally, if you need to store and analyze lots of time series data. Now, you would choose a traditional row-oriented database like PostgreSQL if your primary use case involves transactions. This choice is also appropriate if you don’t need low query latency for analytics and if you're working with smaller data sets. You would also choose a row-oriented database if you need strict ACID compliance, if you need to perform frequent, small updates and deletions, and finally, if you need to store and access records with unique IDs. By the way, I have a video explaining the differences between the most popular databases and when you should use each one, or you can take a look at related database videos in this playlist. Thank you for watching, and I’ll see you in the next video.

Info

Channel: Anton Putra

Views: 1,818

Rating: undefined out of 5

Keywords: colmnar database, columnar database pros and cons, what is columnar database, columnar database example, columnar vs row database, column oriented database, devops, databaess, what is database, anton putra, cloud, aws, azure, gcp

Id: 1MnvuNg33pA

Channel Id: undefined

Length: 8min 19sec (499 seconds)

Published: Sun May 05 2024