A columnar database can dramatically improve the
speed of some of your queries, especially when you only need to access a subset of columns.
For example, if you need to create a pie chart based on the total sales data in a billion-row
database, with each row representing a single sale. In a row-oriented database, it would need to
read every row to find a sale value and add it to the totals. So, it would need to read almost the
entire database or perform random reads. In a columnar database, it would only need to read the
sale value column, which is all stored together. A typical relational database provides data that
represents a two-dimensional table of columns and rows. For example, a database might have the
following table. It includes employee ID, name fields, and a salary. This two-dimensional format
is an abstraction. In actual implementation, storage hardware requires the data to
be serialized into one form or another. Most databases are too large to fit into random
access memory (RAM), so they need to be stored on the disk. The most expensive operations in
computers involve hard disks and seeks. For example, a typical SATA hard drive has an average
seek time of between 16 and 22 milliseconds, while DRAM access on an Intel Core i7 processor
takes, on average, 60 nanoseconds—almost 400,000 times faster. That's why how data is serialized
and stored on the disk is very important. Now, hard disks are organized into
series of blocks of a fixed size, typically enough to store several rows of a table.
It is also true that sequential reads from disks are faster than random or partial reads; this is
especially true for spinning platter hard drives, but it also applies to solid-state storage
devices. By organizing the table's data so that rows fit within these blocks and
grouping related rows in sequential blocks, we minimize the number of disk operations that
need to be performed. Which means, how the data is arranged and stored on a disk will have a
huge impact on overall database performance. Now, for row-oriented databases like MySQL or
Postgres, a common method of storing a table is to serialize each row of data like this. When
data is inserted into the table, it is assigned an internal ID. This RowID is used internally in
the system to refer to the data. In this case, the records have sequential RowIDs independent of
the user-assigned employee ID. In this example, we use short integers to store RowIDs; in practice,
64-bit or even 128-bit numbers are typically used. The biggest difference is that row-oriented
databases are designed to efficiently return data for an entire row or record with as few
operations as possible. This is based on the common use case where you are trying to retrieve
information about a particular object. It could be product information in an online shopping
store or user information in a CRM system. If you store a record’s data in a single
block, along with other related records, this minimizes the number of disk
operations and improves query performance. On the other hand, row-oriented systems
are not efficient at performing set-wide operations on the whole table, as opposed to a
small number of specific records. For instance, in order to find all records in the table with
salaries between 60,000 and 90,000 the database would have to fully scan through the entire table
looking for matching records. While our small table will likely fit in a single disk block,
a table with even a few hundred rows would not, and multiple disk operations would
be needed to get the required result. To improve the performance of these sorts of
operations, which are very common and generally the point of using a DBMS, most databases
support the use of database indexes. These indexes store all the values from a set of
columns along with RowID pointers back to the original table. For example, an index on the
salary column would look something like this. As they store only single pieces
of data rather than entire rows, indexes are generally much smaller than the
main table stores. And scanning this smaller set of data reduces the number of disk
operations. If the index is heavily used, it can dramatically reduce the time for common
operations. However, maintaining indexes adds overhead to the system, especially when new
data is written to the database. Records not only need to be stored in the main table, but
any attached indexes must be updated as well. The main reason why indexes dramatically
improve performance on large datasets is that database indexes on one or more
columns are typically sorted by value, which makes range query operations (like the
above example of 'finding all records with salaries between $60,000 and $90,000') very
fast, resulting in lower time complexity. Next, let's take a look at column-oriented
databases. They serialize all of the values of one column together, followed
by the values of the next column, and so on. For our example table,
the data would be stored like this. Whether or not a column-oriented database will be
more efficient depends heavily on the workload or the query you are trying to run. Operations that
retrieve all the data for a given object (the entire row) are slower. A row-oriented system
can retrieve the row in a single disk read, whereas a columnar database requires several
disk operations to collect data from multiple columns. However, these whole-row operations
are generally rare. In the majority of cases, only a limited subset of data is retrieved. For
instance, in CRM applications, collecting the first and last names from many rows to build a
list of contacts is far more common than reading all data for any single record. This is even truer
for writing data into the database, especially if the data tends to be 'sparse' with many optional
columns. For this reason, column stores have demonstrated excellent real-world performance
despite many theoretical disadvantages. Let's go over a few use cases when you would
choose to use a columnar database. First of all, you would use it if your primary use case is
analytics. Also, if you have a large amount of data and need low query latency. Additionally, if
you don’t require strict ACID compliance or you're using event sourcing principles. When you only
need to append new data and don’t perform frequent updates and deletions. And finally, if you need
to store and analyze lots of time series data. Now, you would choose a traditional row-oriented
database like PostgreSQL if your primary use case involves transactions. This choice is also
appropriate if you don’t need low query latency for analytics and if you're working with
smaller data sets. You would also choose a row-oriented database if you need strict ACID
compliance, if you need to perform frequent, small updates and deletions, and finally, if you
need to store and access records with unique IDs. By the way, I have a video explaining
the differences between the most popular databases and when you should use each
one, or you can take a look at related database videos in this playlist. Thank you for
watching, and I’ll see you in the next video.