InterSystems IRIS
has traditionally stored data in rows when it's
used as a relational database. Now, thanks to the flexibility
of the underlying data structure, you can also
store data by column. While each option
has its benefits, storing data in
columns--referred to as columnar storage--can
significantly boost performance for a broad set of use cases. Consider a company that
keeps data for all orders it receives using
row-based storage, tracking data such as the
OrderID, OrderDate, Customer, Priority, Status,
and TotalAmount. Each row of data logically
corresponds to an order, and all data in a single row
is physically stored together. This model makes it easy to
quickly add or update orders. Orders can be added
one at a time, and each database write
corresponds to exactly one row. With transactions
like this, there is no need to access
or update any data in the table except the
row you are changing. Let's consider another use case. Say the company wants to find
the average earnings per sale in each month. For this, you would only
need the information from two columns:
OrderDate and TotalAmount. However, with
row-based storage, you must retrieve all the
data for each order to get this information. Even if you use an index
on the OrderDate column, you would still need
to read the full row for each order within the date
range to get the TotalAmount. With a large number of orders,
this can be very inefficient. This is where columnar
storage comes into play. Rather than storing
data together by row, data is stored
together by column. Logically, the table and the
associations between the data remain the same. It is only the physical
storage method that changes. Let's see how you would get the
average earnings for a given month when using
columnar storage. You only need to read
two columns: OrderDate and TotalAmount. You can filter the
OrderDate column to select the orders
placed in a certain month, then use fast
vectorized processing to calculate the average
from the TotalAmount column. There are a few benefits
to storing data this way. First is the ability to run
much faster analytical queries. As you saw in the example,
columnar storage greatly reduces the amount
of irrelevant data that must be retrieved
for common queries. Rather than reading all of
the data for each order, we only need to read the
OrderDate and TotalAmount columns. Since we are filtering by date,
all of the data in that column is relevant, and only one
entry in the TotalAmount column ends up being irrelevant. With low-level instructions
available on modern CPUs, operations can also be
applied to chunks of data rather than single values. This increase in efficiency can
greatly improve your analytics. Another benefit to
columnar storage is that all elements in a column
have the same data type, making for efficient compression. For example, all entries
in the Status column are comprised of only
a few possible strings. In this case, the column
can maintain a dictionary of distinct statuses
and store only pointers to the dictionary entries,
rather than full strings. Doing so makes encoding each
column much more efficient, as you avoid storing
duplicate values. It also makes reading
column data more efficient. The benefits of
columnar storage are noticeable in a typical
analytical workload, such as in data warehouses. However, in many use cases, a
purely columnar storage model is not the best approach. For example, columnar
storage is less ideal when running transactions
to retrieve or update a single row. Since you need to access
all the columns that store data for that
row, you end up pulling more data
than necessary. In our example,
columnar storage greatly improves the efficiency
of analyzing earnings, but adding a new order to the
table would be less efficient. In InterSystems
IRIS, data is stored in globals, which
allows you to use a combination of both
row-based and columnar storage. With this flexible
approach, you can choose to store
certain data by column that you will need to
run fast queries on, while storing the rest
of the data by row so you can still run quick
transactions to retrieve or update a single row. Data such as Status and
Priority will likely be updated a few times,
so those transactions will benefit from row storage. You can store the TotalAmount
in columnar storage for easy access and
analysis, while storing the rest of the table in rows. Columnar storage is simple
to implement: just add a Storagetype clause, WITH
STORAGETYPE = COLUMNAR, to the end of your
table definition. There is no change to the
way you query your data, so you can start running
fast analytic queries as soon as you have defined your
table with columnar storage.