What is Columnar Storage?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
InterSystems IRIS has traditionally stored data in rows when it's used as a relational database. Now, thanks to the flexibility of the underlying data structure, you can also store data by column. While each option has its benefits, storing data in columns--referred to as columnar storage--can significantly boost performance for a broad set of use cases. Consider a company that keeps data for all orders it receives using row-based storage, tracking data such as the OrderID, OrderDate, Customer, Priority, Status, and TotalAmount. Each row of data logically corresponds to an order, and all data in a single row is physically stored together. This model makes it easy to quickly add or update orders. Orders can be added one at a time, and each database write corresponds to exactly one row. With transactions like this, there is no need to access or update any data in the table except the row you are changing. Let's consider another use case. Say the company wants to find the average earnings per sale in each month. For this, you would only need the information from two columns: OrderDate and TotalAmount. However, with row-based storage, you must retrieve all the data for each order to get this information. Even if you use an index on the OrderDate column, you would still need to read the full row for each order within the date range to get the TotalAmount. With a large number of orders, this can be very inefficient. This is where columnar storage comes into play. Rather than storing data together by row, data is stored together by column. Logically, the table and the associations between the data remain the same. It is only the physical storage method that changes. Let's see how you would get the average earnings for a given month when using columnar storage. You only need to read two columns: OrderDate and TotalAmount. You can filter the OrderDate column to select the orders placed in a certain month, then use fast vectorized processing to calculate the average from the TotalAmount column. There are a few benefits to storing data this way. First is the ability to run much faster analytical queries. As you saw in the example, columnar storage greatly reduces the amount of irrelevant data that must be retrieved for common queries. Rather than reading all of the data for each order, we only need to read the OrderDate and TotalAmount columns. Since we are filtering by date, all of the data in that column is relevant, and only one entry in the TotalAmount column ends up being irrelevant. With low-level instructions available on modern CPUs, operations can also be applied to chunks of data rather than single values. This increase in efficiency can greatly improve your analytics. Another benefit to columnar storage is that all elements in a column have the same data type, making for efficient compression. For example, all entries in the Status column are comprised of only a few possible strings. In this case, the column can maintain a dictionary of distinct statuses and store only pointers to the dictionary entries, rather than full strings. Doing so makes encoding each column much more efficient, as you avoid storing duplicate values. It also makes reading column data more efficient. The benefits of columnar storage are noticeable in a typical analytical workload, such as in data warehouses. However, in many use cases, a purely columnar storage model is not the best approach. For example, columnar storage is less ideal when running transactions to retrieve or update a single row. Since you need to access all the columns that store data for that row, you end up pulling more data than necessary. In our example, columnar storage greatly improves the efficiency of analyzing earnings, but adding a new order to the table would be less efficient. In InterSystems IRIS, data is stored in globals, which allows you to use a combination of both row-based and columnar storage. With this flexible approach, you can choose to store certain data by column that you will need to run fast queries on, while storing the rest of the data by row so you can still run quick transactions to retrieve or update a single row. Data such as Status and Priority will likely be updated a few times, so those transactions will benefit from row storage. You can store the TotalAmount in columnar storage for easy access and analysis, while storing the rest of the table in rows. Columnar storage is simple to implement: just add a Storagetype clause, WITH STORAGETYPE = COLUMNAR, to the end of your table definition. There is no change to the way you query your data, so you can start running fast analytic queries as soon as you have defined your table with columnar storage.
Info
Channel: InterSystems Learning Services
Views: 6,393
Rating: undefined out of 5
Keywords:
Id: p18s8Ckn5H4
Channel Id: undefined
Length: 5min 5sec (305 seconds)
Published: Thu Mar 30 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.