Types of Databases: Relational vs. Columnar vs. Document vs. Graph vs. Vector vs. Key-value & more

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Performance of your application frequently depends on the database you select for a specific task. As a developer, you may need to choose the appropriate database based on your use case and requirements. In this video, we'll go through the most common database types used in enterprises. I'll provide use cases when to choose a specific database, as well as examples of open-source and cloud-managed databases. Let's start with the relational database. Originally, it was developed by IBM in the 1970s. You can think of a relational database as a collection of spreadsheet files that help businesses organize, manage, and relate data. In the relational database model, each 'spreadsheet' is a table that stores information, represented as columns (attributes) and rows (records or tuples). Now, columns specify data types, and each record (or row) contains values corresponding to those specific data types. For example, in the customer table, you may have an integer type for the ID column and varchar for the name and address. Also, in a relational database, each table has a unique identifier for each row called the "primary key." Rows from different tables can be linked using a "foreign key," which refers to the primary key in another table. Let’s take a look at how the relational database model works in practice. Let's say we have a website and track every single action that a user may do. A user can click on a button, add an item to the shopping cart, and things like that. Now, in a very simplified version, we would have a customer table and an event table. Each time a customer does something on the website, we would store it in the event table and use a foreign key to link this event to the customer. Also, we can use SQL to query our database. For example, you can use the following query to get all the customers you have. Now, if you want to get all customers who are active, meaning they have performed at least one action, you can use the following join. I have another video explaining all the different joins as well. Another important feature provided by most relational databases is ACID transactions. This means all changes to data are performed as if they are a single operation. If at least one task fails, the whole transaction is rolled back. For example, consider a transfer of funds from one bank account to another. First, you need to check if there is enough money to transfer, then you would debit one account and credit another. So, if at least one step fails, you want to roll back the entire transaction. Now, you would use a relational database for structured data and ACID compliance. Structured data means that you have rows and columns that clearly define data attributes. Some examples of relational databases are MySQL, PostgreSQL, MariaDB, Microsoft SQL Server, and Oracle Database. There are also some cloud-managed databases available, such as Amazon Aurora, Azure SQL, and many others. Now, traditional databases do not scale very well. If you have a lot of data, like billions of records, traditional databases can be slow. To improve query performance with big data, columnar databases have been developed. Let’s go through the simple exercise, which will help you see the power of using a columnar database for certain tasks. To illustrate this exercise, I’m going to use a spreadsheet populated with some customer data. First of all, let's see how a traditional database works. Before we start, there is a rule that we need to follow: we have to read the data from left to right, starting from the beginning of each row, and proceed from row to row, kind of like reading a book. So, let’s start. Let's ask the traditional database to return all paying customers. Here it goes: as you read the data, notice that you're retrieving the Name and Tier on each pass of every row, and eventually, you'll acquire all the Names and their associated Tiers. Now, servers do this very quickly; however, what if my request involves several billion rows? You can start to see that reading each row just to grab the Name and Tier columns could take a while. And this is where a columnar database comes in handy. We’re going to redo the exercise one more time, but instead of using traditional database rules, I’m going to give you columnar database rules. This means you’re going to read the data from top to bottom, and you’re only going to read the columns that I ask you to. Now, let's ask the columnar database the same question: to return all paying customers. Let’s start. Notice that this method skips all the data that isn’t related to what we’re looking for. Once we’ve finished getting that information, we start with the next column, skipping the columns that we don’t need. But wait a second, how does the computer know which Tier to assign to each Customer name it’s retrieving? The way it does this is by having the Columnar Database assign a number to each row of data, allowing it to quickly pair up the different columns it retrieves. You can see how this really comes in handy when you start reading the Name column. All it needs to know is the number where the Name values ended, in this case, row 10. This means I just need the Tier rows from 1 to 10. Now, you've probably noticed how the columnar database was repeating 'basic' over and over. This is actually another advantage of a columnar database. By using the numbering system, columnar databases can utilize algorithms to simplify data retrieval. And you’ll find that each columnar database system has highly sophisticated methods for achieving further performance enhancements. So, compared to traditional databases, a columnar database does not actually read billions of rows of data. Instead, it just reads a few long columns. This can be a very straightforward way of addressing the large amounts of data that corporations have to deal with.. They're not replacements for traditional databases, but they are certainly a powerful way of performing highly aggregated analysis. I think Apache Cassandra is one of the most popular open-source columnar databases that is widely adopted by many Fortune 500 companies. Now, both relational and columnar databases require you to define a schema before you can store anything. If your requirements change over time, you may need to add additional columns or perhaps change the data type of one of them. What if you have an e-commerce application with thousands of products? Different products usually have a different number of attributes. Managing thousands of attributes in relational databases is inefficient and can slow down your database. This is an ideal use case for the document database. You can store a product with all its attributes in a single document. It simplifies management and improves the reading speed. And if you change the attribute of one product, it won’t affect others. So, what is a document database? Well, it’s a type of database used to store and query data in JSON-like documents. JSON is a data format that is both human and machine-readable. Developers can use JSON documents in their code and save them directly into the document database. The flexible, semi-structured, and hierarchical nature of documents and document databases allows them to evolve with the needs of applications. Let's take a look at the structure of JSON documents. JSON represents data in three ways. The first is key-value pairs. These pairs are recorded within curly braces. The key is a string, and the value can be any data type, such as an integer, decimal, or boolean. For example, a simple key-value pair is {"year": 2024}. Next is the array. An array is an ordered collection of values defined within left ([) and right (]) brackets. Items in the array are comma-separated. For example, {"fruit": ["apple", “grapes”]}. And finally, objects. An object is a collection of key-value pairs. Essentially, JSON documents allow developers to embed objects and create nested pairs. For example, {"address": {"country": "USA", "state": “Texas”}}. Now, here is an example of a JSON-like document that describes a film dataset. You can see that the JSON document holds simple values, arrays, and objects quite flexibly. You can even have an array with JSON objects within it. Document-oriented databases allow you to create an unlimited-level hierarchy of embedded JSON objects. It's entirely up to you what schema you want to give to your document store. Now, let's talk about the operations that you can perform on a document database. You can create, read, update, and delete entire documents stored in the database. Document databases provide a query language or an API that allows developers to run these operations. Okay, first, you can create documents in the database. Each document has a unique identifier that serves as a key. Then, you can use the API or query language to read document data. You can run queries using field values or keys. You can also add indexes to the database to increase read performance. And finally, you can update existing documents flexibly. You can either rewrite the entire document or update individual values. So, what are the advantages of document databases? The first thing that comes to mind is the ease of development. JSON documents map to objects—a common data type in most programming languages. When building applications, developers can flexibly create and update documents directly from the code. This means they spend less time creating data models beforehand. As a result, application development becomes more rapid and efficient. Next, there's flexible schema. A document-oriented database allows you to create multiple documents with different fields within the same collection. This can be handy when storing unstructured data, like emails or social media posts. However, some document databases offer schema validation, allowing you to impose some restrictions on the structure. Another advantage is performance at scale. Document databases offer built-in distribution capabilities. You can scale them horizontally across multiple servers without impacting performance, which is also cost-efficient. Also, document databases provide fault tolerance and high availability through built-in replication. If you want to try it out, MongoDB is a great option. It's the most popular open-source document database available, easy to start with, has SDKs for most languages, and has large community. Have you ever seen a detective board in the movies, with pictures, news articles, and notes connected by thumbtacks and yarn? Immediately, you can see the power of connecting the dots in all of those relationships. Imagine taking that detective board and applying a mathematical engine that could query its data relationships. Well, that is a graph database. I want to explain graph databases by starting with relational databases. One of the main traits of a relational database is the constraining nature of its relationships, which makes it ideal for processing transactions. However, these strict constraints often make it too complex to answer questions about distant relationships. Imagine you had all the university professor and student data ever collected. Then imagine you wanted to know the relationship between a group of 10 students originating from completely different universities. Well, at first glance, you would think that since the students didn't go to the same universities, they really don't have a connection. But if we look at their professors, we can discover that they all shared a common professor when they were students. So now let's zoom in and describe some of the traits of a graph database. First, you have nodes, which are essentially records. Connected to these nodes is a type of relationship, which can have a direction and a property associated with it. So in our case, the direction points from the original professor. The relationship type is 'student_of.' And the property is the year and semester where they were taught. Now, querying this database isn't like your typical SQL query. Graph database vendors often have their own query language, so this is something that the industry is still working out. You need to be very careful with graph databases because they can infer connections that don't actually mean anything. For example, imagine the inference you could make if all the students in our previous example ended up dropping out of school. Does that mean the original professor had some kind of meaningful impact on that bad outcome? Well, anything is possible, but we have to be a little more skeptical of such conspiratorial patterns. So, graph databases are usually a mechanism for starting questions but not necessarily answering them. So, what we see is data science communities using graph databases to test inferences. The discovery of these relationships, and their relevance to the organization, often is what gets bubbled up into data warehouses. Thus closing the loop. One open source graph database that I can suggest is ArangoDB. Now, information comes in many forms. Some information is unstructured, like text documents, pictures, videos, and audio, while some is structured, like application logs, tables, and graphs. On the other side, we have vector databases that store data as high-dimensional vectors. Each vector has a certain number of dimensions, ranging from tens to thousands, depending on the complexity of the data. Now, we can apply some kind of transformations to the raw data. We can encode all types of data into vectors that capture the meaning and context of the asset. This allows us to find similar assets by searching for neighboring data points. Vector search methods enable unique experiences, such as taking a photograph with your smartphone and searching for similar images. You can also find documents that are similar to a given document based on their topic and sentiment. And find products that are similar to a given product based on their features and ratings. It just a few examples. You can try milvus which is open source vector database. Now, let's talk about a key-value database. It stores data as a collection of key-value pairs, where a key serves as a unique identifier. Both keys and values can be anything, ranging from simple objects to complex compound objects. The document database is a special type of key-value store where keys can only be strings. Also, when querying your document store, you can read the entire value or a part of the value, especially if the value is another JSON object. For example, you can have {"book": {"id": 1, "price": $10}}, then query book.price, and the database will return the value 10. Key-value databases always return the whole value, including both ID and price information. There are a few advantages of key-value databases. The first is scalability. Most key-value databases can scale horizontally and automatically distribute data across servers to reduce bottlenecks at a single server. Then there's ease of use. Key-value databases follow the object-oriented paradigm, allowing developers to map real-world objects directly to software objects. And Performance. Unlike relational databases, key-value databases don't have to perform resource-intensive table joins, which makes them much faster. Now let's talk about use cases. First is session management. A session-oriented application, such as a web application, starts a session when a user logs in and remains active until the user logs out or the session times out. You can use it for a shopping cart. An e-commerce website may receive billions of orders per second during the holiday shopping season, so you need a high-performance database. You can also use it for a metadata storage engine. Your key-value store can act as an underlying storage layer for higher levels of data access. And caching: You can use a key-value database to store data temporarily for faster retrieval. If you need a reliable key-value database, you can take a look at etcd. Kubernetes uses etcd to store all its cluster data, such as pods, deployments, and services. And finally, let's talk about time series databases. As the name suggests, it's optimized for time-stamped or time series data. Time series data are simply measurements or events that are tracked, monitored, downsampled, and aggregated over time. This could include server metrics, application performance monitoring, network data, sensor data, events, clicks, trades in a market, and many other types of analytics data. It's used over traditional database types because it enables fast data retrieval and is a cost-effective storage solution for that type of data. One example of general-purpose metrics database is InfluxDB. We also frequently use Prometheus to collect time series data from infrastructure. Prometheus is also a time series database with additional features to query targets such as VMs and Kubernetes pods. Each database type has its specialty: Relational for structured data and ACID compliance, Columnar for analytics, Document for unstructured data flexibility, Graph for complex relationships, Time-Series for time-stamped data, Vector for AI and ML scenarios, and Key-Value for simple, fast data access. Using the right database type can be a game-changer for performance, and the wrong one can slow down your queries and be very expensive. If you want to learn about relational databases and SQL joins, you can watch this video. Thank you for watching, and I’ll see you in the next lesson.

Info

Channel: Anton Putra

Views: 30,717

Rating: undefined out of 5

Keywords: Relational Database, Columnar Database, Document Database, Graph Database, Vector Database, Key-value Database, Time-series Database, Types of Databases, database tutorial, anton putra, devops, database types, relational vs columnar

Id: VfcRxtBKI54

Channel Id: undefined

Length: 18min 23sec (1103 seconds)

Published: Thu Dec 28 2023