Types of Databases: Relational vs. Columnar vs. Document vs. Graph vs. Vector vs. Key-value & more

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Performance of your application frequently depends   on the database you select for  a specific task. As a developer,   you may need to choose the appropriate database  based on your use case and requirements. In this video, we'll go through the  most common database types used in   enterprises. I'll provide use cases  when to choose a specific database,   as well as examples of open-source  and cloud-managed databases. Let's start with the relational database.  Originally, it was developed by IBM in the   1970s. You can think of a relational database  as a collection of spreadsheet files that help   businesses organize, manage, and relate data. In  the relational database model, each 'spreadsheet'   is a table that stores information, represented  as columns (attributes) and rows (records or   tuples). Now, columns specify data types, and each  record (or row) contains values corresponding to   those specific data types. For example, in the  customer table, you may have an integer type   for the ID column and varchar for the name  and address. Also, in a relational database,   each table has a unique identifier for each row  called the "primary key." Rows from different   tables can be linked using a "foreign key," which  refers to the primary key in another table. Let’s   take a look at how the relational database  model works in practice. Let's say we have   a website and track every single action that  a user may do. A user can click on a button,   add an item to the shopping cart, and things like  that. Now, in a very simplified version, we would   have a customer table and an event table. Each  time a customer does something on the website,   we would store it in the event table and use a  foreign key to link this event to the customer.   Also, we can use SQL to query our database.  For example, you can use the following query   to get all the customers you have. Now, if  you want to get all customers who are active,   meaning they have performed at least one action,  you can use the following join. I have another   video explaining all the different joins as  well. Another important feature provided by   most relational databases is ACID transactions.  This means all changes to data are performed as   if they are a single operation. If at least one  task fails, the whole transaction is rolled back.   For example, consider a transfer of funds from one  bank account to another. First, you need to check   if there is enough money to transfer, then you  would debit one account and credit another. So,   if at least one step fails, you want to roll  back the entire transaction. Now, you would   use a relational database for structured data  and ACID compliance. Structured data means that   you have rows and columns that clearly define data  attributes. Some examples of relational databases   are MySQL, PostgreSQL, MariaDB, Microsoft SQL  Server, and Oracle Database. There are also   some cloud-managed databases available, such  as Amazon Aurora, Azure SQL, and many others. Now, traditional databases do not scale very  well. If you have a lot of data, like billions   of records, traditional databases can be slow.  To improve query performance with big data,   columnar databases have been developed. Let’s  go through the simple exercise, which will help   you see the power of using a columnar database  for certain tasks. To illustrate this exercise,   I’m going to use a spreadsheet populated  with some customer data. First of all,   let's see how a traditional database works. Before  we start, there is a rule that we need to follow:   we have to read the data from left to right,  starting from the beginning of each row,   and proceed from row to row, kind of like reading  a book. So, let’s start. Let's ask the traditional   database to return all paying customers. Here it  goes: as you read the data, notice that you're   retrieving the Name and Tier on each pass of every  row, and eventually, you'll acquire all the Names   and their associated Tiers. Now, servers do this  very quickly; however, what if my request involves   several billion rows? You can start to see that  reading each row just to grab the Name and Tier   columns could take a while. And this is where  a columnar database comes in handy. We’re going   to redo the exercise one more time, but instead  of using traditional database rules, I’m going   to give you columnar database rules. This means  you’re going to read the data from top to bottom,   and you’re only going to read the columns that I  ask you to. Now, let's ask the columnar database   the same question: to return all paying customers. Let’s start. Notice that this method skips all   the data that isn’t related to what we’re looking  for. Once we’ve finished getting that information,   we start with the next column, skipping the  columns that we don’t need. But wait a second,   how does the computer know which Tier to  assign to each Customer name it’s retrieving?   The way it does this is by having the Columnar  Database assign a number to each row of data,   allowing it to quickly pair up the different  columns it retrieves. You can see how this really   comes in handy when you start reading the Name  column. All it needs to know is the number where   the Name values ended, in this case, row 10. This  means I just need the Tier rows from 1 to 10. Now,   you've probably noticed how the columnar  database was repeating 'basic' over and   over. This is actually another advantage of a  columnar database. By using the numbering system,   columnar databases can utilize algorithms  to simplify data retrieval. And you’ll find   that each columnar database system has highly  sophisticated methods for achieving further   performance enhancements. So, compared to  traditional databases, a columnar database   does not actually read billions of rows of data.  Instead, it just reads a few long columns. This   can be a very straightforward way of addressing  the large amounts of data that corporations have   to deal with.. They're not replacements for  traditional databases, but they are certainly   a powerful way of performing highly aggregated  analysis. I think Apache Cassandra is one of the   most popular open-source columnar databases that  is widely adopted by many Fortune 500 companies. Now, both relational and columnar databases  require you to define a schema before you can   store anything. If your requirements change over  time, you may need to add additional columns or   perhaps change the data type of one of them.  What if you have an e-commerce application   with thousands of products? Different products  usually have a different number of attributes.   Managing thousands of attributes in relational  databases is inefficient and can slow down your   database. This is an ideal use case for the  document database. You can store a product   with all its attributes in a single document. It  simplifies management and improves the reading   speed. And if you change the attribute of  one product, it won’t affect others. So,   what is a document database? Well, it’s a type  of database used to store and query data in   JSON-like documents. JSON is a data format that  is both human and machine-readable. Developers   can use JSON documents in their code and save them  directly into the document database. The flexible,   semi-structured, and hierarchical nature of  documents and document databases allows them   to evolve with the needs of applications.  Let's take a look at the structure of JSON   documents. JSON represents data in three  ways. The first is key-value pairs. These   pairs are recorded within curly braces. The key  is a string, and the value can be any data type,   such as an integer, decimal, or boolean. For  example, a simple key-value pair is {"year":   2024}. Next is the array. An array is an ordered  collection of values defined within left ([) and   right (]) brackets. Items in the array are  comma-separated. For example, {"fruit": ["apple",   “grapes”]}. And finally, objects. An object is  a collection of key-value pairs. Essentially,   JSON documents allow developers to embed objects  and create nested pairs. For example, {"address":   {"country": "USA", "state": “Texas”}}. Now,  here is an example of a JSON-like document that   describes a film dataset. You can see that the  JSON document holds simple values, arrays, and   objects quite flexibly. You can even have an array  with JSON objects within it. Document-oriented   databases allow you to create an unlimited-level  hierarchy of embedded JSON objects. It's entirely   up to you what schema you want to give to  your document store. Now, let's talk about   the operations that you can perform on a document  database. You can create, read, update, and delete   entire documents stored in the database. Document  databases provide a query language or an API that   allows developers to run these operations.  Okay, first, you can create documents in the   database. Each document has a unique identifier  that serves as a key. Then, you can use the API   or query language to read document data. You can  run queries using field values or keys. You can   also add indexes to the database to increase read  performance. And finally, you can update existing   documents flexibly. You can either rewrite the  entire document or update individual values. So,   what are the advantages of document databases? The first thing that comes to mind is the ease   of development. JSON documents map to objects—a  common data type in most programming languages.   When building applications, developers can  flexibly create and update documents directly   from the code. This means they spend less time  creating data models beforehand. As a result,   application development becomes more rapid and  efficient. Next, there's flexible schema. A   document-oriented database allows you to  create multiple documents with different   fields within the same collection. This can be  handy when storing unstructured data, like emails   or social media posts. However, some document  databases offer schema validation, allowing you   to impose some restrictions on the structure.  Another advantage is performance at scale.   Document databases offer built-in distribution  capabilities. You can scale them horizontally   across multiple servers without impacting  performance, which is also cost-efficient. Also,   document databases provide fault tolerance and  high availability through built-in replication.   If you want to try it out, MongoDB is a great  option. It's the most popular open-source document   database available, easy to start with, has SDKs  for most languages, and has large community. Have you ever seen a detective board in the  movies, with pictures, news articles, and notes   connected by thumbtacks and yarn? Immediately,  you can see the power of connecting the dots   in all of those relationships. Imagine taking that  detective board and applying a mathematical engine   that could query its data relationships. Well,  that is a graph database. I want to explain graph   databases by starting with relational databases.  One of the main traits of a relational database   is the constraining nature of its relationships,  which makes it ideal for processing transactions.   However, these strict constraints often make it  too complex to answer questions about distant   relationships. Imagine you had all the university  professor and student data ever collected. Then   imagine you wanted to know the relationship  between a group of 10 students originating from   completely different universities. Well, at first  glance, you would think that since the students   didn't go to the same universities, they really  don't have a connection. But if we look at their   professors, we can discover that they all shared  a common professor when they were students. So now   let's zoom in and describe some of the traits  of a graph database. First, you have nodes,   which are essentially records. Connected to these  nodes is a type of relationship, which can have a   direction and a property associated with it. So in  our case, the direction points from the original   professor. The relationship type is 'student_of.'  And the property is the year and semester where   they were taught. Now, querying this database  isn't like your typical SQL query. Graph database   vendors often have their own query language,  so this is something that the industry is still   working out. You need to be very careful with  graph databases because they can infer connections   that don't actually mean anything. For example,  imagine the inference you could make if all the   students in our previous example ended up dropping  out of school. Does that mean the original   professor had some kind of meaningful impact on  that bad outcome? Well, anything is possible,   but we have to be a little more skeptical  of such conspiratorial patterns. So, graph   databases are usually a mechanism for starting  questions but not necessarily answering them. So, what we see is data science communities  using graph databases to test inferences. The   discovery of these relationships, and  their relevance to the organization,   often is what gets bubbled up into  data warehouses. Thus closing the loop. One open source graph database  that I can suggest is ArangoDB. Now, information comes in many forms. Some  information is unstructured, like text documents,   pictures, videos, and audio, while some is  structured, like application logs, tables,   and graphs. On the other side, we have vector  databases that store data as high-dimensional   vectors. Each vector has a certain number of  dimensions, ranging from tens to thousands,   depending on the complexity of the data. Now, we  can apply some kind of transformations to the raw   data. We can encode all types of data into vectors  that capture the meaning and context of the asset.   This allows us to find similar assets by searching  for neighboring data points. Vector search methods   enable unique experiences, such as taking a  photograph with your smartphone and searching for   similar images. You can also find documents that  are similar to a given document based on their   topic and sentiment. And find products that are  similar to a given product based on their features   and ratings. It just a few examples. You can  try milvus which is open source vector database. Now, let's talk about a key-value database. It  stores data as a collection of key-value pairs,   where a key serves as a unique identifier.  Both keys and values can be anything,   ranging from simple objects to complex  compound objects. The document database   is a special type of key-value store  where keys can only be strings. Also,   when querying your document store, you can  read the entire value or a part of the value,   especially if the value is another JSON object.  For example, you can have {"book": {"id": 1,   "price": $10}}, then query book.price, and the  database will return the value 10. Key-value   databases always return the whole value,  including both ID and price information.   There are a few advantages of key-value databases. The first is scalability. Most key-value databases   can scale horizontally and automatically  distribute data across servers to reduce   bottlenecks at a single server. Then there's  ease of use. Key-value databases follow the   object-oriented paradigm, allowing developers  to map real-world objects directly to software   objects. And Performance. Unlike relational  databases, key-value databases don't have   to perform resource-intensive table  joins, which makes them much faster. Now let's talk about use cases. First is session management. A session-oriented  application, such as a web application, starts   a session when a user logs in and remains active  until the user logs out or the session times out. You can use it for a shopping cart. An  e-commerce website may receive billions   of orders per second during the holiday shopping  season, so you need a high-performance database. You can also use it for a metadata  storage engine. Your key-value   store can act as an underlying storage  layer for higher levels of data access. And caching: You can use a key-value database  to store data temporarily for faster retrieval. If you need a reliable key-value database,  you can take a look at etcd. Kubernetes   uses etcd to store all its cluster data,  such as pods, deployments, and services. And finally, let's talk about time series  databases. As the name suggests, it's optimized   for time-stamped or time series data. Time series  data are simply measurements or events that are   tracked, monitored, downsampled, and aggregated  over time. This could include server metrics,   application performance monitoring, network data,  sensor data, events, clicks, trades in a market,   and many other types of analytics data.  It's used over traditional database types   because it enables fast data retrieval and is a  cost-effective storage solution for that type of   data. One example of general-purpose metrics  database is InfluxDB. We also frequently use   Prometheus to collect time series data from  infrastructure. Prometheus is also a time   series database with additional features to  query targets such as VMs and Kubernetes pods. Each database type has its specialty: Relational  for structured data and ACID compliance,   Columnar for analytics, Document  for unstructured data flexibility,   Graph for complex relationships,  Time-Series for time-stamped data,   Vector for AI and ML scenarios, and  Key-Value for simple, fast data access. Using the right database type can be a  game-changer for performance, and the wrong one   can slow down your queries and be very expensive.  If you want to learn about relational databases   and SQL joins, you can watch this video. Thank you  for watching, and I’ll see you in the next lesson.
Info
Channel: Anton Putra
Views: 30,717
Rating: undefined out of 5
Keywords: Relational Database, Columnar Database, Document Database, Graph Database, Vector Database, Key-value Database, Time-series Database, Types of Databases, database tutorial, anton putra, devops, database types, relational vs columnar
Id: VfcRxtBKI54
Channel Id: undefined
Length: 18min 23sec (1103 seconds)
Published: Thu Dec 28 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.