Performance of your application frequently depends on the database you select for
a specific task. As a developer, you may need to choose the appropriate database
based on your use case and requirements. In this video, we'll go through the
most common database types used in enterprises. I'll provide use cases
when to choose a specific database, as well as examples of open-source
and cloud-managed databases. Let's start with the relational database.
Originally, it was developed by IBM in the 1970s. You can think of a relational database
as a collection of spreadsheet files that help businesses organize, manage, and relate data. In
the relational database model, each 'spreadsheet' is a table that stores information, represented
as columns (attributes) and rows (records or tuples). Now, columns specify data types, and each
record (or row) contains values corresponding to those specific data types. For example, in the
customer table, you may have an integer type for the ID column and varchar for the name
and address. Also, in a relational database, each table has a unique identifier for each row
called the "primary key." Rows from different tables can be linked using a "foreign key," which
refers to the primary key in another table. Let’s take a look at how the relational database
model works in practice. Let's say we have a website and track every single action that
a user may do. A user can click on a button, add an item to the shopping cart, and things like
that. Now, in a very simplified version, we would have a customer table and an event table. Each
time a customer does something on the website, we would store it in the event table and use a
foreign key to link this event to the customer. Also, we can use SQL to query our database.
For example, you can use the following query to get all the customers you have. Now, if
you want to get all customers who are active, meaning they have performed at least one action,
you can use the following join. I have another video explaining all the different joins as
well. Another important feature provided by most relational databases is ACID transactions.
This means all changes to data are performed as if they are a single operation. If at least one
task fails, the whole transaction is rolled back. For example, consider a transfer of funds from one
bank account to another. First, you need to check if there is enough money to transfer, then you
would debit one account and credit another. So, if at least one step fails, you want to roll
back the entire transaction. Now, you would use a relational database for structured data
and ACID compliance. Structured data means that you have rows and columns that clearly define data
attributes. Some examples of relational databases are MySQL, PostgreSQL, MariaDB, Microsoft SQL
Server, and Oracle Database. There are also some cloud-managed databases available, such
as Amazon Aurora, Azure SQL, and many others. Now, traditional databases do not scale very
well. If you have a lot of data, like billions of records, traditional databases can be slow.
To improve query performance with big data, columnar databases have been developed. Let’s
go through the simple exercise, which will help you see the power of using a columnar database
for certain tasks. To illustrate this exercise, I’m going to use a spreadsheet populated
with some customer data. First of all, let's see how a traditional database works. Before
we start, there is a rule that we need to follow: we have to read the data from left to right,
starting from the beginning of each row, and proceed from row to row, kind of like reading
a book. So, let’s start. Let's ask the traditional database to return all paying customers. Here it
goes: as you read the data, notice that you're retrieving the Name and Tier on each pass of every
row, and eventually, you'll acquire all the Names and their associated Tiers. Now, servers do this
very quickly; however, what if my request involves several billion rows? You can start to see that
reading each row just to grab the Name and Tier columns could take a while. And this is where
a columnar database comes in handy. We’re going to redo the exercise one more time, but instead
of using traditional database rules, I’m going to give you columnar database rules. This means
you’re going to read the data from top to bottom, and you’re only going to read the columns that I
ask you to. Now, let's ask the columnar database the same question: to return all paying customers.
Let’s start. Notice that this method skips all the data that isn’t related to what we’re looking
for. Once we’ve finished getting that information, we start with the next column, skipping the
columns that we don’t need. But wait a second, how does the computer know which Tier to
assign to each Customer name it’s retrieving? The way it does this is by having the Columnar
Database assign a number to each row of data, allowing it to quickly pair up the different
columns it retrieves. You can see how this really comes in handy when you start reading the Name
column. All it needs to know is the number where the Name values ended, in this case, row 10. This
means I just need the Tier rows from 1 to 10. Now, you've probably noticed how the columnar
database was repeating 'basic' over and over. This is actually another advantage of a
columnar database. By using the numbering system, columnar databases can utilize algorithms
to simplify data retrieval. And you’ll find that each columnar database system has highly
sophisticated methods for achieving further performance enhancements. So, compared to
traditional databases, a columnar database does not actually read billions of rows of data.
Instead, it just reads a few long columns. This can be a very straightforward way of addressing
the large amounts of data that corporations have to deal with.. They're not replacements for
traditional databases, but they are certainly a powerful way of performing highly aggregated
analysis. I think Apache Cassandra is one of the most popular open-source columnar databases that
is widely adopted by many Fortune 500 companies. Now, both relational and columnar databases
require you to define a schema before you can store anything. If your requirements change over
time, you may need to add additional columns or perhaps change the data type of one of them.
What if you have an e-commerce application with thousands of products? Different products
usually have a different number of attributes. Managing thousands of attributes in relational
databases is inefficient and can slow down your database. This is an ideal use case for the
document database. You can store a product with all its attributes in a single document. It
simplifies management and improves the reading speed. And if you change the attribute of
one product, it won’t affect others. So, what is a document database? Well, it’s a type
of database used to store and query data in JSON-like documents. JSON is a data format that
is both human and machine-readable. Developers can use JSON documents in their code and save them
directly into the document database. The flexible, semi-structured, and hierarchical nature of
documents and document databases allows them to evolve with the needs of applications.
Let's take a look at the structure of JSON documents. JSON represents data in three
ways. The first is key-value pairs. These pairs are recorded within curly braces. The key
is a string, and the value can be any data type, such as an integer, decimal, or boolean. For
example, a simple key-value pair is {"year": 2024}. Next is the array. An array is an ordered
collection of values defined within left ([) and right (]) brackets. Items in the array are
comma-separated. For example, {"fruit": ["apple", “grapes”]}. And finally, objects. An object is
a collection of key-value pairs. Essentially, JSON documents allow developers to embed objects
and create nested pairs. For example, {"address": {"country": "USA", "state": “Texas”}}. Now,
here is an example of a JSON-like document that describes a film dataset. You can see that the
JSON document holds simple values, arrays, and objects quite flexibly. You can even have an array
with JSON objects within it. Document-oriented databases allow you to create an unlimited-level
hierarchy of embedded JSON objects. It's entirely up to you what schema you want to give to
your document store. Now, let's talk about the operations that you can perform on a document
database. You can create, read, update, and delete entire documents stored in the database. Document
databases provide a query language or an API that allows developers to run these operations.
Okay, first, you can create documents in the database. Each document has a unique identifier
that serves as a key. Then, you can use the API or query language to read document data. You can
run queries using field values or keys. You can also add indexes to the database to increase read
performance. And finally, you can update existing documents flexibly. You can either rewrite the
entire document or update individual values. So, what are the advantages of document databases?
The first thing that comes to mind is the ease of development. JSON documents map to objects—a
common data type in most programming languages. When building applications, developers can
flexibly create and update documents directly from the code. This means they spend less time
creating data models beforehand. As a result, application development becomes more rapid and
efficient. Next, there's flexible schema. A document-oriented database allows you to
create multiple documents with different fields within the same collection. This can be
handy when storing unstructured data, like emails or social media posts. However, some document
databases offer schema validation, allowing you to impose some restrictions on the structure.
Another advantage is performance at scale. Document databases offer built-in distribution
capabilities. You can scale them horizontally across multiple servers without impacting
performance, which is also cost-efficient. Also, document databases provide fault tolerance and
high availability through built-in replication. If you want to try it out, MongoDB is a great
option. It's the most popular open-source document database available, easy to start with, has SDKs
for most languages, and has large community. Have you ever seen a detective board in the
movies, with pictures, news articles, and notes connected by thumbtacks and yarn? Immediately,
you can see the power of connecting the dots in all of those relationships. Imagine taking that
detective board and applying a mathematical engine that could query its data relationships. Well,
that is a graph database. I want to explain graph databases by starting with relational databases.
One of the main traits of a relational database is the constraining nature of its relationships,
which makes it ideal for processing transactions. However, these strict constraints often make it
too complex to answer questions about distant relationships. Imagine you had all the university
professor and student data ever collected. Then imagine you wanted to know the relationship
between a group of 10 students originating from completely different universities. Well, at first
glance, you would think that since the students didn't go to the same universities, they really
don't have a connection. But if we look at their professors, we can discover that they all shared
a common professor when they were students. So now let's zoom in and describe some of the traits
of a graph database. First, you have nodes, which are essentially records. Connected to these
nodes is a type of relationship, which can have a direction and a property associated with it. So in
our case, the direction points from the original professor. The relationship type is 'student_of.'
And the property is the year and semester where they were taught. Now, querying this database
isn't like your typical SQL query. Graph database vendors often have their own query language,
so this is something that the industry is still working out. You need to be very careful with
graph databases because they can infer connections that don't actually mean anything. For example,
imagine the inference you could make if all the students in our previous example ended up dropping
out of school. Does that mean the original professor had some kind of meaningful impact on
that bad outcome? Well, anything is possible, but we have to be a little more skeptical
of such conspiratorial patterns. So, graph databases are usually a mechanism for starting
questions but not necessarily answering them. So, what we see is data science communities
using graph databases to test inferences. The discovery of these relationships, and
their relevance to the organization, often is what gets bubbled up into
data warehouses. Thus closing the loop. One open source graph database
that I can suggest is ArangoDB. Now, information comes in many forms. Some
information is unstructured, like text documents, pictures, videos, and audio, while some is
structured, like application logs, tables, and graphs. On the other side, we have vector
databases that store data as high-dimensional vectors. Each vector has a certain number of
dimensions, ranging from tens to thousands, depending on the complexity of the data. Now, we
can apply some kind of transformations to the raw data. We can encode all types of data into vectors
that capture the meaning and context of the asset. This allows us to find similar assets by searching
for neighboring data points. Vector search methods enable unique experiences, such as taking a
photograph with your smartphone and searching for similar images. You can also find documents that
are similar to a given document based on their topic and sentiment. And find products that are
similar to a given product based on their features and ratings. It just a few examples. You can
try milvus which is open source vector database. Now, let's talk about a key-value database. It
stores data as a collection of key-value pairs, where a key serves as a unique identifier.
Both keys and values can be anything, ranging from simple objects to complex
compound objects. The document database is a special type of key-value store
where keys can only be strings. Also, when querying your document store, you can
read the entire value or a part of the value, especially if the value is another JSON object.
For example, you can have {"book": {"id": 1, "price": $10}}, then query book.price, and the
database will return the value 10. Key-value databases always return the whole value,
including both ID and price information. There are a few advantages of key-value databases.
The first is scalability. Most key-value databases can scale horizontally and automatically
distribute data across servers to reduce bottlenecks at a single server. Then there's
ease of use. Key-value databases follow the object-oriented paradigm, allowing developers
to map real-world objects directly to software objects. And Performance. Unlike relational
databases, key-value databases don't have to perform resource-intensive table
joins, which makes them much faster. Now let's talk about use cases. First is session management. A session-oriented
application, such as a web application, starts a session when a user logs in and remains active
until the user logs out or the session times out. You can use it for a shopping cart. An
e-commerce website may receive billions of orders per second during the holiday shopping
season, so you need a high-performance database. You can also use it for a metadata
storage engine. Your key-value store can act as an underlying storage
layer for higher levels of data access. And caching: You can use a key-value database
to store data temporarily for faster retrieval. If you need a reliable key-value database,
you can take a look at etcd. Kubernetes uses etcd to store all its cluster data,
such as pods, deployments, and services. And finally, let's talk about time series
databases. As the name suggests, it's optimized for time-stamped or time series data. Time series
data are simply measurements or events that are tracked, monitored, downsampled, and aggregated
over time. This could include server metrics, application performance monitoring, network data,
sensor data, events, clicks, trades in a market, and many other types of analytics data.
It's used over traditional database types because it enables fast data retrieval and is a
cost-effective storage solution for that type of data. One example of general-purpose metrics
database is InfluxDB. We also frequently use Prometheus to collect time series data from
infrastructure. Prometheus is also a time series database with additional features to
query targets such as VMs and Kubernetes pods. Each database type has its specialty: Relational
for structured data and ACID compliance, Columnar for analytics, Document
for unstructured data flexibility, Graph for complex relationships,
Time-Series for time-stamped data, Vector for AI and ML scenarios, and
Key-Value for simple, fast data access. Using the right database type can be a
game-changer for performance, and the wrong one can slow down your queries and be very expensive.
If you want to learn about relational databases and SQL joins, you can watch this video. Thank you
for watching, and I’ll see you in the next lesson.