Primary Key
In a relational database, records are uniquely identified by the Primary Key. No two records can have the same primary key.
This concept is the same in Dynamodb as well. The Primary Key of a table is either the Partition key or a combination of the Partition key and the Sort key.
In the table given below, the student ID is the partition key. It is unique for each of the items.
student_id (Partition Key) | name | date_of_birth | |
cb15ffa1-5e86-4c51-b341 | John Doe | john.doe@example.com | 2002-01-15 |
cbfe7f85-6b85-4d75-8030 | Jane Smith | jane.smith@example.com | 2000-05-22 |
2accafed-6a3e-4f10-b16e | Michael Johnson | michael.j@example.com | 2001-08-30 |
93b98d8b-6de3-452e-s | Emily Davis | emily.davis@example.com | 1999-11-10 |
93b98d8b-6de3-452e-8 | William Brown | william.brown@example.com | 2003-03-25 |
In the below example, we store the content of articles in this table. Since the article content is long, it is broken into multiple parts and stored in multiple table items to prevent exceeding the 400kb limit Dynamodb imposes on item size.
ArticleId (Partition Key) | Part (Sort Key) | Content |
101 | 1 | Introduction to DynamoDB with examples and illustrations... |
101 | 2 | Setting up your DynamoDB environment with detailed steps... |
101 | 3 | Basic operations in DynamoDB such as CRUD... |
102 | 1 | Understanding different types of databases with pros and cons... |
102 | 2 | Benefits of NoSQL databases over SQL databases... |
103 | 1 | Advanced DynamoDB features like global tables... |
103 | 2 | Performance optimization tips and tricks in DynamoDB... |
A single article includes multiple items, so, the partition key cannot be unique. Therefore, a combination of the partition key and the sort key is used to uniquely identify each item. This is called a composite primary key.
Partitions
Dynamodb is a distributed system. This means that unlike in traditional Relational Database Management System (RDBMS), all the data is not stored in a single server. Instead, they are spread across multiple servers.
In a Dynamodb table, items are grouped into partitions. Think of partitions as boxes within the table into which each item is placed. The partitions in a table are stored physically in different nodes (servers) within the region in which the table was created.
To simplify, you put each item of the table into different partitions (think of them as boxes) and then keep those boxes in different locations.
How Dynamodb stores items
This is where the partition key comes into the picture. When you add a new item to the table, the partition key goes through a hash function. A hash function is an algorithm that takes in a value as an input and then outputs a hash code. In this case, the hash code is a fixed-size integer. When the same value is input into the hash function multiple times, it will always produce the same hash code.
Each partition in a table is associated with a hash code range.
To simplify, let’s say that a table has the below partitions associated with the given ranges.
Partition 1 — 100 - 200
Partition 2 — 201 - 301
Partition 3 — 302 - 402
Partition 4 — 403 - 503
We are adding this item to the table
{
"student_id": "cb15ffa1-5e86-4c51-b341-48d1aed0f84f",
"name": "John Doe",
"email": "john.doe@example.com",
"date_of_birth": "2002-01-15"
}
When the student_id
is input to the hash function, the output is 352.
Since this is within 302 - 402
, the item will go into Partition 3.
How Dynamodb retrieves items
Unlike relational databases, Dynamodb does not offer too much flexibility on how data is queried.
You can only query data using a Partition Key. For example, if you want to query John Doe
from the Students table, you can only do so if you know his student_id
. To query by any other attribute, you need to create a Global Secondary Index (discussed later). Or perform a scan operation which is computationally very expensive.
This limitation is because items in the table are stored in different partitions that are located at different nodes. Dynamodb needs to know the partition where the item is stored so that it only has to search within that partition. In contrast, performing a scan operation is going through all the items in the table to find the one you are looking for.
Suppose we are querying for John Doe
using his student_id
which is the partition key. First, Dynamodb will input the student_id
into the hash function. The hash function will output the same hash code that was generated when the item was added to the table. By looking at the hash code, Dynamodb can know the partition in which the item is located in.
Inside the partition, items are stored using a data structure called a B-Tree where the items are sorted by the partition key if it is unique. If we are using a composite primary key and the partition key is not unique, items will be sorted by the sort key. This allows Dynamodb to search the item within the partition very efficiently with a time complexity of O(logn).
Pros and cons of this behaviour
As mentioned earlier, the limitation of Dynamodb is that queries are not as flexible as in relational databases. When you are designing a Dynamodb table, you have to consider all querying patterns that will be used by your application and decide on the Primary Key pattern and indexes to enable them.
But on the upside, for these query patterns that are supported by your table keys, data can be fetched extremely efficiently compared to an RDBMS. This is because, unlike in an RDBMS, when the Dynamodb table grows in size, the underlying infrastructure can be scaled horizontally by spreading the data across more nodes. The query knows the precise location of each item using the partition key. Therefore, there will be no change in performance or latency of the table as it grows in size.
How partition key works in a Local Secondary Index (LSI)
Consider this table.
Department (Partition Key) | empId (Sort Key) | Name | Age | Location | DOJ |
HR | 1001 | John Doe | 30 | New York | 2020-05-15 21:35:51 |
IT | 1002 | Jane Smith | 28 | San Francisco | 2019-08-10 01:20:55 |
Finance | 1003 | Mike Brown | 35 | Chicago | 2018-03-22 08:49:11 |
Marketing | 1004 | Linda White | 32 | Seattle | 2021-06-30 21:22:48 |
IT | 1005 | James Green | 29 | Austin | 2017-11-12 07:14:55 |
HR | 1006 | Emily Davis | 27 | New York | 2022-02-05 21:07:59 |
There is a requirement to query the employees in a specific department who joined after a specific date.
This can be achieved by creating an LSI with DOJ as the sort key.
Department (Partition Key) | DOJ (Sort Key) | empId | Name (Projected Attribute) |
IT | 2017-11-12 07:14:55 | 1005 | James Green |
Finance | 2018-03-22 08:49:11 | 1003 | Mike Brown |
IT | 2019-08-10 01:20:55 | 1002 | Jane Smith |
HR | 2020-05-15 21:35:51 | 1001 | John Doe |
Marketing | 2021-06-30 21:22:48 | 1004 | Linda White |
HR | 2022-02-05 21:07:59 | 1006 | Emily Davis |
Note that the LSI has the same partition key as the associated item in the base table. Therefore, LSI items are stored in the same partition as their base table counterpart. LSI is the same items as the base table sorted differently. In this case, they are sorted by DOJ. Now you can use this LSI to query for employees who joined within a specific date range. Note that similar to the base table, the primary key of an LSI should be unique. There cannot be two employees who are from the same department and joined the exact same date and time!
How partition key works in a Global Secondary Index (GSI)
Consider the same table as above.
Department (Partition Key) | empId (Sort Key) | Name | Age | Location | DOJ |
HR | 1001 | John Doe | 30 | New York | 2020-05-15 21:35:51 |
IT | 1002 | Jane Smith | 28 | San Francisco | 2019-08-10 01:20:55 |
Finance | 1003 | Mike Brown | 35 | Chicago | 2018-03-22 08:49:11 |
Marketing | 1004 | Linda White | 32 | Seattle | 2021-06-30 21:22:48 |
IT | 1005 | James Green | 29 | Austin | 2017-11-12 07:14:55 |
HR | 1006 | Emily Davis | 27 | New York | 2022-02-05 21:07:59 |
We have another requirement to query employees by their ID without knowing their department. This query cannot be performed using an LSI since we don’t know the partition key.
So, we create a GSI where the partition key is the empId
.
empId (Partition Key) | Department | Name | Age | Location | DOJ |
1001 | HR | John Doe | 30 | New York | 2020-05-15 21:35:51 |
1002 | IT | Jane Smith | 28 | San Francisco | 2019-08-10 01:20:55 |
1003 | Finance | Mike Brown | 35 | Chicago | 2018-03-22 08:49:11 |
1004 | Marketing | Linda White | 32 | Seattle | 2021-06-30 21:22:48 |
1005 | IT | James Green | 29 | Austin | 2017-11-12 07:14:55 |
1006 | HR | Emily Davis | 27 | New York | 2022-02-05 21:07:59 |
Since the partition key of the GSI is different from the base table, the items in the GSI will be stored in a partition different from its corresponding item in the base table.
Think of the GSI as a separate copy of the table (with limited attributes that you select from the base table). Same with the LSI, the primary key should be unique.
You can use the GSI to query employees by their ID.
Recap
Primary Key uniquely identifies items in a Dynamodb table. It can be just the partition key if that is unique or a combination of the partition key and sort key if the partition key is not unique.
Based on the partition key of a record, Dynamodb puts them into different partitions. Partitions in a table are stored in different nodes (servers) within the region where the table was created.
In queries, Dynamodb uses the partition key to know the partition in which the item is stored and hence the exact server where the item resides.
Within a partition items are stored in a sorted order of partition key if it is unique or else the sort key. This allows Dynamodb to efficiently search for an item within the partition.
LSI is like a copy of the table sorted using a different attribute but items still have the same partition key and are stored in the same partition as the corresponding base table item.
GSI is also like a copy of the table but the items have a different partition key and they are stored in partitions different to the corresponding items in the base table.