Part 3 of Technology En Masse
For this third edition of Technology En Masse we take a quick step into the realm of the developer. You might find yourself drawing sketches for database designs on napkins or coasters (e.g. with the help of UML). Even for hobby projects it can be very valuable to put some effort in designing and thinking about the structure of your data before you start the project. One key component of designing is picking the right kind of identification for your database records.
For this article we assume you have a basic understanding of how relational databases work and have designed/rigged up one before. We will discuss why you would need an identifier and three methods to generate one.
Why use an identifier?
When setting up a database you want to have a quick way to refer to a specific record within it. This is not only beneficial when testing and debugging but also key part of the operations performed on a database. The database needs what is called a Primary Key to function correctly. You can imagine using this identifier when your database has multiple tables and connecting content between two or more tables (e.g. a user has written a certain post; the post has to be linked to the user). This means an identifier needs to be unique, since you want to receive a specific record (only one) when making these connections.
Incremental IDs
The simplest form of an identifier (ID) for your database records is by making use of a digit. This number should of course be unique for every record. This can be done by raising the ID incrementally for every new record when compared to the last inserted record. This of course also means the ID becomes a property with a value that has a meaning in the real world. If you start with the first record it will most likely have the ID 1. And the second will have ID 2, and so on. The meaning of these IDs is the order they have been created (1 is the first record and 2 the second) and there is now also a direct relationship between the first and second record through their IDs. As an example of this, please take a look at the following table:
ID | Firstname | Lastname |
1 | John | Doe |
2 | Mary | Doe |
3 | Hank | Edison |
Going for Incremental IDs can be an informed decision, since you can make very good use of this attributed value. For instance when outputting the ID in your URL. If the ID in the URL contains the following value post.php?post=668 you can say with a lot of certainty that this post probably is the 668th post for this website. Furthermore you can change the parameter in the URL to post.php?post=669 and you will most likely get to view the next post.
The predictability of Incremental IDs is both it’s largest advantage as well as its biggest flaw. It makes it easier for wrongdoers to figure out the inner workings of your database design. Another flaw of Incremental ID’s is that they are only unique within their own domain: e.g. within one table of your database. Whenever your application operates on multiple levels or communicates to other systems (e.g. a REST API) your Incremental ID might not be so unique anymore.
UUIDs (and GUIDs)
By using a UUID (which is short for Universally Unique Identifier) you eliminate the predictability and meaning that is connected to incremental IDs. A UUID is nothing more than a random generated value you can attach to your records within the database. Whenever someone is referring to GUID (or globally unique identifier) they mean the exact same thing and they probably have a Microsoft background.
With UUIDs the IDs do not follow up on each other and no meaning can be attributed to the value that is created. A UUID might look like this: 7c6a5360-c343-11e9-88e0-2a2ae2dbcce4. An example of the usage of such UUIDs can be seen in this table:
ID | Firstname | Lastname |
7c6a5360-c343-11e9-88e0-2a2ae2dbcce4 | John | Doe |
c4694a54-c343-11e9-9cb5-2a2ae2dbcce4 | Mary | Doe |
cabd0148-c343-11e9-9cb5-2a2ae2dbcce4 | Hank | Edison |
Statistically these UUIDs are globally unique which means there are no clashes when working with multiple/distributed systems on data migrations or other synchronization practices. Furthermore, there is a lot less predictability, so for wrongdoers it is a lot more complicated to guess the design of your database, which increases security by design. Lastly, UUIDs are code-generated IDs. You don’t need to wait for the database engine to deliver the ID back to you (which is the case when using Incremental IDs) since you have the value already available in code. This offers a performance boost over using Incremental IDs.
Using UUIDs, however, does cost a little more computing power. Incremental IDs are supported by most database engines and ordering on a database with Incremental IDs is easier than on UUIDs. Besides, UUIDs are far larger values than Incremental IDs, so you will need extra storage space for them.
Consider using UUIDs only when you are designing a system that you expect to grow large quickly and need distribution over multiple servers. Even then, think carefully on how to implement the UUID (never store them as varchar(36)
, go for bytes). Furthermore when you need to freely expose your IDs (e.g. in a URL) using UUIDs offers a fairly secure solution. You should be aware that this is not watertight and only makes attacking your system more expensive. Large botnets that check a lot of random UUIDs every second still have a chance of finding a valid UUID in your system.
Unique properties
Some database purists might insist on not using an identifier that is made up of a single property. Database records should be viewed in the context that they exist in and from this a set of poperties together can form the ID. Considering this approach the data in the following table spans multiple columns. The identifier in this example consists of firstname, lastname and birthday:
Firstname | Lastname | Birthday |
John | Doe | 1995-01-12 |
Mary | Doe | 1987-06-02 |
Hank | Edison | 1990-10-18 |
If we refer back to the inner workings of a database as described in the first paragraph, this might cause a lot of redundant data in your database. If our user, from the above example, ‘John Doe‘ has written a post, we need to save his specific information (firstname, lastname and birthday) also in our ‘post’-table, together with the record of the post. Besides, it does not offer any solution to freely expose your ID since you don’t have any single identifier. Finding a combination of properties that are always unique might prove quite the challenge (however slim the chance, multiple John Doe’s might be born on January 12th, 1995).
Initially this approach seems to offer a clean cut and pure way to create a Primary Key and making use of existing data as identifier. But in practice it doesn’t offer any advantages let alone when talking about databases containing hundreds of thousands of records.
There are a lot of things to be considered when designing a database, but one of the main questions is what type of identifier you want to use. In this part of Technology En Masse we have described three common answers to that question. Most developers will be perfectly happy using Incremental IDs since (most) database engines offer this option as a default and you don’t need to think about it too much. However UUIDs might offer you a good alternative when it comes to scalability and performance. Unless of course you are a purist and are confident of the potency of using unique properties as identifier.
References
- https://medium.com/@Mareks_082/auto-increment-keys-vs-uuid-a74d81f7476a
- https://segment.com/blog/a-brief-history-of-the-uuid/
- https://docs.microsoft.com/en-us/previous-versions/aa373931(v=vs.80)
- https://www.youtube.com/watch?v=s5Im6LWfLrY
- https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439
- https://www.clever-cloud.com/blog/engineering/2015/05/20/why-auto-increment-is-a-terrible-idea/