Shall I use relational or document databases?
This article is the result of my hands-on experience. I have had extensive experience at using both relational and document databases and have formed a strong opinion. It is not my job to persuade anyone to change their mind, but to share my insights for consideration.
Pros and Cons of Relational databases
Relational databases have been kings for at least the last 2 decades:
- The famous SQL, “Structured Query Language”, evolved to a really powerful universal tool to perform just anything.
- They are particularly suitable for data that is relational. In other words, when there are relationships between entities. For example Airplanes, Flight, Airline Companies, Passengers, Tickets, all these are related and the query pattern is likely relational (e.g. Passengers who flew with 3 different Companies or Airplanes in the same Company, tickets sold per flight or per Airport per day, etc).
- Native operations such as JOINs facilitate selecting and filtering the data that matches specific criteria.
- They are flexible. You can pretty much use relational databases for any type of data, I have seen terrible abuses like a single table with over 300 columns and nullable fields, simulating a document database, or the EAV model that allows defining arbitrary attributes.
- You can define a strict schema and enforce referential integrity.
On the other hand referential databases:
- Don’t scale well vertically. With too many records, performances decay.
- The DBMS introduces a significant overhead.
- While powerful, analytics queries may take nondeterministially long time
- You can stick to best practices, for example minimization, but can easily make mistakes. Migrations can force you to refactor the model.
- Migrations can be very complex.
- With many entities, the complexity of the relations explodes.
- A vast set of difficulties called Impedance Mismatch.
Pros and Cons of Document databases
Document databases, like AWS DocumentDB or AWS Managed MongoDB are great because:
- The are very light, almost zero overhead for the DBMS
- They are exceptionally fast
- They are particularly suited for document data, that can have different fields and values
- Except a very simple schema composed of one or both of a partition hashing key and a sorting key, columns are infinite, all optional and dynamic. They can be primitives or unstructured data. You can fit anything.
- They scale extremely well vertically.
But they come short at the following points:
- Analytical queries are impossible. To analyze a JSON at depth, you must query full objects and process on the client side.
- There is no concept of JOIN, referential integrity and you must validate on the client side.
My opinion
Relational databases are quite clunky. Their advantages depend on correct use and you pay the cost when maintaining or migrating. For relatively small databases (e.g. 27Mb) it takes longer to run an analytical query, than to download the entire database with a single query and process it on the client side! You see?
Relational databases are adept at solving problems that however they create in the first place.
Sure, to build a usable NoSQL database you must know the query patterns and figure the exact fields, redundant indexing tables, etc, in other words, Data Denormalization. But in the end everything works more efficiently.
And you can always use ElasticSearch in case you need full text search or advanced mapping features. You will lose the cost advantage but keep speed and scalability, and it’s much easier than relational databases, while one step closer to data warehousing.
In Conclusion
Stop using relational databases whenever possible and you will never regret it. Document databases are better at everything: performance, cost, scaling, easiness of use – only at the cost to have to design a great schema and queries, and to elaborate also on the client side.
The only exception is when data is extremely relational, you need analytical queries, and downloading the entire DB is out of the question. In that case PostgreSQL is well suited for complex queries on indexed columns, while allowing unstructured data too.
OLTP transactional SQL is dead. You should re-architect your software and data model to use document data. And referential integrity still needs validation anyway, unless one makes assumptions of validity, so an invalid record may break the code. In other words the weak schema of document databases is still fine.
AWS DynamoDB is my favourite NoSQL database. In theory it’s not a document db, because every record has a size limit of 400kb, but I consider this hardly a limitation. I consider it’s a good practice to avoid BLOBs or large objects in a db, we have object storages and data lake technology for them.
Data lakes are a concept pushing the concept of document data even further. They are a newer trend, the only natural consequence of Big Data, analytics, and Machine Learning. There are Cloud solutions like Google Object Storage + BigQuery, and AWS S3 + Athena, that allow to query directly on the data points. They are extremely cheap, the are queries slower than databases, but probably acceptable for the use cases. But anyway after a certain number of rows all relational databases, including even PostgreSQL, will decay and then hit system limits so all analytical queries stalemate and fail, while in a document database or in a data lake this doesn’t happen.