Articles / Software Engineering / Thoughts on ORM Design

Thoughts on ORM Design

The age old reference to when something is too good to be true, it probably is. ORM technologies are a great abstraction from the low level details of working with data. In this article, I am going to address some of the pitfalls and best practices when working with this type of technology. I will pretty much focus on ORM based data stores here, with not much empasis on non relational based stores.

ORM technologies are great, a refreshing alternative to working with low level SQL. At the earliest point in my career, SQL was basically you only had as a choice to work with data. poeerfull for for many tasks, but labor intensive to work with, and a bad fit for applying business logic. Yes, I said it. Everyone that is a fan boy of stored procedures probably just had their blood pressure shoot up a few points after reading that last statement. Don‘t get me wrong, SQL has a distinctive purpose, and a space that it fits into, but when moding more complex projects, we need a higher level of operating.

The introduction of ORM technologies provided a new way of working with data, almost too good to be true. For the most part, everything I seemed to read provided golden use cases, mostly related to CRUD operations on simple objects, or object graphs with very simple relationships. Let‘s take a look at a few examples:

How to model objects

The fundementals of object modeling in an ORM are pretty straightforward. Within the context of a RDMS system, an object definition per table is standard stuff. That is a general blueprint. There are cases however where additional objects need to be created for optimization purposes. There are many discussion points on how data is fetched, especially lay and eager loading of data, especially when it comes to relations. Let‘s take a look an example...

sdfdsfsdfdsfdsfdsfsdfds

So in the above illustration, it is a not brainer that we have default requirement to create 7 entities, each modeling a relationable table. That is the minimal amount of work. A little more effort goes into establishing relationships. This is the minimal amount of work required to get an ORM working, however there are additional modeling opportunities from an optimization perspective.

From a data consumption perspective, there are cases where it is necessary to create additional entities to wrapper optimized fetch operations. This comes at a cost however, additional effort, complexity, and should be implemented as an per use case optimization effort. In short, less code is arguably better code, however there are exceptions when appropriate where there is a need for additional tooling.

What really happens under the hood

The beauty of an ORM based approach is that you are abstracted from what is happening at a low level, i.e. what SQL is executed. As a convenience, you don‘t really need to deal with it, however this can easily lead to performance issues. This is especially the case when you are working with a small set of data that is not representative of what the actual state will be at a future point.

Let‘s talk eager vs lazy data load options. Eager fetching might be a no brainer for a lot of use cases, however perhaps not in a lot of use cases. For example, simple REST based read calls might require very basic usage of eager fetching. In other cases, where more complex data fetching operations are required, there might be other data fetching strategies required, and further tuning.

Easy data access

Let‘s look at the code it takes to retrieve a given entity, and the associated SQL required to performance the operation on an entity query. [code reference] The issue here is that first call works like magic as we are performing a PK lookup, indexed typcially by default. The second version results in a where clause

Filter operations become a bit more complicated under the hood however

Ruby

Java/Spring/JPA

NodeJS/Sequeqlize

Understanding the importance of data while developing

When developing code, you are working with data, which will fall into generally on of the three buckets.

Completely made up data because you are working on a new feature/product
A sample dump from production (hopfully scrubbed) data source
A full fledged data set, regulary updated from a production environment

In the first case, this is especially dangerous as your focus is based on a lot of assumptions on the amount of data. You are probably making up data to ‘just get it to work‘, and there is no shame there, just be aware that this will not expose performance issues in the future.

In the second case, hopefully it is a decent representation of at least the current state of data, at least enough to expose any potential performance issues. I know that establishing data sets in the cloud has a cost factor associated with it. As developers, team leaders, technical managers, it is important to establish a non-production data set to not only load test against, but spawn up to do a few smoke tests on during early development phases.

This is an often overlooked aspect of development, and often downplayed. I have been an firsthand observer/victim of developing a greenfield project where there was no support for staging date until a few weeks until go live, at which point a small team including myself spend an entire weekend working against actual data and optimizing. On another occasion when granted a production copy of data (again last minute before go live) I lost about two solid weeks to refactoring to conpensate for performance issues. I view these experiences as scars you might encoure when learning to ride a bike. You can‘t avoid falling off a bike while learning to ride, however I would hope that anyone would learn from examples such as these and avoid these mistakes. In short, build, load test, adjust if needed, a little bit of an iterative process. There are numerous options to create cloud db instances, start and stop, and import/delete data on an as needed basis.

The infamous N+1 problem

A core pitfall of working with ORM. One of the most basic opportunities for improvement I have observed during my career have been tied to excessive querying. If the world where we typically learn how to explain basic ORM techniques, it is easy, and necessary, to to create simple examples. These illustrate a point quite well, however do not give enough background on real world scenarios. Basically, it is easy to get into situations where you realize a serious performance issue with a basic master detail UI use case. Worst case scenario I have personally observed was one case where a simple fetch to retrieve a series of entities for a list view, limited by 200 records, which reulted in about 3,500 queries on average per request. No bueno.

When addressing any given master detail UI pattern, I would recommend taking an immediate look at the underlying SQL interation for that use case.

For underlying more server side business logic specific scenarios, I would also recommend taking a look at common use cases, and optimatize accordingly. I sincerely aprologize for not my lack of ability to provide a single golden answer to every given scenario. I can only offer my advice on underlying query behavior on a use case by use case basis, starting with the high priority use cases.

Introducing: Optimization approaches....

Let‘s talk caching

It is a well known fact that in memory access to data is faster than basic IO. Caching data can be performed at many levels. In an RDBMS there are many tuning optimizations, there are many alterative options when it comes to middleware, with Redis as a popular option, and there are options to cache at the client level when working with HTTP consumers that I think are all to often overlooked.

The problem with caching is that it comes at a cost. It introduces complexity, additional moving pieces, strategies for syncing data, etc. You need to be strategic about how you approach this. I really wish I had a one-size-fits-all solution to the correct solution for caching, but it is so context specific, even attempting to provide one would be a fallacy. I can share however a few guidelines.

When and where to optimize for performance

This is a hard question to provide an answer to as well. Early optimization is inheritely bad, however a common sense approach towards proactive tuning I do believe deserves merit. I will hit on a few decisions on how you can fall into traps in the topics below.

Let‘s identify patterns

There is so much repetative, boilerplate, work in implementing ORM efforts. I would strongly learn towards code generation practices. Rolling the foundation by hand is tedious, repetative, and error prone. So much opportunity for automation. I would highly encourage anyone to look into DSL and code generation approaches. On one particular project

Conclusion

In summary, ORM technologies are a productivity multiplier, when used correctly. I hope sincearly you walk away from this topic with a conversation started.

Diagram your object model up front (as mush as you can based on what you know)
Take a deep dive into what the actual SQL is doing, especially in hight use patterns.
Stay sharp and focused on caching opportunities, the cost and benefit of each tuning decision. Don‘t tune preamturely
Stay sensitive to the data you are testing with, if you are working on a new/greenfield project, you may not have a realistic public set to work with.
Ensure all filter operations are backed by indexes on the back end.