I have had my fair share of experience designing, implementing, and supporting large sets logic. And by large, I mean roughly anything that could be flowcharted in perhaps 10 pages or more meets that criteria. In my personal experience I have worked on processes that ranged anywhere from 40 to in upwards of 90 pages.
This level of complexity presents a challege as you are essentially building a black box. By default, developers might make use of logging, however the traditional approach does not really fit well into scenarios where you need an in depth detail about the decisions were made for a given transaction.
In this article I will talk a bit about my personal experience, and the solution that I learned to apply towards these types of cases. For the purposes of illustration, I will throw in an demo application.
After a significant amount of testing, I deployed the routine into production. In my experience, testing complex sets of logic such as this is time consuming, and error prone. There was a heavy reliance on unit level testing to ensure a quality product. We launched the module, and transactions began pouring in.
Initially everything went well, but after a short period of time, support tickets came in. Given the large number of transactions, compounded by the fact that they were concentrated into a short time seasonal window, this became a full time job for a dedicated analyst and myself during the peak usage period. The routine was extemely dependant upon the correct entry of data in order to produce the correct results. The data entry process itself was prone to mistakes. From an end users perspective, the algorithm itself was to blame by default, even though almost every ticket resulted in myself and my analyst resource to researching and confirming where there were incorrect data points entered.
We were dealing with a black box essentially. Each inquiry required a considerable amount of time often involving two resouces. I had completed the task I was assigned, the routine worked, traditional logging was implemented, however much to verbose to accurately perform auditing on a transaction.
There are moments in life where you have to stop and laugh sometimes. By the end of the discovery phaae, I ended up with a 40+ page packed flowchart. Lesson learned, when you get about past 10 pages or so, you tend to tape the edges together for organizational purposes. That comes out to about 36 feet total, intimidating at first, just a bit.
This was my first rodeo, however the lessons learned here transferred well to more complex tasks at a future tasks at this role.
Although by any measurable means, the project was a success. Implemented according to spec, heavily instrumented for unit level tests, performant. What I identified as a failure was the lack of auditing. It was never identified. I cannot blame the analysts for not identifying this, and I cannot really beat up on myself for not instrumenting better visibility into the application, as it was really my first rodeo with this kind of thing. Sometimes, you learn lessons the hard way, like falling off a bike.
Personally, I hate repetative tasks. I get bored easily. I enjoy the challenge iniitally, and then my interest tapers off. The concept of job stability as the resource that 'is an expert'. for this system scares me. What a boring existence. I am a good person to hand crappy processes to, because I get impatient quickly, look for opportunities for improvement and streamlining processes.
My first lesson, look beyond the immediate ask, and drive a larger conversation if needed. All too often, as engineers we are asked to complete a task, our understanding is within a defined scenario, however all to often, you need to reach beyond the percieved definition of the problem at hand and ensure the scope of the effort includes all aspects. In this case, deep auditing and reporting is a 'sleeper' requirement. Knowing what I know today, I would have presented the requirement for more in depth auditing.
By any measureable means, the project was not a failure, however in 20/20 hindsight, tremendous unforseen opportunity for improvement was present in terms of auditing. Lessons, more valuable than gold, learned.
In response, I took a took a few deep breathes, gathered some thoughts, and came up with a plan to instrument the existing process with enhanced monitoring. This article will cover the approach I took. The good news, there is a good chance you can retoractively refit you application to adopt this model.
I won't go into detail about the original problem domain and exact solution as to protect the IP of my previous emplyer at the time, however lets create a new use case/industry.
So we can imagine a scenario where a bank may have a particular module that is used to approve a large loan for a customer. One could imagine the potential steps involved in performing a risk assement, eligibility, and terms of the loan.
When we take the traditional approach towards logging, we log entries in a flat format.
The new approach introduces the concept of a listener component which recieves events from another process. This is commonly referred to as the listener pattern.
In a real world scenario, the transaction might extend across different, distributed server instances. We will be making use of a messaging bus in order to allow the separate modules to communicate.
Let's take a look at how this works.
An 'agent' or code that provides the ability allow customized messages to be recorded, very similar to traditional logging.
The ability to log an correlate events is dependant on linking them together via a unique transaction id. The approach used for this effort involves generating a unique transaction id for each request, and then binding it the request context via the AsyncLocalStorage library.
1// *******************************************************************
2// src/storage.ts
3// *******************************************************************
4import { AsyncLocalStorage } from "async_hooks";
5const storage = new AsyncLocalStorage<{[key: string]: any}>();
6export default storage;
7
8// *******************************************************************
9// src/loan-app-processor.ts
10// *******************************************************************
11export const processLoanApplication = (app:LoanApplication):LoanApplication =>{
12 console.log(`attempting to process new application.`);
13
14 const transactionId:string = AuditAgent.initTransaction();
15 storage.run({ transactionId: transactionId }, () => {
16 processLoanApplicationInternal(app);
17 });
18
19 return app;
20}
21
22// *******************************************************************
23// Accesing the transaction id (src/monitoring/audit-agent.ts)
24// *******************************************************************
25import storage from "@/storage";
26
27...
28
29static getCurrentTransactionId():string{
30 const store = storage.getStore();
31 return store?.transactionId;
32}
33
Above, we can observe that the logic related to AsyncLocalStorage is centralized in the storage.ts file. The loan-app-processor file is the entry point for loan processing requests. It basically is the main or parent method responsible for calling a series of other functions in order to complete service the request to process the loan. In the above snippet, we see how the main method is essentially wrappering this main method, performing instrumentation required to support AsyncLocalStorage functionality.
The getCurrentTransactionId() method provides a convenient way to access the transaction id across all components in the request stack.
The first step is create an observabable aspect of the solution. The concept is very similar to a traditional logging approach, however we are logging at an debug/info type level, as we record all relavent events.
1export default abstract class AuditAgent{
2
3 static initTransaction():string{ }
4
5 static initGroup(){ ... }
6
7 endCurrentGroup(){ ... }
8
9 endTransaction(){ ... }
10
11}
A simple object model is established to store the event. In short we have the concept of event groups and events. Events are grouped under an EventGroup entity, and EventGroups not only contain Events, but other groups as well. Basically we esablish a hierarchical format, the purpose of which will become more apparent when we approach the reporting aspect.
Implementing reporting events such as group definitions and events is a failry straightfoward process , where all work is delegated to the audit-agent class. Let's dive into how we report events from within the various functions.
1import AuditAgent from "@/monitoring/audit-agent";
2
3...
4
5AuditAgent.initGroup('Retrieving credit scores');
6...
7AuditAgent.recordMessage(`Credit store (experian): ${scores.experianScore}`);
8...
9AuditAgent.endCurrentGroup();
So we see that a group is defined, events are reported at the current group level, and then the group is 'closed', where additional events would be assigned to the parent of the group that was closed. Both group creation and event reporting accept a single string argument in this example, but certainly other attributes may be added. The caller is completely abstracted from providing the current transaction id as it is referenced internally by the agent.
And now the magic happens. At this point, we have a series of events and groups in a structured format. Next we need to translate the object structure is translated into a readable format. For this effort, we will be generating HTML, however PDF might be an good alternate format.
The logic for generating the report is delegated to the src/monitoring/publisher.ts file. This prepares us for a pluggable approach towards where the report information is published. For this example, local file storage works, however in a real world scenario, other options such as persisting to an S3 bucket might come into play.
I elected to use the handbars library in order to translate the in memory object structure into the final format. The template logic resides in ./report-templates/default.template. The idea is simple in this example, we iterate through each group, and subsequent events, applying minimal formatting logic. This gets the job done, however there is plenty of room for more specialized formatting.
When we look at the output generated by standard logging, we begin to appreciate the simplicity of the HTML format when an analyst type role to review. Whereas different logging levels are possible given the default logging approach, we only have a single level with the auditing approach. Verbose level logging is typically disabled in production environments, all business level events are always reported. We also restrict event related messages to higher level messaging, whereas verbose level logging messages reported to the standard logging approach may contain more low level details intended for more technical roles to interpret.
So we have covered an alternate approach towards logging, intended to streamline the process of auditing complex processes. Analysts, and potentially end users have the ability to more effectively monitor/audit transactions, miminizing the requirement for a technical role to reverse activity after the fact using a very low level approach.
As I have mentiod before, usng this approach, I have virtually eliminated by support role for several critical processes. I encourage anyone
Full source code is available on my gitHub account at
https://github.com/west-coast-matthew/blog-process-monitoring