At Apixio, our improved logging infrastructure has had a significant impact on the way we work and enhanced the quality of our product.
A few years back, we had lots of logs that looked like this:
INFO Apr 18, 2016 1:19:30 PM com.apixio.blog.LoggingExample Started writing blog
DEBUG Apr 18, 2016 2:20:15 PM com.apixio.blog.LoggingExample Edited blog title
ERROR Apr 18, 2016 5:00:00 PM com.apixio.blog.LoggingExample Missed blog deadline
These logs were written to disk by our applications, but because of our need to have a very secure production environment, they were not directly accessible to our engineers. Most of the logs generated by our Hadoop cluster were never seen and problems there were fairly opaque and could easily be missed.
The Goals Behind Logging
After much discussion, our engineering team agreed on a few overarching goals for Apixio’s logs:
- Logs should drive our business decisions. We were already a fairly data-driven team but getting data was relatively expensive for us. We wanted the logs to facilitate collecting actionable data.
- Logs should help us prototype new capabilities faster. Instead of fiddling around with production databases, data could be written to logs and analyzed almost immediately coming out of production.
- Logs should stand in as a real audit trail. In the healthcare industry, auditability is an important feature, and logging would allow us to add advanced audit controls easily and affordably.
- Logs should be crafted as part of a complex event processing pipeline. While a single log tells the story of an event, groups of logs together can track data through a pipeline, user sessions, or trends in performance. (This turned out to be a big win for us downstream, and we’ll cover that in a separate post.)
Our Solution: Standardizing, Centralizing, Storing, and Backing Up Logs
Standardizing logging format
The first step was cleaning up and standardizing log formats in order to ensure encapsulation, structure and consistency.
Encapsulation refers to a log containing enough information to be useful on it’s own. Think of a log as an “Event.” We want to know who (Document 123), what (parsed successfully), when (at 9:31am on 01/01/2016), where (in the Pipeline Parser Complete Method) and why (as part of batch ABC).
Structure refers to the programmatic processing of log information. Plain text strings containing changing fields like keys or timestamps aren’t good for processing because they can’t be grouped or segmented. Rather, XML or JSON should be used to enable logs from different components, dates, or organizations to be grouped and processed individually. For our system, we decided that all logs would be JSON. While not every log could conform to this (namely existing logs coming from Hadoop or Apache server), we could ensure that all of our application logging would be through a common library that output structured JSON.
Next, we decided that the actual structure of the JSON would be “engineered.” We decided on envelope and payload standardization, and considered which join keys would be needed to link log entries together. New product features would have their logs designed as part of the software development process. To ensure consistency, we created common logging libraries for our Java, Scala and Python services.
Consistency refers to the use of logs across a wider system. Consistency enables new logs to be processed by existing tools. An outer wrapper indicating the source of the log, the time and the log level, for example, enables a log processing system to easily route messages. One step further, logs emitted by a Web Service could include the service, the method and the elapsed time one level deeper, enabling automated service monitoring and performance analysis.
Moving logs around
We also needed a way to move the logs around. In some sense the decision for how to do this is more important than where the logs actually come to rest. For large, distributed analysis systems, the aggregated log data may be larger than the analyzed data itself. We chose Fluentd, which has a small footprint and support for a hierarchical key structure, fan out, as well as store and forward. Most importantly, it was designed to stay up and not lose logs. Fluentd also has numerous supported plugins for different infrastructure components (including some which we’ve contributed to) and has become a fairly popular platform over time.
Storing and backing up logs
The process of determining where to actually store the logs required us to evaluate our use cases. Were the logs being stored for later batch processing, streamed for real-time dashboarding, or split to both? Logs needed to be preserved, backed up, and made easily available. They had to be accessible regardless of where they were created or when. They couldn’t be rolled over or deleted after an amount of time had passed, or disappear silently with a decommissioned box.
In the end, we chose to keep a backup of all of the logs on Amazon Simple Storage Service, and the most important logs on the Hadoop Distributed File System so that we could use batch processing tools like Hive to query them. Additionally, all structured and unstructured logging would be sent to Graylog for ad hoc search and basic analytics and monitoring capabilities. Lastly, we send each log to a Kafka topic that was partitioned by environment and component for stream processing.
Like all significant changes you plan on making to your system architecture, not everyone understood why we needed to do this. Logging infrastructure was a significant investment in terms of time, money, and attention, and the scale of return wasn’t immediately clear.
We had some holdouts on the engineering team who resisted adopting the consistent formats and only changing logs after consideration for downstream effects. We were going to have to spend continuously on logging infrastructure, as a small company, and this meant we needed management buy-in as well. While our logging infrastructure is relatively small today, at the time it was almost the same size as our application stack.
Once everyone saw the positive initial results, we had problems managing the volume. The logs started really flowing, coming in from our Hadoop cluster and from our analytics jobs, which were capable of producing billions of log lines in an afternoon. We also found that some team members would think it perfectly reasonable to spew a 50mb log line that would block up the pipeline. We had not designed the system for this throughput right out of the gate, so it took time to architect the system for the volume we see today.
Promising Results from Enhanced Logging Infrastructure
Today, we use the techniques described above to construct, route, process and display real-time information about each of our many microservices, analysis pipeline jobs, and web applications. When data enters our system we log it, and at each step along the way we emit subsequent logs. The real-time processing of our logs enables us to be sure that everything that entered the system is complete (and if not, why). When something breaks (and something will definitely break), we know why, and if necessary, what data needs to be recovered.
This rigor has enabled us to achieve important results, including:
- Intrusion detection: We can identify unauthorized access from particular users, hosts.
- Audit requirements: For security audit purposes, we understand who accessed what data and when.
- Consistency checking: We can compare the number or keys of records processed against our database.
- Completeness: We can see everything that started flow X completed step Y.
- Performance: We can understand response times across sliding windows.
- Cost Accounting: We can see user activity and session lengths.
- User Behavior: We can identify atypical workflows or detection of confusing designs.
- Correlation of events: We can identify the source of issue in one system by aligning time series.
- Recovery: We can identify incomplete transactions or lost data.
One of the most important features of our log infrastructure is the ability to query our logs using standard SQL. Because of the native JSON parsing in Hive, it is trivial to do things like:
Get statistics about the number of clinical events of each type for a certain organization:
get_json_object(line, ‘$.patientUUID’) as pat,
get_json_object(line, ‘$.documentUUID’) as doc,
cast(get_json_object(line, ‘$.event.count’) as int) as total_event_count,
cast(get_json_object(line, ‘$.eventTypeCount’) as int) as event_type_count,
cast(get_json_object(line, ‘$.patientevent.count’) as int) as patient_total_event_count,
get_json_object(line, ‘$.status’) as status,
get_json_object(line, ‘$.error.message’) as error_message,
get_json_object(line, ‘$.batchId’) as batch_id,
get_json_object(line, ‘$.orgId’) as org_id
from staging_logs_eventjob_24 where
get_json_object(line, ‘$.orgId’)=”430″ and
get_json_object(line, ‘$.patientUUID’) is not null
Don’t worry, I don’t expect you to be able to read these results:
Or, taking it a step further, we can join across multiple log tables to identify which documents lacked sufficient machine readability for processing:
select a.doc_uuid, c.patient_uuid,
b.total_page_count – a.page_count as unreadable_count,
(b.total_page_count – a.page_count)/b.total_page_count as unreadable_percent
(select get_json_object(line, ‘$.document.uuid’) as doc_uuid,
count(distinct get_json_object(line, ‘$.page.number’)) as page_count,
get_json_object(line, ‘$.page.readability’) as readability
where get_json_object(line, ‘$.orgId’) = ‘123’
and get_json_object(line, ‘$.page.readability’) is not null
and get_json_object(line, ‘$.page.resolution’)=’300′
group by get_json_object(line, ‘$.document.uuid’),
(select get_json_object(line, ‘$.document.uuid’) as doc_uuid,
count(distinct get_json_object(line, ‘$.page.number’)) as total_page_count
where get_json_object(line, ‘$.orgId’) = ‘123’
and get_json_object(line, ‘$.page.number’) is not null
group by get_json_object(line, ‘$.document.uuid’)
right outer join
(select distinct doc_id,
where org_id = ‘123’ and year=’2015′
on a.doc_uuid = b.doc_uuid and a.doc_uuid = c.doc_id
where a.readability == ‘readable’
The above queries are an excellent example of how new features and analysis can be performed without developing new datastores or schemas when logs are treated as first-class data.
While the above examples demonstrate the ability for us to query logs retroactively, we can also process logs in realtime to drive application dashboards.
Here is an example of a single consolidated log message generated by one of our applications. Note: this is just demo data
“coding_org_name”: “Erin’s Demo Co”,
“user_agent”: “Chrome/49.0.2623.87 Safari/537.36”
“message”: “user has clicked the reject button on the finding page”,
“pages”: “2, 3”
Because this log encapsulates all of the various information needed to capture various metrics, we are able to use our Stream Processing system to update a real-time dashboard:
In particular, we are able to:
Increment a counter for number of rejects using the event_name of “reject_click”
- Grouped by “project”: “PRHCC_123” project statistics
- Grouped by “user”: “email@example.com” for user stats
- Grouped by “hcc”: “12” for code specific stats
Track a user’s session in the application for “user”: “firstname.lastname@example.org”
- “workflow”: “coder” indicates that the current session is still active
- “event_name”: “reject_click” updates the step in the workflow
Track access patterns for various pieces of data
- Identify how often “sourceId”: “c64dfec1-d730-4174-a37d-1c28f9ff36bd” is accessed
- Track the lifecycle of “lifecycleId”: “efb4ee22-f6dd-494e-e8ae-edee67cd6342”
Ensure consistency with our operational data store
- Denormalized secure data enables confirmation that records aren’t lost
Logs as a First-Class Data Citizen
As you can see, the power of the log cannot be understated. We got some first hand evidence of how heavily we had grown to rely on them when we moved data centers last year. A system that started out small and had been up nonstop for 18 months needed to be moved and scaled up. Inevitable hiccups of such a major transition made us understand how critical logs had become to so many parts of our business. No longer did anyone view logging as a “debugging” aid, it was a first class citizen in our data center architecture, one that we couldn’t allow to fall into disrepair.
At Apixio, we take logs very seriously. They drive our dashboards, they ensure our data is complete, and they verify that our customer experiences are quick, responsive, and effective.
Logs are so much more than just messages from the black box; they are the real-time sensors of the software world, a key safeguard to catastrophe, and the mechanism by which operational success can be measured and obtained.
Interested in being part of a great engineering and science team using data and AI to change healthcare? Check out our Careers Page.