AFP548 Site News September 29, 2022 at 8:54 am

Zentral for ~Observability~ ~Governance~ Compliance

You’ve watched my MDOYVR presentation, but instead of being able to draw an owl, you’re concerned about standing up an osquery query distribution stack in production (unfortunately not what we’ll cover in this post, sorry!) and actually doing the job of visualizing the data.

‘Observability’ and ‘governance’ are my favorite buzzwords of late, because the Reporting Structure Above incentivizes us Client Platform Engineers to display the slog of busywork we ship when preparing for being audited as it immediately turns into proof of a compliant state – in being rigorous, we mix in things that prove we’re doing our actual job, and the org validates that every effort is proven to be worthwhile or dashboards wouldn’t need to be looked at. I won’t spend too many characters on ‘feelings’ about this whole CIS thing, I think I got enough of that out of my system, so in this post we’ll elaborate on the client and server and the data moving parts of actually shipping those checks.

Outkast reference

As we use Zentral to get the job done, it’s what I’ll be referencing a bunch in this post. Bundling Grafana/Prometheus/OpenSearch behind its osquery service gave us flexible options at an enterprise scale, but that’s admittedly a lot of stack if you’re looking at setting up the ~dozen or so core moving parts (our Terraform apply is ~250 AWS resources between all the alarms and backups and secure infra configs and tweaks and customizations we’ve built up).

Unfortunately if you were hoping for a clean shot where you wouldn’t need to be a subject matter expert on various parts of the systems you need to audit AND the osquery servers moving parts… ur in teh rong biz/are perhaps in need of reminding what that ‘engineer’ part refers to when the whole world has done gone SRE/DevOps-y for a while now. Fortunately, we’re all lucky to be in the biz with flexible, rock-solid tools like Zentral and the open source communities helping smooth over the rough spots that undoubtedly occur with the various agents that can send data through it. Even our relatively lean team got a long way towards telling a good story in graphs (lies, damn lies, statistics, etc.)… at least until the delayed and new/future tech debt comes a callin’ 😅.

Zooming back in on the examples referenced in my presentation, jmespath and osquery are the two formats you can write ‘compliance check’s in. The ‘pass/fail’ data for these can then be surfaced as ‘filters’ a.k.a. widgets on the Zentral inventory view. You can even include a collection of them at a RESTful URL to easily point folks at. But… Zentral has a GUI only a developer could love 😅 – just as vital to the dependability of the backend (by leveraging cloud-primitives like subscriptions and message queues), and frontend (via ‘stores’ like open/elasticsearch/splunk/datadog/snowflake…) that accompanying Prometheus and Grafana stack I referred to lets us tap into and slice and dice the data for even easier consumption – which folks usually need logging infra and vendor-specific specialization to get at.

Jmespath is a way of querying json data (in our implementation from events earmarked to also get streamed into prometheus buckets) which has usage syntax not unlike jq. Zentral implements common-case, low-hanging fruit/golden-path ways of querying e.g. the presence of mobile configuration profiles delivered to a device (as inventoried by the ‘monolith’ wrapper custom to Zentral in case you’re hosting munki somewhere else – CERTAIN overpriced MDM’s cannot efficiently return full inventory data about what profiles are installed on a device… no comment.) But you should keep in mind jmespath is not a full scripting language, you may soon realize how shallow you dip into jq when parsing stuff. Many of the ‘sources’ you can plug in as modules to Zentral (Jamf/Munki/Puppet being the most notable) have a section in each inventory record with the last recorded value for ‘extra facts’ – that custom/arbitrary inventory data can have simpler jmespath queries run and stop smoke from coming out of your ears. (I’m leaving out some specifics and details about which facts you may benefit from, but… you know where to find us in the Zentral community if you need blanks filled in 😀.)

Helpfully, you can use the DevTool built into Zentral to take real collected data for a device and confirm that the jmespath check gives you the result you expect, and conversely devices that are in the opposite state are parsed correctly. Besides ‘scoping’ a check to a source or an operating system platform, tags also allow you to take other groupings you can sync into Zentral and apply arbitrary constraints as needed. For example, cloud-hosted Ubuntu systems wouldn’t need full disk encryption and screensaver locks, nor do macOS-based conference room and signage ‘appliance’ Mac Mini’s, so identifying those via tags or sites ensures the checks only run on applicable devices. Here’s an example of confirming the presence of a mobile configuration profile with WSOne or Jamf:

contains(keys(@),profiles) && contains(profiles[*].uuid,E81C883A-6938-4F43-8E28-10428836FB2B)

Breakinitdown:

  • First, confirm which source sends the inventory information for the top-level key you’re going to follow down the tree for (e.g. on Mac, munki gathers profile information)
  • Then confirm it’s present (as there may be a slow convergence period e.g. right after DEP bootstrap) by using ‘contains’ and the key name wrapped in tildes: contains(keys(@), profiles)
  • As mentioned, you can customize this for other top-level keys, under extra_facts. Note those would be source-specific, so e.g. you’d choose Jamf if you need EAs
  • Next, continue evaluating with a logical and, &&
  • Then enumerate the uuids across all nested profiles with a glob/asterisk in square brackets after the top-level key, and then dot-index into the value we want to check: profiles[*].uuid
  • Nest that inside a contains() and follow the profiles[*].uuid above with a comma and then the value we’re checking for wrapped in tildes

To reiterate a point I made in the presentation, I find these types of ‘is profile present’ checks… naïve, because you’re trusting that the responsible framework(s) did in fact apply the change/restriction and it’s not pending a reboot or transition to a specific ‘happy state’. So to contrast the previous example, here’s how you’d check that the condition puppet tests for via a fact to run idempotently is in the applicable complaint state, we use funny logic that ends up making enough sense that expressing a boolean in text doesn’t drive us nuts:

puppet_node.extra_facts.contains(keys(@),indeed_nfc_disabled) && puppet_node.extra_facts.indeed_nfc_disabled == 'OK'

(We’d always see the puppet_node in the root of the json inventory representation, so we’re going two nodes down the tree to make sure we have the extra_fact populated in the first place, otherwise Zentral would give us a third state, ‘UNKNOWN’ which obvs isn’t ideal.)

To explain how you’d write SQLite to check compliance with osquery, it’s best to start with another naïve example – this checks the hash of one file and is wrapped with keywords Zentral looks for when parsing its boolean result:

SELECT sha256,
CASE
WHEN sha256 = 'omgwtfbbqrandomchars00112233445566' THEN 'OK'
ELSE 'FAILED'
END ztl_status
FROM hash
WHERE path = '/path/to/file';

To explain the syntax in use here even if you’re not SQLite/osquery-savvy,
* starting with SELECT and then sha256 as the ‘key’/column ing the result whose value we’re trying to evaluate results based on
* then CASE so we enter the sqlite ‘logic flow’ where pass/fail criteria can be assigned and different paths can be followed as a result
* then WHEN sha256 = ‘sha256werelookingforomgwtfbbq’ THEN ‘OK’ ← which designates the ‘pass’ case if the osquery hash table says the result of checking that file is as we expect
* ELSE ‘FAILED’ (one possible keyword Zentral is looking for in the ‘failure’ state)
* END ztl_status ← this actually renames the adhoc key/column in our results to ztl_status, and is how Zentral knows where it’s trying to read a value from. The END closes the case statement
* FROM hash ← tells it what table we’re getting the sha256 of the file via
* WHERE path = ‘/path/to/conf’; ← these are our sql-isms to stretch how you think a database works, since it’s not something statically stored that we’re retrieving, it’s ‘at-time-of-query’ that it does the lookup on the device in question to determine what the sha256 of the path stated is

And being more rigorous with an advanced example of a query for the NTP config which SHOULD be parsed and applied as expected on most recent macOS versions… 😅 (which… we never got the red team to confirm does anything worthwhile? But we felt there wouldn’t be collateral damage/much downside to making this the default, and the GUI is still accessible to confirm SOMETHING we consider accurate is being applied):

with compliance_check as (
select sha256,
case when sha256 = 'omgwtfbbqrandomchars00112233445566' then 'OK' else 'FAILED' end ztl_status
from hash
where path = '/path/to/conf'
)
select compliance_check.sha256,
coalesce(compliance_check.ztl_status, 'FAILED') ztl_status
from (select 1 as x) x
left join compliance_check on x.x = 1;

For osquery via sqlite, ‘join’s allow us to pull results to correlate or otherwise mash up from multiple tables. Dot-notation is used with those table names to map the ‘field’/column label to either return in the subsequent results or match on when sewing tables together. You can additionally adhoc create new tables to work with or wrap the output from a select statement looking up values in a table into a newly named one.

Zentral provided code to cover the case when the thing we’re auditing may actually not be present and therefore not return a result, whereas we’d prefer to fail the check in that context instead of letting it be unknown, to improve on the outcome of the v1 query above. To explain what’s going on in the v2 example code provided:

  • First we’re overriding the name of the table ‘nesting’ the results of the query in parentheses using the ‘as’ keyword to make it ‘compliance_check’
  • We’re then requesting two columns/rows –
  • sha256,
  • and the results of the case statement, labeled ‘ztl_status’.
  • From that new ‘metatable’, we say we’d like to end up with its sha256 column and a default of failed for ztl_status (via coalesce)
  • In the process/same line we’re also re-applying the ztl_status label (since the way we’re asking for it would be the column header otherwise)
  • To ‘trick’ the query into returning/generating/allowing an empty/NULL result (when as in this case the ‘where path’ isn’t found/the query would normally return no result) we make a SECOND, new table to join against with one key/value, x = 1, and re-lable that single column ‘x’.
  • Left joining excludes that new tables content but now we can return an empty sha256 and failed for that case on the right-side’s table. Think of it like a venn diagram where our join only cares about the values distinctly only in the ‘compliance_check’… metatable

And now that we’ve broken that down, to bring it all home, here’s the true not-so-secret sauce of the setup: compliance checks status change events get written out to Prometheus metrics, which Grafana can dynamically pull in via variables to a self-maintaining dashboard! The one we use reflects the raw numbers alongside percentages as a time series trend a.k.a. a burndown chart (if we had a threshold, which you can also trivially represent in Grafana as an arbitrary horizontal line, to work towards getting the fleet under). Looks impressive on slides! Get yours today!

Zentral actually has more exciting visualizations, but I’ll round out this article underlying the governance benefits of using this tool for proof when being audited; on Windows and Linux via puppet we check in the code, on Mac we checked in the Munki pkg with the metadata and payload that made the change, then with Terraform to manage the compliance check or gitops to manage the osquery pack we verified that the job got done. And THAT is how you apply rigor to your processes so you can own the ‘engineer’ in your title as Client Platform Engineers, if I do say so myself. I hope you agree and get to work with tooling that enables rigor like this!

Allister Banks

Allister lives in Japan, has not read the Slack scroll back, and therefore has no idea what is going on.

More Posts - Website

Follow Me:
Twitter

Tags:

Leave a reply

You must be logged in to post a comment.