At Google, we use the following principles to run our device fleet securely and at scale:
Secure default settings at depth with central enforcement
Ensure a scalable process
Invest in fleet testing, monitoring, and phased rollouts
Ensure high quality data
Secure default settings
Defense in depth requires us to layer our security defenses such that an attacker would need to pass multiple controls in an attack. To uphold this defensive position at scale, we centrally manage and measure various qualities of our devices, covering all layers of the platform;
We use automated configuration management systems to continuously enforce our security and compliance policies. Independently, we observe the state of our hardware and software. This allows us to determine divergence from the expected state and verify whether it is an anomaly.
Where possible, our platforms use native OS capabilities to protect against malicious software, and we extend those capabilities across our platforms with custom and commercial tooling.
Google manages a fleet of several hundred thousand client devices (workstations, laptops, mobile devices) for employees who are spread across the world. We scale the engineering teams who manage these devices by relying on reviewable, repeatable, and automated backend processes and minimizing GUI-based configuration tools. By using and developing open-source software and integrating it with internal solutions, we reach a level of flexibility that allows us to manage fleets at scale without sacrificing customizability for our users. The focus is on operating system agnostic server and client solutions, where possible, to avoid duplication of effort.
Software for all platforms is provided by repositories which verify the integrity of software packages before making them available to users. The same system is used for distributing configuration settings and management tools, which enforce policies on client systems using the open-source configuration management system Puppet, running in standalone mode. In combination, this allows us to easily scale infrastructure and management horizontally as described in more detail and with examples in one of our BeyondCorp whitepapers, Fleet Management at Scale.
All device management policies are stored in centralized systems which allow settings to be applied both at the fleet and the individual device level. This way policy owners and device owners can manage sensible defaults or per-device overrides in the same system, allowing audits of settings and exceptions. Depending on the type of exception, they may either be managed self-service by the user, require approval from appropriate parties, or affect the trust level of the affected device. This way, we aim to guarantee user satisfaction and security simultaneously.
Fleet testing, monitoring, and phased rollouts
Applying changes at scale to a large heterogeneous fleet can be challenging. At Google, we have automated test labs which allow us to test changes before we deploy them to the fleet. Rollouts to the client fleet usually follow multiple stages and random canarying, similar to common practices with service management. Furthermore, we monitor various status attributes of our fleet which allows us to detect issues before they spread widely.
High quality data
Device management depends on the quality of device data. Both configuration and trust decisions are keyed off of inventory information. At Google, we track all devices in centralized asset management systems. This allows us to not only observe the current (runtime) state of a device, but also whether it’s a legitimate Google device. These systems store hardware attributes as well as the assignment and status of devices, which lets us match and compare prescribed values to those which are observed.
Prior to implementing BeyondCorp, we performed a fleet-wide audit to ensure the quality of inventory data, and we perform smaller audits regularly across the fleet. Automation is key to achieving this, both for entering data initially and for detecting divergence at later points. For example, instead of having a human enter data into the system manually, we use digital manifests and barcode scanners as much as possible.
How do we figure out whether devices are trustworthy?
After appropriate management systems have been put in place, and data quality goals have been met, the pertinent security information related to a device can be used to establish a “trust” decision as to whether a given action should be allowed to be performed from the device.
High level architecture for BeyondCorp
This decision can be most effectively made when an abundance of information about the device is readily available. At Google, we use an aggregated data pipeline to gather information from various sources, which each contain a limited subset of knowledge about a device and its history, and make this data available at the point when a trust decision is being made.
Various systems and repositories are employed within Google to perform collection and storage of device data that is relevant to security. These include tools like asset management repositories, device management solutions, vulnerability scanners, and internal directory services, which contain information and state about the multitude of physical device types (e.g., desktops, laptops, phones, tablets), as well as virtual desktops, used by employees at the company.
Having data from these various types of information systems available when making a trust decision for a given device can certainly be advantageous. However, challenges can present themselves when attempting to correlate records from a diverse set of systems which may not have a clear, consistent way to reference the identity of a given device. The challenge of implementation has been offset by the gains in security policy flexibility and improvements in securing our data.
What lessons did we learn? As we rolled out BeyondCorp, we iteratively improved our fleet management and inventory processes as outlined above. These improvements are based on various lessons we learned around data quality challenges.
Audit your data ahead of implementing BeyondCorp
Data quality issues and inaccuracies are almost certain to be present in an asset management system of any substantial size, and these issues must be corrected before the data can be utilized in a manner which will have a significant impact on user experience. Having the means to compare values that have been manually entered into such systems against similar data that has been collected from devices via automation can allow for the correction of discrepancies, which may interrupt the intended behavior of the system.
Prepare to encounter unforeseen data quality challenges
Numerous data incorrectness scenarios and challenging issues are likely to present themselves as the reliance on accurate data increases. For example, be prepared to encounter issues with data ingestion processes that rely on transcribing device identifier information, which is physically labeled on devices or their packaging, and may incorrectly differ from identifier data that is digitally imprinted on the device.
In addition, over reliance on the assumed uniqueness of certain device identifiers can sometimes be problematic in the rare cases where conventionally unique attributes, like serial numbers, can appear more than once in the device fleet (this can be especially exacerbated in the case of virtual desktops, where such identifiers may be chosen by a user without regard for such concerns).
Lastly, routine maintenance and hardware replacements performed on employee devices can result in ambiguous situations with regards to the “identity” of a device. When internal device components, like network adapters or mainboards, are found to be defective and replaced, the device’s identity can be changed into a state which no longer matches the known inventory data if care is not taken to correctly reflect such changes.
Implement controls to maintain high quality asset inventory
After inventory data has been brought to an acceptable correctness level, mechanisms should be put into place to limit the ability for new inaccuracies to be introduced. For example, at Google, data correctness checks have been integrated into the provisioning process for new devices so that inventory records must be correct before a device can be successfully imaged with an operating system, ensuring that the device will meet required data accuracy standards before being delivered to an employee.
Next time In the next post in this series, we will discuss a tiered access approach, how to create rule-based trust and the lessons we’ve learned through that process.