Product Quality at Rubrik - Part 3

At Rubrik, we perpetually strive to enhance our Developer experience and their velocity. Quick iteration and delivery doesn’t imply starting with a product of lower quality and then working towards stabilization subsequently. In the agile landscape of today, enabling the development team to measure the quality at every stage of the feature development becomes pivotal, contributing to swift iterations and faster delivery of high-quality products to our customers. Ultimately, the sooner we can catch the regressions, the better it is. Let’s delve into how our Testing Process and Infrastructure enabled our Engineering community with faster iteration and efficient delivery of our Products.

Test Process

Our strategic test process, embedded throughout the product release cycle, is succinctly represented below.

Every code check-in is accompanied by the execution of dependent unit, component and integration tests before code lands on the main branch. At regular intervals, typically every hour, we initiate the building process for the main branch and conduct a full suite of tests, including unit, component, integration, and smoke tests. This rigorous testing ensures that critical functionality is thoroughly validated and helps identify any regressions on the main branch. For a build to be promoted to the next stages of testing, it is crucial that all these tests pass seamlessly. As build processes are initiated more frequently than test pipelines, we promote multiple successful builds and select the latest one for running our pipelines. We understand very well that running E2E tests is both time consuming and expensive, so we avoid triggering E2E pipelines if there are any build or promotion issues. Upon resolution of these issues, the cycle restarts automatically.

Test Infrastructure Components

Our current, seamless process is the culmination of overcoming numerous challenges, understanding the engineers’ pain points, continually iterating, innovating, and enhancing our ecosystem. Multiple components in our test infrastructure are seamlessly integrated to enable smoother test implementation, automated addition and execution as part of our daily test pipeline runs. Here’s an overview of the journey from test authoring to the visualization of test results from the pipeline runs:

Let’s take a deeper look at the Test Infrastructure components involved in ensuring high-quality products at Rubrik.

Test Framework

Test authors at Rubrik utilize a Python test framework, anchored on Pytest. The core of our test framework is the Self-Describing Test (SDT) mechanism. This innovative mechanism made it easy for anyone to understand the test objective, the necessary resources it depends on, and the different configurations with which the test can be run.

SDT wasn’t part of our initial automated test writing process. It emerged in response to our user pain points. While the test owners were aware of the required resources, others needed to decipher the test description to understand these dependencies. In our older test framework, this meant a laborious multi-step process of identifying the resource requirements, ordering the necessary resources and then supplying them via a command line argument.

To address these issues, we leveraged our SDT capability, enabling the current Test Framework to communicate seamlessly with our in-house built resource management service. Now, instead of multiple steps to execute a test, a single command suffices to understand the resource demands, order the requisite resources, and execute the test upon fulfillment. This integration has empowered test owners and simplified the process for everyone else intending to run our E2E tests.

Once an E2E test is fully automated and meets expectations, it is integrated into our test pipelines for regression testing. Engineers can simply use the Pytest markers' capability to flag these tests for inclusion in any chosen pipeline, without needing to switch to a different system. It’s as simple as that!

Pipelines Auto-Config Generator

Let me provide a concise overview of our pipeline structure. A pipeline is composed of one or more test suites, each of which comprises one or more test jobs. In turn, a test job includes one or more tests. Generally, a test suite represents a collection of tests from a specific component or area within our product. The sample structure below shows the hierarchical structure and relationships between pipelines, test suites, test jobs and individual tests.

Before the implementation of our auto-generation process, engineers would manually construct these test jobs, suite definitions, and ultimately incorporate them into the pipelines – a labor-intensive process that clearly called for a more efficient system. This need prompted us to develop an automated process for generating our test pipelines.

This auto-generation runs at regular intervals, analyzing the tests to detect any changes and then building the pipelines accordingly. When changes are identified, the system leverages the Pytest markers specified by the test authors to auto-generate the test jobs, suites, and pipelines.This automation relieved our engineers from the laborious task of manually adding tests to the pipelines and allowed us to swiftly integrate the new tests and updates into our test pipelines.

Pipeline Executor

Having the updated pipelines ready should mean we're all set to execute them and spot regressions, right? Not quite. As mentioned earlier in this blog, selecting the correct build with the necessary promotions is crucial for our success. Moreover, the infrastructure running these pipelines functions properly is equally important. To this end, we run an infrastructure pipeline at regular intervals to detect and address any operational anomalies. This helps us prevent failures in pipelines due to infrastructure issues.

For execution, we rely on predefined schedules that trigger our pipelines automatically. While we have an in-built system overseeing the scheduling and triggering of our pipelines, the underlying framework that carries out our tests is Tekton.

Our engineers also use this in-built system to trigger any ad hoc runs to test their changes.

Pipeline Triager

Managing the numerous pipelines to assess the quality of our builds could be daunting, especially when it comes to manually handling tickets for issues that arise. Tracking and avoiding duplicate tickets adds to the complexity. To ease our engineers from this workload, we developed an in-house auto triaging and reporting system. This system serves as the first point of analysis for our pipeline runs, effectively eliminating the need for manual ticket management for test failures.

The auto-triage system processes pipeline metadata at regular intervals throughout the runs. Whenever a test failure is detected, it not only autonomously files a ticket against the relevant component but also leverages stack trace awareness to efficiently deduplicate similar failures into a single issue. This approach allows us to track issues more efficiently and enables our engineers to focus on resolving them promptly.

Pipeline Results Reporter

Our auto-triaging system also facilitates automated reporting of test pipeline results. Once the system completes processing the pipeline results, the data is transferred to Snowflake. Subsequently, Tableau accesses this data from Snowflake periodically for visualization purposes. In addition, the system ensures timely communication with the stakeholders via Slack notifications and Jira tickets regarding any pipeline failures.

The Tableau dashboards provide a clear visual narrative, helping all parties understand the current quality status of a specific release. Ultimately, when the pipeline results meet the expected standards, the build is qualified for release.

Summary

A concise summary of the steps detailed above can be found below

Metrics and Performance: Adopting the DORA 4 Keys Approach

To quantify the impact of our enhanced automated testing processes, we leverage the DORA 4 keys approach. These metrics provide a comprehensive view of our DevOps performance and help us continually refine our processes. Here's how we measure up:

Test Execution Frequency:
- What it Measures: How often we execute automated tests.
- Our Progress: By automating our test pipelines with frequent code checks and seamless test execution, we've significantly increased our test execution frequency. This ensures that our code is constantly validated, leading to quicker identification of issues.
Lead Time for Test Changes:
- What it Measures: The time it takes from writing a test to its successful execution in the pipeline.
- Our Progress: Our streamlined test process, powered by the Self-Describing Test (SDT) mechanism and the auto-generation of test pipelines, has drastically reduced the lead time for implementing and executing new tests. This efficiency enables our engineering teams to validate changes rapidly.
Test Failure Rate:
- What it Measures: The percentage of test executions that result in a failure.
- Our Progress: The integration of comprehensive automated testing—ranging from unit and component tests to integration and smoke tests—has helped reduce our test failure rate. This proactive identification of issues ensures higher code quality and more stable builds.
Mean Time to Test Failure Recovery (MTTFR):
- What it Measures: The time it takes to recover from a test failure within the development pipeline.
- Our Progress: With our auto-triaging system, failures within the test pipelines are swiftly detected and categorized. The system automatically files tickets against the relevant components, ensuring that issues are assigned without delay. This rapid identification and assignment significantly reduces the MTTFR, allowing engineers to quickly diagnose, address, and resolve test failures. As a result, our test pipelines maintain high efficiency, minimizing downtime and restoring successful test execution promptly.

By closely monitoring these metrics, we ensure our test infrastructure not only supports but also accelerates our agile development goals. Our commitment to continuous improvement is reflected in these key performance indicators, demonstrating our ability to deliver high-quality products efficiently.

Non-Technical Aspects Contributing to High Quality

While our robust test automation processes and sophisticated infrastructure ensure technical excellence, several non-technical factors also play a crucial role in delivering top-quality products at Rubrik:

Ownership of Testing by Component Teams:
- Each component team takes complete ownership of their testing processes. This distributed accountability ensures that quality is a shared responsibility across all teams. By integrating testing as part of their daily tasks, each team proactively identifies and addresses issues, fostering an environment of continuous improvement.
A Strong Developer Platform Team:
- Our dedicated Dev Platform team is instrumental in enhancing productivity. This team focuses on developing and providing tools, frameworks, and automation that streamline various development processes. Their continuous efforts to innovate and optimize workflows enable our engineers to focus on delivering higher-quality code more efficiently.

These non-technical aspects provide the foundation for our technical successes, ensuring that our teams are not only equipped with the right tools and processes but also aligned with a culture that supports high-quality outcomes.

Conclusion

Overtime, our test infrastructure has evolved by exploring our developers’ needs and addressing them proactively. This evolution has significantly enhanced our team’s experience and productivity, enabling us to deliver top-quality products faster. We remain dedicated to ongoing enhancement of the developer experience and velocity.

Numerous other elements to our infrastructure such as our code check-in process, build infrastructure, resource manager, and more, have not been discussed in this series. We will continue to share updates on these fronts.

In the dynamic world of agile development, we prioritize continuous feedback to ensure our developer community is satisfied. Our commitment is to listen, adapt, and create a supportive environment.

In our quest for high-quality products, the mantra is clear: Don’t Backup. Go Forward.

!Go Rubrik!

blogpost | 10 min read | Sep 12, 2023

Product Quality at Rubrik - Part 1

At Rubrik, we consider product quality a top priority. In this blog, we will talk about the automated test strategy we follow at Rubrik to ensure the best quality products for our customers.

blogpost | 16 min read | Feb 16, 2024

Product Quality at Rubrik - Part 2

In this blog, let’s dive deeper into one key aspect of automated testing - E2E testing, which ensures our solutions function seamlessly from start to finish resulting in products that meet our customer expectations.

DATA PROTECTION

DATA THREAT ANALYTICS

DATA SECURITY POSTURE

CYBER RECOVERY

CLOUD & SAAS

ENTERPRISE

INDUSTRIES

RESOURCES

CASE STUDIES

CONNECT

LEARN & EXPLORE

ALLIANCE PROGRAM

RESELLER PARTNERS

SERVICE DELIVERY PARTNERS