Every developer agrees that it is beneficial to have unit tests, but very often the developers end up not having tests for their modules. One of the issues is that often the barrier to writing a test is very high. Tools like py.test reduce that barrier very much and even after that, there will be complex scenarios where writing tests is hard and cumbersome.

One of the common issues that we face is that the test requires complex input, and the output is equally complex - this is often addressed in one of the following two days.

  1. Keep the input and output as part of the test code
  2. Keep the input and output in separate files

The problem with the first approach is that the test functions become big and often the programming language that we are working with (usually Python) is not the best fit for representing that data. Because of these factors, the test functions become long and cumbersome - this discourages developers from adding more tests.

The problem with the second approach is that you need to look at both the test code and test data files to understand each test case. Also, there will be multiple files for each test, and that increases the complexity.

We have faced many such situations and ended up creating a framework to reduce the barrier to write tests by making the tests to be written in a declarative way using YAML.

Example 1: Query API

Let’s look at an example of a query interface that we use heavily in our application. Since the query interface is used in multiple parts of our system, we’ve built an internal API that accepts the query as JSON, translates that into a SQL query and executes it.

Here is a sample query that computes the sale volume by brand in city Bangalore.

    "table": "combination",
    "columns": [
        {"name": "product.brand"},
        {"name": "volume", "agg": "sum"},
    "filters": [
        {"name": "customer.city", "cond": "eq", "value": "bangalore"}
    "group_by": [{"name": "product.brand"}]

Implementing this involves a lot of joins as product, brand, city etc. are different entities. As you can see, this is quite complex, and it is essential to make sure that the query generated by the system is correct.

This is how we write for this case.

name: query volume by brand in city bangalore
  table: combination
    - name: product.brand
    - name: volume
      agg: sum
    - name: customer.city
      cond: eq
      value: bangalore
  sql: >
    SELECT brand.key AS brand, sum(combination.m00) AS volume 
    FROM document AS combination
    LEFT OUTER JOIN document AS product 
      ON combination.d01 = product.id AND product.document_type_id = 7
    LEFT OUTER JOIN document AS brand 
      ON product.d00 = brand.id AND brand.document_type_id = 5 
    LEFT OUTER JOIN document AS customer 
      ON combination.d00 = customer.id AND customer.document_type_id = 8
      combination.document_type_id = 11 
      AND customer.d00 = 9

As you can see the test is completely written in YAML. And there is a driver that generates a test case for each entry in the YAML files.

In this case, the generated SQL is compared ignoring the case and ignoring the differences in the whitespace.

Example 2: Testing Data Pipelines

This example is more like an integration test. The function being tested expects the files to be at specific locations and writes the results in another location.

Unlike the previous example, this test involves multiple steps. This test has three steps. First, it puts the test data in the required location, runs the necessary action and verifies that the data in the file at the output location is as expected.

- name: load data
  action: put_daily_dfs
    feature: raw_sellin
    columns: ["store", "variant", "date", "quantity", "amount"]
      - ["S1", "A", "2018-01-01", 1, 10.0]
      - ["S1", "B", "2018-01-02", 2, 20.0]
      - ["S2", "A", "2018-01-03", 3, 30.0]
      - ["S2", "B", "2018-01-04", 4, 40.0]
      - ["S1", "A", "2018-01-05", 5, 50.0]
      - ["S1", "A", "2018-01-08", 11, 110.0]
      - ["S1", "B", "2018-01-09", 12, 120.0]
      - ["S2", "A", "2018-01-10", 13, 130.0]

- name: compute sale_interval
  action: compute_sale_interval
    date: "2018-01-01"

- name: verify sale_interval
  action: get_df
    path: sale_interval/weekly/2018-01-01.csv
    columns: ["store", "variant", "date", "quantity", "amount", "next_sale_date"]
      - ["S1", "A", "2018-01-01", 1, 10.0, "2018-01-05"]
      - ["S1", "B", "2018-01-02", 2, 20.0, "2018-01-09"]
      - ["S2", "A", "2018-01-03", 3, 30.0, "2018-01-10"]
      - ["S2", "B", "2018-01-04", 4, 40.0, "2018-01-15"]
      - ["S1", "A", "2018-01-05", 5, 50.0, "2018-01-08"]

As you can see, the above test, completely specified in YAML is straightforward to understand and very easy to add new test cases. Whenever we find a corner case that doesn’t work as expected, we first add a test case to reproduce that. Having tests written in a declarative way in YAML makes it so easy to do it.


The declarative way of doing things is getting a lot popular these days, esp. in the space of infrastructure automation. Tools like Ansible and Terraform are some great examples.

We believe, there is untapped potential to apply those ideas in test automation and our limited experience has been quite positive. We’re working on turning these ideas into a generic framework. We’ll have to wait and see what shape it takes.

We’re hiring! We are keen to work with enthusiastic engineers who are passionate about product development, software engineering and machine learning. Please visit our careers page for more details.