top of page

SQL Break: Random Assignment Using SQL

Writer's picture: Soo ReedSoo Reed

Updated: Oct 2, 2022

There will be times when someone at your organization asks you to randomly select a certain number of visitors so that they can run AB Testing on them. After they begin the testing, they also want you to track the performance of each group so you will want to retain the two groups in the database. How can you do this using SQL?


Scenario 1: using percentage


If you are needing to select the number of visitors by percentage, then this is the sql you can use. Please note that in the example below, I am extracting 10% of the visitors as Test Groups and 10% of the visitors as Control Groups:

SQL code:

CREATE TABLE database.test_groups AS (

WITH Random_assignment AS (

SELECT user_id , CASE WHEN pr <= 0.1 THEN ‘Test Group’     WHEN pr <= 0.2 THEN ‘Control Group’     ELSE NULL END AS category , COUNT(*)

FROM (

SELECT user_id —assign a random number and based on it, rank then in percent , PERCENT_RANK() OVER (ORDER BY RANDOM()) AS pr

FROM database.visitors –WHERE clause can go here )

–get 20% of the data WHERE pr <= 0.2 GROUP BY 1,2 )

SELECT * FROM Random_assignment )

Explanation:

In the above code, per user_id from visitors table, I am giving them percent_rank() but its order is RANDOM(); this ensures the random assignment but also gives us the ability to take a certain percentage of randomly assigned user_ids and assign them in either Test Group and Control Group. As you can see in the case statement, in the code I am assigning any user_ids that were assigned between 0~10% as Test Groups and then 10%~20% as Control Group (because I needed to pull 20% of the entire visitors and divide them by two to conduct testing. If your organization is wanting more %, you can adjust the code accordingly). 

More filter needed?

If test and control groups also need to be in specific criteria/category, you can add WHERE clause under “FROM database.visitors” to ensure that the filters are applied before assigning random percentage.

Preserving the Result:

If you run the code again without CREATE TABLE, then every time the assignment per user_id will randomly change. Since you need to preserve the result to report on each group’s performance, you want to use CREATE TABLE as shown in the code. This will ensure the static table is created that will show which user_id was assigned to either Test or Control Group for future queries. 


Scenario 2: using absolute number


If you are needing to select a specific number of visitors, then this is the sql you can use. Please note that in the example below, I am extracting 100 visitors as Test Groups and 100 visitors as Control Groups:


SQL Code:


CREATE TABLE database.test_groups AS (

WITH Random_assignment AS (

SELECT user_id , CASE WHEN rn <= 100 THEN ‘Newsletter’    WHEN rn <= 200 THEN ‘Control Group’    ELSE NULL END AS category , COUNT(*)

FROM (

SELECT user_id —assign a random number and based on it, rank them , ROW_NUMBER() OVER (ORDER BY RANDOM()) AS rn

FROM dbt_pbi.pbi_rsa_subscriptions prs –WHERE clause can go here  )

–get 200 visitors WHERE rn <= 200 GROUP BY 1,2 )

SELECT * FROM Random_assignment )

Explanation:

In the above code, per user_id from visitors table, I am assigning row_number() but its order is RANDOM(); this ensures the random assignment but also gives us the ability to take a specific number of randomly assigned user_ids and assign them in either Test Group and Control Group. As you can see in the case statement, in the code I am assigning any user_ids that were assigned between 1~100 as Test Groups and then 101~200 as Control Group (because I needed to pull 200 of the entire visitors and divide them by two to conduct testing. If your organization is wanting more or less #, you can adjust the code accordingly). 


More filter needed?

If test and control groups also need to be in specific criteria/category, you can add WHERE clause under “FROM database.visitors” to ensure that the filters are applied before assigning random percentage. 

Preserving the result:

If you run the code again without CREATE TABLE, then every time the assignment per user_id will randomly change. Since you need to preserve the result to report on each group’s performance, you want to use CREATE TABLE as shown in the code. This will ensure the static table is created that will show which user_id was assigned to either Test or Control Group for future queries. 


2 views0 comments

Recent Posts

See All

Commenti


©2022 by datasnooze. Proudly created with Wix.com

bottom of page