Selection bias in clinical sciences: Difference between revisions

From Rosetta Code
Content added Content deleted
(Initial task creation and Python example)
 
m (clean up task formatting)
Line 3: Line 3:
In epidemiology, retrospective analyses have well-known limitations compared to prospective studies.
In epidemiology, retrospective analyses have well-known limitations compared to prospective studies.


One such limitation is the occurence of __selection bias__ in the choice of subjects between treated
One such limitation is the occurrence of <em>selection bias</em> in the choice of subjects between treated
and untreated groups about whom the data is collected. For example, a treatment may have only been
and untreated groups about whom the data is collected. For example, a treatment may have only been
given to persons who were less severely ill, which would bias the results in favor of such subjects
given to persons who were less severely ill, which would bias the results in favor of such subjects
Line 12: Line 12:
retrospective study is the topic of this task.
retrospective study is the topic of this task.


The genuine, historical example in this task is of a study done of persons who, over a course of 180
The genuine, historical example approximated in this task is of a study done of persons who, over a course
days, may or may not have become infected with Covid-19. Prior to becoming ill, these subjects may or
of 180 days, may or may not have become infected with Covid-19. Prior to becoming ill, these subjects may
may not have taken an available medication, which was usually taken in doses of 3, 6, or 9 mg daily.
or may not have taken an available medication, which was usually taken in doses of 3, 6, or 9 mg daily.
This retrospective study divided its subjects into three groups based on their cumulative dosage of the
The historical study divided its subjects into three groups based on their cumulative dosage of the
study medication:
study medication:


:: Group UNTREATED were those who did not take the study medication at all before they got Covid-19,
* Group UNTREATED were those who did not take the study medication at all before they got Covid-19, including those who exited the study period without Covid-19 and having never taken the study medication.
including those who exited the study period without Covid-19 and having never taken the study medication.


:: Group IRREGULAR is those who took the study medication but whose cumulative dose was less than 100 mg
* Group IRREGULAR is those who took the study medication but whose cumulative dose was less than 100 mg before they either came down with Covid-19 during the study or the study period ended.
before they either came down with Covid-19 during the study or the study period ended.


:: Group REGULAR is those who took >= 100 mg of the study medication either before thay came down with
* Group REGULAR is those who took >= 100 mg of the study medication either before thay came down with Covid-19 or took >= 100 mg by the end of the study and never became infected during the study.
Covid-19 or took >= 100 mg by the end of the study and never became infected during the study.


: Assumptions for the study:
;Assumptions for the study:
:: Daily risk of getting Covid-19 infection for each subject was 0.1% per day, or 18% over the 180
* Daily risk of getting Covid-19 infection for each subject was 0.1% per day, or 18% over the 180 cumulative days of the study.

cumulative days of the study.
* The probability of starting treatment medication for anyone not already taking it was 0.5% per day. For those who started medication, the chance of continuing the treatment was increased 50-fold to 25% each day, since most who started the medication continued to take it to some extent.
:: The probability of starting treatment medication for anyone not already taking it was 0.5% per day.
For those who started medication, the chance of continuing the treatment was increased 50-fold to 25%
each day, since most who started the medication continied to take it to some extent.
:: Study dose per day is random between 3, 6 and 9 mg. The daily cumulative dosage is used to determine
the group the subject is in, unless a subject develops Covid-19. If a subject was diagnosed with Covid-19,
their group at the time of that diagnosis is used in the statistical analysis of that group.
* Study dose per day is random between 3, 6 and 9 mg. The daily cumulative dosage is used to determine the group the subject is in, unless a subject develops Covid-19. If a subject was diagnosed with Covid-19, their group at the time of that diagnosis is used in the statistical analysis of that group.

;Task:
;Task:


* Create a simulation of the subjects, keeping track of their medication dosages, group membership, and
* Create a simulation of the subjects, keeping track of their medication dosages, group membership, and Covid-19 status during the study.

Covid-19 status during the study.
* Use at least 1000 subjects in the simulation over the 180 days (historically, the study size was 80,000).

* Use at least 1000 subjects in the simulation over the 180 days (historically, the study size was 80,000).
* Statistics used are to be the Kruscal statistic for the analysis of multiple groups, with the boolean study outcome variable whether the subject got Covid-19 during the study period, analyzed versus category.

* Statistics used are to be the Kruscal statistic for the analysis of multiple groups, with the boolean
* You should get a statistical result highly favoring the REGULAR group.
study outcome variable whether the subject got Covid-19 during the study period, analyzed versus category.

;Stretch task
* You should get a statistical result highly favoring the REGULAR group.
* Show monthly outcomes.

; Stretch task: show monthly outcomes.


A note regarding outcome: Note that by simulation design all subjects must have an IDENTICAL risk, that is
A note regarding outcome: Note that by simulation design all subjects must have an IDENTICAL risk, that is 0.1 per cent or p = 0.001 per day, of developing Covid-19. Because of the design, any statistical differences between the groups CANNOT come from an influence of the treatment on that risk, but must come from some other feature of the study design.
0.1 per cent or p = 0.001 per day, of developing Covid-19. Because of the design, any statistical differences
between the groups CANNOT come from an influence of the treatment on that risk, but must come from some other
feature of the study design.


;See also:
;See also:
Line 147: Line 136:
s.had_covid for s in population if s.category == IRREGULAR]
s.had_covid for s in population if s.category == IRREGULAR]
regular = [s.had_covid for s in population if s.category == REGULAR]
regular = [s.had_covid for s in population if s.category == REGULAR]
print('\n\n Final statistics: ', kruskal(untreated, irregular, regular))
print('\nFinal statistics: ', kruskal(untreated, irregular, regular))




Line 153: Line 142:
</syntaxhighlight>{{out}}
</syntaxhighlight>{{out}}
<pre>
<pre>
Total subjects: 1,000

Day 30:
Untreated: N = 872, with infection = 25
Irregular Use: N = 128, with infection = 2
Regular Use: N = 0, with infection = 0

Day 60:
Untreated: N = 755, with infection = 55
Irregular Use: N = 222, with infection = 8
Regular Use: N = 23, with infection = 1

Day 90:
Untreated: N = 671, with infection = 70
Irregular Use: N = 219, with infection = 13
Regular Use: N = 110, with infection = 4

At midpoint, Infection case percentages are:
Untreated : 10.432190760059612
Irregulars: 5.936073059360731
Regulars : 3.6363636363636362

Day 120:
Untreated: N = 600, with infection = 88
Irregular Use: N = 189, with infection = 17
Regular Use: N = 211, with infection = 8

Day 150:
Untreated: N = 514, with infection = 108
Irregular Use: N = 194, with infection = 21
Regular Use: N = 292, with infection = 16

Day 180:
Untreated: N = 447, with infection = 119
Irregular Use: N = 189, with infection = 26
Regular Use: N = 364, with infection = 26

At study end, Infection case percentages are:
Untreated : 26.62192393736018 of group size of 447
Irregulars: 13.756613756613756 of group size of 189
Regulars : 7.142857142857143 of group size of 364


Final statistics: KruskalResult(statistic=55.48204323818349, pvalue=8.95833684545873e-13)
</pre>
</pre>

Revision as of 08:40, 28 September 2022

Selection bias in clinical sciences is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

In epidemiology, retrospective analyses have well-known limitations compared to prospective studies.

One such limitation is the occurrence of selection bias in the choice of subjects between treated and untreated groups about whom the data is collected. For example, a treatment may have only been given to persons who were less severely ill, which would bias the results in favor of such subjects appearing to have done better because of the treatment when the biased group is then compared to those who who did not receive the study treatment. Or, in a retrospective study, there may a choice to place subjects in a particular study group using a method which is inadvertently biased by the outcome being measured. Creating a programming example of a simulation of such selection bias in the design of a retrospective study is the topic of this task.

The genuine, historical example approximated in this task is of a study done of persons who, over a course of 180 days, may or may not have become infected with Covid-19. Prior to becoming ill, these subjects may or may not have taken an available medication, which was usually taken in doses of 3, 6, or 9 mg daily. The historical study divided its subjects into three groups based on their cumulative dosage of the study medication:

  • Group UNTREATED were those who did not take the study medication at all before they got Covid-19, including those who exited the study period without Covid-19 and having never taken the study medication.
  • Group IRREGULAR is those who took the study medication but whose cumulative dose was less than 100 mg before they either came down with Covid-19 during the study or the study period ended.
  • Group REGULAR is those who took >= 100 mg of the study medication either before thay came down with Covid-19 or took >= 100 mg by the end of the study and never became infected during the study.
Assumptions for the study
  • Daily risk of getting Covid-19 infection for each subject was 0.1% per day, or 18% over the 180 cumulative days of the study.
  • The probability of starting treatment medication for anyone not already taking it was 0.5% per day. For those who started medication, the chance of continuing the treatment was increased 50-fold to 25% each day, since most who started the medication continued to take it to some extent.
  • Study dose per day is random between 3, 6 and 9 mg. The daily cumulative dosage is used to determine the group the subject is in, unless a subject develops Covid-19. If a subject was diagnosed with Covid-19, their group at the time of that diagnosis is used in the statistical analysis of that group.
Task
  • Create a simulation of the subjects, keeping track of their medication dosages, group membership, and Covid-19 status during the study.
  • Use at least 1000 subjects in the simulation over the 180 days (historically, the study size was 80,000).
  • Statistics used are to be the Kruscal statistic for the analysis of multiple groups, with the boolean study outcome variable whether the subject got Covid-19 during the study period, analyzed versus category.
  • You should get a statistical result highly favoring the REGULAR group.
Stretch task
  • Show monthly outcomes.


A note regarding outcome: Note that by simulation design all subjects must have an IDENTICAL risk, that is 0.1 per cent or p = 0.001 per day, of developing Covid-19. Because of the design, any statistical differences between the groups CANNOT come from an influence of the treatment on that risk, but must come from some other feature of the study design.

See also





Python

''' Rosetta code rosettacode.org/wiki/Study_Bias_in_Clinical_Sciences '''


from random import randrange
from numpy.random import rand
from scipy.stats import kruskal

UNTREATED = 0
IRREGULAR = 1
REGULAR = 2
DOSE_FOR_REGULAR = 100


class Subject:
    ''' A subject for the study '''

    def __init__(self):
        self.cum_dose = 0.0
        self.category = UNTREATED
        self.had_covid = False
        self.update_count = 0

    def update(self, p_covid=0.001, p_starting_treatment=0.005, p_redose=0.25, drange=(3, 10, 3)):
        ''' daily update on the subject to check for infection and randomly dose. '''
        if not self.had_covid:
            if rand() < p_covid:
                self.had_covid = True
            elif (self.cum_dose == 0 and rand() < p_starting_treatment) or\
                 (self.cum_dose > 0 and rand() < p_redose):
                self.cum_dose += randrange(*drange)
                self.categorize()
        self.update_count += 1

    def categorize(self):
        ''' Set treatment category based on cumulative treatment taken. '''
        self.category = UNTREATED if self.cum_dose == 0 else REGULAR if\
            self.cum_dose >= DOSE_FOR_REGULAR else IRREGULAR
        return self.category


def run_study(num_subjects=1000, duration=180, interval=30):
    ''' Run the study using the population of size `N` for `duration` days. '''
    population = [Subject() for _ in range(num_subjects)]
    unt, unt_covid, irr, irr_covid, reg, reg_covid = 0, 0, 0, 0, 0, 0
    print(f'Total subjects: {num_subjects:,}')
    for day in range(duration):
        for subj in population:
            subj.update()

        if (day + 1) % interval == 0:
            print(f'\nDay {day + 1}:')
            unt = sum(s.category == UNTREATED for s in population)
            unt_covid = sum(s.category ==
                            UNTREATED and s.had_covid for s in population)
            print(f'Untreated: N = {unt}, with infection = {unt_covid}')
            irr = sum(s.category == IRREGULAR for s in population)
            irr_covid = sum(s.category ==
                            IRREGULAR and s.had_covid for s in population)
            print(f'Irregular Use: N = {irr}, with infection = {irr_covid}')
            reg = sum(s.category == REGULAR for s in population)
            reg_covid = sum(s.category ==
                            REGULAR and s.had_covid for s in population)
            print(f'Regular Use: N = {reg}, with infection = {reg_covid}')

        if day == duration // 2 - 1:
            print('\nAt midpoint, Infection case percentages are:')
            print('  Untreated : ', 100 * unt_covid / unt)
            print('  Irregulars: ', 100 * irr_covid / irr)
            print('  Regulars  : ', 100 * reg_covid / reg)

    print('\nAt study end, Infection case percentages are:')
    print(f'  Untreated : {100 * unt_covid / unt} of group size of {unt}')
    print(f'  Irregulars: {100 * irr_covid / irr} of group size of {irr}')
    print(f'  Regulars  : {100 * reg_covid / reg} of group size of {reg}')
    untreated = [
        s.had_covid for s in population if s.category == UNTREATED]
    irregular = [
        s.had_covid for s in population if s.category == IRREGULAR]
    regular = [s.had_covid for s in population if s.category == REGULAR]
    print('\nFinal statistics: ', kruskal(untreated, irregular, regular))


run_study()
Output:
Total subjects: 1,000

Day 30:
Untreated: N = 872, with infection = 25
Irregular Use: N = 128, with infection = 2
Regular Use: N = 0, with infection = 0

Day 60:
Untreated: N = 755, with infection = 55
Irregular Use: N = 222, with infection = 8
Regular Use: N = 23, with infection = 1

Day 90:
Untreated: N = 671, with infection = 70
Irregular Use: N = 219, with infection = 13
Regular Use: N = 110, with infection = 4

At midpoint, Infection case percentages are:
  Untreated :  10.432190760059612
  Irregulars:  5.936073059360731
  Regulars  :  3.6363636363636362

Day 120:
Untreated: N = 600, with infection = 88
Irregular Use: N = 189, with infection = 17
Regular Use: N = 211, with infection = 8

Day 150:
Untreated: N = 514, with infection = 108
Irregular Use: N = 194, with infection = 21
Regular Use: N = 292, with infection = 16

Day 180:
Untreated: N = 447, with infection = 119
Irregular Use: N = 189, with infection = 26
Regular Use: N = 364, with infection = 26

At study end, Infection case percentages are:
  Untreated : 26.62192393736018 of group size of 447
  Irregulars: 13.756613756613756 of group size of 189
  Regulars  : 7.142857142857143 of group size of 364

Final statistics:  KruskalResult(statistic=55.48204323818349, pvalue=8.95833684545873e-13)