Blogs | Rajesh Majumder blog

Loki, the Tesseract, and the Secret of the Fourth Dimension

Sun, 13 Jul 2025 00:00:00 +0000

The Trickster and the Cube
Wait—What Even Is a Tesseract?
Visualizing the Impossible
Real-Life Analogies (So Your Brain Doesn’t Explode)
Physics Agrees
Why Loki and the Tesseract Make Sense
Final Thought: Mischief, Math, and Meaning

“I am Loki of Asgard, and I am burdened with glorious purpose.”

That one line was enough to send chills down our spines—and New York into chaos.

Loki. The God of Mischief. Prince of Asgard. Adopted son. Outcast. Trickster.

He’s walked through fire, betrayal, redemption, timelines, and TVA offices. But there’s one thing that always seems to follow him like a shadow: The Tesseract.

That mysterious glowing blue cube… object of obsession, war, and wonder.

But here’s the thing: the Tesseract wasn’t just a Marvel MacGuffin or a sci-fi light show. It represented something far deeper—something most people miss, just like I did until now 😂.

It wasn’t just about space. Or time. It was about breaking the limits of what we understand. It was about the fourth dimension.

The Trickster and the Cube

Let’s rewind. In The Avengers (2012), we see Loki land on Earth like a thunderbolt—armed with charm, chaos, and the Tesseract. He doesn’t just want to rule—he wants to bend reality. Slide between realms. Open doors no one else can.

With the Tesseract in hand, he teleports across cities, escapes imprisonment, and whispers across dimensions. To most, it’s a weapon. To Loki? It’s a key.

Because what he’s always wanted isn’t power for power’s sake. It’s freedom. Freedom from being Thor’s shadow. From being Odin’s mistake. From being bound by one timeline, one fate, one reality.

The Tesseract gave him that taste. Because the Tesseract is more than it seems.

Wait—What Even Is a Tesseract?

The word Tesseract isn’t made up by Marvel. It’s a real concept in geometry. It’s a 4-dimensional cube.

Sounds bonkers, right? Let’s walk through it step-by-step.

0D: A Point – No length, no width—just a dot.
1D: A Line – Stretch that dot in one direction → a line.
2D: A Square – Drag the line sideways → now you have length + width.
3D: A Cube – Move that square up into space → you get depth.
4D: A Tesseract – Move a cube in a completely new direction → the fourth dimension.

We can’t see that direction. We weren’t built to. But math says it’s there. Just like a flat cartoon can’t see “up,” but we can.

Visualizing the Impossible

We cheat by looking at shadows:

A cube casts a square shadow.
A tesseract casts a cube-within-a-cube shadow that warps and rotates strangely.
You’ve probably seen this gifs of a cube folding into itself—that’s a 3D shadow of a 4D shape. That’s the Tesseract.

The tesseract can be unfolded into eight cubes into 3D space, just as the cube can be unfolded into six squares into 2D space.

Real-Life Analogies (So Your Brain Doesn’t Explode)

The Shadow World: Imagine a 2D world on paper. A 3D ball drops through it. The 2D beings see a circle that grows and shrinks—it’s magic to them. To us, it’s just physics. Now flip it. A 4D object entering our space would look like a cube suddenly appearing, shifting, and vanishing. Sound familiar?
Storing in the 4th Dimension: Imagine your apartment’s too full. What if you could store your couch in a direction outside the 3 we know? That’s the kind of freedom a tesseract implies.
Reality Like a Flipbook: Our 3D world is one page. Flip the book, and you get a new world each time. The fourth dimension lets you flip pages—jump timelines—just like Loki did.

Physics Agrees

This isn’t just sci-fi:

Einstein called time the fourth dimension, making spacetime.
String theory predicts up to 10 or 11 dimensions.
Math proves the tesseract is real—even if we can’t see it.

Why Loki and the Tesseract Make Sense

Loki doesn’t want to rule Earth. He wants to escape being labeled: “adopted”, “villain”, “failure”. He wants to rewrite himself. Be reborn. Free from fate.

The Tesseract is his way out. His escape hatch from the script.

Final Thought: Mischief, Math, and Meaning

Loki is all of us—bending rules, breaking molds, reaching for something bigger.

The Tesseract is more than a cube. It’s a symbol of possibility. Of transcendence. Of the fourth dimension that lives not just in physics…

…but maybe, also in hope.

If your brain’s spinning like a tesseract-in-a-blender, good. That means you’ve seen a glimpse of something more—just like Loki did.

Ohh!! lastly!! Thanks to Wikipedia for providing such excellent visual animations.

Clastering in R

Tue, 12 Dec 2023 00:00:00 +0000

Comprehensive Summary of Some Most Applicable Machine Learning Techniques

Tue, 12 Dec 2023 00:00:00 +0000

Concept of ANOVA and Its Sample Size Calculation Formula

Tue, 12 Dec 2023 00:00:00 +0000

What is Small Area Estimation? Let's understand!

Tue, 12 Dec 2023 00:00:00 +0000

How to convert Shiny app to a stand alone desktop application

Mon, 28 Aug 2023 00:00:00 +0000

Last month I got a consultancy work where I had to examine data on male breast cancer in Indian men and create an application for my clients—who were essentially doctors—so that they could utilize it for their upcoming diagnoses. For performing fundamental statistical analysis, I typically use R programming, and I typically use Python to create apps. But this time, I faced difficulties because I needed to deploy the app quickly and all of the statistical tools that were utilized here weren’t available in Python (or, to be more honest, I didn’t know how to do that in Python).

The only choice left was to use Rshiny, but this time the issue was that they desired a desktop application. It was quite challenging for me at the time because I was only aware of how to utilize R Shiny to construct a web application that could be hosted on Shiny Server. I then began looking for answers, coming across a variety of them. In some articles, it is recommended to utilize Electron to transform a Siny app into a desktop application. Other sources advise utilizing the RInno package to do the same. But I found them all to be incredibly challenging. All of these approaches, which I attempted, were unsuccessful for me.

I later discovered “Packaging your Shiny App as a Windows desktop app” on Analytixware’s site, which is a wonderfully simple and efficient answer to my issue. Later, I came across Lee Pang’s article on R Bloggers where he offered a similar solution. Here I’m going to explain these steps to convert a Shiny app into a standalone desktop app. Note that, this steps are Windows specific, not for Mac.

Steps to convert Shiny App into a Desktop app:

Step 1

Create a Folder in a specific location add give it a name which you have decided for your App.

For example :

path: D:\Myapps\
Create New folder: D:\Myapps\MyApp1\

Step 2

Download:

R-Portable
Google Chrome Portable

and install both into MyApp1\ folder.

So inside MyApp1 folder there will be another two folder:

D:\Myapps\MyApp1\GoogleChromePortable\ D:\Myapps\MyApp1\R-Portable\

Step 3

Download all dependencies(R packages) of your shiny app into R portable.

Step 4

Create a folder called D:\Myapps\MyApp1\shiny\. This is where the files for your Shiny app (e.g. ui.R and server.R, data.csv, etc) will reside.

Step 5

Add the following to the server.R inside the shinyServer(function(input, output, session) { ... }). It is important to pass session as the third argument! The code you need to add is:

shinyServer(function(input, output, session) { ... }) {
# ... your other server code here
# close the R session when Chrome closes
session$onSessionEnded(function() {
stopApp()
q("no")
})
# ... your other server code here
}

Step 6

To launch the application you will need two scripts:

D:\Myapps\MyApp1\runShinyApp.R : an R-script that loads the shiny package and launches your app via runApp() A shell script (either a .bat or .vbs file) that invokes R-portable.

Step 6.1 Create runshinyApp.R:

Open a new notepad file and paste this following lines of code and save it as runShinyApp.R on D:\Myapps\MyApp1\ location.

.libPaths("./R-Portable/App/R-Portable/library")
# the path to portable chrome
browser.path = file.path(getwd(),"GoogleChromePortable/GoogleChromePortable.exe")
options(browser = browser.path)
shiny::runApp("./Shiny/",port=8888,launch.browser=TRUE)

Step 6.2 Create shell script (run.vbs / run.bat):

Again open a new notepad file and paste this following code and save it as run.vbs or run.bat on D:\Myapps\MyApp1\ location.(I created run.vbs)

Randomize
CreateObject("Wscript.Shell").Run "R-Portable\App\R-Portable\bin\R.exe CMD BATCH --vanilla --slave runShinyApp.R" & " " & RND & " ", 0, False

Now, click on the run.vbs, you will see your App will be opened on the Google Crome browser.

I highly recommend you to read Analytixware’s blog and Lee Pang’s article on R Bloggers for clear understanding.

How to Add Table of contents in a R Markdown and Jupyter Notebook Dodument

Sun, 25 Jun 2023 00:00:00 +0000

R Markdown
- PDF Document
- HTML Document
Jupyter Notebook

As a statistician or a data analyst, writing reports and showcasing results is the most essential part. When you are dealing with a large data analysis project it is obvious that there will be several sections and sub-sections which you will explain in our report. Not only for the writing report; even if to write a blog or some article, there will be also several sections that you will mention. So, to give your reader a brief idea about what are the things that are mentioned in your blog or article one of the most important feature is to give a Table of Contents.

Today I’m going to share the steps that how I’m creating my content list for my blog posts and reports which I usually create by using R Markdown & Jupyter Notebook. pdf_document

R Markdown

For R Markdown adding a content list is very easy. You can add a table of contents (TOC) using the toc option. For example:

PDF Document

---
title: "Habits"
output:
pdf_document:
toc: true
---

HTML Document

---
title: "Habits"
output:
html_document:
toc: true
---

There are some other options to customize content list:

---
title: "Habits"
output:
html_document:
toc: true # table of content true/yes
toc_depth: 3 # upto three depths of headings (specified by #, ## and ###)
number_sections: true # if you want number sections at each table header
theme: united # many options for theme, this one is my favorite.
highlight: tango # specifies the syntax highlighting style
css: my.css # you can add your custom css, should be in same folder
---

For more details see: https://bookdown.org/yihui/rmarkdown/html-document.html

Jupyter Notebook

The only tools required to include a table of contents in a Jupyter notebook are the “anchor tags” in the appropriate places. The links that point to the other sections of the notebook’s table of content are translated into such links.

Creating a table of content in a Jupyter notebook is quite easy and simple. We can add the table of content in the Jupyter notebook using the HTML anchor. See the followin steps:

Step 1: Select the Markdown Format

Open the Jupyter notebook and select the markdown cell format instead of the code.

Step 2: Create the Structure of the Table of Content

First, create a table of contents using the markdown in the notebook. Here, we also need to link the anchors that we will create in the next step. Use the following text and paste it into the markdown cell:

## Table of Contents
* [Chapter 1](#chapter1)
* [Section 1.1](#section_1_1)
* [Sub Section 1.1.1](#sub_section_1_1_1)
* [Chapter 2](#chapter2)
* [Section 2.1](#section_2_1)
* [Sub Section 2.1.1](#sub_section_2_1_1)
* [Sub Section 2.1.2](#sub_section_2_1_2)
* [Section 2.2](#section_2_2)
* [Sub Section 2.2.1](#sub_section_2_2_1)
* [Sub Section 2.2.2](#sub_section_2_2_2)
* [Chapter 3](#chapter3)
* [Section 3.1](#section_3_1)
* [Sub Section 3.1.1](#sub_section_3_1_1)
* [Sub Section 3.1.2](#sub_section_3_1_2)
* [Section 3.2](#section_3_2)
* [Sub Section 3.2.1](#sub_section_3_2_1)
* [Sub Section 3.2.2](#sub_section_3_2_2)

Press Shift + Enter to run the previous lines in the Jupyter notebook. The table of content should display like this:

Note that, the displayed name of the link is enclosed in brackets [] and the reference to the anchor tags is placed in parenthesis preceded by a hash (#) symbol.

Step 3: Create Anchor Tags

Now, we will create the anchor tags in order to link with the table of contents. Create the chapters, sections, and subsections. Enter the following text in the next markdown cell:

## Chapter 1 <a class="anchor" id="chapter1"></a>
This is chapter number 1
### Section 1.1 <a class="anchor" id="section_1_1"></a>
This is section 1.1
##### Section 1.1.1 <a class="anchor" id="sub_section_1_1_1"></a>
This is sub section 1.1.1
## Chapter 2 <a class="anchor" id="chapter2"></a>
This is chapter number 2
### Section 2.1 <a class="anchor" id="section_2_1"></a
This is section 2.1
#### Section 2.1.1 <a class="anchor" id="sub_section_2_1_1"></a>
This is sub section 2.1.1
#### Section 2.1.2 <a class="anchor" id="sub_section_2_1_2"></a>
This is sub section 2.1.2
### Section 2.2 <a class="anchor" id="section_2_2"></a>
This is section 2.2
#### Section 2.2.1 <a class="anchor" id="sub_section_2_2_1"></a>
This is sub section 2.2.1
#### Section 2.2.2 <a class="anchor" id="sub_section_2_2_2"></a>
This is sub section 2.2.2
## Chapter 3 <a class="anchor" id="chapter3"></a>
This is chapter number 3
### Section 3.1 <a class="anchor" id="section_3_1"></a>
This is section 3.1
#### Section 3.1.1 <a class="anchor" id="sub_section_3_1_1"></a>
This is sub section 3.1.1
#### Section 3.1.2 <a class="anchor" id="sub_section_3_1_2"></a>
This is sub section 3.1.2
### Section 3.2 <a class="anchor" id="section_3_2"></a>
This is section 3.2
#### Section 3.2.1 <a class="anchor" id="sub_section_3_2_1"></a>
This is sub section 3.2.1
#### Section 3.2.2 <a class="anchor" id="sub_section_3_2_2"></a>
This is sub section 3.2.2

Press Shift + Enter or run this cell to see the effects. The following output should display on your notebook:

Here, you will notice that you can easily navigate to the desired section from the table of content.

Note that, We can also add a table of content in a Jupyter notebook using the pre-build extensions click here.

Python Tutorials

Fri, 14 Apr 2023 00:00:00 +0000

C3

Sun, 19 Jun 2022 00:00:00 +0000

The c3 package is a wrapper, or htmlwidget, for the C3 javascript charting library by Masayuki Tanaka. You will find this package useful if you are wanting to create a chart using R and embedding it in a Rmarkdown document or Shiny App.

The C3 library is very versatile and includes a lot of options. Currently this package wraps most of the C3 options object. Even with this current limitation a wide range of options are available.

Instalation

install.packages("c3")
# or
devtools::install_github("mrjoh3/c3")

Usage

The c3 package is intended to be as simple and lightweight as possible. As a starting point the data input must be a data.frame or tibble with several options.

If a data.frame without any options is passed all of the numeric columns will be plotted. This can be used in line and bar plots. Each column is a line or bar.
For more complex plots only 3 columns are used, those defined as x, y and group. This requires a data.frame with a vertical structure.

The Basics

Where no options are supplied a simple line plot is produced by default. Where no x-axis is defined the plots are sequential. Date x-axis can be parsed with not additional setting if in the format %Y-%m-%d (ie ‘2014-01-01’)

library(c3)

## Warning: package 'c3' was built under R version 4.1.3

##
## Attaching package: 'c3'

## The following objects are masked from 'package:graphics':
##
## grid, legend

data = data.frame(a = abs(rnorm(20) * 10),
b = abs(rnorm(20) * 10),
date = seq(as.Date("2011-01-01"), by = "month", length.out = 20))
c3(data)

Piping

The package also imports the magrittr piping function (%>%) to simplify syntax.

data%>%c3()

Other Line Plots

There are 5 different line plots available:

line
spline
step
area
area-step

Spline

data %>%
c3() %>%
c3_line('spline')

Step

data %>%
c3(x = 'date') %>%
c3_line('area-step')

Bar Plots

data[1:10, ] %>%
c3() %>%
c3_bar(stacked = TRUE,
rotate = TRUE)

Mixed Geometry Plots

Mixed geometry currently only works with a horizontal data.frame where each numeric column is plotted.

data$c <- abs(rnorm(20) *10)
data$d <- abs(rnorm(20) *10)
data %>%
c3() %>%
c3_mixedGeom(type = 'bar',
stacked = c('b','d'),
types = list(a='area',
c='spline')
)

Secondary Y Axis

To use a secondary Y axis columns must first be matched to an axis and then the secondary axis made visible.

data %>%
dplyr::select(date, a, b) %>%
c3(x = 'date',
axes = list(a = 'y',
b = 'y2')) %>%
c3_mixedGeom(types = list(a = 'line',
b = 'area')) %>%
y2Axis()

Scatter Plot

mtcars %>%
c3(x = 'mpg',
y = 'wt',
group = 'cyl') %>%
c3_scatter()

Pie Charts

data.frame(India = 45,
Bangladesh = 20,
SriLanka = 10) %>%
c3() %>%
c3_pie()

Donut Charts

data.frame(red = 82, green = 33, blue = 93) %>%
c3(colors = list(red = 'red',
green = 'green',
blue = 'blue')) %>%
c3_donut(title = '#d053ee')

Gauge Charts

data.frame(data = 80) %>%
c3() %>%
c3_gauge()

Grid Lines & Annotation

data %>%
c3() %>%
grid('y') %>%
grid('x',
show = F,
lines = data.frame(value = c(3, 10),
text= c('Line 1','Line 2')))

Sub-chart

data %>%
c3(x = 'date') %>%
subchart()

Color Palette

Plot color palettes can be changed to either RColorBrewer or viridis palettes using either RColorBrewer (S3 method) or c3_viridus.

data.frame(sugar = 20,
fat = 45,
salt = 10,
vegetables = 60) %>%
c3() %>%
c3_pie() %>%
RColorBrewer()

data.frame(sugar = 20,
fat = 45,
salt = 10,
vegetables = 60) %>%
c3() %>%
c3_pie() %>%
c3_viridis()

Point Size

data %>%
c3(x = 'date') %>%
point_options(r = 6,
expand.r = 2)

On Click

Onclick, onmouseover and onmouseout are all available via the c3 function. To use wrap a js function as a character string to htmlwidgets::JS(). Please see the C3.js documentation and examples. The example below should be enough to get you started.

data %>%
c3(onclick = htmlwidgets::JS('function(d, element){console.log(d)}'))

Tooltips

C3 tooltips are readily modified with the use of javascript functions. For further detail see the C3.js documentation. Or for more advanced usage see the C3.js examples page.

library("htmlwidgets")

## Warning: package 'htmlwidgets' was built under R version 4.1.3

data %>%
c3() %>%
tooltip(format = list(title = JS("function (x) { return 'Data ' + x; }"),
name = JS('function (name, ratio, id, index) { return name; }'),
value = JS('function (value, ratio, id, index) { return ratio; }')))

Basic introduction to SQL with MySQL

Wed, 11 May 2022 00:00:00 +0000

This tutorial was planned for the under graduate students who are totally new to SQL. Through out this course we mainly used MySQL & MySQL query and some times R and Python were used. Most of the tutorials of this lecture series are given here; rest of the tutorials and course works will be available soon.

Survival Analysis with R

Fri, 08 Apr 2022 00:00:00 +0000

Introduction
Examples of Survival Data
- (A) The lung dataset
- (B) The alloauto dataset
Some Basic Definations which are used in Survival Studies:
- Cumulative Distribution
- Survival Function
- Failure rate or Hazard rate
What is Censoring ?
Censored survival data
Distribution of follow-up time
Components of survival data
Dealing with dates in R
- Formating dates
- Calculating Survival Times
Event indicator and Creating survival objects
Estimating Survival curves and Survival probabilities with Kaplan-Meier method
Kaplan-Meier Plots
Estimating x-years survival
Testing of survival curves
Cox’s Proportional Hazard Regression model
Competing Risks
Cumulative incidence in Melanoma data
Plot the Cumulative incidence (CIF)
Plot the Cumulative incidence (CIF) manually
Compare cumultive incidence between groups
Plot the Cumulative incidence (CIF) between groups manually
Competing risks regression
- (A) Competing risks regression in Melanoma data- subdistribution hazard approach
- (B) Competing risks regression in Melanoma data- Cause-specific hazard approach

This class will provide theoretical as well as hands-on instruction and exercises covering basic survival analysis using R.
Some References for further reading:
- Clark, T., Bradhurn, M., Love, S., & Altman, D. (2003). Survival analysis part I: Basic concepts and first analysis.232-238.ISSN 0007-0920.
- Clark, T., Bradhurn, M., Love, S., & Altman, D. (2003). Survival analysis part II: Multivariate data analysis- an introduction to concepts and methods. British Journal of Cancer, 89(3),431-436.
- Clark, T., Bradhurn, M., Love, S., & Altman, D. (2003). Survival analysis part III: Multivariate data analysis- choosing a model and assessing its adequacy and fit. British Journal of Cancer, 89(4),605-11.
- Clark, T., Bradhurn, M., Love, S., & Altman, D. (2003). Survival analysis part IV: Farther concepts and methods in survival analysis. 781-786.ISSN 0007-0920.
It is assumed that the readers are familiar with R Programming. If not then get some basic R programming knowledge and then comeback.
In this tutorial I have used some random toy data set as well as some R inbuilt data sets.
Some packages we’ll be using here include:
- lubridate
- survival
- cmprsk and some others.

Lets start….

Introduction

Generally, Survival Analysis is a collection off statistical procedures for data analysis for which the outcome variable of interest is time until an event occurs.. More precisely, consider an example; suppose we want to study that how diabetes rate differed between males and females. So, here we will use some basic categorical data analysis – comparing proportions (risks, rates, etc) between different groups using a chi-square or fisher exact test, or logistic regression. Note that, in this kind of analysis you implicitly assume that the rates are constant over the period of the study, or as defined by the different groups you defined.

But, in longitudinal studies where you track samples or subjects from one time point (e.g., entry into a study, diagnosis, start of a treatment) until you observe some outcome event (e.g., death, onset of disease, relapse), it doesn’t make sense to assume the rates are constant. For example: the risk of death after heart surgery is highest immediately post-op, decreases as the patient recovers, then rises slowly again as the patient ages. Or, recurrence rate of different cancers varies highly over time, and depends on tumor genetics, treatment, and other environmental factors.

Examples for Time-to-Event data

Examples from cancer

Time from surgery to death
Time from start of treatment to progression
Time from response to recurrence

Examples from other fields

Time from HIV infection to development of AIDS
Time to heart attack
Time to onset of substance abuse
Time to initiation of sexual activity
Time to machine malfunction

Aliases for survival analysis

Reliability analysis
Duration analysis
Event history analysis
Time-to-event analysis

Examples of Survival Data

In the following you can see how the survival data looks like. For this, I’ve used some R’s inbuilt datasets which are available under different packages.

(A) The lung dataset

The lung dataset is available inside the survival package in R. The data contain subjects with advanced lung cancer from the North Central Cancer Treatment Group.

time: Survival time in days
status: censoring status 1=censored, 2=dead
sex: Male=1, Female=2

I’m ignoring the other variables for simplicity. You can explore those by your self.

time	status	age
306	2	74
455	2	68
1010	1	56
210	2	57
883	2	60
1022	1	74

(B) The alloauto dataset

Consider the alloauto dataset in the KMSurv library in R. This contains 90 measurement on Leukemia treated patients with allogeneic and autologous Transplantation.

Klein and Moeschberger (1997) Survival Analysis Techniques for Censored and truncated data, Springer. Kardaun Stat. Nederlandica 37 (1983), 103-126.

time: Time to death or relapse months
$\mathbf{\delta}$: 0= alive without replace, 1 = dead or relapse
Event: 0= alive(Censored), 1= dead(Event)
type: Transplant, 1=allogeneic, 2=autologous

time	type	delta
0.030	1	1
0.493	1	1
0.855	1	1
1.184	1	1
1.283	1	1
1.480	1	1

Some Basic Definations which are used in Survival Studies:

Let, $T$ be the failure time or survival time or failure time, which is non negative valued discrete random variable & $(t_1,t_2,t_3,...,t_n)$ are the $n$ number of observations.

Cumulative Distribution

\[F(t)=P(T\leq t)\]

Survival Function

The survival function, is the probability an individual survives (or, the probability that the event of interest does not occur) up to and including time $t$. It’s the probability that the event (e.g., death) hasn’t occured yet. It looks like this, where $T$ is the time of death, and $P(T>t)$ is the probability that the time of death is greater than some time $t$.

So, \[\text{Survival Function}=S(t)= P(T>t)= 1-F(t)\]

Characteristics of $S(t)$:

$S(t)=1\:\: \text{,if}\:\;t<0$
$S(\infty)=\lim_{t \to \infty} S(t)=0$
$S(t)\: \text{is non increasing in}\:t$

Failure rate or Hazard rate

Failure rate or hazard rate for an item at the time point $t$; usually denoted by $\gamma(t)$ or $r(t)$ or $h(t)$. Infact, it is an instantaneous probability rate that an item functioning till the time point $t$ will fail at tat instant.

\[\gamma(t)/h(t)=lim_{t \to \infty} \frac{P(t \leq T< t+{\Delta}t){\mid}T{\geq}t}{{\Delta}t}\]

This hazard function$\{h(t)\}$ can be written in term of Cumulative distribution function & Survival function:

\[\gamma(t)/h(t)= \frac{\frac{d}{dt}\{F(t)\}}{1-F(t)}= \frac{-\frac{d}{dt}\{S(t)\}}{S(t)}\]

Types of Hazard

The hazard function may be increase, decrease, remain constant or indicate more complicated processes.

Hazard Nane	Example
Increasing Hazard(IFR)	Patients with acute leukemia who do not respond to treatment have an increasing hazard.
Decreasing Hazard(DFR)	Risk of soldiers, wounded by bullets who undergo survey, The main danger is the operation itself and this danger decreases if the surgery is successful.
Constant Hazard	The risk of healthy persons between 18 to 40 years of age whose main risk of death are accidents.
Bathtub curve	Describes the process of human life. During an initial period, the risk is high(high Infant Mortality). Subsequently, $\gamma(t)$ stays approximately constant until a certain time, after which it increases because of were-out failures.
Increasing & Decreasing Hazard	Patients with tuberculosis have risks that increase initially, then decrease after treatment.

The Kaplan-Meier curve illustrates the survival function. It’s a step function illustrating the cumulative survival probability over time. The curve is horizontal over periods where no event occurs, then drops vertically corresponding to a change in the survival function at each time an event occurs.

What is Censoring ?

Censoring is a type of missing data problem unique to survival analysis. This happens when you track the sample/subject through the end of the study and the event never occurs. This could also happen due to the sample/subject dropping out of the study for reasons other than death, or some other loss to followup. The sample is censored in that you only know that the individual survived up to the loss to followup, but you don’t know anything about survival after that.

Depending on the directions from which incompleteness in the observations come, censoring is of three types:

Right Censoring
Left Censoring
Interval Censoring

Type of Censoring	Description
Right Censoring	Here the lifetime of an item is followed until some time at which the event (i.e., failure or death) is yet to occur; but the event takes no farther part in the study after the time. Example: Measuring the survival years of a patient with lung cancer; but he died in a car accident after $t$ years.
Left Censoring	This occurs when the event of interest has already taken place at the time of observation; but the exact time of occurence of the event is not known. Example: Infection with a sexually transmitted like HIV/AIDS.
Interval Censoring	It reflects uncertainty as to the exact times the units failed within an interval. This type of data frequently comes from tests or situations where the objects of interest are not constantly monitored.

Censored survival data

Let’s create a toy example for survival data in R.

my.data = data.frame(id=c(1:13), # Patient id
time=c(3,2,3,5,1,0.5,4.5,3.3,2,3.6,1.4,5.0,4.7), # Study years
Status=factor(c("E","C","C","E","E","C","C","C","E","C","E","C","C")) # Censored or Event status
)
head(my.data)
# Visualizing the Data
SurvPlot(time = my.data$time,
status = my.data$Status,
C="C",E="E",
text.adjs = 0.7,point.cex = 3,
title = "Survival Plot for My Data",
legend.posi = "bottomright") # This is my created function

## id time Status
## 1 1 3.0 E
## 2 2 2.0 C
## 3 3 3.0 C
## 4 4 5.0 E
## 5 5 1.0 E
## 6 6 0.5 C

In this example, how would we compute the proportion who are event-free at 4 years ?

Subjects 5,7,12 & 13 were event-free at 4 years. Subjects 7 & 12 had the event before 4 years. Subjects 1,2,5,6,8,9 and 11 were censored before 4 years, so we don’t know whether they had event or not by 4 years- how do we incorporate these subjects into our estimate?

Distribution of follow-up time

Censored subjects still provide information so must be appropriately included in the analysis. Distribution of follow-up times is skewed, and may differ between censored patients and those with events. Follow-up times always positive.

Let’s draw the Histogram and density plot(using Kernel density estimation technique) for Censored and Event from the above toy example.

# Histogram
pp=function(...){
with(subset(my.data,Status=="C"),hist(time,col= adjustcolor("red", alpha.f = 0.20)))
par(new=T)
with(subset(my.data,Status=="E"),hist(time,col=adjustcolor("blue", alpha.f = 0.20),axes = F))
}
par(mar=c(5.1,4.1,4.1,7))
pp()
legend(x=5.3,y=3,
legend = c("Censor","Event"),
fill=c(adjustcolor("red", alpha.f = 0.20),
adjustcolor("blue", alpha.f = 0.20)),
title = "Survival",
xpd=T)
par(mar=c(5.1,4.1,4.1,2.1),xpd=NA)
# Density Plot
density.plot=function(...){
cens=density(my.data$time[my.data$Status=="C"]) # fitting Kernel density for Censored data
event=density(my.data$time[my.data$Status=="E"]) # fitting Kernel density for Event data
plot(cens,main="",xlab = "",ylab = "",...)
par(new=T)
polygon(cens,col = adjustcolor("red", alpha.f = 0.20))
par(new=T)
plot(event,main = "Density Plots of time for Censored & Event",
xlab="time",
ylab="Frequency",
axes=F)
par(new=T)
polygon(event,col=adjustcolor("blue", alpha.f = 0.20))
}
par(mar=c(5.1,4.1,4.1,7))
density.plot()
legend(x=8.2,y=0.25,
legend = c("Censor","Event"),
fill=c(adjustcolor("red", alpha.f = 0.20),
adjustcolor("blue", alpha.f = 0.20)),
title = "Survival",
xpd=T)
par(mar=c(5.1,4.1,4.1,2.1),xpd=NA)

Note that, here I’ve used the yarrr package for reducing the density of the R base colors.

Components of survival data

For subject $i$:

Event time $T_i$
Censoring time $C_i$
Event indicator $\delta_i$:
- 1 if event observed (i.e., $T_i\le C_i$)
- 0 if censored (i.e., $T_i> C_i$)
Observed time $Y_i=min(T_i,C_i)$

The observed times and an event indicator are provided in the lung data.

time: Survival time in days $(Y_i)$
status: censoring status 1=censored, 2=dead $(\delta_i)$

inst	time	status	age	sex	ph.ecog	ph.karno	pat.karno	meal.cal	wt.loss
3	306	2	74	1	1	90	100	1175	NA
3	455	2	68	1	0	90	90	1225	15
3	1010	1	56	1	0	90	90	NA	15
5	210	2	57	1	1	90	60	1150	11
1	883	2	60	1	0	100	90	NA	0
12	1022	1	74	1	1	50	80	513	0

Dealing with dates in R

Data will often come with start and end dates rather than pre-calculated survival times. The first step is to make sure these are formatted as dates in R.

Let’s create a small example dataset with ‘start.date’ for surgery date and ‘last.followup.date’ for the last follow-up date.

date_ex=data.frame(start.date=c("2007-06-22","2004-02-12","2010-11-03"),
last.followup.date=c("2017-04-15","2018-07-04","2016-10-31"))
str(date_ex)

## 'data.frame': 3 obs. of 2 variables:
## $ start.date : chr "2007-06-22" "2004-02-12" "2010-11-03"
## $ last.followup.date: chr "2017-04-15" "2018-07-04" "2016-10-31"

We see these are both character variables, which will often be the case, but we need them to be formatted as dates. For that, here I’m showing two methods; one is using the R base as.Date() function and another is using lubridate package.

Formating dates

#####################################
## ##
##--Using base function as.Date()--##
## ##
#####################################
date_ex$start.date=as.Date(date_ex$start.date,format = "%Y-%m-%d")
date_ex$last.followup.date=as.Date(date_ex$last.followup.date,format = "%Y-%m-%d")
date_ex
str(date_ex)
##################################################
## ##
##--Using ymd() func. inside lubridate package--##
## ##
##################################################
library(lubridate)
date_ex$start.date=ymd(date_ex$start.date)
date_ex$last.followup.date=ymd(date_ex$last.followup.date)
date_ex
str(date_ex)

## start.date last.followup.date
## 1 2007-06-22 2017-04-15
## 2 2004-02-12 2018-07-04
## 3 2010-11-03 2016-10-31

## 'data.frame': 3 obs. of 2 variables:
## $ start.date : Date, format: "2007-06-22" "2004-02-12" ...
## $ last.followup.date: Date, format: "2017-04-15" "2018-07-04" ...

Note that in base R the format must include the separator as well as the symbol.e.g., if your date is in format m/d/Y then you would need format= “%m/%d/%Y”, on the other hand for ymd() function in the lubridate package, the separators do not need to be specified.

Calculating Survival Times

Now to calculate the survival times, we need to calculate the difference between the start time & end time. So for this, in base R there is a function called difftime() which gives the number of days between two dates. Use as.numeric() function to convert the the differences into a numeric value. Finally to convert it into years divide it by 365.25, the average number of days in a year.

On the other hand, using the lubridate package, the operator %–% designates a time interval, which i then converted to the number of elapsed seconds using as.duration() and finally converted to years by dividing by dyears(1), which gives the number of seconds in a year.

#####################################
## ##
##--Using base function as.Date()--##
## ##
#####################################
date_ex$time=round(as.numeric(difftime(date_ex$last.followup.date,
date_ex$start.date,
units = "days"))/365.25,2)
date_ex
###############################
## ##
##--Using lubridate package--##
## ##
###############################
date_ex$time=as.duration(date_ex$start.date %--% date_ex$last.followup.date)/dyears(1)
date_ex

## start.date last.followup.date time
## 1 2007-06-22 2017-04-15 9.82
## 2 2004-02-12 2018-07-04 14.39
## 3 2010-11-03 2016-10-31 5.99

Event indicator and Creating survival objects

In the Components of survival data section I mentioned the event indicator:

Event indicator $\delta_i$:

1 if event observed (i.e., $T_i\le C_i$)
0 if censored (i.e., $T_i> C_i$)

In R Surv() function inside the survival package creates a survival object. There will be one entry for each subject that is the survival time, which is followed by a ‘+’ if the subject was censored. To create the survival object, inside the Surv() function, you need to give the time variable and the status variable which indicates that the subject was censored or not.

Let’s look at the first 10 observations of the survival objects for the lung dataset:

library(survival)
Surv(lung$time,lung$status)[1:10]

## Warning: package 'survival' was built under R version 4.1.3

## [1] 306 455 1010+ 210 883 1022+ 310 361 218 166

Estimating Survival curves and Survival probabilities with Kaplan-Meier method

Now, Our interest is to estimate what is the survival probability at a certain time say $t$. Survival probability at a certain time, $S(t)$, is a conditional probability of surviving beyond that time, given that an individual has survived just prior to that time. There are various techniques are available, one of them is Product Limit estimation/Kaplan-Meier estimation.

Kaplan-Meier Estimator/ Product Limit(PL) Estimator

Let $t_1,t_2,...,t_n$ be uncensored sample observations on failure time. Then, a non-parametric estimate of survival function $S(t)$ at the time point $t$ is given by,

\[R_n(t)=\frac{\# \text{Observations}>t}{n}\] \[R_n(t)=\frac{\# T_i>t}{n}\:\: \text{where}\:T_i\:\forall i=1(1)n \: \: \text{are uncensored R.V.'s}\] This is basically the complementary empirical distribution function at time $t$.

Note that, $R_n(t)$ is an UMVUE(uniformly minimum variance unbiased estimator), consistent, efficient estimator of $S(t)$.

But, usually we cannot expect uncensored failure data due to many practical limitations, that’s why we need some modifications. This modified estimator is called Product limit estimator/Kaplan-Meier estimator.

Let there be $n$ items and $k(\leq n)$ distinct failure times $t_1 \leq t_2 \leq .... \leq t_k$ observed.

Let, \[d_j=\# \text{failures at time}\: t_j \:\:\: ,\forall j=1(1)k\] \[n_j=\# \text{items at risk of failing at}\: t_j \\ \:\:\:\:\;\;\;\;\;\;\;\:\:\:\:\:\:\:\;\;\;\;\;\;\;\:\:\:\:\:\:\:\;\;\;\;\;\;\;\:\:\:\:\:\:\:\;\;\;\;\;\;\;\:\:\: =\# \text{items that are functioning and uncensored just prior to}\: t_j\] Then, the Kaplan-Meier estimator is defined as: \[\hat{R_n(t)}={\prod}_{j:t_j<t} \frac{n_j-d_j}{n_j}\] where, $n_{j+1}=n_j-d_j-c_j\:\:\:\:, c_j=\# \text{items censored at }t_j$

It can be shown that, $\hat{R_n(t)}$ is a non-parametric MLE of the survival function $S(t)$.
So, $E(\hat{R_n(t)})=S(t)$ at a particular time $t$.
The estimated asymptotic variance of Kaplan-Meier estimator (by Greenwood’s formula) is \[\hat{V}(\hat{R_n(t)}) \approx (\frac{\delta \hat{R_n(t)}}{\delta\:log \hat{R_n(t)}})^2. \hat{V}(log \hat{R_n(t)}) \\ \:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\: =(\hat{R_n(t)})^2 .{\sum}_{j:t_j<t} \frac{d_j}{n_j(n_j-d_j)}\]

Now, in R the survfit function creates survival curves based on a formula. Let’s generate the overall survival curve for the entire cohort, assign it to object f1, and look at the summary.

f1=survfit(Surv(time, status)~1,data=lung)
f1
summary(f1)

## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
##
## n events median 0.95LCL 0.95UCL
## [1,] 228 165 310 285 363

These tables show a row for each time point where either the event occured or a sample was censored. It shows the number at risk (number still remaining), and the cumulative survival at that instant.

For more details about the outputs of survfit(), type ?summary.survfit or you can simply run names(fit1) to know all the output features.

Here we’ve created a simple survival curve that doesn’t consider any different groupings, so we’ve specified just an intercept (e.g., ~1) in the formula that survfit expects. It is similar to how we specify data for linear models with lm(), we use the data= argument to specify which data we’re using.

You can give the summary() function an option for what times you want to show in the results. let’s create a sequence of times from the lung dataset and for those time points we’ll see the survfit results.

# checking the range of the time variable
range(lung$time)
# creating a time sequence
seq(0,1100,100)
# visualizing the summary of 'f1' for the above time sequence
summary(f1, times = seq(0,1100,100))

## [1] 5 1022

## [1] 0 100 200 300 400 500 600 700 800 900 1000 1100

## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 228 0 1.0000 0.0000 1.0000 1.000
## 100 196 31 0.8640 0.0227 0.8206 0.910
## 200 144 41 0.6803 0.0311 0.6219 0.744
## 300 92 29 0.5306 0.0346 0.4669 0.603
## 400 57 25 0.3768 0.0358 0.3128 0.454
## 500 41 12 0.2933 0.0351 0.2320 0.371
## 600 24 10 0.2136 0.0335 0.1571 0.290
## 700 16 8 0.1424 0.0303 0.0938 0.216
## 800 8 7 0.0783 0.0246 0.0423 0.145
## 900 3 2 0.0503 0.0228 0.0207 0.123
## 1000 2 0 0.0503 0.0228 0.0207 0.123

What’s more interesting though is if we model something besides just an intercept. Let’s fit survival curves separately by sex.

f2=survfit(Surv(time, status)~sex,data=lung)
f2

## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## n events median 0.95LCL 0.95UCL
## sex=1 138 112 270 212 310
## sex=2 90 53 426 348 550

We can use the above time sequence vector with a summary call on fit2 to get life tables at those intervals separately for both males (1) and females (2). From these tables we can start to see that males tend to have worse survival than females.

summary(f2,times = seq(0,1100,100))

## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## sex=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 138 0 1.0000 0.0000 1.0000 1.000
## 100 114 24 0.8261 0.0323 0.7652 0.892
## 200 78 30 0.6073 0.0417 0.5309 0.695
## 300 49 20 0.4411 0.0439 0.3629 0.536
## 400 31 15 0.2977 0.0425 0.2250 0.394
## 500 20 7 0.2232 0.0402 0.1569 0.318
## 600 13 7 0.1451 0.0353 0.0900 0.234
## 700 8 5 0.0893 0.0293 0.0470 0.170
## 800 6 2 0.0670 0.0259 0.0314 0.143
## 900 2 2 0.0357 0.0216 0.0109 0.117
## 1000 2 0 0.0357 0.0216 0.0109 0.117
##
## sex=2
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 90 0 1.0000 0.0000 1.0000 1.000
## 100 82 7 0.9221 0.0283 0.8683 0.979
## 200 66 11 0.7946 0.0432 0.7142 0.884
## 300 43 9 0.6742 0.0523 0.5791 0.785
## 400 26 10 0.5089 0.0603 0.4035 0.642
## 500 21 5 0.4110 0.0626 0.3050 0.554
## 600 11 3 0.3433 0.0634 0.2390 0.493
## 700 8 3 0.2496 0.0652 0.1496 0.417
## 800 2 5 0.0832 0.0499 0.0257 0.270
## 900 1 0 0.0832 0.0499 0.0257 0.270

Note: You can see that the outputs of the summary is a ‘list’. To convert the summary output into a data.frame, one easiest way to do this is to use the tidy function from broom package.

Kaplan-Meier Plots

Now we plot the survfit object in base R to get the Kaplan-Meier plot.

##-- Survival plot for overall data --##
plot(f1,
xlab = "Days",
ylab="Survival Probability",
main="Overall Survival Probability")
##-- Survival plot grouped by Sex --##
plot(f2,
col=c(1,2),
xlab = "Days",
ylab="Survival Probability",
main="Survival Probability ptot grouped by Sex",
lwd=2)
legend("top",legend=c("Male","Female"),col = c(1,2),lty = c(1,2),lwd=2,box.col = "white",horiz = T,title = "Sex")
box()

The default plot in base R shows the step function(solid line) with associated confidende intervals (dotted lines)
Horizontal lines represent survival duration for the interval.
The height of vertical lines show the change in cumulative probability.
Censored observations, indicated by tick marks, reduce the cumulative survival between intervals.(the tick marks for censored patients are not shown by default, but could be added using the option mark.time = TRUE)
When there are two or more survival curves, by default the plot ignore the confidence intervals.
To plot survival curve in more convenient way, use the ggsurvplot function from the survminer package.
But you can customize this base R survival plots according to your needs. For example :

summary_1=summary(f1) # Summary of f1
##-- Survival Probability plot --##
plot(summary_1$time,summary_1$surv,
type="S",
lwd=2,
xlab = "Days",
ylab="Survival Probability",
main="Overall Survival Probability")
##-- Plotting confidence interval with shade --##
polygon(c(summary_1$time,rev(summary_1$time)),c(summary_1$upper,rev(summary_1$lower)),
col=gray(0.4,0.4),border = NA)

Note: To plot the confidence interval with a shaded region, the trick is that use the polygon function and you must provide 2 times the x coordinates in one vector, once in normal order and once in reverse order (with function rev) and you must provide the y coordinates as a vector of the upper bounds followed by the lower bounds in reverse order.

Similarly, you can plot this type of customized plot for the ‘f2’.

summary_2=summary(f2)
##-- Informations that we need --##
# For Sex=1 (Male)
time_sex_1 = summary_2$time[summary_2$strata=="sex=1"] # Times for Sex=1
surv.prob_sex_1 = summary_2$surv[summary_2$strata=="sex=1"] # Survival Prob. for Sex=1
lower_sex_1 = summary_2$lower[summary_2$strata=="sex=1"] # lower Confidence level for Sex=1
upper_sex_1= summary_2$upper[summary_2$strata=="sex=1"] # upper Confidence level for Sex=1
# For Sex=2 (Female)
time_sex_2 = summary_2$time[summary_2$strata=="sex=2"] # Times for Sex=2
surv.prob_sex_2 = summary_2$surv[summary_2$strata=="sex=2"] # Survival Prob. for Sex=2
lower_sex_2 = summary_2$lower[summary_2$strata=="sex=2"] # lower Confidence level for Sex=2
upper_sex_2= summary_2$upper[summary_2$strata=="sex=2"] # upper Confidence level for Sex=2
##-- Plotting Survival Curve grouped by Sex --##
plot(time_sex_1,surv.prob_sex_1,
type="S",
lwd=2,
col= "blue",
xlab = "Days",
ylab="Survival Probability",
main="Survival Probability Curve grouped by Gender")
polygon(c(time_sex_1,rev(time_sex_1)),c(upper_sex_1,rev(lower_sex_1)),
col=adjustcolor("blue", alpha.f = 0.20),border = NA)
par(new=T)
plot(time_sex_2,surv.prob_sex_2,
type="S",
lwd=2,
col="red",
xlab = "",
ylab="",
axes=FALSE)
polygon(c(time_sex_2,rev(time_sex_2)),c(upper_sex_2,rev(lower_sex_2)),
col=adjustcolor("red", alpha.f = 0.20),border = NA)
legend("top",legend=c("Male","Female"),
fill=c("blue","red"),
box.col = "white",
horiz = T,
title = "Sex")
box()

Estimating x-years survival

One quantity often of interest in a survival analysis is the probability of surviving beyond a certain number$(x)$ of years.

For example, to estimate the probability of surviving to 1 year, use summary with the times argument (Note: the time variable in the lung data is actually in days, so we need to use times=365.25)

summary(f1,times = 365.25)

## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 365 65 121 0.409 0.0358 0.345 0.486

We find the 1 year probability of survival in this study is 41%.

Note that,

n.risk = 65 = the number of subjects at risk at 1 year ,i.e.,number of subjects who are remaining in the study
n.event = 121 = it is the cumulative number of events that have occurred since the last time listed until 1 year

Testing of survival curves

When there are more than 1 survival curves for $k\;(\text{where,}\:k\geq2)$ number of groups, we need to perform a statistical significant test between those survival curves. As, $S(t)$ is a probability function, so, Log Rank test statistic is approximately distributed as a chi-square test statistic with d.f. 1.

The hypothesis are:

\[H_0: \text{In terms of survivability, there is no difference between the groups} \\ a.g.\\ H_1: \text{There is a survival differential between the groups.}\]

In R to do Log Rank test, there is a function called survdiff under the survival package, that we usually use.

test= survdiff(Surv(time,status)~sex,data=lung)
test

## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = lung)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 138 112 91.6 4.55 10.3
## sex=2 90 53 73.4 5.68 10.3
##
## Chisq= 10.3 on 1 degrees of freedom, p= 0.001

The Chi-Squared test statistic is 10.3 with 1 degree of freedom and the corresponding p-value is 0.001. Since this p-value is less than 0.05, we reject the null hypothesis.

In other words, we have sufficient evidence to say that there is a statistically significant difference in survival between the Male(sex=1) & Female(sex=2).

To, extract the p-value from survdiff we use the following trick

p.val= 1 - pchisq(test$chisq, length(test$n) - 1)
round(p.val,3)

## [1] 0.001

Or there is the sdp function in the ezfun package, which you can install using devtools::install_github(“zabore/ezfun”). It returns a formulated p-value.

Cox’s Proportional Hazard Regression model

Kaplan-Meier curves are good for visualizing differences in survival between two categorical groups, but they don’t work well for assessing the effect of quantitative variables like age, gene expression, leukocyte count, etc. Cox PH regression can assess the effect of both categorical and continuous variables, and can model the effect of multiple variables at once.

Cox PH regression models the natural log of the hazard at time $t$, denoted $h(t)$, as a function of the baseline hazard $h_0(t)$ (the hazard for an individual where all exposure variables are 0) and multiple exposure variables $X_1,X_2,...,X_p$. The form of the Cox PH model is:

\[ln(h(t))= ln(h_0(t))+\beta_1 X_1+\beta_2 X_2+......+\beta_p X_p\] If you exponentiate both sides of the equation, and limit the right hand side to just a single categorical exposure variable $(X_1)$ with two groups ($X_1=1$ for exposed and $X_1=2$ for unexposed), the equation becomes: \[h_1(t)=h_0(t) \times e^{\beta_1 x_1}\]

Rearranging that equation lets you estimate the hazard ratio, comparing the exposed to the unexposed individuals at time $t$: \[HR(t)=\frac{h_1(t)}{h_0(t)}=e^{\beta_1}\]

This model shows that the hazard ratio is $e^{\beta_1}$, and remains constant over time $t$ (hence the name proportional hazards regression). The $\beta$ values are the regression coefficients that are estimated from the model, and represent the $log(\text{Hazard Ratio})$ for each unit increase in the corresponding predictor variable. The interpretation of the hazards ratio depends on the measurement scale of the predictor variable, but in simple terms, a positive coefficient indicates worse survival and a negative coefficient indicates better survival for the variable in question.

Note that, the model is a semi-parametric because:

Model involves some parameters $\beta$,
Model does not depend on any specific life-distribution.

Note: parametric regression models for survival outcomes are also available, but they won’t be addressed in this taining. You can read about it in here.

In R, to perform Cox regression, there is a function called coxph under the survival package. The coxph() function uses the same syntax as lm(), glm(), etc. The response variable you create with Surv() goes on the left hand side of the formula, specified with a ~. Explanatory variables go on the right side.

Let’s go back to the lung cancer data and run a Cox regression on sex.

Cox.fit=coxph(Surv(time,status)~sex,data=lung)
Cox.fit

## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
##
## coef exp(coef) se(coef) z p
## sex -0.5310 0.5880 0.1672 -3.176 0.00149
##
## Likelihood ratio test=10.63 on 1 df, p=0.001111
## n= 228, number of events= 165

The exp(coef) column contains $e^{\beta_1}$. This is the hazard ratio (in our case HR=0.59) – the multiplicative effect of that variable on the hazard rate (for each unit increase in that variable). So, for a categorical variable like sex, going from male (baseline) to female results in approximately ~40% reduction in hazard, i.e., around 0.6 times as many females are dying as males, at any given time. You could also flip the sign on the coef column, and take exp(0.531), which you can interpret as being male resulting in a 1.7-fold increase in hazard, or that males die ad approximately 1.7x the rate per unit time as females (females die at 0.588x the rate per unit time as males).

Note that:

HR=1: No effect
HR>1: Increase in hazard
HR<1: Reduction in hazard (protective)

You’ll also notice there’s a p-value on the sex term, and a p-value on the overall model. That 0.00111 p-value is really close to the p=0.00131 p-value we saw on the Kaplan-Meier plot. That’s because the KM plot is showing the log-rank test p-value. You can get this out of the Cox model with a call to summary(Cox.fit).

Competing Risks

What is Competing Event ? and Competing Risk ?

In standard survival data, subjects are supposed to experience only one type of event over follow-up, such as death from breast cancer. On the contrary, in real life, subjects can potentially experience more than one type of a certain event. For instance, if mortality is of research interest, then our observations – senior patients at an oncology department, could possibly die from heart attack or breast cancer, or even traffic accident. When only one of these different types of event can occur, we refers to these events as “competing events”, in a sense that they compete with each other to deliver the event of interest, and the occurrence of one type of event will prevent the occurrence of the others. As a result, we call the probability of these events as “competing risks”, in a sense that the probability of each competing event is somehow regulated by the other competing events, which has an interpretation suitable to describe the survival process determined by multiple types of events.

To better understand the competing event scenario, consider the following examples:

A patient can die from breast cancer or from stroke, but he cannot die from both;
A breast cancer patient may die after surgery before they can develop hospital infection;
A soldier may die during a combat or in a traffic accident.

In the examples above, there are more than one pathway that a subject can fail, but the failure, either death or infection, can only occur once for each subject (without considering recurring event). Therefore, the failures caused by different pathways are mutually exclusive and hence called competing events. Analysis of such data requires special considerations.

How to handel this type of situation ?

Competing risks implies that a subject can experience one of a set of different events or outcomes. In this case, 2 different types of hazard functions are of interest: the cause-specific hazard function and the subdistribution hazard function.

The key components for competing risks are :

Cumulative incidence function(CIF)
Cause-specific hazard
Subdistribution hazard

Cumulative incidence function(CIF)

The cumulative incidence function gives the proportion of patients at time t who have died from cause k accounting for the fact that patients can die from other causes.

define, \[S_t= \text{Number at risk at the end of period}\;t \\ E_t= \text{Number of primary events in period}\;t \\ A_t= \text{Number of competing events in period}\;t\]

\[P(E=t|E \geq t) \approx \frac{E_t}{E_t+A_t+S_t}\]

Note: \[P(E \geq t+1|E \geq t) \neq 1- \frac{E_t}{E_t+A_t+S_t}\]

That means, $\color{blue}{\text{Kaplan-Meier estimator does not work!}}$

So, the survival function : $\hat{S(t)}= {\prod}^{t}_{j=1} (1- \frac{A_j+E_j}{E_j+A_j+S_j})$

and, CIF(of the primary events) : $\hat{C(t)}= {\sum}^t_{j=1} \frac{E_j}{E_j+A_j+S_j} \hat{S}(j-1)$

Cause-specific hazard

The cause-specific hazard, $h^{cs}_k(t)$, is the instantaneous risk of dying from a particular cause $k$ given that the subject is still alive at time $t$.

Mathematically, \[h^{cs}_k(t)={\lim}_{\Delta t \to 0} \frac{P(t \leq T< t+ \Delta t ,D=k|T \geq t)}{\Delta t}\]

Subdistribution hazard

The subdistribution hazard, $h^{sd}_k(t)$, is the instantaneous risk of dying from a particular cause k given that the subject has not died from cause k.

Mathematically, \[h^{sd}_k(t)={\lim}_{\Delta t \to 0} \frac{P(t \leq T< t+ \Delta t ,D=k\:|\:T \geq t \cup (T<t \cap K \neq k))}{\Delta t}\]

A bunch off additional notes

When the events are independent(almost never true), cause-specific hazards is unbiased.
when the events are dependent, a variety of results can be obtained depending on the setting.
Cumulative incidence using K-M is always $\geq$ Cumulative incidence using competing risks methods, so can only lead to an overestimate of the cumulative incidence, the amount of overestimation depends on event rates and dependence among events.
To establish that a covariate is indeed acting on the event of interest, cause-specific hazards may be preferred for treatment or pronostic marker effect testing.

In R, the primary package for use in competing risks analysis is cmprsk.

library(cmprsk)

## Warning: package 'cmprsk' was built under R version 4.1.3

Cumulative incidence in Melanoma data

Description of the Melanoma data

The Melanoma dataset is available in the MASS package. It contains variables:

time survival times in days, possibly censored
status 1 died from melanoma, 2 alive, 3 dead from other causes.
sex 1= male; 0= Female
age age in years
year of operation
thickness tumor thickness in mm.
ulcer 1= presence; 0= absence

head(MASS::Melanoma)

time	status	sex	age	year	thickness	ulcer
10	3	1	76	1972	6.76	1
30	3	1	56	1968	0.65	0
35	2	1	41	1977	1.34	0
99	3	0	71	1968	2.90	0
185	1	1	52	1965	12.08	1
204	1	1	28	1971	4.84	1

Cumulative incidence in Melanoma data

Estimate the cumulative incidence in the context of competing risks using the cuminc function.

Note: in the Melanoma data, censored patients are coded as 2 for status, so we cannot use the cencode option for the cuminc() function default of 0.

ci_fit= cuminc(MASS::Melanoma$time,MASS::Melanoma$status,cencode = 2)
ci_fit

## Estimates and Variances:
## $est
## 1000 2000 3000 4000 5000
## 1 1 0.12745714 0.23013963 0.30962017 0.3387175 0.3387175
## 1 3 0.03426709 0.05045644 0.05811143 0.1059471 0.1059471
##
## $var
## 1000 2000 3000 4000 5000
## 1 1 0.0005481186 0.0009001172 0.0013789328 0.001690760 0.001690760
## 1 3 0.0001628354 0.0002451319 0.0002998642 0.001040155 0.001040155

Plot the Cumulative incidence (CIF)

Here, I’m showing how to plot CIF using base R. There is a another beautiful function is available named ggcompetingrisks() which is available under the survminer package.

plot(ci_fit,xlab="Days")

In the legend:

1st number indicates the the group. In this case there is only one group as overall data, so it is ‘1’ for both
2nd number indicates the event type. In this case the solid line is 1 for death from melanoma and the dashed line is 3 for death from other cases.

Plot the Cumulative incidence (CIF) manually

We can plot this above CIF curve manually in base R.

##-- For 'status= 1': death from melanoma --##
time_1= ci_fit$`1 1`$time
estimate_1= ci_fit$`1 1`$est
##-- For 'status= 3': death from other cases --##
time_3= ci_fit$`1 3`$time
estimate_3= ci_fit$`1 3`$est
##-- Plotting the Cumulative Incidence Curve --##
plot(x=time_1, y=estimate_1,
type = "S",
lwd=1,
ylim= c(0,1),
col= adjustcolor("blue",alpha.f = 0.55),
xlab = "Days",
ylab = "Probability of an event",
main="Cumulative incidence functions")
par(new=T)
plot(x=time_3, y=estimate_3,
type = "S",
lwd=1,
ylim=c(0,1),
col= adjustcolor("red",alpha.f = 0.55),
xlab = "",
ylab = "",
axes=F)
legend("top",
legend=c("1: death from melanoma","3: death from other cases"),
col=c(adjustcolor("blue",alpha.f = 0.55),
adjustcolor("red",alpha.f = 0.55)),
lty=c(1,1),
title = "Event",
box.col = "white",
horiz = T)
box()

Compare cumultive incidence between groups

Note that in cuminc Gray’s test is used for between-group tests.

As an example, compare the Melanoma outcomes according to ulcer, the presence or absence of ulceration. The results of the tests can be found in Tests.

ci_fit_ulcer= cuminc(ftime = MASS::Melanoma$time,
fstatus = MASS::Melanoma$status,
group = MASS::Melanoma$ulcer,
cencode = 2)
ci_fit_ulcer

## Tests:
## stat pv df
## 1 26.120719 3.207240e-07 1
## 3 0.158662 6.903913e-01 1
## Estimates and Variances:
## $est
## 1000 2000 3000 4000 5000
## 0 1 0.03509042 0.10322276 0.18165409 0.18165409 0.1816541
## 1 1 0.24444444 0.38972746 0.46972340 0.53306966 NA
## 0 3 0.01746826 0.02624086 0.04028177 0.12960814 0.1296081
## 1 3 0.05555556 0.07981432 0.07981432 0.07981432 NA
##
## $var
## 1000 2000 3000 4000 5000
## 0 1 0.0002997449 0.0008952562 0.0019180376 0.0019180376 0.001918038
## 1 1 0.0020796399 0.0026929462 0.0035308463 0.0046320135 NA
## 0 3 0.0001512406 0.0002255429 0.0004165726 0.0029626459 0.002962646
## 1 3 0.0005902878 0.0008546097 0.0008546097 0.0008546097 NA

ci_fit_ulcer[['Tests']]

## stat pv df
## 1 26.120719 3.207240e-07 1
## 3 0.158662 6.903913e-01 1

Plot the Cumulative incidence (CIF) between groups manually

We can plot Cumulative incidence between groups using simply plot() function but it is good practice to visualize this using ggcompetingrisks function. But I always prefer to use base R functions and use additional packages only when it is absolutely necessary.

plot(ci_fit_ulcer$`0 1`$time,ci_fit_ulcer$`0 1`$est,
type = "S",
lwd=1,
lty=1,
ylim = c(0,1),
col=adjustcolor("blue",alpha.f = 0.5),
xlab = "Days",
ylab = "Cumulative incidence of event",
main="Death by ulceration",
bg=gray(0.4,0.3))
par(new=T)
plot(ci_fit_ulcer$`0 3`$time,ci_fit_ulcer$`0 3`$est,
type = "S",
lwd=1,
lty=1,
ylim = c(0,1),
col=adjustcolor("red",alpha.f = 0.5),
xlab = "",
ylab = "",
axes = F)
par(new=T)
plot(ci_fit_ulcer$`1 1`$time,ci_fit_ulcer$`1 1`$est,
type = "S",
lwd=1,
lty=2,
ylim = c(0,1),
col=adjustcolor("blue",alpha.f = 0.5),
xlab = "",
ylab = "",
axes = F)
par(new=T)
plot(ci_fit_ulcer$`1 3`$time,ci_fit_ulcer$`1 3`$est,
type = "S",
lwd=1,
lty=2,
ylim = c(0,1),
col=adjustcolor("red",alpha.f = 0.5),
xlab = "",
ylab = "",
axes = F)
legend("topleft",
legend = c("0: not ulcerated"," 1: ulcerated"),
col=c(1,1),
lty=c(1,2),
title = "Group",
box.col = "white")
legend("topright",
legend = c("1: death from melanoma","3: death from other cases"),
fill= c(adjustcolor("blue",alpha.f = 0.5),adjustcolor("red",alpha.f = 0.5)),
title = "Event",
box.col = "white")
box()

Competing risks regression

As discussed in the earlier, there are two approaches:

Cause-specific hazards
- instantaneous rate of occurrence of the given type of event in subjects who are currently event-free
- estimated using Cox regression (coxph function)
Subdistribution hazards
- instantaneous rate of occurrence of the given type of event in subjects who have not yet experienced an event of that type.
- estimated using Fine-Gray regression (crr function)

(A) Competing risks regression in Melanoma data- subdistribution hazard approach

Let’s say we are interested in looking at the effect if age and sex on death from melanoma, with death from other causes as a competing event.

Notes:

crr requires specification of covariates as a matrix.
If more than one event is of interest, you can request results for a different event by using the failcode option, by default results are returned for failcode = 1.

shr_fit= crr(ftime = MASS::Melanoma$time,
fstatus = MASS::Melanoma$status,
cov1 = MASS::Melanoma[,c("sex","age")],
cencode = 2)
shr_fit

## convergence: TRUE
## coefficients:
## sex age
## 0.58840 0.01259
## standard errors:
## [1] 0.271800 0.009301
## two-sided p-values:
## sex age
## 0.03 0.18

In the above example, both ‘sex’ and ‘age’ were coded as numeric variables. The crr function cannot naturally handle character variables, and you will get an error, so if character variables are present we have to create dummy variables using model.matrix

(B) Competing risks regression in Melanoma data- Cause-specific hazard approach

Censor all subjects who did not have the event of interest, in this case death from melanoma, and use coxph as before. So patients who died from other causes are now censored for the cause-specific hazard approach to competing risks.

chr_fit= coxph(Surv(time,ifelse(status== 1,1,0))~sex + age, data = MASS::Melanoma)
summary(chr_fit)

## Call:
## coxph(formula = Surv(time, ifelse(status == 1, 1, 0)) ~ sex +
## age, data = MASS::Melanoma)
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex 0.598259 1.818949 0.267639 2.235 0.0254 *
## age 0.016542 1.016679 0.008663 1.910 0.0562 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 1.819 0.5498 1.0765 3.074
## age 1.017 0.9836 0.9996 1.034
##
## Concordance= 0.631 (se = 0.037 )
## Likelihood ratio test= 9.94 on 2 df, p=0.007
## Wald test = 10 on 2 df, p=0.007
## Score (logrank) test = 10.26 on 2 df, p=0.006

Sample Size Calculation in R.

Wed, 16 Mar 2022 00:00:00 +0000

The Why of Sample Size Calculations :
Key features of Sample Size Calculation :
Effect Size :
Mathematical Formulas for calculating sample Sazes :
- (A) For Estimation :
- (B) For testing :
Sample Size Calculation in R :
One Mean T-test :
Two Means T-test :
Paired T-test :
One-Way ANOVA :
Single Proportion Test :
Two Proportions Test :
Chi-Squared Test :
Simple & Multiple Linear Regression :
Correlation :
Non-Parametric T-tests :
Kruskal Wallace Test :
Repeated Measures ANOVA :

The Why of Sample Size Calculations :

In designing an experiment, a key question is : How many individuals/subjects do I need for my experiment ?
Too small of a sample size can under detect the effect of interest in our experiment.
Too large of a sample size may lead to unnecessary wasting of resources and individuals.
We want our sample size to be ‘just right’.
The answer: Sample Size Calculation.
Goal: We strive to have enough samples to resonably detect if it really is there without wasting limited resources on too many samples.

Key features of Sample Size Calculation :

Effect Size: magnitude of the effect under the $H_1$ (alternative). - the larger the effect size, the easier it is to an effect and require fewer samples.
Power: Probability of correctly rejecting the $H_0$(null) if it is flse. i.e., ($1-\beta$), where $\beta$= Type-II Error.
Significance level($\alpha$): Probability of falsely rejecting the null hypothesis even through it is true. i.e., Type-I error.

Effect Size :

While Power and Significance level are usually set irrespective of the data, the effect size is a property of the sample data.
It is essentially a function of the difference between the means of the null and alternative hypotheses over the variation (standard deviation) in the data.\[Effect\:Size \approx \frac{{|{\mu}_{H_1}-{\mu}_{H_0}}|}{\sigma}\]
Note that, this sample size can also be calculated from the Confidence interval. But here we are ignoring that technique.

Mathematical Formulas for calculating sample Sazes :

(A) For Estimation :

For Estimation

(B) For testing :

For Proportion

For Mean

For Epidemiology Study Design

Sample Size Calculation in R :

Table of R packages & functions for calculating Sample Size for different tests

Name of test	Package	Function
One Mean T-test	pwr	pwr.t.test()
Two Means T-test	pwr	pwr.t.test()
Two Means T-test (unequal Sample)	pwr	pwr.t2n.test()
Paired T-test	pwr	pwr.t.test()
One-way ANOVA	pwr	pwr.anova.test()
Single Proportion Test	pwr	pwr.p.test()
Two Proportions Test	pwr	pwr.2p.test()
Two Proportion Test (unequal Sample)	pwr	pwr.2p2n.test()
Chi-Squared Test	pwr	pwr.chisq.test()
Simple Linear Regression	pwr	pwr.f2.test()
Multiple Linear Regression	pwr	pwr.f2.test()
Correlation	pwr	pwr.r.test()
One Mean Wilcoxon Test	pwr	pwr.t.test()+15%
Mann-Whitney Test	pwr	pwr.t.test()+15%
Paired Wilcoxon Test	pwr	pwr.t.test()+15%
Kruskal Wallace Test	pwr	pwr.anova.test()+15%
Repeated Measures ANOVA	WebPower	wp.rmanova()
Multi-way ANOVA (1 Category of interest)	WebPower	wp.kanova()
Multi-way ANOVA (>1 Category of interest)	WebPower	wp.kanova()
Non-Parametric Regression (Logistic)	WebPower	wp.logistic()
Non-Parametric Regression (Poisson)	WebPower	wp.poisson
Multilevel modeling: CRT	WebPower	wp.crt2arm/wp.crt3arm
Multilevel modeling: MRT	WebPower	wp.mrt2arm/wp.mrt3arm

One Mean T-test :

Description: This tests if a sample mean is any different from a set value for a normally distributed variable.

Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
1	0	0	0	Yes	No

Effect size calculation: $Effect\:Size(D)= \frac{{|{\mu}_{H_1}-{\mu}_{H_0}}|}{\sigma}$
Example:(1) Is the average body temperature of college students any different from 98.6°F?
Solution:
- Here, $H_0\:: Avg\:Body\:temp.=98.6°F$ and $H_0\:: Avg\:Body\:temp.\neq 98.6°F$
- We will guess that the effect sizes will be medium.
- For t-tests: 0.2=small, 0.5=medium, and 0.8=large effect sizes.
- R Package: pwr Package
- R function: pwr.t.test(d = , sig.level = , power = , type = c(“two.sample”, “one.sample”, “paired”))
  - d= effect size
  - sig.level= significant level
  - power= power of test
  - type= type of test
- Answer of the problem:
```
 library(pwr)
Pwer_t=pwr.t.test(d=0.5, sig.level=0.05, power=0.80, type="one.sample",alternative="two.sided")
Pwer_t
```
```
##
## One-sample t test power calculation
##
## n = 33.36713
## d = 0.5
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
```
```
 print(paste0("Sample Size by rounding off is:",round(Pwer_t$n,0)))
```
```
## [1] "Sample Size by rounding off is:33"
```
Example:(2) Calculate the sample size for the following scenarios (with $\alpha=0.05$, and power=0.80):
- (i) You are interested in determining if the average income of college freshman is less than Rs.20,000. You collect trial data and find that the mean income was Rs.14,500 (SD=6000).
- (ii) You are interested in determining if the average sleep time change in a year for college freshman is different from zero. You collect the following data of sleep change (in hours).
  
  Variable Values
  
  Sleep Change -0.55, 0.16, 2.6, 0.65, -0.23, 0.21, -4.3, 2, -1.7, 1.9
- (iii) You are interested in determining if the average weight change in a year for college freshman is greater than zero.
Solution:
- 1. You are interested in determining if the average income of college freshman is less than Rs.20,000. You collect trial data and find that the mean income was Rs.14,500 (SD=6000).
  - Effect size = $(Mean_{H_1}-Mean_{H_0})/SD= (14,500-20,000)/6000 = -0.917$
  - One-tailed test
  - R Code:
```
 print(paste0("The Sample Size is :",round(pwr.t.test(d=-0.917, sig.level=0.05, power=0.80, type="one.sample", alternative="less")$n,0)))
```
```
## [1] "The Sample Size is :9"
```
- 1. Effect size =$(Mean_{H_1}-Mean_{H_0})/SD =(-0.446-0)/1.96 = -0.228$
  - Two-tailed test
  - R Code:
```
print(paste("The Sample Size is :",round(pwr.t.test(d=-0.228, sig.level=0.05, power=0.80, type="one.sample", alternative="two.sided")$n,0)))
```
```
## [1] "The Sample Size is : 153"
```
- 1. Try it by yourself.

Variable	Values
Sleep Change	-0.55, 0.16, 2.6, 0.65, -0.23, 0.21, -4.3, 2, -1.7, 1.9

Two Means T-test :

Description: this tests if a mean from one group is different from the mean of another group for a normally distributed variable. AKA, testing to see if the difference in means is different from zero.

Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
1	1	2	1	Yes	No

Effect size calculation: $Effect\:Size(D)= \frac{{|Mean_{H_1}-Mean_{H_0}}|}{SD_{pooled}}$
Example:(1) : Is the average body temperature higher in women than in men?
Solution:
- Here, $H_0\:: Avg\:difference\:Body\:temp.\:between\:men\:and\: women=0°F$ and $H_1\:: Avg\:difference\:Body\:temp.\:between\:men\:and\: women>0°F$
- We will guess that the effect sizes will be medium.
- For t-tests: 0.2=small, 0.5=medium, and 0.8=large effect sizes.
- Selected greater, because we only cared to test if women’s temp was higher, not lower (group 1 is women, group 2 is men)
- R Package: pwr Package
- R function: pwr.t.test(d = , sig.level = , power = , type = c(“two.sample”, “one.sample”, “paired”))
  - d= effect size
  - sig.level= significant level
  - power= power of test
  - type= type of test
- Answer of the problem:
```
print(paste0("The Sample Size is :",round(pwr.t.test(d=0.5, sig.level=0.05, power=0.80,type="two.sample", alternative="greater")$n,0)))
```
```
## [1] "The Sample Size is :50"
```
Example:(2) Calculate the sample size for the following scenarios (with α=0.05, and power=0.80):
- (i) You are interested in determining if the average daily caloric intake different between men and women. You collected trial data and found the average caloric intake for males to be 2350.2 (SD=258), while females had intake of 1872.4 (SD=420).
- (ii) You are interested in determining if the average protein level in blood different between men and women. You collected the following trial data on protein level (grams/deciliter).
  
  Protein levels
  
  Male Protein 1.8, 5.8, 7.1, 4.6, 5.5, 2.4, 8.3, 1.2
  
  Female Protein 9.5, 2.6, 3.7, 4.7, 6.4, 8.4, 3.1, 1.4
- (iii) You are interested in determining if the average glucose level in blood is lower in men than women.
Solution:
- 1. You are interested in determining if the average income of college freshman is less than Rs.20,000. You collect trial data and find that the mean income was Rs.14,500 (SD=6000).
  - Effect size = $(Mean_{H1}-Mean_{H0})/ SD_{pooled} =(2350.2-1872.4)/ \sqrt{(2582+ 4202)/2} =477.8/348.54 = 1.37$
  - two-tailed test
  - R Code:
```
 print(paste0("The Sample Size is :",round(pwr.t.test(d=1.37, sig.level=0.05, power=0.80, type="two.sample",alternative="two.sided")$n,0)))
```
```
## [1] "The Sample Size is :9"
```
- 1. Effect size =$(Mean_{H_1}-Mean_{H_0})/ SD_{pooled} =(4.59-4.98)/ \sqrt{(2.58^2+ 2.88^2)/2} = -0.14$
  - Two-tailed test
  - R Code:
```
print(paste("The Sample Size is :",round(pwr.t.test(d=-0.14, sig.level=0.05, power=0.80, type="two.sample", alternative="two.sided")$n,0)))
```
```
## [1] "The Sample Size is : 802"
```
- 1. Try it by yourself.

Protein	levels
Male Protein	1.8, 5.8, 7.1, 4.6, 5.5, 2.4, 8.3, 1.2
Female Protein	9.5, 2.6, 3.7, 4.7, 6.4, 8.4, 3.1, 1.4

Paired T-test :

Description: : This tests if a mean from one group is different from the mean of another group, where the groups are dependent (not independent) for a normally distributed variable. Pairing can be leaves on same branch, siblings, the same individual before and after a trial, etc.

Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
1	1	2	1	Yes	Yes

Effect size calculation: $Effect\:Size(D)= \frac{{|Mean_{H_1}-Mean_{H_0}}|}{SD_{pooled}}$
Example:(1) Is heart rate higher in patients after a run compared to before a run?
Solution:
- Here, $H_0\::bpm(after) – bpm(before) \leq 0$ and $H_1\:: bpm(after) – bpm(before)>0$
- We will guess that the effect sizes will be large.
- For t-tests: 0.2=small, 0.5=medium, and 0.8=large effect sizes.
- Selected One-tailed, because we only cared if bpm was higher after a run.
- R Package: pwr Package
- R function: pwr.t.test(d = , sig.level = , power = , type = c(“two.sample”, “one.sample”, “paired”))
  - d= effect size
  - sig.level= significant level
  - power= power of test
  - type= type of test
- Answer of the problem:
```
print(paste0("The Sample Size is :",round(pwr.t.test(d=0.8, sig.level=0.05, power=0.80, type="paired", alternative="greater")$n,0)))
```
```
## [1] "The Sample Size is :11"
```
Example:(2) Calculate the sample size for the following scenarios (with α=0.05, and power=0.80):
- (i) You are interested in determining if metabolic rate in patients after surgery is different from before surgery. You collected trial data and found a mean difference of 0.73 (SD=2.9).
- (ii) You are interested in determining if heart rate is higher in patients after a doctor’s visit compared to before a visit. You collected the following trial data and found mean heart rate before and after a visit.
  
  Heart rate levels
  
  BPM before 126, 88, 53.1, 98.5, 88.3, 82.5, 105, 41.9
  
  BPM after 138.6, 110.1, 58.44, 110.2, 89.61, 98.6, 115.3, 64.3
Solution:
- 1. You are interested in determining if metabolic rate in patients after surgery is different from before surgery. You collected trial data and found a mean difference of 0.73 (SD=2.9).
  - Effect size = $(Mean_{H_1}-Mean_{H_0})/SD =(0.73)/ 2.9 = 0.25$
  - Two-tailed test
  - R Code:
```
 print(paste0("The Sample Size is :",round(pwr.t.test(d=0.25, sig.level=0.05, power=0.80, type="paired", alternative="two.sided")$n,0)))
```
```
## [1] "The Sample Size is :128"
```
- 1. Effect size = $(Mean_{H_1}-Mean_{H_0})/ SD_{pooled} =(98.1-85.4)/ \sqrt{(26.82+ 27.22)/2} =12.7/27 = 0.47$
  - One-tailed test
  - R Code:
```
print(paste("The Sample Size is :",round(pwr.t.test(d=0.47, sig.level=0.05, power=0.80, type="paired", alternative="greater")$n,0)))
```
```
## [1] "The Sample Size is : 29"
```

Heart rate	levels
BPM before	126, 88, 53.1, 98.5, 88.3, 82.5, 105, 41.9
BPM after	138.6, 110.1, 58.44, 110.2, 89.61, 98.6, 115.3, 64.3

One-Way ANOVA :

Description: : This tests if at least one mean is different among groups, where the groups are larger than two, for a normally distributed variable. ANOVA is the extension of the Two Means T-test for more than two groups.

Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
1	1	> 2	1	Yes	No

Effect size calculation: \[Effect\:Size(f)=\sqrt{\frac{\eta^2}{1-\eta^2}}\] Where, \[\eta = \frac{SS_T}{TSS}=\frac{Treatment\:Sum\:Squares}{Total\:Sum\:Squares}\]
Example:(1) Is there a difference in new car interest rates across 6 different cities?
Solution:
- Here, $H_0\::0$ and $H_1\:: \neq 0$
- There are a total of 6 groups (cities).
- We will guess that the effect sizes will be small.
- For t-tests: 0.2=small, 0.5=medium, and 0.8=large effect sizes.
- Groups assumed to be the same size.
- R Package: pwr Package
- R function: pwr.anova.test(k =, f = , sig.level = , power = )
  - k= number of groups
  - f= effect size
  - sig.level= significant level
  - power= power of test
- Answer of the problem:
```
print(paste0("The Sample Size is :",round(pwr.anova.test(k =6 , f =0.1 , sig.level=0.05 , power =0.80 )$n,0)))
```
```
## [1] "The Sample Size is :215"
```
Example:(2) Calculate the sample size for the following scenarios (with α=0.05, and power=0.80):
- (i) You are interested in determining there is a difference in weight lost between 4 different surgery options. You collect the following trial data of weight lost in pounds.
  
  Surgery Weight Measures
  
  A 6.3, 2.8, 7.8, 7.9, 4.9
  
  B 9.9, 4.1, 3.9, 6.3, 6.9
  
  C 5.1, 2.9, 3.6, 5.7, 4.5
  
  D 1.0, 2.8, 4.8, 3.9, 1.6
- (ii) You are interested in determining if there is a difference in white blood cell counts between 5 different medication regimes.
Solution:
- 1. Here,
  - $\eta = SS_T/TSS=31.47/(31.47+62.87) = 0.33$ Note that, you can calculate $SS_T$ & $TSS$ by performing ANOVA on the dataset using aov() function.
  - Effect size$(f)$ = $\sqrt{\eta^2/(1-\eta^2)}=\sqrt{0.33/(1- 0.33)} = 0.7$
  - No. of groups= 4
  - R Code:
```
 print(paste0("The Sample Size is :",round(pwr.anova.test(k =4, f =0.7, sig.level=0.05, power =0.80 )$n,0)))
```
```
## [1] "The Sample Size is :7"
```
- 1. You are interested in determining if there is a difference in white blood cell counts between 5 different medication regimes.
  - Guessed a medium effect size (0.25)
  - No. of groups= 5
  - R Code:
```
print(paste("The Sample Size is :",round(pwr.anova.test(k =5, f =0.25, sig.level=0.05, power =0.80 )$n,0)))
```
```
## [1] "The Sample Size is : 39"
```

Surgery	Weight Measures
A	6.3, 2.8, 7.8, 7.9, 4.9
B	9.9, 4.1, 3.9, 6.3, 6.9
C	5.1, 2.9, 3.6, 5.7, 4.5
D	1.0, 2.8, 4.8, 3.9, 1.6

Single Proportion Test :

Description: : This tests when you only have a single proportion and you want to know if the proportions of certain values differ from some constant proportion.

Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
0	1	2	1	N/A	No

Effect size calculation: \[Effect\:Size(h)=2*(arcsin(\sqrt{p_{H_1}}))-2*(arcsin(\sqrt{p_{H_0}})))\]
Example:(1) Is there a significance difference in cancer prevalence of middle-aged women who have a sister with breast cancer (5%) compared to the general population prevalence (2%)?
Solution:
- Here, $H_0\::0$ and $H_1\:: \neq 0$
- You don’t have background info, so you guess that there is a small effect size.
- For h-tests: 0.2=small, 0.5=medium, and 0.8=large effect sizes.
- Selected Two-sided, because we don’t care about directionality.
- R Package: pwr Package
- R function: pwr.p.test(h = , sig.level =, power =, alternative=“two.sided”, “less”, or “greater” )
  - h= effect size
  - sig.level= significant level
  - power= power of test
  - alternative= type of tail
- Answer of the problem:
```
print(paste0("The Sample Size is :",round( pwr.p.test(h=0.2, sig.level=0.05, power=0.80, alternative="two.sided")$n,0)))
```
```
## [1] "The Sample Size is :196"
```
Example:(2) Calculate the sample size for the following scenarios (with α=0.05, and power=0.80):
- (i) You are interested in determining if the male incidence rate proportion of cancer in North Dakota is higher than the US average (prop=0.00490). You find trial data cancer prevalence of 0.00495.
- (ii) You are interested in determining if the female incidence rate proportion of cancer in North Dakota is lower than the US average (prop=0.00420).
Solution:
- 1. Here,
  - Effect size = $2*arcsin(\sqrt{0.00495})-2*arcsin(\sqrt{0.00490})=0.0007$. Note that, in R arcsin can be calculated by the function asin(). Difference of proportion power calculation for binomial distribution (arcsine transformation)
  - One-sided test
  - R Code:
```
 print(paste0("The Sample Size is :",round(pwr.p.test(h=0.0007, sig.level=0.05, power=0.80, alternative="greater")$n,0)))
```
```
## [1] "The Sample Size is :12617464"
```
- 1. You are interested in determining if the female incidence rate proportion of cancer in North Dakota is lower than the US average (prop=0.00420).
  - Guess a very low effect size (0.001)
    - One-tailed test
    - R Code:
```
print(paste("The Sample Size is :",round(pwr.p.test(h=-0.001, sig.level=0.05, power=0.80, alternative="less")$n,0)))
```
```
## [1] "The Sample Size is : 6182557"
```

Two Proportions Test :

Description: : this tests when you only have two groups and you want to know if the proportions of each group are different from one another.

Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
0	2	2	2	N/A	No

Effect size calculation: \[Effect\:Size(h)=2*(arcsin(\sqrt{p_{H_1}}))-2*(arcsin(\sqrt{p_{H_0}})))\]
Example:(1) Is the expected proportion of students passing a stats course taught by psychology teachers different from the observed proportion of students passing the same stats class taught by mathematics teachers?
Solution:
- Here, $H_0\::0$ and $H_1\:: \neq 0$
- You don’t have background info, so you guess that there is a small effect size.
- For h-tests: 0.2=small, 0.5=medium, and 0.8=large effect sizes.
- Selected Two-sided, because we don’t care about directionality.
- R Package: pwr Package
- R function: pwr.2p.test(h = , sig.level =, power =, alternative=“two.sided”, “less”, or “greater” )
  - h= effect size
  - sig.level= significant level
  - power= power of test
  - alternative= type of tail
- Answer of the problem:
```
print(paste0("The Sample Size is :",round( pwr.2p.test(h=0.2, sig.level=0.05, power=.80, alternative="two.sided")$n,0)))
```
```
## [1] "The Sample Size is :392"
```
Example:(2) Calculate the sample size for the following scenarios (with α=0.05, and power=0.80):
- (i) You are interested in determining if the expected proportion (P1) of students passing a stats course taught by psychology teachers is different than the observed proportion (P2) of students passing the same stats class taught by biology teachers. You collected the following data of passed tests.
  
  Teaching Method Response
  
  Psychology Yes, Yes, Yes, No, No, Yes, Yes, Yes, Yes, No
  
  Biology No, No, Yes, Yes, Yes, No, Yes, No, Yes, Yes
- (ii) You are interested in determining of the expected proportion (P1) of female students who selected YES on a question was higher than the observed proportion (P2) of male students who selected YES. The observed proportion of males who selected yes was 0.75.
Solution:
- 1. Here,
  - $p_1=7/10=0.70, p_2=6/10=0.60$ Note that, you can calculate $SS_T$ & $TSS$ by performing ANOVA on the dataset using aov() function.
  - Effect size= $h= 2*asin(\sqrt{0.60})-2*asin(\sqrt{0.70})=-0.21$
  - R Code:
```
 print(paste0("The Sample Size is :",round(pwr.2p.test(h=-0.21, sig.level=0.05, power=0.80, alternative="two.sided")$n,0)))
```
```
## [1] "The Sample Size is :356"
```
- 1. You are interested in determining if there is a difference in white blood cell counts between 5 different medication regimes.
  - Guess that the expected proportion $(p_1)$ =0.85
  - Effect Size= $h= 2*asin(\sqrt{0.85})-2*asin(\sqrt{0.75})=0.25$
  - R Code:
```
print(paste("The Sample Size is :",round(pwr.2p.test(h=0.25, sig.level=0.05, power=0.80, alternative="greater")$n,0)))
```
```
## [1] "The Sample Size is : 198"
```

Teaching Method	Response
Psychology	Yes, Yes, Yes, No, No, Yes, Yes, Yes, Yes, No
Biology	No, No, Yes, Yes, Yes, No, Yes, No, Yes, Yes

Chi-Squared Test :

Description: : this tests when you only have two groups and you want to know if the proportions of each group are different from one another.

Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
$0$	$\geq 1$	$\geq 2$	1	N/A	No

Effect size calculation: \[Effect\:Size(w)=\sqrt{\frac{{\chi}^2}{n\times df}}\] where, \[{\chi}^2=\sum{\frac{(O_i-E_i)^2}{E_i}}\]
Example:(1) Does the observed proportions of phenotypes from a genetics experiment different from the expected 9:3:3:1?
Solution:
- Here, $H_0\::0$ and $H_1\:: \neq 0$
- You don’t have background info, so you guess that there is a small effect size.
- For w-tests: 0.1=small, 0.3=medium, and 0.5=large effect sizes.
- Degrees of freedoms= (the number of proportions minus 1) = 4 (phenotypes) – 1 = 3
- R Package: pwr Package
- R function: pwr.chisq.test(w =, df = , sig.level =, power = )
  - w= effect size
  - df= degrees of freedom
  - sig.level= significant level
  - power= power of test
- Answer of the problem:
```
print(paste0("The Sample Size is :",round(pwr.chisq.test(w=0.3, df=3, sig.level=0.05, power=0.80)$N,0)))
```
```
## [1] "The Sample Size is :121"
```
Example:(2) Calculate the sample size for the following scenarios (with α=0.05, and power=0.80):
- (i) You are interested in determining if the ethnic ratios in a company differ by gender. You collect the following trial data from 200 employees.
  
  Gender White Black Am.Indian Asian
  
  Male 0.60 0.25 0.01 0.14
  
  Female 0.65 0.21 0.11 0.03
- (ii) You are interested in determining if the proportions of student by year (Freshman, Sophomore, Junior, Senior) is any different from 1:1:1:1. You collect the following trial data.
  
  Student 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 Grade Frs, Frs, Frs, Frs, Frs, Frs, Frs, Soph, Soph, Soph, Soph, Soph, Jun, Jun, Jun, Jun, Jun, Sen, Sen, Sen
Solution:
- 1. Note that,
  - If they were equal the expected ratios should be the same as the overall ethnic ratios (62.5, 23.0, 6.0, 8.5)
  - Will just focus on males
  - $\chi^2= \sum{\frac{(O_i-E_i)^2}{E_i}} = (60-62.5)2/62.5 + (25-23)2/23 + (1-6)2/6 + (14-8.5)2/8.5 = 8$
  - Effect size= $w = \sqrt{\chi^2 /(n*df)}= \sqrt{8/(200*3)}=0.115$
  - R Code:
```
 print(paste0("The Sample Size is :",round(pwr.chisq.test(w=0.115, df=3, sig.level=0.05, power=0.80)$N,0)))
```
```
## [1] "The Sample Size is :824"
```
- 1. Note that here,
  - $\chi^2= \sum{\frac{(O_i-E_i)^2}{E_i}} = (7-5)^2/5 + (5-5)^2/5 + (5-5)^2/5 + (3-5)^2/5 = 1.6$
  - Effect Size= $w = \sqrt{\chi^2 /(n*df)}= \sqrt{1.6/(20*3)}=0.163$
  - R Code:
```
print(paste("The Sample Size is :",round(pwr.chisq.test(w=0.163, df=3, sig.level=0.05, power=0.80)$N,0)))
```
```
## [1] "The Sample Size is : 410"
```

Gender	White	Black	Am.Indian	Asian
Male	0.60	0.25	0.01	0.14
Female	0.65	0.21	0.11	0.03

Simple & Multiple Linear Regression :

Description: : this test determines if there is a significant relationship between two or more normally distributed numerical variables. The predictor variable is used to try to predict the response variable.

Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
2 or >2	0	NA	NA	Yes	No

Effect size calculation: \[Effect\:Size(f2)=\sqrt{R^2}\] Where, \[R^2= Goodness\:of \:fit\:measure(i.e., Adjusted\:R^2)\]
Example:(1) Is there a relationship between height and weight in college males?
Solution:
- Here, $H_0\::0$ and $H_1\:: \neq 0$
- You don’t have background info, so you guess that there is a small effect size.
- For $f2$-tests: 0.2=small, 0.5=medium, and 0.8=large effect sizes.
- For simple regression (only one predictor variable) = numerator df=1 & for multiple regression it is just the number of predictor variables.
- Output will be denominator degrees of freedom rather than sample size; will need to round up and add 2 for simple linear regression & add p+1; (where p= No. of predictor+1, because there is only one dependent outcome variable) for multiple linear regression to get sample size.
- R Package: pwr Package
- R function: pwr.f2.test(u =, v= , f2=, sig.level =, power = )
  - u= numerator degrees of freedom
  - v= denominator degrees of freedom
  - f2= effect size
  - sig.level= significant level
  - power= power of test
- To calculate sample size: Sample Size(n)= (denominator degrees of freedom(v) + Total No. of variables)
- Answer of the problem:
```
print(paste0("The Sample Size is :",round( pwr.f2.test(u=1, f2=0.35, sig.level=0.05, power=0.80)$v,0)+2)) ##--2 has add because it is a simple linear regression
```
```
## [1] "The Sample Size is :25"
```
Example: (2) You are interested in determining if height (meters), weight (grams), and fertilizer added (grams) in plants can predict yield (grams of berries). You collect the following trial data. Here $\alpha=0.05$, & $Power=(1-\beta)=80%$

Variables Values

Yield 46.8, 48.7, 48.4, 53.7, 56.7

Height 14.6, 19.6, 18.6, 25.5, 20.4

Weight 95.3, 99.5, 94.1, 110, 103

Fertilizer 2.1, 3.2, 4.3, 1.1, 4.3

Variables	Values
Yield	46.8, 48.7, 48.4, 53.7, 56.7
Height	14.6, 19.6, 18.6, 25.5, 20.4
Weight	95.3, 99.5, 94.1, 110, 103
Fertilizer	2.1, 3.2, 4.3, 1.1, 4.3

Solution:

Here, at first we have to find the $Adjusted\:R^2$ value by fitting the linear model.
Then, we will find the sample size.
R Code :

#--Data--#
yield= c(46.8, 48.7, 48.4, 53.7, 56.7)
height= c(14.6, 19.6, 18.6, 25.5, 20.4)
weight= c(95.3, 99.5, 94.1, 110, 103)
Fert= c(2.1, 3.2, 4.3, 1.1, 4.3)
#-- Fitting Linear Model --#
Model= lm(height~yield + weight + Fert)
#-- Extracting Adjusted R^2 Value --#
R_Sqared= summary(Model)$adj.r.squared
#-- Calculating Effect (f2) --#
f.2= sqrt(R_Sqared)
#-- Calculating sample size --#
##--4 has added because it is a multiple linear Regression with 3 predictors and one dependent variable--##
print(paste0("The Sample Size is :",round( pwr.f2.test(u=1, f2=f.2, sig.level=0.05, power=0.80)$v,0)+4))

## [1] "The Sample Size is :14"

Correlation :

Description: : This test determines if there is a difference between two numerical values. It is like simple regression, but is not identical.

Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
2	0	NA	NA	Yes	No

Effect size calculation: Effect Size= r= Correlation Coefficient
Example:(1) Is there a correlation between hours studied and test score?
Solution:
- Here, $H_0\::r=0$ and $H_1\:: r\neq 0$
- You don’t have background info, so you guess that there is a small effect size.
- For Correlation levels (r): 0.1=small, 0.3=medium, and 0.5=large correlations.
- Here approximate correlation power calculation is done by arctangh transformation
- R Package: pwr Package
- R function: pwr.r.test(r = , sig.level = , power = )
  - r= correlation
  - sig.level= significant level
  - power= power of test
- Answer of the problem:
```
print(paste0("The Sample Size is :",round(pwr.r.test(r=0.5, sig.level=0.05, power=0.80)$n,0)))
```
```
## [1] "The Sample Size is :28"
```
Example:(2) Calculate the sample size for the following scenarios (with α=0.05, and power=0.80):
- (i) You are interested in determining if there is a correlation between height and weight in men.
  
  Males Measures
  
  Height 178, 166, 172, 186, 182
  
  Weight 165, 139, 257, 225, 196
- (ii) You are interested in determining if, in lab mice, the correlation between longevity (in months) and average protein intake (grams).

Males	Measures
Height	178, 166, 172, 186, 182
Weight	165, 139, 257, 225, 196

Solution:

Here,

first, calculate the correlation value, and then calculate the sample size.
R Code:

#-- Data --#
MH= c(178,166,172,186,182)
MW= c(165,139,257,225,196)
#-- correlation value --#
r= cor(MH,MW)
print(paste0("The Sample Size is :",round(pwr.r.test(r=0.37, sig.level=0.05, power=0.80)$n,0)))

## [1] "The Sample Size is :54"

(ii)You are interested in determining if, in lab mice, the correlation between longevity (in months) and average protein intake (grams).
- Guessed large (0.5) correlation
- R Code:
```
print(paste("The Sample Size is :",round(pwr.r.test(r=0.5, sig.level=0.05, power=0.80)$n,0)))
```
```
## [1] "The Sample Size is : 28"
```

Non-Parametric T-tests :

Description: versions of the t-tests for non-parametric data.
- $\color{red}{\text{One Mean Wilcoxon:}}$ sample mean against set value
- $\color{red}{\text{Mann-Whitney:}}$ two sample means (unpaired)
- $\color{red}{\text{Paired Wilcoxon:}}$ two sample means (paired)

Name	Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
$\color{red}{\text{One Mean Wilcoxon:}}$	1	0	0	0	No	NA
$\color{red}{\text{Mann-Whitney:}}$	1	1	2	1	No	No
$\color{red}{\text{Paired Wilcoxon:}}$	1	1	2	1	No	Yes

Effect size calculation: $\text{Effect Size}\;(\text{Cohen’s D:})= \frac{{|{\mu}_{H_1}-{\mu}_{H_0}}|}{\sigma};\frac{{|{\mu}_{H_1}-{\mu}_{H_0}}|}{\sigma_{pooled}};\frac{{\mu}_{\text{diff}}}{\sigma_{\text{diff}}}$
Example:(1) (for t-tests, 0.2=small, 0.5=medium, and 0.8 large effect sizes)
- 1. $\color{red}{\text{One Mean Wilcoxon:}}$ Is the average number of children in Grand Forks families different than 1?
- Solution:
  - Here, $H_0\:: 1\;\text{child}$ and $H_1\:: >1\;\text{child}$
  - You don’t have background info, so you guess that there is a medium effect size.
  - Select one-tailed (greater)
  - R Package: pwr Package
  - R function: pwr.t.test(d = , sig.level = , power = , type = c(“two.sample”, “one.sample”, “paired”)) + 15%
    - d= effect size
    - sig.level= significant level
    - power= power of test
    - type= type of test
  - Answer of the problem:
```
Pwer_t=pwr.t.test(d=0.5, sig.level=0.05, power=0.80, type="one.sample", alternative="greater")
##-- Nonparametric Correction : adding 15% --##
print(paste0("Sample Size : ",round((Pwer_t$n*1.15),0)))
```
```
## [1] "Sample Size : 30"
```
- 1. $\color{red}{\text{Mann-Whitney:}}$ Does the average number of snacks per day for individuals on a diet differ between young and old persons?
- Solution:
  - Here, $H_0\:: 0\;\text{difference in snack number, }$ and $H_1\:: \neq 0\;\text{difference in snack number}$
  - You don’t have background info, so you guess that there is a small effect size
  - Select two-sided
  - R Package: pwr Package
  - R function: pwr.t.test(d = , sig.level = , power = , type = c(“two.sample”, “one.sample”, “paired”)) + 15%
    - d= effect size
    - sig.level= significant level
    - power= power of test
    - type= type of test
  - Note: “Parametric t-test + 15% Approach” for calculating Sample Size for Non Parametric test
  - Answer of the problem:
```
Pwer_t=pwr.t.test(d=0.2, sig.level=0.05, power=0.80, type="two.sample", alternative="two.sided")
##-- Nonparametric Correction : adding 15% --##
print(paste0("Sample Size : ",round((Pwer_t$n*1.15),0)))
```
```
## [1] "Sample Size : 452"
```
- 1. $\color{red}{\text{Paired Wilcoxon:}}$ Is genome methylation patterns different between identical twins?
- Solution:
  - Here, $H_0\::\text{0% methylation}$ and $H_1\:: \neq \text{0% methylation}$
  - You don’t have background info, so you guess that there is a large effect size
  - Select one-tailed (greater)
  - R Package: pwr Package
  - R function: pwr.t.test(d = , sig.level = , power = , type = c(“two.sample”, “one.sample”, “paired”)) + 15%
    - d= effect size
    - sig.level= significant level
    - power= power of test
    - type= type of test
  - Answer of the problem:
```
Pwer_t= pwr.t.test(d=0.8, sig.level=0.05, power=0.80, type="paired", alternative="greater")
##-- Nonparametric Correction : adding 15% --##
print(paste0("Sample Size : ",round((Pwer_t$n*1.15),0)))
```
```
## [1] "Sample Size : 13"
```
Example:(2) Calculate the sample size for the following scenarios (with $\alpha=0.05$, and power=0.80):
- (i) You are interested in determining if the average number of pets in Grand Forks families is greater than 1. You collect the following trial data for pet number.
  
  Variable Values
  
  Pets 1, 1, 1, 3, 2, 1, 0, 0, 0, 4
- (ii) You are interested in determining if the number of meals per day for individuals on a diet is higher in younger people than older. You collected trial data on meals per day.
  
  Variable Values
  
  Young meals 1, 2, 2, 3, 3, 3, 3, 4
  
  Older meals 1, 1, 1, 2, 2, 2, 3, 3
- (iii) You are interested in determining if genome methylation patterns are higher in the first fraternal twin born compared to the second. You collected the following trial data on methylation level difference (in percentage).
  
  Variable Values
  
  Methy.Diff(%) 5.96, 5.63, 1.25, 1.17, 3.59, 1.64, 1.6, 1.4
Solution:
- 1. You are interested in determining if the average income of college freshman is less than Rs.20,000. You collect trial data and find that the mean income was Rs.14,500 (SD=6000).
  - Effect size = $(Mean_{H_1}-Mean_{H_0})/SD= (1.3-1.0)/1.34 =0.224$
  - One-tailed test
  - R Code:
```
Pwer_t= pwr.t.test(d=0.224, sig.level=0.05, power=0.80, type="one.sample", alternative="greater")
#-- Non-parametric Correction --#
print(paste0("The Sample Size is :",round(Pwer_t$n*1.15,0)))
```
```
## [1] "The Sample Size is :143"
```
- 1. Try it by yourself.
- 1. Try it by yourself.

Variable	Values
Pets	1, 1, 1, 3, 2, 1, 0, 0, 0, 4

Variable	Values
Young meals	1, 2, 2, 3, 3, 3, 3, 4
Older meals	1, 1, 1, 2, 2, 2, 3, 3

Variable	Values
Methy.Diff(%)	5.96, 5.63, 1.25, 1.17, 3.59, 1.64, 1.6, 1.4

Kruskal Wallace Test :

Description: : this tests if at least one mean is different among groups, where the groups are larger than two for a non-normally distributed variable. (AKA, non-parametric ANOVA). There really isn’t a good way of calculating sample size in R, but you can use a rule of thumb:
- Run Parametric Test
- Add 15% to total sample size

Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
1	1	>2	1	No	No

Effect size calculation: Effect Size = Same as the effect size for the ANOVA.
Example:(1) ** Is there a difference in draft rank across 3 different months? **
Solution:
- Here, $H_0\::r=0$ and $H_1\:: r\neq 0$
- There will be a total of 3 groups (months)
- You don’t have background info, so you guess that there is a medium effect size.
- For $\text{f-test}$ : 0.1=small, 0.25=medium, and 0.4=large correlations.
- No Tails in ANOVA
- Groups assumed to be the same size.
- R Package: pwr Package
- R function: pwr.anova.test(k =, f = , sig.level = , power = )
  - k= number of groups
  - f= effect size
  - sig.level= significant level
  - power= power of test
- Answer of the problem:
```
 ##-- Balanced one-way analysis of variance power calculation --##
Pwr_Anova= pwr.anova.test(k =3 , f =0.25 , sig.level=0.05 , power =0.80 )
#-- Non-parametric Correction --#
print(paste0("The Sample Size is :",round((Pwr_Anova$n*1.15),0)))
```
```
## [1] "The Sample Size is :60"
```
Example:(2) Calculate the sample size for the following scenarios (with α=0.05, and power=0.80):
- (i) You are interested in determining there is a difference in hours worked across 3 different groups(faculty, staff, and hourly workers). You collect the following trial data of weekly hours.
  
  Groups Working Hours
  
  Faculty 42, 45, 46, 55, 42
  
  Staff 46, 45, 37, 42, 40
  
  Hourly 29, 42, 33, 50, 23
- (ii) You are interested in determining there is a difference in assistant professor salaries across 25 different departments.

Groups	Working Hours
Faculty	42, 45, 46, 55, 42
Staff	46, 45, 37, 42, 40
Hourly	29, 42, 33, 50, 23

Solution:

Here,

$\eta^2 = SS_T/TSS=286.5/(286.5+625.2) = 0.314$ Note that, you can calculate $SS_T$ & $TSS$ by performing ANOVA on the dataset using aov() function.
Effect size$(f)$ = $\sqrt{\eta^2/(1-\eta^2)}=\sqrt{0.314/(1- 0.314)} = 0.677$
No. of groups= 3
R Code:

 ##-- Balanced one-way analysis of variance power calculation --##
Pwr_Anova= pwr.anova.test(k =3, f =0.677, sig.level=0.05, power =0.80)
#-- Non-parametric Correction --#
print(paste0("The Sample Size is :",round((Pwr_Anova$n*1.15),0)))

## [1] "The Sample Size is :9"

You are interested in determining there is a difference in assistant professor salaries across 25 different departments.

Guess small effect size (0.10)
No. of groups= 25
R Code:

 #-- Balanced one-way analysis of variance power calculation --#
Pwr_Anova= pwr.anova.test(k =25, f =0.10, sig.level=0.05, power =0.80)
#-- Non-parametric Correction --#
print(paste0("The Sample Size is :",round((Pwr_Anova$n*1.15),0)))

## [1] "The Sample Size is :104"

Repeated Measures ANOVA :

Description: : this tests if at least one mean is different among groups, where the groups are repeated measures (more than two) for a normally distributed variable. Repeated Measures ANOVA is the extension of the Paired T-test for more than two groups.

Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
1	1	> 2	1	Yes	No

Effect size calculation: \[Effect\:Size(f)=\frac{\sigma_m}{\sigma}\] Where, \[\sigma_m=\sqrt{\frac{\sum_{j=1}^K{(m_j-m)^2}}{k}}= Standard\:Deviation\:of\:group\:means\] \[m_j= j^{th}\:group\:mean\:,\:\:\forall\:j=1(1)K\] \[m=Overall\:mean\] \[K=number\:of\:groups\] \[\sigma=overall\:standard\:deviation\]
Example:(1) Is there a difference in blood pressure at 1, 2, 3, and 4 months post-treatment?
Solution:
- Here, $H_0\::0$ and $H_1\:: \neq 0$
- 1 group, 4 measurements
- We will guess that the effect sizes will be small.
- For t-tests: 0.2=small, 0.5=medium, and 0.8=large effect sizes.
- For the nonsphericity correction coefficient, 1 means sphericity is met. There are methods to estimate this but will go with 1 for this example.
- R Package: WebPower Package
- R function: wp.rmanova(ng = NULL, nm = NULL, f = NULL, nscor = 1, alpha = 0.05, power = NULL, type = 0)
  - ng= number of groups
  - nm= number of measurements
  - f= effect size
  - nscor= nonsphericity correction coefficient
  - alpha= significant level of test
  - power= statistical power
  - type= (0,1,2) The value “0” is for between-effect; “1” is for within-effect; and “2” is for interaction effect.
  - Note:
    - Within-effects: variability of a particular value for individuals in a sample
    - Between-effects: examines differences between individuals
- Answer of the problem:
```
library(WebPower)
print(paste0("The Sample Size is :",round(wp.rmanova(n=NULL, ng=1, nm=4, f=0.1, nscor=1, alpha=0.05, power=0.80, type=1)$n,0)))
```
```
## [1] "The Sample Size is :1092"
```
Example:(2) Calculate the sample size for the following scenarios (with α=0.05, and power=0.80):
- (i) You are interested in determining if there is a difference in blood serum levels at 6, 12, 18, and 24 months post-treatment. You collect the following trial data of blood serum in mg/dL.
  
  Months Blood Serum
  
  6 Months 38, 13, 32, 35, 21
  
  12 Months 38, 44, 35, 48, 27
  
  18 Months 46, 15, 53, 51, 29
  
  24 Months 52, 29, 60, 44, 36
- (ii) You are interested in determining if there is a difference in antibody levels at 1, 2, and 3 months post-treatment.

Months	Blood Serum
6 Months	38, 13, 32, 35, 21
12 Months	38, 44, 35, 48, 27
18 Months	46, 15, 53, 51, 29
24 Months	52, 29, 60, 44, 36

Solution:

Here,

-Effect Size: $f =\sqrt{\frac{(27.8−37.3)^2+(38.4−37.3)^2+(38.8−37.3)^2+(25.2−37.3)^2}{4}}/ 12.74 = 0.608$

To get sphericity, ran ANOVA

library(ez)

## Warning: package 'ez' was built under R version 4.1.3

## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4

data=data.frame(Patient= factor(rep(c(1,2,3,4,5),4)),
Month= factor(c(rep("6 Months",5),rep("12 Months",5),rep("18 Months",5),rep("24 Months",5))),
Serum= c(38,13,32,35,21,38,44,35,48,27,46,15,53,51,29,52,29,60,44,36))
anova3= ezANOVA(data, dv=Serum, wid=Patient, within=.(Month),detailed=TRUE)
anova3

## $ANOVA
## Effect DFn DFd SSn SSd F p p<.05 ges
## 1 (Intercept) 1 4 27825.8 1506.7 73.872171 0.001006882 * 0.9212804
## 2 Month 3 12 706.6 870.9 3.245378 0.060146886 0.2291032
##
## $`Mauchly's Test for Sphericity`
## Effect W p p<.05
## 2 Month 0.1556327 0.4348287
##
## $`Sphericity Corrections`
## Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05
## 2 Month 0.4844127 0.1187469 0.6892662 0.09014564

Note: For more details about ezANOVA() function for Sphericity and Repeated Measures ANOVA
Sphericity was non-significant (0.43), so coefficient of 1
One group, four measurements, within-effects so type 1
R Code:

 print(paste0("The Sample Size is :",round(wp.rmanova(n=NULL, ng=1, nm=4, f=0.608, nscor=1, alpha=0.05, power=0.80, type=1)$n,0)))

## [1] "The Sample Size is :31"

1. You are interested in determining if there is a difference in antibody levels at 1, 2, and 3 months post-treatment.
- Guess a nonsphericity correction of of 1 and medium effect 0.25
- One group, three measurements, type 1
- R Code:
```
 print(paste("The Sample Size is :",round(wp.rmanova(n=NULL, ng=1, nm=3, f=0.25, nscor=1, alpha=0.05, power=0.80, type=1)$n,0)))
```
```
## [1] "The Sample Size is : 156"
```

Simulation & Statistics in R

Tue, 26 Oct 2021 00:00:00 +0000

INTRODUCTION
Concept Of Simulation
Why do we simulate ?
Drawing of Simple Random Sample
Example
Unequal Probability Sampling
Similating Coin Tosses
Find the Proportion of heads & tails in long run
Find the Proportion of heads & tails in long run
Finding Probabilities
- Fact
Drawing a Card
Divisibility Test
Urn-Ball Problem
Urn-Ball Problem
Birthday Problem
Card Shiffting
Cut Shuffle
Simulating a Cut Shuffle
Riffle Shuffle
Simulating Riffle Shuffle
Simulating Riffle Shuffle
Simulating Random Variables
Using it farther
Much Complicated Ones
- Fact
- Algorithm
Generating Poisson Distribution
Continuous Distributions
- Fact
Working with inbuilt R functions
Plotting the normal density
Other Standard Distributions in R
Central Limit Theorem
- Theorem
Law of Laege Numbers
Plotting the Probability
Strong Law of large numbers
Illustrating Strong Law
Family Planning
Using Simulation to construct Tests
Plot of Beta Densities
What type of test shall we perform ?
Now lets find the c
Generating Normal Variables
- Fact: Box-Muller transformation
Generating Bivariate Normal Variables
- Fact:
Monte Carlo Simulation
Bias & Variance
Monte Carlo Integration
- Example
- Another Example
An assignment Problem
Brownian Motion

INTRODUCTION

As a statistician, we Want to deal with random experiments. So to do that, thare are various techniques to predict the outcome of such experiments :

Wait and See: Designing winning strategies by trial-and-error method.
Solving Probability Models: Assume a definite mathematical model to predict outcome, sometimes gets complicated.
Simulate Probability Models: Also start with a mathematical model, but instead of computing it mathematically we use computers to perform the virtual random experiment following that model and then analyze the artificial data generated by computers. Similar to “wait and see” except that we do not need to wait long reality.

Concept Of Simulation

Assume a mathematical model.
Use computers to perform the random experiment artificially.
Computers can do artificial random experiment as computers can generate random numbers.
Use the artificial data generated by the computers to analyze the model and predict the outcome.
Note that, the random numbers generated by computers are not random in absolute sense, they are only pseudo-random numbers.

Why do we simulate ?

To have a better understanding of the known probability models.
To visualize a probability model with examples of outcome of a random experiment ( which in reality are hard to obtain )
To have an idea about the result of a statistical model which cannot be solved explicitly using formula.
To judge the performance a model before applying it to a real data situation.

Drawing of Simple Random Sample

We use the sample() command for both with-replacement & with-out-replacement sampling.

set.seed(123)
sample(c("A","B","C","D","E"),size = 3,replace = F) #-- Without replacement

## [1] "C" "B" "E"

set.seed(123)
sample(c("A","B","C","D","E"),size = 3,replace = T) #-- With replacement

## [1] "C" "C" "B"

Example

set.seed(5)
sample(1:10,size=2,replace = T)

## [1] 2 9

set.seed(6)
sample(100,size=5)

## [1] 53 10 45 78 56

Unequal Probability Sampling

set.seed(7)
sample(c("A","B","C"),size = 2,prob = c(0.1,0.4,0.5))

## [1] "A" "C"

Similating Coin Tosses

An unbiased coin is tossed 10 times. Lets see the output of the tosses.

set.seed(100)
sample(c("H","T"),10,replace = T)

## [1] "T" "H" "T" "T" "H" "H" "T" "T" "T" "H"

Suppose now the probability of head is 2/6

set.seed(100)
sample(c("H","T"),10,replace = T,prob = c(2/6,4/6))

## [1] "T" "T" "T" "T" "T" "T" "H" "T" "T" "T"

Find the Proportion of heads & tails in long run

prop=NULL
size1=seq(100,10000,by=1000)
size2=seq(20000,500000,by=10000)
size=c(size1,size2)
for (n in size)
{
x=sample(0:1,n,rep=T)
prop=c(prop,sum(x)/n)
}
plot(size,prop,type="l")
abline(0.5,0)

Find the Proportion of heads & tails in long run

## [1] 100 1100 2100 3100 4100 5100 6100 7100 8100 9100
## [11] 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000
## [21] 120000 130000 140000 150000 160000 170000 180000 190000 200000 210000
## [31] 220000 230000 240000 250000 260000 270000 280000 290000 300000 310000
## [41] 320000 330000 340000 350000 360000 370000 380000 390000 400000 410000
## [51] 420000 430000 440000 450000 460000 470000 480000 490000 500000

## [1] 0.4800000 0.5100000 0.4985714 0.4880645 0.5026829 0.4905882 0.5039344
## [8] 0.5021127 0.5004938 0.5057143 0.4978500 0.5019333 0.5033500 0.5027600
## [15] 0.5003167 0.5004000 0.4984125 0.5031000 0.4990700 0.4997818 0.4995667
## [22] 0.4988692 0.5021143 0.5017067 0.5019000 0.4986529 0.5001111 0.5005684
## [29] 0.5001250 0.4984714 0.4992182 0.4990478 0.4965500 0.4987200 0.4986769
## [36] 0.4991741 0.4989179 0.5002103 0.4991067 0.4998323 0.5003156 0.4998909
## [43] 0.4985824 0.4995286 0.5017111 0.5003432 0.4990737 0.5005205 0.4994575
## [50] 0.4997585 0.4988833 0.4997023 0.5001773 0.5009356 0.5003457 0.5004979
## [57] 0.5000729 0.4997633 0.4996940

Finding Probabilities

Fact

Probability of any event A can be interpreted as the long term relative frequency of the event A, i.e., $\frac{no.\;of\;repetitions\;resulting\;in\;A}{total\;number\;of\;repetitions}$ $as\;n\rightarrow\infty$

Hence for computing the probability of any event A by simulation, we shall simulate a large number $n$ of cases and count the number of times the event A has occurred. If this number is $m$, then the probability of the event A can be approximated by $\frac{m}{n}$

Drawing a Card

A card is drawn from a full pack of 52 cards. Find the probability that the drawn card is a picture card (i.e., king, queen or jack).

set.seed(125)
pic=NULL
for(i in 1:10000)
{
x=sample(52,size = 1)
if(any(x%%13==c(11,12,0)))
{
pic[i]=1
}
else pic[i]=0
}
sum(pic)/10000

## [1] 0.2287

Because, king= $11^{th}$ no. card, queen=$12^{th}$ no. card,jack=$13^{th}$ no. card.

Divisibility Test

A number is chosen at random from 1 to 1000. Find the probability that it is divisible by 3, 5 or 6.

count=0
for(i in 1:100000)
{
num=sample(1000,1)
if(num%%3==0||num%%5==0||num%%6==0)
{
count=count+1
}
}
count/100000

## [1] 0.46794

Urn-Ball Problem

Suppose an urn contains 7 white and 5 black balls. 3 balls are chosen at random without replacement. Find the probability that :
- all the 3 balls are white
- 2 are white and 1 is black.

Urn-Ball Problem

count1=0; count2=0
balls= as.factor(c(rep("W",7),rep("B",5)))
for ( i in 1:10000)
{
chosen= sample(balls,3)
if (all(chosen=="W")) count1=count1+1
if (table(chosen)["W"]==2) count2=count2+1
}
count1/10000; count2/10000

## [1] 0.1565

## [1] 0.4859

Birthday Problem

In a class of 25 students, find the probability that at least two students share the same birthday.

count=0
for(i in 1:500)
{
##-- drawing samples by SRSWR --##
class=sample(365,25,replace = T)
if(length(unique(class))<length(class))
{
count=count+1
}
}
count/500

## [1] 0.574

Card Shiffting

Often we speak of well-shuffled deck of cards.
When we shuffle a deck by hand, the shuffling is always imperfect (not random)
We can simulate these imperfect shuffling in computer.

Cut Shuffle

The simplest method is “cutting” the deck.
We cut the deck at some random point chosen somewhere around the middle of the deck.
Then put the lower part on the top of the upper part.
We shall simulate this shuffle.

Simulating a Cut Shuffle

cut=function(deck)
{
#choose a random cut point near middle
x=rbinom(1,52,0.5)
temp=c(deck[(x+1):52],deck[1:x])
return(temp)
}
cut(1:52)

## [1] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## [26] 49 50 51 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## [51] 22 23

Riffle Shuffle

A much more reliable way is the rifle shuffle ( also known as Faro shuffle or dovetail shuffle)
First split the deck into two parts just as in the cut method.
Take the top half in left hand, and the other half in your right.
Release the cards randomly from both the hands.
Mathematically, if at any stage there are $a$ cards in your left hand and $b$ cards in your right, then the next card comes from the left hand with probability $\frac{a}{a+b}$ and from the right with probability $\frac{b}{a+b}$.

Simulating Riffle Shuffle

riffle=function(deck)
{
n=length(deck)
x=rbinom(1,52,0.5)
left=deck[1:x]; right=deck[(x+1):52]; k=0;
a=length(left); b=length(right); tab=NULL;
for(i in 1:52 )
{
ind=rbinom(1,1,a/(a+b))
if(ind==1)
{
tab[k+1]=left[a]
left=left[1:(a-1)]
k=k+1; a=a-1
}
else
{
tab[k+1]=right[b]
right=right[1:(b-1)]
k=k+1; b=b-1
}
}
return(tab)
}

Simulating Riffle Shuffle

riffle(1:52)

## [1] 26 52 25 51 50 49 48 24 23 22 21 20 47 19 18 46 17 16 45 15 44 43 14 42 41
## [26] 40 39 13 12 11 38 10 9 8 7 37 6 36 35 34 33 5 4 32 3 31 30 2 29 28
## [51] 27 1

Simulating Random Variables

We can simulate a Uniform(0,1) variable by the command runif()
This can be used to generate random variables from other discrete and continuous as well.
Suppose we want to generate a Bernoulli random variable with probability of success 0.7

bernoulli=function(prob)
{
u=runif(1); x=NULL;
if(u<prob) x=1
else x=0
return(x)
}
bernoulli(0.7)

## [1] 0

Using it farther

Suppose we want to simulate a Geometric(0.8) random variable.

x=1
y=bernoulli(0.8)
while(y!=1)
{
y=bernoulli(0.8)
x=x+1
}
x

## [1] 1

Much Complicated Ones

How can we generate poisson or a Hypergeometric random variable using the above technique ?
For this we need to take the help of the following fact :

Fact

Suppose we want to generate $X$ having p.m.f. $P(X=x_i)=p_i\;\;\forall i=0,1,2,...\;\;\sum{p_i}=1$. We generate $U\sim Uni(0,1)$ and set \[X = \left\{ \begin{array}{rcl} x_0 & if & U<p_0\\ x_1 & if & p_0\leqslant U<{p_0+p_1} \\.&.\\.&.\\x_j & if & \sum^{i-1}_{j=0}p_j\leqslant U<\sum^{i}_{j=0}p_j\\. &.\\. &.\\. &.\end{array}\right.\]

Algorithm

The preceding fact can be written as :
- Generate a random $U\sim U(0,1)$
- If $U<p_0$ stop and set $X=x_0$
- If $U<p_0+p_1$ stop and set $X=x_1$
- If $U<p_0+p_1+p_2$ stop and set $X=x_2$
- and so on…

Generating Poisson Distribution

poi_mass=function(x,lambda)
{
return(exp(-lambda)*(lambda^x)/factorial(x))
}
poi_sample=function(lambda)
{
U=runif(1); i=0; cumprob=poi_mass(0,lambda)
while(U>cumprob)
{
i=i+1
cumprob=cumprob+poi_mass(i,lambda)
}
return(i)
}
poi_sample(5)

## [1] 2

Continuous Distributions

For continuous distributions we use the following fact :

Fact

(Probability Integral Transformation) If $X$ has an absolutely continuous distribution, then the C.D.F of $X$, $F(x)$ has $U(0,1)$ distribution.

Suppose we want to generate $X$ from $Exp(\lambda)$ distribution.
\[F(x)=1-e^{\lambda x}\;=> U\sim U(0,1)\]
\[X=-\frac{1}{\lambda}ln(1-U)\] is the required random variable.

Working with inbuilt R functions

Suppose we want to generate random variables from $N(\mu,{\sigma}^2)$

# 2 samples from N(5,2) Distribution
rnorm(n=2,mean=5,sd=sqrt(2))

## [1] 7.280889 6.274643

Now let us find $P(X\leq x)$ i.e., $\Phi(\frac{x-\mu}{\sigma})$

# P(X<=4) for N(5,2)
pnorm(4,mean=5,sd=sqrt(2))

## [1] 0.2397501

# P(X>7) for N(5,2)
pnorm(7,mean=5,sd=sqrt(2),lower.tail = F)

## [1] 0.0786496

We can also compute the normal quantiles $z_\alpha$

# lower 0.05 point
qnorm(0.05,mean=5,sd=sqrt(2))

## [1] 2.673826

# lower 0.01 point
qnorm(0.01,mean=5,sd=sqrt(2),lower.tail = F)

## [1] 8.289953

Can also compute normal density $\phi(\frac{x-\mu}{\sigma})$

# density at x=2
dnorm(2,mean = 5,sd=sqrt(2))

## [1] 0.02973257

# density at x=5
dnorm(5,mean=5,sd=sqrt(2))

## [1] 0.2820948

Plotting the normal density

x=seq(-3,3,by=0.01); y=dnorm(x,0,1)
plot(x,y,type="l",main="Density of N(0,1)",ylab=expression(phi(x)))

Other Standard Distributions in R

Distribution	Sample	P(X<=x)	z_alpha	Density
Binomial	rbinom(n,size,prob)	pbinom	qbinom	dbinom
Poisson	rpois(n,lambda)	ppois	qpois	dpois
Neg.Binomial	rnbinom(n,size,prob,mu)	pnbinom	qnbinom	dnbinom
Geometric	rgeom(n,prob)	pgeom	qgeom	dgeom
Hypergeometric	rhyper(nn,m,n,k)	phyper	qhyper	dhyper
Uniform	runif(n,min=0,max=1)	punif	qunif	dunif
Exponential	rexp(n,rate=1)	pexp	qexp	dexp
Cauchy	rcauchy(n,location=0,scale=1)	pcauchy	qcauchy	dcauchy
t	rt(n,df,ncp)	pt	qt	dt
F	rf(n,df1,df2,ncp)	pf	qf	df
Chi-Square	rchisq(n,df,ncp=0)	pchisq	qchisq	dchisq
Gamma	rgamma(n,shape,rate,schale)	pgamma	qgamma	dgamma
Beta	rbeta(n,shape1,shape2,ncp)	pbeta	qbeta	dbeta
Multinommial	rmultinom(n,size,prob)	-	-	dmultinom
Mult.Normal	rmnnorm(n,mean,sigma)	-	-	dmvnorm

Central Limit Theorem

Theorem

(iid case) Let $X_1,X_2,...,X_n$ be iid random variables with mean $\mu$ and variance ${\sigma}^2<\infty$ and \[S_n=X_1+X_2+...+X_n\]. Then, $\frac{S_n-E(S_n)}{\sqrt{Var(S_n)}}\longrightarrow N(0,1)$ as $n\longrightarrow \infty$.
$U_1,U_2,...,U_n$ are iid $U(0,1)$ variables. Then \[Z_n=\frac{U_1+U_2+...+U_n-\frac{n}{2}}{\sqrt{\frac{n}{12}}}\longrightarrow N(0,1)\;\;;\;as\;\;n\longrightarrow \infty\]

n=100
k=10000
U=runif(n*k)
M=matrix(U,n,k)
X=apply(M,2,sum)
Z=(X-n/2)/sqrt(n/12)
par(mfrow=c(1,2))
hist(Z)
qqnorm(Z)
qqline(Z,col="red")

Law of Laege Numbers

The weak Law of large numbers says that for any $\epsilon>0$ the sequence of probabilities \[P({|\frac{S_n}{n}-\mu|<\epsilon})\longrightarrow 1\;\;\;\;\;as \;\;n\longrightarrow \infty\]
Consider i.i.d. coin flips, that is, Bernoulli trials with $p=\mu=\frac{1}2$
We find the $P({|\frac{S_n}{n}-\mu|<\epsilon})$ in R and illustrate the limiting behavior, with $\epsilon=0.01$

Plotting the Probability

wlln=function(n,eps,p)
{
pbinom(n*p+n*eps,n,p)-pbinom(n*p-n*eps,n,p)
}
prob=NULL
for(n in 1:10000)
{
prob[n]=wlln(n,eps=0.01,p=0.5)
}
plot(prob,type="l",xlab="n",ylab=expression(P(X<=x)))

Strong Law of large numbers

The strong law of large numbers says that $\frac{S_n}n \longrightarrow \mu\;\;\;w.p.\;1\;;\;as\;\;n\longrightarrow \infty$
Consider i.i.d. coin flips, that is, Bernoulli trials with $p=\mu=\frac{1}2$
The sum $S_n$ is a Binomial random variable.

Illustrating Strong Law

slln=function(n,p)
{
x=rbinom(1,size=n,prob=p)
return(x)
}
value=NULL
for(i in 1:10000)
{
value[i]=slln(i,0.5)/i
}
plot(value,type="l",xlab="n",ylab="Sample mean")
abline(h=0.5)

Family Planning

Suppose a couple is planning to have children until they have one child of each sex. Assuming male and female child are equally probable, how may children can they expect to have ?

count=NULL
for(i in 1:1000)
{
child=sample(c(0,1),1)
while(length(unique(child))<2)
{
child=c(child,sample(c(0,1),1))
}
count[i]=length(child)
}
mean(count)

## [1] 3.08

Using Simulation to construct Tests

Simulation can be used to construct tests in situations where the exact sampling distribution of the test statistic is hard to find even under the null hypothesis.
More specially we use simulation to find the $\alpha100$% cut-off points
Suppose we have a sample of 101 from Beta(5,b)
We want to test $H_0:b=5$ against $H_1:b<5$
To get an idea about the nature of the test we plot the density function for different values of b.

Plot of Beta Densities

par(mfrow=c(1,3))
x=seq(0,1,by=0.01)
y1=dbeta(x,shape1 = 5,shape2 = 2)
y2=dbeta(x,shape1 = 5,shape2 = 5)
y3=dbeta(x,shape1 = 5,shape2 = 10)
plot(x,y1,type="l",xlab="b<5")
abline(v=median(rbeta(101,shape1 = 5,shape2 = 2)),col="red")
plot(x,y2,type="l",xlab="b=5")
abline(v=median(rbeta(101,shape1=5,shape2=5)),col="red")
plot(x,y3,type="l",xlab="b>5")
abline(v=median(rbeta(101,shape1=5,shape2=10)),col="red")

What type of test shall we perform ?

From the figures we see that sample median can be used as a test statistic.
Also a right tailed test based on the median will be appropriate
Thus we shall reject $H_0$ if the sample median exceeds some value $c$.
We want to find the test at 90% level of signficance.
We shall use simulation technique to find the cut-off point $c$.

set.seed(100);prob=NULL; j=1
C=seq(0.2,0.9,by=0.001)
for ( c in C)
{
prob[j]=0
for(i in 1:100)
{
x=rbeta(101,shape1=5,shape2=5)
me=median(x)
if(me>c) prob[j]=prob[j]+1
}
prob[j]=prob[j]/100
j=j+1
}
plot(C,prob,type="l")
abline(h=0.9,col="red")

Now lets find the c

We continue to search the c for which $P_{H_0}(me>c)$ is closet to 0.9

C[which(prob>0.89 & prob<0.91)]

## [1] 0.473 0.475

prob[which(C %in% C[which(prob>0.89 & prob<0.91)])]

## [1] 0.9 0.9

So, we can take c to be 0.473 Thus our test rule is: Reject $H_0$ if sample median exceeds 0.473

Generating Normal Variables

Instead of using R inbuilt function, rnorm(), we can generate Normal variables from scratch.

Fact: Box-Muller transformation

let, $U_1$,$U_2$$\sim U(0,1)$

\[Z_1=\sqrt{-2 ln U_1} cos(2\pi U_2)\] \[Z_2=\sqrt{-2 ln U_1} sin(2\pi U_2)\]

Then, $Z_1,Z_2 \sim N(0,1)$ independently.

So, to generate $Y \sim N(\mu , {\sigma}^2)$; we use, $Y=\mu +\sigma Z$ where, $Z \sim N(0,1)$.

normal=function(n)
{
U1 = runif(n)
U2 = runif(n)
C = sqrt(-2*log(U1))
Z1 = C*cos(2*pi*U2)
Z2 = C*sin(2*pi*U2)
return(Z1)
}
n = 100000
Z = normal(n)
hist(Z,prob=T,col=rainbow(12))
#-- note that, rainbow() is a graphical function, which gives multiple color.
curve(dnorm(x),-3,3,
add=T,lwd=4)

Generating Bivariate Normal Variables

Fact:

If $(X,Y) \sim N_2(0,0,1,1,\rho)$, then $Z_1=X$ and $Z_2=\frac{(Y-\rho X)}{\sqrt{1-{\rho}^2}}$ are iid $N(0,1)$, where, $-1<\rho<1$

Equivalently, if we generate $Z_1$,$Z_2$ iid $N(0,1)$, then setting $X=Z_1$ and $Y=(1-{\rho}^2)Z_2+\rho Z_1$ gives a pair of random variables that have $N_2(0,0,1,1,\rho)$ distribution.

binorm=function(n,rho)
{
x = numeric(n); y<- numeric(n)
for (i in 1:n)
{
z1 = normal(1)
z2 = normal(1)
x[i] = z1
y[i] = rho*z1+sqrt(1-rho^2)*z2
}
return(cbind(x,y))
}
n = 1000 ;rho = -0.5
data = binorm(n,rho)
##-- Plotting Bivariate Normal Data --##
plot(data,
pch=19,
xlab="X",
ylab = "Y")
abline(lm(data[,2]~data[,1]),col="red",
v=mean(data[,1]),
h=mean(data[,2]),
lwd=3)
legend("topright",legend = c(paste("mean(X)= ",round(mean(data[,1]),3)),
paste("Var(X)= ",round((sd(data[,1]))^2,3)),
paste("mean(Y)= ",round(mean(data[,2]),3)),
paste("Var(Y)= ",round((sd(data[,2]))^2,3)),
paste("samp corr.= ",round(cor.test(data[,1],data[,2])$estimate,2))),
cex=0.66)

Monte Carlo Simulation

We know that sample average $\bar{x}$ converges to population mean by consistency property.
Thus expected value of any function can be approximated by the sample average.
Thus $\frac{1}{N} \sum_{i=1}^N{f(X_i)} \longrightarrow E(f(X))$ with probability 1 as $N \longrightarrow \infty$ if $X_1,X_2,...$ are iid sequence of random variables with the same distribution as $X$.
A Monte Carlo method for estimating $E(f(X))$ is a numerical method based on the approximation \[Z_N^{MC}=E[f(X)]\approx \frac{1}{N} \sum_{i=1}^N(f(X_1))\] where $X_1,X_2,...$ are iid sequence of random variables with same distribution as $X$.

Bias & Variance

The Monte Carlo estimate $Z_N^{MC}$ for $E(f(X))$, has \[bias(Z_N^{MC})=0\] and \[MSE(Z_N^{MC})=Var(Z_N^{MC})=\frac{1}{N} Var(f(X))\]

Monte Carlo Integration

Consider the integral ${\int}_a^bf(x)dx$
Objective is to approximate this integral
Let $X_1,X_2,...$ be iid $U(a,b)$, i.e., density of $X_j$ is $\phi(x)=\frac{1}{b-a}I_{[a,b]}$
Then \[{\int}_a^bf(x)dx=(b-a){\int}_a^bf(x) \phi(x)dx=(b-a)E(f(X))\approx \frac{b-a}{N} \sum_{i=1}^N{f(X_j)}\] for large $N$

Example

Evaluate the integral ${\int}_0^{2\pi} e^{k\:cos(x)}dx$
We, grnerate samples $X_j$ from $U(0,2\pi)$
Then use the approximation \[{\int}_0^{2\pi} e^{k\:cos(x)}dx \approx \frac{2\pi}{N} \sum_{j=1}{e^{k\:cox(X_j)}}\]

set.seed(123)
N = 1000
x = runif(N,min=0,max=(2*pi))
value = sum(exp(cos(x)))
value = (2*pi)*value/N
value

## [1] 7.901431

Another Example

Problem: Generate the c.d.f of $N(0,1)$ for several values of the argument and then compare its accuracy.
The normal c.d.f can be expressed as, \[\Phi(t)= {\int}_{-\infty}^t \frac{1}{\sqrt{2\pi}}e^{\frac{x^2}{2}}dx\]
We shall use Monte Carlo method to estimate $\phi(t)$ as, \[\phi{(t)} \approx \frac{1}{n} \sum_{i=1}^n{I(X_i \le t)}\] where, $I(X_i \le t)=$ 1 or 0 with prob. $\phi(t)$ or $1-\phi(t)$ respectively, and $X_i$’s are random samples from $N(0,1)$.

n = 1000
t = seq(-3,3,0.01)
x = NULL
phi.hat=NULL
phi=NULL
for(i in 1:length(t))
{
x = rnorm(n)
s = sum(x<=t[i])
phi.hat[i] = s/n
phi[i] = pnorm(t[i])
}
par(mfrow = c(1,2))
plot(t,phi,main="Original c.d.f",col="red",pch=19)
plot(t,phi.hat,main="Estimated c.d.f",col="blue",pch=19)

An assignment Problem

Here, we have to do :

Draw a random sample of size 50 from $N(1,2)$
Draw another radom sample of size 1000 from the same distribution $N(1,2)$
Calculate the test statistic :$T_n=\frac{\sqrt{n}(\bar{X_n}-1)}{s_n}$
Repeat this 1000 times.
Draw histograms of the $T_n$’s coming from two different samples of the same population $N(0,1)$
Copare these two histograms with the Standard Normal distribution.

Solution:

here, to plot histograms, instead of using the famous ggplot2 package, I am using the basic R plotting function.

##--- Creating a Function to simulate 1000 test statistics for two different samples
simulation=function(len_1,len_2){
A=NULL
B=NULL
for(i in 1:1000)
{
Sam_1=rnorm(len_1,1,sqrt(2))
Sam_2=rnorm(len_2,1,sqrt(2))
Tn_1=(sqrt(length(Sam_1))*((sum(Sam_1)/length(Sam_1))-1))/sqrt(var(Sam_1))
Tn_2=(sqrt(length(Sam_2))*((sum(Sam_2)/length(Sam_2))-1))/sqrt(var(Sam_2))
A=c(A,Tn_1)
B=c(B,Tn_2)
}
Mat=as.data.frame(matrix(c(A,B),ncol=2,byrow = F))
names(Mat)=c("Tn_1","Tn_2")
return(Mat)
}
X=(simulation(50,1000)) # Data Table
X[1:10,] # Showing 1st 10 samples of the Data Table

## Tn_1 Tn_2
## 1 0.8020946 0.20259551
## 2 -0.8509208 -1.40945355
## 3 -0.2805664 -1.05684656
## 4 -0.6615001 -0.44404425
## 5 2.3164142 -0.09538437
## 6 -2.3255364 -0.42208845
## 7 -0.2363790 0.60943958
## 8 -2.8176113 -0.40001562
## 9 -0.2177217 -2.25709292
## 10 -0.0575885 2.26107062

# Writing a message
writeLines(paste(c("Omitting","the","rest","990","values")),sep=" ")

## Omitting the rest 990 values

##--- Histogram of the 1st Sample
hist(X$Tn_1,
col="red",
xlab="Tn",
ylab="Frequency",
main="For the Sample 1",
density = 50)

##--- Histogram of the 2nd Sample
hist(X$Tn_2,
col=12,
xlab="Tn",
ylab="",
main="For the Sample 2",
density = 40)

##--- Preparing the density of the N(0,1)
a=seq(-3,3,by= 0.01) # Range of the sample points for N(0,1)
b=dnorm(a) # density of the N(0,1) for the above range.
##--- Comparing Plots :
hist(X$Tn_1, # histogram for the Sample 1
col="red",
xlab="Tn",
ylab="Frequency",
main="Comparing Two Histograms coming from two\n different Samples of the same \nPopulation N(1,2) with the Standard Normal density",
density = 50,
axes=F,
cex=4)
par(new=T) # For Overlap the new plot
hist(X$Tn_2,
col=12,
xlab="",
ylab="", # histogram for the Sample 2
main="",
density = 40,
axes = F)
par(new=T) # For Overlap the new plot
plot(a,b,
type ="l",
xlab="",
ylab="", # Density curve of the N(0,1)
main = "",
axes = F,
col="darkgreen",
lwd=3)
# Adding Legend
legend("topright",
legend = c("Sample 1","Sample 2","PDF-N(0,1)"),
fil=c("red",20,"darkgreen"),
cex=0.6)
# Adding box
box()

Brownian Motion

Note that the concepts on basics of Monte Carlo Simulation and various Random Distributions have been introduced lets focus on using Monte Carlo methods to simulate paths for various Stochastic Processes.
Standard Brownian Motion on $[0,T]$ is a Stochastic Process $(W(t),0\leq t \leq T)$ which satisfies some properties such as
- $W(0) = 0$
- For any $k$ and any $0 \leq t_1 \leq .... \leq T$, the increments between any two successive $W(t_i)-W(t_{i-1})$ are independent.
- The difference $W(t)-W(s) \sim N(0,t-s)$ for any $0\leq s<t \leq T$

As a consequence of 1st and 2nd, $W(t) \sim N(0,t)$.

On the other hand, Brownian Motion which is non-standard will have two parameters just like Normal Distribution known as drift and diffusion. Using $W(t)$ we therefore give a Stochastic Differential Equation for any Brownian Motion

\[Brownian\:motion\:with\:drift\: \mu \:and\:\:diffusion\:\:coefficient\: \sigma^2\:\:through\:\:the\;SDE\] \[dX(t)=\mu(t)dt+\sigma(t)dW(t)\]

Sample Paths Generations

Solving the SDE presented above we can write the equation in terms of $X(t_i),\mu(s),\sigma(s)$ \[X(t_{i+1})=X(t_i)+\int_{t_i}^{t_{i+1}}\mu(s)ds+\sqrt{\int_{t_i}^{t_{i+1}}{\sigma^2(u)du Z_{i+1}}}\] Hence let us look at the code to generate paths where I have assumed $\mu$ and $\sigma$ to be constant.

Brownian = function() # This is a function to generate Browninan with drift 0.04 and diffusion 0.7
{
paths = 10
count = 5000
interval = 5/count
sample = matrix(0,nrow=(count+1),ncol=paths)
for(i in 1:paths)
{
sample[1,i] = 5
for(j in 2:(count+1))
{
sample[j,i] = sample[j-1,i]+interval*0.04+((interval)^.5)*rnorm(1,0,1)*0.7
}
}
cat("E[W(2)] = ",mean(sample[2001,]),"\n")
cat("E[W(5)] = ",mean(sample[5001,]),"\n")
matplot(sample,main="Brownian",xlab="Time",ylab="Path",type="l")
}
StandardBrownian = function() # This is a function to generate Standard Browninan with drift 0 and diffusion 1
{
paths = 10
count = 5000
interval = 5/count
sample = matrix(0,nrow=(count+1),ncol=paths)
for(i in 1:paths)
{
sample[1,i] = 0
for(j in 2:(count+1))
{
sample[j,i] = sample[j-1,i]+((interval)^.5)*rnorm(1)
}
}
cat("E[W(2)] = ",mean(sample[2001,]),"\n")
cat("E[W(5)] = ",mean(sample[5001,]),"\n")
matplot(sample,main="Standard Brownian",xlab="Time",ylab="Path",type="l")
}
StandardBrownian()

## E[W(2)] = -0.2205643
## E[W(5)] = -0.2023842

Brownian()

## E[W(2)] = 5.321014
## E[W(5)] = 5.338038

Understanding what is Statistical Regularity with a Ludo & Paper Game

Tue, 26 Oct 2021 00:00:00 +0000

Ststistical Regularity
The ‘Ludo & Paper Game’
- Theory
- R code

Hi,…

In this tutorial we will learn what is Statistical Regularity ???

Actually, I am writing this blog, because when I first read about this Statistical Regularity from one of my favorite teacher, Professor Arnab Chakraborty’s blog, where he has described the concept of statistical regularity with a beautiful example; I was so excited, but there he did not give the actual solution of that example(i.e., the coding part). So, here I am giving my solution which is actually very easy, and I think that’s why he did not give the coding 😅 😅 😅. But, I was very happy to do that by myself, and that’s why now I like to share that with you. I guess you will enjoy it…

So, before going to the example part, let me give a brief introduction on what is Statistical Regularity is.

Ststistical Regularity

Statistical regularity is different from mathematical patterns in the sense that it is rarely exactly replicated, it is extremely similar but not the same. We see this all around us. Our finger prints, for example, or the leaves on a tree.

Statistical regularity is like a mysterious black box which takes random unpredictable input and somehow digests the randomness to produce regular output. No doubt, if we can master this technique then it should help the predictable output from unpredictable inputs! The quite predictable profit of the Casino owner or insurance companies are examples.

Statistical regularity takes many forms, some more dramatic, some less. The simplest occurrence of the phenomenon was first proved mathematically by Jakob Bernoulli.The theorem and its proof will hardly fill a page completely. But it took 25 years to figure out how to tackle randomness using mathematics to arrive at the proof!

So, mathematically,

Consider a random experiment. so as it is known that, the result of a single random experiment can never be correctly predicted before conducting the experiment, if the random experiment is carried out a large number of times under identical conditions it will be seen that the Relative Frequency (R.F) of an event stabilizes to a certain value.

The Relative Frequency (R.F) of an outcome, $O$ (say), of an experiment is the number of time $O$ occurs, $f_n (O)$, divided by the total number of times, $n$, the experiment is carried out.

So, the Relative Frequency (R.F) of an outcome $O$ is:\[r_n (O)=\frac{f_n (O)}{n}\:\:;clearly\:0\leq r_n (O)\leq 1\]. It is seen that when the experiment is repeated indefinitely, $r_n (O)$ tends to a certain value, $p$ (say); where $0 \leq p \leq 1$.

for example :

A coin was tossed several times and the no. of times it fell Heads was noted. The following table shows the no. of Heads (H) obtained in sets of $n$ experiments.

Set	n=10	n=50	n=100
1	4	29	47
2	4	22	52
3	6	24	54
4	7	27	49
5	5	31	53
6	5	26	51
7	3	25	48
8	7	28	52
9	5	21	47
10	6	23	55
Total	52	256	508

For N=10, Relative Frequency (R.F), r(H) varies from 0.3 to 0.7.
For n=50, extreme values of r(H) become closer being 0.42 & 0.62.
For N=100, r(H) varies between 0.47 & 0.55.

The average values of $r(H)$ were 0.520, 0.512, 0.508 for $n$= 10, 50 ,100, respectively. Thus one may conclude that as $n$ increases Relative Frequency of H will be expected to be very close to 0.50.

Ok, so we have understand what is Statistical Regularity. Now, It’s time to jump into our main example.

The ‘Ludo & Paper Game’

Theory

We take four pieces of paper and write the following formulas on them:

1 \[X_{(new)}=0.8*X_{(old)}+0.1\] \[Y_{(new)}=0.8*Y_{(old)}+0.04\] 2 \[X_{(new)}=0.5*X_{(old)}+0.25\] \[Y_{(new)}=0.5*Y_{(old)}+0.04\] 3 \[X_{(new)}=0.355*X_{(old)}-0.355*Y_{(old)}+0.266\] \[Y_{(new)}=0.355*X_{(old)}+0.355*Y_{(old)}+0.078\] 4 \[X_{(new)}=0.355*X_{(old)}+0.355*Y_{(old)}+0.378\] \[Y_{(new)}=-0.355*X_{(old)}+0.355*Y_{(old)}+0.434\]

These are all formulas to compute two numbers, $X_{(new)}$ and ${Y_(new)}$ from two other numbers $X_{(old)}$ and $Y_{(old)}$.
We shall play a game of Ludo with these! The Ludo board will be $\mathbb{R}^2$, and the counter will be a single point, which is initially at $(X,Y)=(0,0)$. Draw one of the four pieces of paper at random and apply the formula on it to compute the new position of the counter. Keep on doing this. A every step you are drawing one of the four papers at random (same paper may get picked many times). All the counter positions are marked as dots.

R code

play=function(n)
{
X.old=0
Y.old=0
X.all=NULL
Y.all=NULL
for(i in 1:n)
{
sam=sample(1:4,1,replace=T)
if(sam==1)
{
X.new=0.8*X.old+0.1
Y.new=0.8*Y.old+0.04
}
else if(sam==2)
{
X.new=0.5*X.old+0.25
Y.new=0.5*Y.old+0.4
}
else if(sam==3)
{
X.new=0.355*X.old-0.355*Y.old+0.266
Y.new=0.355*X.old+0.355*Y.old+0.078
}
else
{
X.new=0.355*X.old+0.355*Y.old+0.378
Y.new=-0.355*X.old+0.355*Y.old+0.434
}
X.all[i]=X.new
Y.all[i]=Y.new
X.old=X.new
Y.old=Y.new
}
plot(X.all,Y.all,
pch=16,
col="darkgreen",
cex=.7,
axes=F,
xlab="",
ylab="",
main="Ludo & Paper Game Population")
box()
}
Ans=play(100000) #--- Playing this game 10000 times

So, actually, individual outcomes are random, but, when the number of trials are very large, then the experiment looses it’s randomness and gives a known structural shape which is very interesting.

Thank you for reading…

Write a user-defined function in R

Mon, 25 Oct 2021 00:00:00 +0000

INTRODUCTION

In this tutorial, we will learn, how to make our own custom function in R.Though, R has thousands of functions under thousands of packages, but it is most important to know about how to make a customized function function.

User Defined Functions

Functions are created using the function() directive and are stored as R objects just like anything else. In particular, they are R objects of class function”.
The basic format of the code is

function_name = function(arguments)
{ main computation to be done }

#---define a function
testfunction = function(x,y)
{
x+y
}
#--- call the function with arguments 2,5
testfunction(2,5)

## [1] 7

Doing more than one computation

When a function performs more than one task and gives multiple objects return() is used to get all the outputs in a form of a vector.

testfunction = function(x,y)
{
sum= x+y
prod= x*y
return(c(Sum=sum,Product=prod))
}
testfunction(2,5)

## Sum Product
## 7 10

Note that the two output can be accepted separatedly as

result = testfunction(2,5)
result[1]

## Sum
## 7

result[2]

## Product
## 10

Alternatively multiple output can be extracted using list(). This will enable us to extract by names (along with indices)

testfunction = function(x,y)
{
sum= x+y
prod= x*y
output=list(Sum=sum,Product=prod)
return(output)
}
output= testfunction(2,5)

output$Sum

## [1] 7

output$Product

## [1] 10

Default argument of a function

R provides methods to define the default value of the arguments while defining the function.
This default values will be used when the function is called unless this argument values are changed during calling.

#--- initializing x=1 & y=1
testfunction = function(x=1,y=1)
{
sum= x+y
prod= x*y
#--- Creates the output list
output=list(Sum=sum,Product=prod)
return(output)
}
testfunction() #-- calling function with no arguments

## $Sum
## [1] 2
##
## $Product
## [1] 1

Additional Arguments

Provision for additional arguments (probably optional arguments, which cannot be decided beforehand) can be done using “…”

testfunction = function(x=1,y=1,...)
{
sum= x+y
prod= x*y
#--- Creates the output list
output=list(Sum=sum,Product=prod)
return(output)
}
testfunction(2,5,z=12) #-- z is an extra argument which has no use in this function

## $Sum
## [1] 7
##
## $Product
## [1] 10

Data types of arguments

Since the types of arguments are not specified (at the time of definition), the arguments can be of any type of any data type provided the internal code of the function is conformable with that data type

testfunction = function(x=1,y=1,...)
{
sum= x+y
prod= x*y
#--- Creates the output list
output=list(Sum=sum,Product=prod)
return(output)
}
testfunction(2,5,z=12) #-- calling with vectors

## $Sum
## [1] 7
##
## $Product
## [1] 10

#-- calling with characters
testfunction("F","M")

$\color{red}{\text{Error in x+y : non-numeric argument to binary operator}}$

Sanity checking argument

So how can we stop a function if the user calls it with non-conformable arguments ?
A good practice is to write functions in such that while calling, it checks whether the arguments supplied make sense before going to the main body of the function.

testfunction = function(x=1,y=1,...)
{
#-- check if the arguments are not characters
stopifnot(typeof(x)!="character",typeof(y)!="character")
sum= x+y
prod= x*y
#--- Creates the output list
output=list(Sum=sum,Product=prod)
return(output)
}
testfunction("F","M")

$\color{red}{\text{Error in testfunction("F","M") : typeof(x) != "character" is not TRUE}}$

The stopifnot function halts the execution of the function (with error message) if all of its arguments do not evaluate to TRUE.

Scope of variables

When we define a variable within a function, it will be local and will not affect any global variable even if the name matches.

f_outer=function()
{
a=2
f_inner=function()
{
b=5
}
}
c=10

Then variable c is global to both f_outer and f_inner. For f_inner variable b is local but a is global whereas for f_outer, both a and b are local.

Recursive Function

R supports recursive function, i.e., a function that calls itself recursively.

#-- Creating a recursive function
fact= function(x)
{
if(x==0)
{
return(1)
}
else
{
return(x+fact(x-1))
}
}
fact(5) #-- calling the function with x=5

## [1] 16

Loops in R

Loops helps to repeat a job. We first start with for loop.
The syntax is for(variable in sequence) { expression to be evaluated }
Here sequence is an expression which evaluates to a vector(not necessarily in A.P.)
For example all the following are valid for(i in 1:10) for(i in c(2,3,7,9,13,17,19,23)) for(i in c(“A”,“B”,“C”))
The no. of times the expression in loop is evaluated is the length of the sequence.

While loop

The syntax is while(condition) { expression to be evaluated }
The loop repeats its action until the test condition is not satisfied.
Unlike for loop we need not to know in advance how many times the loop will repeat.

If & If-Else

The syntax for if statement is if(condition) { expression }
For a binary situation we can use if-else if(condition) { expression 1 } else { expression 2 }

If-Else function

An alternative better way if-else statements is ifelse() function.
The syntax is new variable= ifelse(Some Condition, Value of new variable if condition is true, value if condition is false)
e.g. category= ifelse(marks>80, “Good”,“Fair”) assigns value Good if marks is more than 80 and otherwise Fair.
The additional advantage is in the condition this function can compare a vector with scalar (interpreted as each element compared to the scalar)

Else if Ladder

When we have more than two cases we can use else-if ladder

f= function(x)
{
if(x==1) print(a)
else if(x==2) print(b)
else print(c)
}

Switch Statement

An alternative and faster way is switch() statement.
The basic syntax is switch(statement,list)
Here statement is evaluated and based on this value, the corresponding item in the list is returned.
e.g. switch(2,“A”,“B”,“C”) gives the answer “B”. It selects the item no. 2 from the list.
switch(4,“A”,“B”,“C”) gives NULL as there is no item with index 4 in the list.
switch(“color”,“color”=“red”,“shape”=“round”,“length”=5) gives answer red (it matches the string)

stat= function(x,type)
{
switch(type,"mean"=mean(x),
"median"=median(x),
"sd"=sd(x))
} #--- function ends here
stat(1:10,"mean") #-- call the function with mean

## [1] 5.5

stat(1:10,"median") #-- call the function with median

## [1] 5.5

Repeat Loop

Basic syntax is

repeat { expression to be evaluated }

No default way of termination.
We need to manually terminate the loop using break statement.

x=1 #-- Take any value x as 1
repeat
{ #-- Loop begin here
x=x+1
if(x==6) break #-- manual instruction to exit loop
} #-- Loop ends here
x #-- checking the value of x

## [1] 6

Plotting Functions

Any function can be plotted using curve()
The syntax is curve(function,from,to,n,add=T/F,…) where from and to are range over which the function is plotted and n(integer) is the number of points at which we evaluate. add=TRUE/FALSE indicates whether to add this curve to a existing plot or not.
To get more information about it’s arguments type ??curve()

 myfun= function(x)
{
x*(1-x)
}
curve(myfun,from=0,to=1)

Plotting normal curve

 #-- dnorm gives pdf of N(0,1)
curve(dnorm,from = -4,to=4,n=500)

sin(1/x) plot

 curve(sin(1/x),from = -2,to = 2)

## Warning in sin(1/x): NaNs produced

Zoom at the origin

 curve(sin(1/x),from = -0.1,to = 0.1)

## Warning in sin(1/x): NaNs produced

Solving Equation

Already we know if we have a system of equations we can use solve()
For equations involving one variable we can use uniroot()
The syntax is uniroot(function,interval,…)
For solve \[e^x=sin(x)\] we write

 uniroot(function(x) exp(x)-sin(x),c(-5,5))

Solving Equation

## $root
## [1] -3.183063
##
## $f.root
## [1] -1.359327e-08
##
## $iter
## [1] 8
##
## $init.it
## [1] NA
##
## $estim.prec
## [1] 6.103516e-05

Solving Equation

For finding real or complex roots of a ploynomial use polyroot()
For solving roots of $n$ non-linear equations we can use multiroot() from the rootSolve package.

Some Calculus in R

Define integral can be done using integrate()
e.g. $\int_0^1(x^2)dx$ can be done using

 integrate(function(x) x^2,0,1)

## 0.3333333 with absolute error < 3.7e-15

For derivatives, we use deriv()

Optimization

Maximum or Minimum value of a function can be found using optimize()
optimize(function,interval,maximum=TRUE/FALSE)

 optimise(function(x) exp(-x),c(0,5))

## $minimum
## [1] 4.999936
##
## $objective
## [1] 0.006738379

There are other functions for optimization like optim(),nlm(),constrOptim().

Performance of LASSO when one or more covariate(s) is/are Missing Not at Random(MNAR)

Mon, 23 Aug 2021 00:00:00 +0000

About

This is my M.Sc. final year project.

I did this project under the supervision of my mentor Dr. Sumanta Adhya, WBSU.

In this project, I have tried to see that, how LASSO will perform the variable selection tasks under the multicollinearity situation when the data is affected by the missing values where the missingness is not at random. I have investigated different LASSO solutions from simulated data sets and trying to find a method that will benefit us in this situation. In this project, I have proposed a new methodology, “Inverse Probability Weighted Logistic Lasso Estimation” which gives a better solution than complete case analysis under the MNAR mechanism.

here I have compared a total of five Lasso solution techniques, that is, “LASSO on Original Data set(when all known)”, “LASSO on Complete Data set(removing all missing observations)”, “IPW-LASSO on Complete Data set using known(actual) missing probabilities”, “IPW-LASSO on Complete Data set using estimated(MLE) missing probabilities”, and “IPW-LASSO on Complete Data set using estimated(Logistic LASSO) missing probabilities”. And, have shown that “IPW-LASSO on Complete Data set using estimated(Logistic LASSO) missing probabilities” is the better solution than, simple complete case analysis; when the missing mechanism is MNAR.

Keywords : MNAR, Logistic Regression, LASSO, IPW, IPW-LASSO.

Click the Slide button above to see the project presentation.

Click the Report button above to see the project document.

Click the github button above to see the R code.

A Study of effect of different Diet on Weight loss

Mon, 14 Jan 2019 00:00:00 +0000

About

This is my B.Sc. final year project.

I did this project under the supervision of my mentor Dr. Arabinda Das, A.P.C. College.

In this project, I had worked on diet data, where I had studied that, how different diets actually affected weight loss.

Keywords: ANOVA, ANCOVA, Shapiro-Wilk Test, Kolmogorov-Smirnov test, Bartlett’s test, Levene’s Test, Tukey HSD test.

Click the Report button above to see the project document.

Unfortunately, I am unable to give the data set and the code because, I have lost all the necessary documents regarding this project, due to a computer crash. But I am trying to do this again from scratch.

Name	Numeric Var(s)	Cat. Var(s)	Cat. Var Group #	Cat. Var # of interest	Parametric	Paired
\(\color{red}{\text{One Mean Wilcoxon:}}\)	1	0	0	0	No	NA
\(\color{red}{\text{Mann-Whitney:}}\)	1	1	2	1	No	No
\(\color{red}{\text{Paired Wilcoxon:}}\)	1	1	2	1	No	Yes