# Simulated Data With R

R has a wide variety of capabilities for generating simulated data that can be useful for testing or pedagogy.

## Linear Sequences

Linear sequences of numbers are commonly used in R, notably for iterating through members of data structures.

For sequences of successive integers, the **colon operator**
is usually the simplest choice. The result is all integers
between the number before the colon and the number after the colon, inclusive.
The sequence can be ascending or descending:

> x = 1:10 > x [1] 1 2 3 4 5 6 7 8 9 10 > x = 1:-10 > x [1] 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 > x = -1:10 > x [1] -1 0 1 2 3 4 5 6 7 8 9 10 > x = 1:1 > x [1] 1

For sequences of real numbers or sequences that proceed in
increments not equal to one, the **seq(from, to, by)**
function:

> x = seq(10, 12, 0.3) > x [1] 10.0 10.3 10.6 10.9 11.2 11.5 11.8 > x = seq(-10, -12, -0.3) > x [1] -10.0 -10.3 -10.6 -10.9 -11.2 -11.5 -11.8

For regular sequences of specific numbers
the **rep()** (repeat)
function can be used. The parameters are:

- x: the value to repeat
- times: the number of times to repeat that value

> x = rep(1, 10) > x [1] 1 1 1 1 1 1 1 1 1 1

Multiple calls can be concatenated with the **c()**
function to create collections of numbers or categorical/dichotomous data:

> x = c(rep(1,10), rep(2, 10), rep(3, 10)) > x [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 > x = c(rep("Democratic", 4), rep("Republican", 6)) > x [1] "Democratic" "Democratic" "Democratic" "Democratic" "Republican" [6] "Republican" "Republican" "Republican" "Republican" "Republican"

## Nonlinear Sequences

Linear sequences can be transformed to nonlinear sequences of numbers using a variety of functions. Some examples are given below.

### Exponential and Logarithmic Sequences

Many phenomena, such as growth, follow exponential
or logarithmic curves. Natural exponentials and logarithms
(powers and roots of
Euler's Number) are supported with the **exp()** and **log()** functions.

> x = exp(0:10) > x [1] 1.000000 2.718282 7.389056 20.085537 54.598150 [6] 148.413159 403.428793 1096.633158 2980.957987 8103.083928 [11] 22026.465795 > plot(x, type="b")

> x = log(1:20) > x [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 [8] 2.0794415 2.1972246 2.3025851 2.3978953 2.4849066 2.5649494 2.6390573 [15] 2.7080502 2.7725887 2.8332133 2.8903718 2.9444390 2.9957323 > plot(x, type="b")

Exponential sequences can be created based on any base using the
**carat operator**:

> x = 2^(0:10) > x [1] 1 2 4 8 16 32 64 128 256 512 1024 > plot(x, type="b")

The optional second parameter to the log() function is an base
if you need a logarithm on something other than *e*:

> x = log(1:10, 2) > x [1] 0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000 [9] 3.169925 3.321928 > plot(x, type="b")

### Cyclical Sequences

Many phenomena such as temperature or market activity occur in
regularly repeating cycles. The **sin()** and **cos()**
functions can be used to create cyclical sequences. The parameter
is in radians (multiples of 2π), and the native **pi** constant
can be used to convert a linear sequence to radians:

> x = sin(seq(0, 6, 0.25) * pi) > x [1] 0.000000e+00 7.071068e-01 1.000000e+00 7.071068e-01 1.224647e-16 [6] -7.071068e-01 -1.000000e+00 -7.071068e-01 -2.449294e-16 7.071068e-01 [11] 1.000000e+00 7.071068e-01 3.673940e-16 -7.071068e-01 -1.000000e+00 [16] -7.071068e-01 -4.898587e-16 7.071068e-01 1.000000e+00 7.071068e-01 [21] 6.123234e-16 -7.071068e-01 -1.000000e+00 -7.071068e-01 -7.347881e-16 > plot(x, type="b")

## Normally-Distributed Sequences

A normally distributed sequence of numbers can be created using the
**qnorm()** quantile function. The parameters:

- x: A linear sequence of probabilities in the range 0 to 1
- mean: The mean (μ)
- sd: The standard deviation (σ)

> x = qnorm(seq(0, 1, 0.02), 2, 1) > x [1] -Inf -0.05374891 0.24931393 0.44522641 0.59492844 0.71844843 [7] 0.82501321 0.91968066 1.00554212 1.08463491 1.15837877 1.22780679 [13] 1.29369744 1.35665459 1.41715849 1.47559949 1.53230120 1.58753687 [19] 1.64154121 1.69451921 1.74665290 1.79810652 1.84903078 1.89956628 [25] 1.94984642 2.00000000 2.05015358 2.10043372 2.15096922 2.20189348 [31] 2.25334710 2.30548079 2.35845879 2.41246313 2.46769880 2.52440051 [37] 2.58284151 2.64334541 2.70630256 2.77219321 2.84162123 2.91536509 [43] 2.99445788 3.08031934 3.17498679 3.28155157 3.40507156 3.55477359 [49] 3.75068607 4.05374891 Inf > hist(x)

### Sequences Based on Percentiles

You will occasionally find statistical data given as percentiles. For example, in this document the CDC gives the following percentiles for the height of men in the USA:

Percentile: 5 10 15 25 50 75 85 90 95 Height (in): 64.3 65.4 66.1 67.2 69.1 71.2 72.3 73.0 74.1

This means that 5% of men are below 64.3 inches, 25% are below 67.2 inches, etc.

By using a spline, it is possible to simulate a curve that fits those percentile points, then use the predict() function to create a representative sample. The CDC indicates that they based their estimates on 5,232 examined persons (sample size), which is what is simulated below:

# Table 12. Height in inches for males aged 20 and over and number of examined # persons, mean, standard error of the mean, and selected percentiles, by race # and Hispanic origin and age: United States, 2011–2014 pheight = c(64.3, 65.4, 66.1, 67.2, 69.1, 71.2, 72.3, 73.0, 74.1) percentile = c(5, 10, 15, 25, 50, 75, 85, 90, 95) # Create a spline to fit those points spline = smooth.spline(percentile, pheight) # Simulate 5,232 points that cover the range of percentiles height = predict(spline, 100 * (1:5232) / 5233) # Plot the simulated data plot(percentile, pheight, col="blue", pch=16) lines(height$x, height$y) # Print the descriptive statistics paste("mean = ", mean(height$y)) paste("stdev =", sd(height$y)) paste("stderr =", sd(height$y) / sqrt(length(height$y)))

### Logistic Sequences

Phenomena like growth to the carrying capacity of an ecosystem follow a logistic curve, which
can be generated using the self-starting logistic model function
**SSlogis()**. The parameters are:

- input: A linear sequence
- Asym: The maximum output value (carrying capacity)
- xmid: The x value that is at the middle of the curve
- scal: The scaling (width) of the curve

> x = SSlogis(1:20, Asym = 100, xmid = 10, scal = 1) > x [1] 0.01233946 0.03353501 0.09110512 0.24726232 0.66928509 1.79862100 [7] 4.74258732 11.92029220 26.89414214 50.00000000 73.10585786 88.07970780 [13] 95.25741268 98.20137900 99.33071491 99.75273768 99.90889488 99.96646499 [19] 99.98766054 99.99546021 > plot(x, type="b") > text(1, 10, "Initial Colonization", pos=4) > text(11, 50, "High Growth", pos=4) > text(13, 90, "Limit of Carrying Capacity", pos=4)

## Random Sequences

Random sequences of numbers are useful for testing analysis code, and are essential for formal statistical techniques like Monte Carlo methods.

Random sequences of numbers do not follow a regular pattern. While computers are deterministic machines that cannot generate truly random sequences of numbers, there are a variety of algorithms that generate pseudo-random sequences that are close enough to random to meet the needs of researchers.

### Uniformly-Distributed Random Sequences

The most basic random number generating function is the **runif()**
function, which generates a random sequence in a uniform distribution (the unif part of the name).
The parameters:

- n: The count of numbers to generate
- min: The minimum possible value in the sequence (defaults to 0)
- max: The maximum possible value in the sequence (defaults to 1)

> x = runif(20) > x [1] 0.6877721 0.7481171 0.9397808 0.6928701 0.6112946 0.4786213 0.3839646 [8] 0.4164434 0.9205573 0.8799462 0.1750195 0.5641760 0.9437534 0.5265315 [15] 0.1380678 0.8415553 0.4562519 0.8806186 0.9218461 0.3643150 > hist(x)

runif() will return an unpredictably different sequence on
each successive call. If you need consistent repeatability, such as for
a demonstration presentation, you can use the **set.seed()**
function to set a seed number that is used to start the
random number generator at a specific point:

> set.seed(1) > runif(10) [1] 0.26550866 0.37212390 0.57285336 0.90820779 0.20168193 0.89838968 [7] 0.94467527 0.66079779 0.62911404 0.06178627 > runif(10) [1] 0.2059746 0.1765568 0.6870228 0.3841037 0.7698414 0.4976992 0.7176185 [8] 0.9919061 0.3800352 0.7774452 > set.seed(1) > runif(10) [1] 0.26550866 0.37212390 0.57285336 0.90820779 0.20168193 0.89838968 [7] 0.94467527 0.66079779 0.62911404 0.06178627

### Normally-Distributed Random Sequences

As with linear sequences, a random sequence generated by runif() can
be fed into other functions to transform to a different distribution.
There are also convenience functions that do this in one step. For
example the **rnorm()** function generates random numbers that
have a normal distribution. Because the sequence is random, the
distribution will not be perfectly even.

For example, the mean height of men in the US is 5 feet, 9.3 inches with a standard deviation of 2.94 inches.. Simulated height for 50 men:

> x = rnorm(50, (5 * 12) + 93, 2.94) > x [1] 153.4227 152.6538 150.3185 148.7735 150.6566 156.6870 155.2701 152.3546 [9] 151.7511 151.7682 155.9311 152.1892 156.6927 154.9012 156.8200 150.4326 [17] 153.0246 150.4102 154.7530 153.3520 152.1704 157.2806 153.6733 155.9298 [25] 155.2987 150.7163 151.1890 153.1369 149.6767 154.6956 149.2346 157.7788 [33] 151.5280 157.9342 151.7872 150.1415 153.0746 153.0808 148.0603 156.0980 [41] 149.7084 153.9867 154.4547 153.4059 152.6508 153.5812 149.8580 150.6386 [49] 149.7255 157.6455 > hist(x, breaks=10)

### Random Integers

runif() generates real numbers. If you need a random sequence of
integers (whole numbers), the **round()** function can be used with
the **as.integer()** type convertor. Because the sequence is random,
it will not necessarily be completely even:

> x = as.integer(round(runif(50, 0, 10))) > x [1] 2 7 6 2 0 7 1 5 7 4 8 1 9 10 1 1 0 10 8 9 2 6 8 5 10 [26] 10 3 10 8 5 6 4 4 10 10 5 2 2 1 1 5 6 1 2 1 8 9 1 8 3 > hist(x, breaks=11)

### Random Categorical Data

Random dichotomous (one of two values) sequences can be
created using the **ifelse()** function. The condition
determines the proportion of values, although with small numbers
the randomness of the sequence may not match the proportion:

> x = ifelse(runif(20) > 0.6, "Democratic", "Republican") > x [1] "Republican" "Democratic" "Republican" "Republican" "Republican" [6] "Republican" "Republican" "Republican" "Democratic" "Republican" [11] "Democratic" "Republican" "Republican" "Republican" "Democratic" [16] "Republican" "Republican" "Democratic" "Republican" "Democratic" > table(x) x Democratic Republican 6 14

## Random Sampling

Random numbers are commonly used to select a random sample
from a list of possible people to survey. The **sample()**
function is a shortcut to perform this task. As with runif(),
the set.seed() function can be used to get reproducable results.
The parameters to sample():

- x: a vector of values to sample from
- size: the number of values to sample

> people = c("Hailey", "Tom", "Jane", "Jim", "Tony", "Mike", + "Ethan", "Misty", "Frank", "Joan", "Ida", "Helen") > set.seed(1) > sample(people, 5) [1] "Jim" "Tony" "Mike" "Frank" "Tom" > sample(people, 5) [1] "Ida" "Helen" "Ethan" "Mike" "Hailey" > set.seed(1) > sample(people, 5) [1] "Jim" "Tony" "Mike" "Frank" "Tom"

sample() can be used to demonstrate the margin of error with sampling. For example, in close election involving 10,000 voters, with 51 percent voting for a Democratic candidate, three successive random samples give significantly different results, including results outside the ±5% margin of error with a 95% level of confidence:

# Simulated population of 10,000 voters with 51% planning to vote Democratic > x = c(rep("Republican", 4900), rep("Democratic", 5100)) > set.seed(1) > table(sample(x, 100)) Democratic Republican 50 50 > table(sample(x, 100)) Democratic Republican 55 45 > table(sample(x, 100)) Democratic Republican 41 59

While political polling is an art that is more complex than simple random sampling, this helps demonstrate how polls that incorporate rigorous sampling methodologies can yield predictions that do not match the final election results.

## Two-Variable Relationship Example

The following formula is use in linear regression to model the relationship between two variables:

Y_{i}= α + ΒX_{i}+ ε_{i}

In English that can be described as:

y = offset + scale * sequence + random_error

For example, there is a rough inverse correlation (R^{2})
between median household income and percent obesity
in US counties (2017-county-data.csv):

> x = read.csv("2017-county-data.csv") > plot(x$MEDHHINC, x$PCOBESITY, col="#00000080") > m = lm(PCOBESITY ~ MEDHHINC, data=x) > abline(m) > abline(m, col="darkred") > summary(m) Call: lm(formula = PCOBESITY ~ MEDHHINC, data = x) Residuals: Min 1Q Median 3Q Max -15.8543 -2.1624 0.2542 2.5016 11.8794 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.922e+01 2.752e-01 142.51 <2e-16 *** MEDHHINC -1.767e-04 5.684e-06 -31.09 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.9 on 3137 degrees of freedom (8 observations deleted due to missingness) Multiple R-squared: 0.2355, Adjusted R-squared: 0.2353 F-statistic: 966.5 on 1 and 3137 DF, p-value: < 2.2e-16

That relationship can be described by the following formula. For each dollar increase in median household income, the percent of obese residents drops by 0.018 percent.

percent_obesity = 39.22 - 0.0001767 * median_household_income

Given that formula we can combine functions described above to create simulated data. Also included is a simple linear model (lm()) to verify the results.

> income = seq(0, 150000, 150) > error = runif(length(income), -5, 5) > set.seed(1) > pcobesity = 39.22 - (0.0001767 * income) + error > plot(income, pcobesity) > > # Verify the simulated data with a linear model > model = lm(pcobesity ~ income) > summary(model) Call: lm(formula = pcobesity ~ income) Residuals: Min 1Q Median 3Q Max -5.0612 -2.2972 -0.1046 2.2012 4.8564 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.945e+01 4.377e-01 90.13 <2e-16 *** income -1.781e-04 5.046e-06 -35.30 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.703 on 149 degrees of freedom Multiple R-squared: 0.8932, Adjusted R-squared: 0.8925 F-statistic: 1246 on 1 and 149 DF, p-value: < 2.2e-16

However, income by county is normally rather than uniformly distributed, suggesting that rnorm() would be better than runif() for generating the input values and error values for the model:

> set.seed(1) > income = rnorm(1000, mean=47000, sd=12200) > error = rnorm(length(income), -1, 3.9) > pcobesity = 39.22 - (0.0001767 * income) + error > plot(income, pcobesity) > > # Verify the simulated data with a linear model > model = lm(pcobesity ~ income) > abline(model, col="darkred") > summary(model) Call: lm(formula = pcobesity ~ income) Residuals: Min 1Q Median 3Q Max -12.6689 -2.6206 -0.0536 2.9460 14.2129 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.806e+01 4.934e-01 77.13 <2e-16 *** income -1.746e-04 1.017e-05 -17.18 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.058 on 998 degrees of freedom Multiple R-squared: 0.2281, Adjusted R-squared: 0.2274 F-statistic: 295 on 1 and 998 DF, p-value: < 2.2e-16