Sunday, 15 March 2015

R Packette - Weighted Binomial Sums

This R code file Weighted_BinSum.r is a... packette? Proto-package? Mini-package?

It's by no means a full R package because it doesn't have the proper documentation, but it's the start of what could be one.

This packette is titled "Exact and Approximate Distributions of Weighted Binomial Sums", it's used produce a distribution of sum( w * binomial(n,p)) where n and p are the usual binomial parameters and w values are arbitrary positive weights. It also has functions to work with the resultant distribution (pbinsum, qbinsum, dbinsum, and rbinsum)  much as you would any of the classical distributions.

The function to compute the sum distribution, make.binsum( ... ), uses an operational parameter, "tol" for tolerance, which limits the number of possible sum values that will be evaluated. If the anticipated number of distinct values the weighted sum can take exceeds this tolerance value, values are rounded off and grouped together in "tol" even spaced increments, and a flag that distribution is approximate, "is.approx", becomes set to true. If such an approximation is never necessary, the exact distribution is computed instead.  The default tolerance is 1 million for weighted binomial sums, and 100 million when the weights are all equal.

Another operational parameter, "batch_size", is used to control the number of possible (non-distinct) sum values that are considered at once. Larger batch sizes reduce the computation time but require more memory. The default batch size, a soft limit, is 1 million and should limit to RAM footprint of this function to a couple hundred Mb at worst. Garbage collection gc() is called to ensure the RAM usage is temporary.

The default output of make.binsum( ... ) is a list including $x, a vector of the possible sum values with estimated probability mass, $pdf and cumulative probability $cdf, as well as the flag $is.approx and the input parameters $n, $p, and $w. By setting the operational parameter out_df in make.binsum( ... ), you can get a data frame of $x, $pdf, and $cdf instead. Both the data frame and list work for the p,q,r, and dbinsum functions.


Examples:

source("Weighted_BinSum.r")
 n = c(50,100,150,200,250)
 p = c(0.1,0.2,0.3,0.4,0.5) 
 w = c(1,1,1,1,1,1)
distr = make.binsum(n=n,p=p,w=w)
str(distr)

List of 7
 $ x        : num [1:751] 0 1 2 3 4 5 6 7 8 9 ...
 $ pdf      : num [1:751] 1.44e-155 6.89e-153 1.64e-150 2.61e-148 3.11e-146 ...
 $ cdf      : num [1:751] 1.44e-155 6.90e-153 1.65e-150 2.63e-148 3.13e-146 ...
 $ n        : num [1:5] 50 100 150 200 250
 $ p        : num [1:5] 0.1 0.2 0.3 0.4 0.5
 $ w        : num [1:5] 1 1 1 1 1
 $ is.approx: logi FALSE


And as a data frame



distr = make.binsum(n=n,p=p,w=w,out_df=TRUE)
str(distr)

'data.frame':   751 obs. of  3 variables:
 $ x  : num  0 1 2 3 4 5 6 7 8 9 ...
 $ pdf: num  1.44e-155 6.89e-153 1.64e-150 2.61e-148 3.11e-146 ...
 $ cdf: num  1.44e-155 6.90e-153 1.65e-150 2.63e-148 3.13e-146 ...


And some query functions (either as a data frame or as a list)

 ## Get the CDF
 pbinsum(c(275,283,291,296,300,305,311,315,320,326),distr)
 [1] 0.5167767 0.7480498 0.9019275 0.9536964 0.9768502 0.9913582 0.9977820 0.9991968
 [9] 0.9998006 0.9999688

 ## Trace CDF back to quantiles
 temp = pbinsum(c(275,283,291,296,300,305,311,315,320,326),distr)
 qbinsum(temp, distr)
 [1] 275 283 291 296 300 305 311 315 320 326


 temp = pbinsum(c(275,283,291,296,300,305,311,315,320,326)+0.5,distr)
 qbinsum(temp, distr) # Some error introduced by linear interpolation
 [1] 275.4991 283.4868 291.4748 296.4673 300.4613 305.4539 311.4451 315.4393 320.4321
[10] 326.4234

 ## Get the probability mass
 dbinsum(c(275,283,291,296,300,305,311,315,320,326),distr)
 [1] 3.128514e-02 2.558020e-02 1.417002e-02 8.051091e-03 4.597739e-03 1.995477e-03
 [7] 6.023298e-04 2.407634e-04 6.700921e-05 1.189357e-05

 ## Randomly select values (with replacement) based on the PMF
 rbinsum(10,distr)
 [1] 279 279 269 270 285 274 277 275 266 266


Finally, a weighted case, which takes more time and ram, produces much larger objects, and is sometimes approximate.

n = c(50,100,150,200,250)
p = c(0.1,0.2,0.3,0.4,0.5) 
w = c(1,1.2,1.3,1.67,2)
distr = make.binsum(n=n,p=p,w=w)

 str(distr)
List of 7
 $ x        : num [1:117967] 0 1 1.2 1.3 1.67 ...
 $ pdf      : num [1:117967] 1.44e-155 8.00e-155 1.32e-147 4.36e-125 2.94e-90 ...
 $ cdf      : num [1:117967] 1.44e-155 9.44e-155 1.32e-147 4.36e-125 2.94e-90 ...
 $ n        : num [1:5] 50 100 150 200 250
 $ p        : num [1:5] 0.1 0.2 0.3 0.4 0.5
 $ w        : num [1:5] 1 1.2 1.3 1.67 2
 $ is.approx: logi TRUE


Given time, I would like to expand this to more distributions like negative binomial, Poisson, and arbitrary discrete distributions. I would also like incorporate more approximation methods, including the one investigated by Michael Stephens (SFU) and Ken Butler (UofT) that inspired this work originally.

As always, feedback and bug reports are greatly appreciated on either the comments here or at jack.davis.sfu@gmail.com

No comments:

Post a Comment