Title: | Survey Sampling |
---|---|
Description: | Functions to draw random samples using different sampling schemes are available. Functions are also provided to obtain (generalized) calibration weights, different estimators, as well some variance estimators. |
Authors: | Yves Tillé <[email protected]>, Alina Matei <[email protected]> |
Maintainer: | Alina Matei <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.10 |
Built: | 2024-11-18 04:54:02 UTC |
Source: | https://github.com/cran/sampling |
Selects a balanced cluster sample.
balancedcluster(X,m,cluster,selection=1,comment=TRUE,method=1)
balancedcluster(X,m,cluster,selection=1,comment=TRUE,method=1)
X |
matrix of auxiliary variables on which the sample must be balanced. |
m |
number of clusters to be selected. |
cluster |
vector of integers that defines the clusters. |
selection |
1, selection of the clusters with probabilities proportional
to size, |
comment |
a comment is written during the execution if |
method |
the used method in the function |
Returns a matrix containing the vector of inclusion probabilities and the selected sample.
samplecube
, fastflightcube
, landingcube
############ ## Example 1 ############ # definition of the clusters; there are 15 units in 3 clusters cluster=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3) # matrix of balancing variables X=cbind(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)) # selection of 2 clusters s=balancedcluster(X,2,cluster,2,TRUE) # the sample of clusters with the inclusion probabilities of the clusters s # the selected clusters unique(cluster[s[,1]==1]) # the selected units (1:length(cluster))[s[,1]==1] # with the probabilities s[s[,1]==1,2] ############ ## Example 2 ############ data(MU284) X=cbind(MU284$P75,MU284$CS82,MU284$SS82,MU284$S82,MU284$ME84) s=balancedcluster(X,10,MU284$CL,1,TRUE) cluster=MU284$CL # the selected clusters unique(cluster[s[,1]==1]) # the selected units (1:length(cluster))[s[,1]==1] # with the probabilities s[s[,1]==1,2]
############ ## Example 1 ############ # definition of the clusters; there are 15 units in 3 clusters cluster=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3) # matrix of balancing variables X=cbind(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)) # selection of 2 clusters s=balancedcluster(X,2,cluster,2,TRUE) # the sample of clusters with the inclusion probabilities of the clusters s # the selected clusters unique(cluster[s[,1]==1]) # the selected units (1:length(cluster))[s[,1]==1] # with the probabilities s[s[,1]==1,2] ############ ## Example 2 ############ data(MU284) X=cbind(MU284$P75,MU284$CS82,MU284$SS82,MU284$S82,MU284$ME84) s=balancedcluster(X,10,MU284$CL,1,TRUE) cluster=MU284$CL # the selected clusters unique(cluster[s[,1]==1]) # the selected units (1:length(cluster))[s[,1]==1] # with the probabilities s[s[,1]==1,2]
Selects a stratified balanced sample (a vector of 0 and 1). Firstly, the flight phase is applied in each stratum. Secondly, the strata are aggregated and the flight phase is applied on the whole population. Finally, the landing phase is applied on the whole population.
balancedstratification(X,strata,pik,comment=TRUE,method=1)
balancedstratification(X,strata,pik,comment=TRUE,method=1)
X |
matrix of auxiliary variables on which the sample must be balanced. |
strata |
vector of integers that specifies the stratification. |
pik |
vector of inclusion probabilities. |
comment |
a comment is written during the execution if |
method |
the used method in the function |
Tillé, Y. (2006), Sampling Algorithms, Springer.
Chauvet, G. and Tillé, Y. (2006). A fast algorithm of balanced sampling. Computational Statistics, 21/1:53–62.
Chauvet, G. and Tillé, Y. (2005). New SAS macros for balanced sampling. In INSEE, editor, Journées de Méthodologie Statistique, Paris.
Deville, J.-C. and Tillé, Y. (2004). Efficient balanced sampling: the cube method. Biometrika, 91:893–912.
Deville, J.-C. and Tillé, Y. (2005). Variance approximation under balanced sampling. Journal of Statistical Planning and Inference, 128/2:411–425.
samplecube
, fastflightcube
, landingcube
############ ## Example 1 ############ # variable of stratification (3 strata) strata=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3) # matrix of balancing variables X=cbind(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)) # Vector of inclusion probabilities. # the sample has its size equal to 9. pik=rep(3/5,times=15) # selection of a stratified sample s=balancedstratification(X,strata,pik,comment=TRUE) # the sample is (1:length(pik))[s==1] ############ ## Example 2 ############ data(MU284) X=cbind(MU284$P75,MU284$CS82,MU284$SS82,MU284$S82,MU284$ME84) strata=MU284$REG pik=inclusionprobabilities(MU284$P75,80) s=balancedstratification(X,strata,pik,TRUE) #the selected units are MU284$LABEL[s==1]
############ ## Example 1 ############ # variable of stratification (3 strata) strata=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3) # matrix of balancing variables X=cbind(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)) # Vector of inclusion probabilities. # the sample has its size equal to 9. pik=rep(3/5,times=15) # selection of a stratified sample s=balancedstratification(X,strata,pik,comment=TRUE) # the sample is (1:length(pik))[s==1] ############ ## Example 2 ############ data(MU284) X=cbind(MU284$P75,MU284$CS82,MU284$SS82,MU284$S82,MU284$ME84) strata=MU284$REG pik=inclusionprobabilities(MU284$P75,80) s=balancedstratification(X,strata,pik,TRUE) #the selected units are MU284$LABEL[s==1]
Selects a balanced two-stage sample.
balancedtwostage(X,selection,m,n,PU,comment=TRUE,method=1)
balancedtwostage(X,selection,m,n,PU,comment=TRUE,method=1)
X |
matrix of auxiliary variables on which the sample must be balanced. |
selection |
1, for simple random sampling without replacement at each stage, |
m |
number of primary sampling units to be selected. |
n |
number of second-stage sampling units to be selected. |
PU |
vector of integers that defines the primary sampling units. |
comment |
a comment is written during the execution if |
method |
the used method in the function |
The function returns a matrix whose columns are the following five vectors: the selected second-stage sampling units (0 - unselected, 1 - selected), the final inclusion probabilities, the selected primary sampling units, the inclusion probabilities of the first stage, the inclusion probabilities of the second stage.
samplecube
, fastflightcube
, landingcube
,
balancedstratification
, balancedcluster
############ ## Example 1 ############ # definition of the primary units (3 primary units) PU=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3) # matrix of balancing variables X=cbind(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)) # selection of 2 primary sampling units and 4 second-stage sampling units # sample and inclusion probabilities s=balancedtwostage(X,1,2,4,PU,comment=TRUE) s ############ ## Example 2 ############ data(MU284) X=cbind(MU284$P75,MU284$CS82,MU284$SS82,MU284$ME84) N=dim(X)[1] PU=MU284$CL m=20 n=60 # sample and inclusion probabilities s=balancedtwostage(X,1,m,n,PU,TRUE) s
############ ## Example 1 ############ # definition of the primary units (3 primary units) PU=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3) # matrix of balancing variables X=cbind(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)) # selection of 2 primary sampling units and 4 second-stage sampling units # sample and inclusion probabilities s=balancedtwostage(X,1,2,4,PU,comment=TRUE) s ############ ## Example 2 ############ data(MU284) X=cbind(MU284$P75,MU284$CS82,MU284$SS82,MU284$ME84) N=dim(X)[1] PU=MU284$CL m=20 n=60 # sample and inclusion probabilities s=balancedtwostage(X,1,m,n,PU,TRUE) s
This data provides information about the Belgian population of July 1, 2004 compared to that of July 1, 2003, and some financial information about the municipality incomes at the end of 2001.
data(belgianmunicipalities)
data(belgianmunicipalities)
A data frame with 589 observations on the following 17 variables:
municipality name.
‘Institut National de statistique’ code.
province number.
administrative division number.
number of men on July 1, 2004.
number of women on July 1, 2004.
total population on July 1, 2004.
number of men on July 1, 2003.
number of women on July 1, 2003.
total population on July 1, 2003.
number of men on July 1, 2004 minus the number of men on July 1, 2003.
number of women on July 1, 2004 minus the number of women on July 1, 2003.
difference between the total population on July 1, 2004 and on July 1, 2003.
total taxable income in euros in 2001.
total taxation in euros in 2001.
average of the income-tax return in euros in 2001.
median of the income-tax return in euros in 2001.
http://https://statbel.fgov.be/fr
data(belgianmunicipalities) hist(belgianmunicipalities$medianincome)
data(belgianmunicipalities) hist(belgianmunicipalities$medianincome)
Computes the g-weights of the calibration estimator. The g-weights should lie in the specified bounds for the truncated and logit methods.
calib(Xs,d,total,q=rep(1,length(d)),method=c("linear","raking","truncated", "logit"),bounds=c(low=0,upp=10),description=FALSE,max_iter=500)
calib(Xs,d,total,q=rep(1,length(d)),method=c("linear","raking","truncated", "logit"),bounds=c(low=0,upp=10),description=FALSE,max_iter=500)
Xs |
matrix of calibration variables. |
d |
vector of initial weights. |
total |
vector of population totals. |
q |
vector of positive values accounting for heteroscedasticity; the variation of the g-weights is reduced for small values of q. |
method |
calibration method (linear, raking, logit, truncated). |
bounds |
vector of bounds for the g-weights used in the truncated and logit methods; 'low' is the smallest value and 'upp' is the largest value. |
description |
if description=TRUE, summary of initial and final weights are printed, and their boxplots and histograms are drawn; by default, its value is FALSE. |
max_iter |
maximum number of iterations in the Newton's method. |
The argument method implements the methods given in the paper of Deville and Särndal(1992).
Returns the vector of g-weights.
Cassel, C.-M., Särndal, C.-E., and Wretman, J. (1976). Some results on generalized difference estimation and generalized regression estimation for finite population.Biometrika, 63:615–620.
Deville, J.-C. and Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87:376–382.
Deville, J.-C., Särndal, C.-E., and Sautory, O. (1993). Generalized raking procedure in survey sampling. Journal of the American Statistical Association, 88:1013–1020.
checkcalibration
, calibev
, gencalib
############ ## Example 1 ############ # matrix of sample calibration variables Xs=cbind( c(1,1,1,1,1,0,0,0,0,0), c(0,0,0,0,0,1,1,1,1,1), c(1,2,3,4,5,6,7,8,9,10) ) # inclusion probabilities piks=rep(0.2,times=10) # vector of population totals total=c(24,26,290) # the g-weights using the truncated method g=calib(Xs,d=1/piks,total,method="truncated",bounds=c(0.75,1.2)) # the calibration estimator of X is equal to 'total' vector t(g/piks)%*%Xs # the g-weights are between lower and upper bounds range(g) ############ ## Example 2 ############ # Example of g-weights (linear, raking, truncated, logit), # with the data of Belgian municipalities as population. # Firstly, a sample is selected by means of Poisson sampling. # Secondly, the g-weights are calculated. data(belgianmunicipalities) attach(belgianmunicipalities) # matrix of calibration variables for the population X=cbind( Men03/mean(Men03), Women03/mean(Women03), Diffmen, Diffwom, TaxableIncome/mean(TaxableIncome), Totaltaxation/mean(Totaltaxation), averageincome/mean(averageincome), medianincome/mean(medianincome)) # selection of a sample with expectation size equal to 200 # by means of Poisson sampling # the inclusion probabilities are proportional to the average income pik=inclusionprobabilities(averageincome,200) N=length(pik) # population size s=UPpoisson(pik) # sample Xs=X[s==1,] # sample matrix of calibration variables piks=pik[s==1] # sample inclusion probabilities n=length(piks) # expected sample size # vector of population totals of the calibration variables total=c(t(rep(1,times=N))%*%X) # computation of the g-weights # by means of different calibration methods g1=calib(Xs,d=1/piks,total,method="linear") g2=calib(Xs,d=1/piks,total,method="raking") g3=calib(Xs,d=1/piks,total,method="truncated",bounds=c(0.5,1.5)) g4=calib(Xs,d=1/piks,total,method="logit",bounds=c(0.5,1.5)) # in some cases, the calibration is not possible, # particularly when bounds are used. # if the calibration is possible, the calibration estimator of X is printed if(checkcalibration(Xs,d=1/piks,total,g1)$result) print(c((g1/piks) %*% Xs)) else print("error") if(!is.null(g2)) if(checkcalibration(Xs,d=1/piks,total,g2)$result) if(!is.null(g3)) if(checkcalibration(Xs,d=1/piks,total,g3)$result & all(g3<=1.5) & all(g3>=0.5)) print(c((g3/piks) %*% Xs)) else print("error") if(!is.null(g4)) if(checkcalibration(Xs,d=1/piks,total,g4)$result & all(g4<=1.5) & all(g4>=0.5)) print(c((g4/piks) %*% Xs)) else print("error") detach(belgianmunicipalities) ############ ## Example 3 ############ # Example of calibration and adjustment for nonresponse in the 'calibration' vignette # vignette("calibration", package="sampling")
############ ## Example 1 ############ # matrix of sample calibration variables Xs=cbind( c(1,1,1,1,1,0,0,0,0,0), c(0,0,0,0,0,1,1,1,1,1), c(1,2,3,4,5,6,7,8,9,10) ) # inclusion probabilities piks=rep(0.2,times=10) # vector of population totals total=c(24,26,290) # the g-weights using the truncated method g=calib(Xs,d=1/piks,total,method="truncated",bounds=c(0.75,1.2)) # the calibration estimator of X is equal to 'total' vector t(g/piks)%*%Xs # the g-weights are between lower and upper bounds range(g) ############ ## Example 2 ############ # Example of g-weights (linear, raking, truncated, logit), # with the data of Belgian municipalities as population. # Firstly, a sample is selected by means of Poisson sampling. # Secondly, the g-weights are calculated. data(belgianmunicipalities) attach(belgianmunicipalities) # matrix of calibration variables for the population X=cbind( Men03/mean(Men03), Women03/mean(Women03), Diffmen, Diffwom, TaxableIncome/mean(TaxableIncome), Totaltaxation/mean(Totaltaxation), averageincome/mean(averageincome), medianincome/mean(medianincome)) # selection of a sample with expectation size equal to 200 # by means of Poisson sampling # the inclusion probabilities are proportional to the average income pik=inclusionprobabilities(averageincome,200) N=length(pik) # population size s=UPpoisson(pik) # sample Xs=X[s==1,] # sample matrix of calibration variables piks=pik[s==1] # sample inclusion probabilities n=length(piks) # expected sample size # vector of population totals of the calibration variables total=c(t(rep(1,times=N))%*%X) # computation of the g-weights # by means of different calibration methods g1=calib(Xs,d=1/piks,total,method="linear") g2=calib(Xs,d=1/piks,total,method="raking") g3=calib(Xs,d=1/piks,total,method="truncated",bounds=c(0.5,1.5)) g4=calib(Xs,d=1/piks,total,method="logit",bounds=c(0.5,1.5)) # in some cases, the calibration is not possible, # particularly when bounds are used. # if the calibration is possible, the calibration estimator of X is printed if(checkcalibration(Xs,d=1/piks,total,g1)$result) print(c((g1/piks) %*% Xs)) else print("error") if(!is.null(g2)) if(checkcalibration(Xs,d=1/piks,total,g2)$result) if(!is.null(g3)) if(checkcalibration(Xs,d=1/piks,total,g3)$result & all(g3<=1.5) & all(g3>=0.5)) print(c((g3/piks) %*% Xs)) else print("error") if(!is.null(g4)) if(checkcalibration(Xs,d=1/piks,total,g4)$result & all(g4<=1.5) & all(g4>=0.5)) print(c((g4/piks) %*% Xs)) else print("error") detach(belgianmunicipalities) ############ ## Example 3 ############ # Example of calibration and adjustment for nonresponse in the 'calibration' vignette # vignette("calibration", package="sampling")
Computes the calibration estimator of the population total and its variance estimation using the residuals' method.
calibev(Ys,Xs,total,pikl,d,g,q=rep(1,length(d)),with=FALSE,EPS=1e-6)
calibev(Ys,Xs,total,pikl,d,g,q=rep(1,length(d)),with=FALSE,EPS=1e-6)
Ys |
vector of interest variable; its size is n, the sample size. |
Xs |
matrix of sample calibration variables. |
total |
vector of population totals for calibration. |
pikl |
matrix of joint inclusion probabilities of the sample units. |
d |
vector of initial weights of the sample units. |
g |
vector of g-weights; its size is n, the sample size. |
q |
vector of positive values accounting for heteroscedasticity; its size is n, the sample size. |
with |
if TRUE, the variance estimation takes into account the initial weights d; otherwise, the final weights w=g*d are taken into account; by default, its value is FALSE. |
EPS |
tolerance in checking the calibration; by default, its value is 1e-6. |
If with is TRUE, the following formula is used
else
where denotes the residual of unit k.
The function returns two values:
cest |
the calibration estimator, |
evar |
its estimated variance. |
Deville, J.-C. and Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87:376–382.
Deville, J.-C., Särndal, C.-E., and Sautory, O. (1993). Generalized raking procedure in survey sampling. Journal of the American Statistical Association, 88:1013–1020.
############ ## Example ############ # Example of g-weights (linear, raking, truncated, logit), # with the data of Belgian municipalities as population. # Firstly, a sample is selected by means of systematic sampling. # Secondly, the g-weights are calculated. data(belgianmunicipalities) attach(belgianmunicipalities) # matrix of calibration variables for the population X=cbind( Men03/mean(Men03), Women03/mean(Women03), Diffmen, Diffwom, TaxableIncome/mean(TaxableIncome), Totaltaxation/mean(Totaltaxation), averageincome/mean(averageincome), medianincome/mean(medianincome)) # selection of a sample of size 200 # using systematic sampling # the inclusion probabilities are proportional to the average income pik=inclusionprobabilities(averageincome,200) N=length(pik) # population size s=UPsystematic(pik) # draws a sample s using systematic sampling Xs=X[s==1,] # matrix of sample calibration variables piks=pik[s==1] # sample inclusion probabilities n=length(piks) # sample size # vector of population totals of the calibration variables total=c(t(rep(1,times=N))%*%X) g1=calib(Xs,d=1/piks,total,method="linear") # computes the g-weights pikl=UPsystematicpi2(pik) # computes the matrix of joint inclusion probabilities pikls=pikl[s==1,s==1] # the same matrix for the units in the sample Ys=Tot04[s==1] # the variable of interest is Tot04 (sample level) calibev(Ys,Xs,total,pikls,d=1/piks,g1,with=FALSE,EPS=1e-6) detach(belgianmunicipalities)
############ ## Example ############ # Example of g-weights (linear, raking, truncated, logit), # with the data of Belgian municipalities as population. # Firstly, a sample is selected by means of systematic sampling. # Secondly, the g-weights are calculated. data(belgianmunicipalities) attach(belgianmunicipalities) # matrix of calibration variables for the population X=cbind( Men03/mean(Men03), Women03/mean(Women03), Diffmen, Diffwom, TaxableIncome/mean(TaxableIncome), Totaltaxation/mean(Totaltaxation), averageincome/mean(averageincome), medianincome/mean(medianincome)) # selection of a sample of size 200 # using systematic sampling # the inclusion probabilities are proportional to the average income pik=inclusionprobabilities(averageincome,200) N=length(pik) # population size s=UPsystematic(pik) # draws a sample s using systematic sampling Xs=X[s==1,] # matrix of sample calibration variables piks=pik[s==1] # sample inclusion probabilities n=length(piks) # sample size # vector of population totals of the calibration variables total=c(t(rep(1,times=N))%*%X) g1=calib(Xs,d=1/piks,total,method="linear") # computes the g-weights pikl=UPsystematicpi2(pik) # computes the matrix of joint inclusion probabilities pikls=pikl[s==1,s==1] # the same matrix for the units in the sample Ys=Tot04[s==1] # the variable of interest is Tot04 (sample level) calibev(Ys,Xs,total,pikls,d=1/piks,g1,with=FALSE,EPS=1e-6) detach(belgianmunicipalities)
Checks the validity of the calibration. In some cases, the computed g-weights do not allow calibration and the calibration estimators do not exist.
checkcalibration(Xs, d, total, g, EPS=1e-6)
checkcalibration(Xs, d, total, g, EPS=1e-6)
Xs |
matrix of calibration variables. |
d |
vector of initial weights. |
total |
vector of population totals. |
g |
vector of g-weights. |
EPS |
control value used to check the calibration, by default equal to 1e-6. |
In the case where calibration is not possible, the 'value' indicates the difference in obtaining the calibration.
The function returns the following three objects:
message |
a message concerning the calibration, |
result |
TRUE if the calibration is possible and FALSE, otherwise. |
value |
value of max(abs(tr-total)/total, which is used as criterium to validate the
calibration, where tr=crossprod(Xs, g*d). If the |
# matrix of auxiliary variables Xs=cbind(c(1,1,1,1,1,0,0,0,0,0),c(0,0,0,0,0,1,1,1,1,1),c(1,2,3,4,5,6,7,8,9,10)) # inclusion probabilities pik=rep(0.2,times=10) # vector of totals total=c(24,26,280) # g-weights g=calib(Xs,d=1/pik,total,method="raking") # check if the calibration is possible checkcalibration(Xs,d=1/pik,total,g)
# matrix of auxiliary variables Xs=cbind(c(1,1,1,1,1,0,0,0,0,0),c(0,0,0,0,0,1,1,1,1,1),c(1,2,3,4,5,6,7,8,9,10)) # inclusion probabilities pik=rep(0.2,times=10) # vector of totals total=c(24,26,280) # g-weights g=calib(Xs,d=1/pik,total,method="raking") # check if the calibration is possible checkcalibration(Xs,d=1/pik,total,g)
Renumbers a variable of stratification (categorical variable). The strata receive a number from 1 to the last stratum number. The empty strata are suppressed. This function is used in ‘balancedstratification’.
cleanstrata(strata)
cleanstrata(strata)
strata |
vector of stratum numbers. |
# definition of the stratification variable strata=c(-2,3,-2,3,4,4,4,-2,-2,3,4,0,0,0) # renumber the strata cleanstrata(strata)
# definition of the stratification variable strata=c(-2,3,-2,3,4,4,4,-2,-2,3,4,0,0,0) # renumber the strata cleanstrata(strata)
Cluster sampling with equal/unequal probabilities.
cluster(data, clustername, size, method=c("srswor","srswr","poisson", "systematic"),pik,description=FALSE)
cluster(data, clustername, size, method=c("srswor","srswr","poisson", "systematic"),pik,description=FALSE)
data |
data frame or data matrix; its number of rows is N, the population size. |
clustername |
the name of the clustering variable. |
size |
sample size. |
method |
method to select clusters; the following methods are implemented: simple random sampling without replacement (srswor), simple random sampling with replacement (srswr), Poisson sampling (poisson), systematic sampling (systematic); if the method is not specified, by default the method is "srswor". |
pik |
vector of inclusion probabilities or auxiliary information used to compute them; this argument is only used for unequal probability sampling (Poisson, systematic). If an auxiliary information is provided, the function uses the inclusionprobabilities function for computing these probabilities. |
description |
a message is printed if its value is TRUE; the message gives the number of selected clusters, the number of units in the population and the number of selected units. By default, the value is FALSE. |
The function returns a data set with the following information: the selected clusters, the identifier of the units in the selected clusters, the final inclusion probabilities for these units (they are equal for the units included in the same cluster). If method is "srswr", the number of replicates is also given.
############ ## Example 1 ############ # Uses the swissmunicipalities data to draw a sample of clusters data(swissmunicipalities) # the variable 'REG' has 7 categories in the population # it is used as clustering variable # the sample size is 3; the method is simple random sampling without replacement cl=cluster(swissmunicipalities,clustername=c("REG"),size=3,method="srswor") # extracts the observed data # the order of the columns is different from the order in the initial database getdata(swissmunicipalities, cl) ############ ## Example 2 ############ # the same data as in Example 1 # the sample size is 3; the method is systematic sampling # the pik vector is randomly generated using the U(0,1) distribution cl_sys=cluster(swissmunicipalities,clustername=c("REG"),size=3,method="systematic", pik=runif(7)) # extracts the observed data getdata(swissmunicipalities,cl_sys)
############ ## Example 1 ############ # Uses the swissmunicipalities data to draw a sample of clusters data(swissmunicipalities) # the variable 'REG' has 7 categories in the population # it is used as clustering variable # the sample size is 3; the method is simple random sampling without replacement cl=cluster(swissmunicipalities,clustername=c("REG"),size=3,method="srswor") # extracts the observed data # the order of the columns is different from the order in the initial database getdata(swissmunicipalities, cl) ############ ## Example 2 ############ # the same data as in Example 1 # the sample size is 3; the method is systematic sampling # the pik vector is randomly generated using the U(0,1) distribution cl_sys=cluster(swissmunicipalities,clustername=c("REG"),size=3,method="systematic", pik=runif(7)) # extracts the observed data getdata(swissmunicipalities,cl_sys)
Transforms a categorical variable into a matrix of indicators. The values of the categorical variable are integer numbers (positive or negative).
disjunctive(strata)
disjunctive(strata)
strata |
vector of integer numbers. |
# definition of the variable of stratification strata=c(-2,3,-2,3,4,4,4,-2,-2,3,4,0,0,0) # computation of the matrix disjunctive(strata)
# definition of the variable of stratification strata=c(-2,3,-2,3,4,4,4,-2,-2,3,4,0,0,0) # computation of the matrix disjunctive(strata)
Executes the fast flight phase
of the cube method (algorithm of Chauvet and Tillé, 2005, 2006).
The data are sorted following the argument order
. Inclusion probabilities equal to
0 or 1 are tolerated.
fastflightcube(X,pik,order=1,comment=TRUE)
fastflightcube(X,pik,order=1,comment=TRUE)
X |
matrix of auxiliary variables on which the sample must be balanced. |
pik |
vector of inclusion probabilities. |
order |
1, the data are randomly arranged, |
comment |
a comment is written during the execution if |
Tillé, Y. (2006), Sampling Algorithms, Springer.
Chauvet, G. and Tillé, Y. (2006). A fast algorithm of balanced sampling. Computational Statistics, 21/1:53–62.
Chauvet, G. and Tillé, Y. (2005). New SAS macros for balanced sampling. In INSEE, editor, Journées de Méthodologie Statistique, Paris.
Deville, J.-C. and Tillé, Y. (2004). Efficient balanced sampling: the cube method. Biometrika, 91:893–912.
Deville, J.-C. and Tillé, Y. (2005). Variance approximation under balanced sampling. Journal of Statistical Planning and Inference, 128/2:411–425.
# Matrix of balancing variables X=cbind(c(1,1,1,1,1,1,1,1,1),c(1,2,3,4,5,6,7,8,9)) # Vector of inclusion probabilities. # The sample size is 3. pik=c(1/3,1/3,1/3,1/3,1/3,1/3,1/3,1/3,1/3) # pikstar is almost a balanced sample and is ready for the landing phase pikstar=fastflightcube(X,pik,order=1,comment=TRUE) pikstar
# Matrix of balancing variables X=cbind(c(1,1,1,1,1,1,1,1,1),c(1,2,3,4,5,6,7,8,9)) # Vector of inclusion probabilities. # The sample size is 3. pik=c(1/3,1/3,1/3,1/3,1/3,1/3,1/3,1/3,1/3) # pikstar is almost a balanced sample and is ready for the landing phase pikstar=fastflightcube(X,pik,order=1,comment=TRUE) pikstar
Computes the g-weights of the generalized calibration estimator. The g-weights should lie in the specified bounds for the truncated and logit methods.
gencalib(Xs,Zs,d,total,q=rep(1,length(d)),method=c("linear","raking","truncated","logit"), bounds=c(low=0,upp=10),description=FALSE,max_iter=500,C=1)
gencalib(Xs,Zs,d,total,q=rep(1,length(d)),method=c("linear","raking","truncated","logit"), bounds=c(low=0,upp=10),description=FALSE,max_iter=500,C=1)
Xs |
matrix of calibration variables. |
Zs |
matrix of instrumental variables with same dimension as Xs. |
d |
vector of initial weights. |
total |
vector of population totals. |
q |
vector of positive values accounting for heteroscedasticity; the variation of the g-weights is reduced for small values of q. |
method |
calibration method (linear, raking, logit, truncated). |
bounds |
vector of bounds for the g-weights used in the truncated and logit methods; 'low' is the smallest value and 'upp' is the largest value. |
description |
if description=TRUE, summary of initial and final weights are printed, and their boxplots and histograms are drawn; by default, its value is FALSE. |
max_iter |
maximum number of iterations in the Newton's method. |
C |
value of the centering constant, by default equals 1. |
The generalized calibration or the instrument vector method computes the g-weights
where
is a vector with values defined for
(or
where
is the set of respondents) and sharing the dimension of the specified auxiliary vector
. The vectors
and
have to be stronlgy correlated. The vector
is determined from the calibration equation
or
.
The function
plays the same role as in the calibration method (see
calib
). If Xs=Zs the calibration method is obtain. If the method is "logit"
the g-weights will be centered around the constant C, with low<C<upp. In the calibration method C=1 (see calib
).
The function returns the vector of g-weights.
Deville, J.-C. (1998). La correction de la nonréponse par calage ou par échantillonnage équilibré. Paper presented at the Congrès de l'ACFAS, Sherbrooke, Québec.
Deville, J.-C. (2000). Generalized calibration and application for weighting for non-response, COMPSTAT 2000: proceedings in computational statistics, p. 65–76.
Estevao, V.M., and Särndal, C.E. (2000). A functional form approach to calibration. Journal of Official Statistics, 16, 379–399.
Kott, P.S. (2006). Using calibration weighting to adjust for nonresponse and coverage errors. Survey Methodology, 32, 133–142.
############ ## Example 1 ############ # matrix of sample calibration variables Xs=cbind( c(1,1,1,1,1,0,0,0,0,0), c(0,0,0,0,0,1,1,1,1,1), c(1,2,3,4,5,6,7,8,9,10)) # inclusion probabilities piks=rep(0.2,times=10) # vector of population totals total=c(24,26,290) # matrix of instrumental variables Zs=Xs+matrix(runif(nrow(Xs)*ncol(Xs)),nrow(Xs),ncol(Xs)) # the g-weights using the truncated method g=gencalib(Xs,Zs,d=1/piks,total,method="truncated",bounds=c(0.5,1.5)) # the calibration estimator of X is equal to the 'total' vector t(g/piks)%*%Xs # the g-weights are between lower and upper bounds summary(g) ############ ## Example 2 ############ # Example of generalized g-weights (linear, raking, truncated, logit), # with the data of Belgian municipalities as population. # Firstly, a sample is selected by means of Poisson sampling. # Secondly, the g-weights are calculated. data(belgianmunicipalities) attach(belgianmunicipalities) # matrix of calibration variables for the population X=cbind(Totaltaxation/mean(Totaltaxation),medianincome/mean(medianincome)) # selection of a sample with expected size equal to 200 # by means of Poisson sampling # the inclusion probabilities are proportional to the average income pik=inclusionprobabilities(averageincome,200) N=length(pik) # population size s=UPpoisson(pik) # sample Xs=X[s==1,] # sample calibration variable matrix piks=pik[s==1] # sample inclusion probabilities n=length(piks) # expected sample size # vector of population totals of the calibration variables total=c(t(rep(1,times=N))%*%X) Z=cbind(TaxableIncome/mean(TaxableIncome),averageincome/mean(averageincome)) # defines the instrumental variables (sample level) Zs=Z[s==1,] # computation of the generalized g-weights # by means of different generalized calibration methods g1=gencalib(Xs,Zs,d=1/piks,total,method="linear") g2=gencalib(Xs,Zs,d=1/piks,total,method="raking") g3=gencalib(Xs,Zs,d=1/piks,total,method="truncated",bounds=c(0.5,8)) g4=gencalib(Xs,Zs,d=1/piks,total,method="logit",bounds=c(0.5,1.5)) # In some cases, the calibration is not possible # particularly when bounds are used. # if the calibration is possible, the calibration estimator of X total is printed if(checkcalibration(Xs,d=1/piks,total,g1)$result) print(c((g1/piks)%*% Xs)) else print("error") if(!is.null(g2)) if(checkcalibration(Xs,d=1/piks,total,g2)$result) print(c((g2/piks)%*% Xs)) else print("error") if(!is.null(g3)) if(checkcalibration(Xs,d=1/piks,total,g3)$result) print(c((g3/piks)%*% Xs)) else print("error") if(!is.null(g4)) if(checkcalibration(Xs,d=1/piks,total,g4)$result) print(c((g4/piks)%*% Xs)) else print("error") detach(belgianmunicipalities) ############ ## Example 3 ############ # Generalized calibration and adjustment for unit nonresponse in the 'calibration' vignette # vignette("calibration", package="sampling")
############ ## Example 1 ############ # matrix of sample calibration variables Xs=cbind( c(1,1,1,1,1,0,0,0,0,0), c(0,0,0,0,0,1,1,1,1,1), c(1,2,3,4,5,6,7,8,9,10)) # inclusion probabilities piks=rep(0.2,times=10) # vector of population totals total=c(24,26,290) # matrix of instrumental variables Zs=Xs+matrix(runif(nrow(Xs)*ncol(Xs)),nrow(Xs),ncol(Xs)) # the g-weights using the truncated method g=gencalib(Xs,Zs,d=1/piks,total,method="truncated",bounds=c(0.5,1.5)) # the calibration estimator of X is equal to the 'total' vector t(g/piks)%*%Xs # the g-weights are between lower and upper bounds summary(g) ############ ## Example 2 ############ # Example of generalized g-weights (linear, raking, truncated, logit), # with the data of Belgian municipalities as population. # Firstly, a sample is selected by means of Poisson sampling. # Secondly, the g-weights are calculated. data(belgianmunicipalities) attach(belgianmunicipalities) # matrix of calibration variables for the population X=cbind(Totaltaxation/mean(Totaltaxation),medianincome/mean(medianincome)) # selection of a sample with expected size equal to 200 # by means of Poisson sampling # the inclusion probabilities are proportional to the average income pik=inclusionprobabilities(averageincome,200) N=length(pik) # population size s=UPpoisson(pik) # sample Xs=X[s==1,] # sample calibration variable matrix piks=pik[s==1] # sample inclusion probabilities n=length(piks) # expected sample size # vector of population totals of the calibration variables total=c(t(rep(1,times=N))%*%X) Z=cbind(TaxableIncome/mean(TaxableIncome),averageincome/mean(averageincome)) # defines the instrumental variables (sample level) Zs=Z[s==1,] # computation of the generalized g-weights # by means of different generalized calibration methods g1=gencalib(Xs,Zs,d=1/piks,total,method="linear") g2=gencalib(Xs,Zs,d=1/piks,total,method="raking") g3=gencalib(Xs,Zs,d=1/piks,total,method="truncated",bounds=c(0.5,8)) g4=gencalib(Xs,Zs,d=1/piks,total,method="logit",bounds=c(0.5,1.5)) # In some cases, the calibration is not possible # particularly when bounds are used. # if the calibration is possible, the calibration estimator of X total is printed if(checkcalibration(Xs,d=1/piks,total,g1)$result) print(c((g1/piks)%*% Xs)) else print("error") if(!is.null(g2)) if(checkcalibration(Xs,d=1/piks,total,g2)$result) print(c((g2/piks)%*% Xs)) else print("error") if(!is.null(g3)) if(checkcalibration(Xs,d=1/piks,total,g3)$result) print(c((g3/piks)%*% Xs)) else print("error") if(!is.null(g4)) if(checkcalibration(Xs,d=1/piks,total,g4)$result) print(c((g4/piks)%*% Xs)) else print("error") detach(belgianmunicipalities) ############ ## Example 3 ############ # Generalized calibration and adjustment for unit nonresponse in the 'calibration' vignette # vignette("calibration", package="sampling")
Extracts the observed data from a data frame (a population). The function is used after a sample has been drawn from this population.
getdata(data, m)
getdata(data, m)
data |
population data frame or data matrix; its number of rows is N, the population size. |
m |
vector of selected units or sample data frame. |
srswor
, UPsystematic
, strata
,
cluster
, mstage
############ ## Example 1 ############ # Generates artificial data (a 235X3 matrix with 3 columns: state, region, income). # The variable 'state' has 2 categories (nc and sc); # the variable 'region' has 3 categories (1, 2 and 3); # the variable 'income' is generated using the U(0,1) distribution. data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE), matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") # the inclusion probabilities are computed using the variable 'income' pik=inclusionprobabilities(data$income,20) # draws a sample using systematic sampling (sample size is 20) s=UPsystematic(pik) # extracts the observed data getdata(data,s) ############ ## Example 2 ############ # see other examples in 'strata', 'cluster', 'mstage' help files
############ ## Example 1 ############ # Generates artificial data (a 235X3 matrix with 3 columns: state, region, income). # The variable 'state' has 2 categories (nc and sc); # the variable 'region' has 3 categories (1, 2 and 3); # the variable 'income' is generated using the U(0,1) distribution. data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE), matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") # the inclusion probabilities are computed using the variable 'income' pik=inclusionprobabilities(data$income,20) # draws a sample using systematic sampling (sample size is 20) s=UPsystematic(pik) # extracts the observed data getdata(data,s) ############ ## Example 2 ############ # see other examples in 'strata', 'cluster', 'mstage' help files
Computes the Hájek estimator of the population total or population mean.
Hajekestimator(y,pik,N=NULL,type=c("total","mean"))
Hajekestimator(y,pik,N=NULL,type=c("total","mean"))
y |
vector of the variable of interest; its length is equal to n, the sample size. |
pik |
vector of the first-order inclusion probabilities; its length is equal to n, the sample size. |
N |
population size; N is only used for the total estimator; for the mean estimator its value is NULL. |
type |
the estimator type: total or mean. |
# Belgian municipalities data data(belgianmunicipalities) # Computes the inclusion probabilities pik=inclusionprobabilities(belgianmunicipalities$Tot04,200) N=length(pik) n=sum(pik) # Defines the variable of interest y=belgianmunicipalities$TaxableIncome # Draws a Poisson sample of expected size 200 s=UPpoisson(pik) # Computes the Hajek estimator of the population mean Hajekestimator(y[s==1],pik[s==1],type="mean") # Computes the Hajek estimator of the population total Hajekestimator(y[s==1],pik[s==1],N=N,type="total")
# Belgian municipalities data data(belgianmunicipalities) # Computes the inclusion probabilities pik=inclusionprobabilities(belgianmunicipalities$Tot04,200) N=length(pik) n=sum(pik) # Defines the variable of interest y=belgianmunicipalities$TaxableIncome # Draws a Poisson sample of expected size 200 s=UPpoisson(pik) # Computes the Hajek estimator of the population mean Hajekestimator(y[s==1],pik[s==1],type="mean") # Computes the Hajek estimator of the population total Hajekestimator(y[s==1],pik[s==1],N=N,type="total")
Computes the Hájek estimator of the population total or population mean for a stratified design.
Hajekstrata(y,pik,strata,N=NULL,type=c("total","mean"),description=FALSE)
Hajekstrata(y,pik,strata,N=NULL,type=c("total","mean"),description=FALSE)
y |
vector of the variable of interest; its length is equal to n, the sample size. |
pik |
vector of the first-order inclusion probabilities for the sampled units; its length is equal to n, the sample size. |
strata |
vector of size n, with elements indicating the unit stratum. |
N |
vector of population sizes of strata; N is only used for the total estimator; for the mean estimator its value is NULL. |
type |
the estimator type: total or mean. |
description |
if TRUE, the estimator is printed for each stratum; by default, FALSE. |
# Swiss municipalities data data(swissmunicipalities) # the variable 'REG' has 7 categories in the population # it is used as stratification variable # computes the population stratum sizes table(swissmunicipalities$REG) # do not run # 1 2 3 4 5 6 7 # 589 913 321 171 471 186 245 # the sample stratum sizes are given by size=c(30,20,45,15,20,11,44) # the method is simple random sampling without replacement # (equal probability, without replacement) st=strata(swissmunicipalities,stratanames=c("REG"),size=c(30,20,45,15,20,11,44), method="srswor") # extracts the observed data # the order of the columns is different from the order in the swsissmunicipalities data x=getdata(swissmunicipalities, st) # computes the population sizes of strata N=table(swissmunicipalities$REG) N=N[unique(x$REG)] #the strata 1 2 3 4 5 6 7 #corresponds to REG 4 1 3 2 5 6 7 # computes the Hajek estimator of the total of Pop020 Hajekstrata(x$Pop020,x$Prob,x$Stratum,N,type="total",description=TRUE)
# Swiss municipalities data data(swissmunicipalities) # the variable 'REG' has 7 categories in the population # it is used as stratification variable # computes the population stratum sizes table(swissmunicipalities$REG) # do not run # 1 2 3 4 5 6 7 # 589 913 321 171 471 186 245 # the sample stratum sizes are given by size=c(30,20,45,15,20,11,44) # the method is simple random sampling without replacement # (equal probability, without replacement) st=strata(swissmunicipalities,stratanames=c("REG"),size=c(30,20,45,15,20,11,44), method="srswor") # extracts the observed data # the order of the columns is different from the order in the swsissmunicipalities data x=getdata(swissmunicipalities, st) # computes the population sizes of strata N=table(swissmunicipalities$REG) N=N[unique(x$REG)] #the strata 1 2 3 4 5 6 7 #corresponds to REG 4 1 3 2 5 6 7 # computes the Hajek estimator of the total of Pop020 Hajekstrata(x$Pop020,x$Prob,x$Stratum,N,type="total",description=TRUE)
Computes the Horvitz-Thompson estimator of the population total.
HTestimator(y,pik)
HTestimator(y,pik)
y |
vector of the variable of interest; its length is equal to n, the sample size. |
pik |
vector of the first-order inclusion probabilities; its length is equal to n, the sample size. |
data(belgianmunicipalities) attach(belgianmunicipalities) # inclusion probabilities pik=inclusionprobabilities(Tot04,200) N=length(pik) n=sum(pik) # draws a Poisson sample of expected size 200 s=UPpoisson(pik) # Horvitz-Thompson estimator of the total of TaxableIncome HTestimator(TaxableIncome[s==1],pik[s==1]) detach(belgianmunicipalities)
data(belgianmunicipalities) attach(belgianmunicipalities) # inclusion probabilities pik=inclusionprobabilities(Tot04,200) N=length(pik) n=sum(pik) # draws a Poisson sample of expected size 200 s=UPpoisson(pik) # Horvitz-Thompson estimator of the total of TaxableIncome HTestimator(TaxableIncome[s==1],pik[s==1]) detach(belgianmunicipalities)
Computes the Horvitz-Thompson estimator of the population total for a stratified design.
HTstrata(y,pik,strata,description=FALSE)
HTstrata(y,pik,strata,description=FALSE)
y |
vector of the variable of interest; its length is equal to n, the sample size. |
pik |
vector of the first-order inclusion probabilities for the sampled units; its length is equal to n, the sample size. |
strata |
vector of size n, with elements indicating the unit stratum. |
description |
if TRUE, the estimator is printed for each stratum; by default, FALSE. |
# Swiss municipalities data base data(swissmunicipalities) # the variable 'REG' has 7 categories in the population # it is used as stratification variable # computes the population stratum sizes table(swissmunicipalities$REG) # do not run # 1 2 3 4 5 6 7 # 589 913 321 171 471 186 245 # the sample stratum sizes are given by size=c(30,20,45,15,20,11,44) # the method is simple random sampling without replacement # (equal probability, fixed sample size, without replacement) st=strata(swissmunicipalities,stratanames=c("REG"),size=c(30,20,45,15,20,11,44), method="srswor") # extracts the observed data # the order of the columns is different from the order in the initial data x=getdata(swissmunicipalities, st) # computes the HT estimator of the total of Pop020 HTstrata(x$Pop020,x$Prob,x$Stratum,description=TRUE)
# Swiss municipalities data base data(swissmunicipalities) # the variable 'REG' has 7 categories in the population # it is used as stratification variable # computes the population stratum sizes table(swissmunicipalities$REG) # do not run # 1 2 3 4 5 6 7 # 589 913 321 171 471 186 245 # the sample stratum sizes are given by size=c(30,20,45,15,20,11,44) # the method is simple random sampling without replacement # (equal probability, fixed sample size, without replacement) st=strata(swissmunicipalities,stratanames=c("REG"),size=c(30,20,45,15,20,11,44), method="srswor") # extracts the observed data # the order of the columns is different from the order in the initial data x=getdata(swissmunicipalities, st) # computes the HT estimator of the total of Pop020 HTstrata(x$Pop020,x$Prob,x$Stratum,description=TRUE)
Computes the first-order inclusion probabilities from a vector of positive numbers (for a probability proportional-to-size sampling design). Their sum is equal to n, the sample size.
inclusionprobabilities(a,n)
inclusionprobabilities(a,n)
a |
vector of positive numbers. |
n |
sample size. |
############ ## Example 1 ############ # a vector of positive numbers a=1:20 # inclusion probabilities for a sample size n=12 inclusionprobabilities(a,12) ############ ## Example 2 ############ # Computation of the inclusion probabilities proportional to the number # of inhabitants in each municipality of the Belgian municipalities data. data(belgianmunicipalities) pik=inclusionprobabilities(belgianmunicipalities$Tot04,200) # the first-order inclusion probabilities for each municipality data.frame(pik=pik,name=belgianmunicipalities$Commune) # the sum is equal to the sample size sum(pik)
############ ## Example 1 ############ # a vector of positive numbers a=1:20 # inclusion probabilities for a sample size n=12 inclusionprobabilities(a,12) ############ ## Example 2 ############ # Computation of the inclusion probabilities proportional to the number # of inhabitants in each municipality of the Belgian municipalities data. data(belgianmunicipalities) pik=inclusionprobabilities(belgianmunicipalities$Tot04,200) # the first-order inclusion probabilities for each municipality data.frame(pik=pik,name=belgianmunicipalities$Commune) # the sum is equal to the sample size sum(pik)
Computes the inclusion probabilities for a stratified design. The inclusion probabilities are equal in each stratum.
inclusionprobastrata(strata,nh)
inclusionprobastrata(strata,nh)
strata |
vector that defines the strata. |
nh |
vector of the number of selected units in each stratum. |
# the strata strata=c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3) # sample size in each stratum nh=c(2,3,3) # inclusion probabilities for each stratum pik=inclusionprobastrata(strata,nh) #check for each stratum cbind(strata, pik)
# the strata strata=c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3) # sample size in each stratum nh=c(2,3,3) # inclusion probabilities for each stratum pik=inclusionprobastrata(strata,nh) #check for each stratum cbind(strata, pik)
Landing phase of the cube method using linear programming.
landingcube(X,pikstar,pik,comment=TRUE)
landingcube(X,pikstar,pik,comment=TRUE)
X |
matrix of auxiliary variables on which the sample must be balanced. |
pikstar |
vector obtained at the end of the flight phase. |
pik |
vector of inclusion probabilities. |
comment |
a comment is written during the execution if |
Tillé, Y. (2006), Sampling Algorithms, Springer.
Chauvet, G. and Tillé, Y. (2006). A fast algorithm of balanced sampling. Computational Statistics, 21/1:53–62.
Chauvet, G. and Tillé, Y. (2005). New SAS macros for balanced sampling. In INSEE, editor, Journées de Méthodologie Statistique, Paris.
Deville, J.-C. and Tillé, Y. (2004). Efficient balanced sampling: the cube method. Biometrika, 91:893–912.
Deville, J.-C. and Tillé, Y. (2005). Variance approximation under balanced sampling. Journal of Statistical Planning and Inference, 128/2:411–425.
# matrix of balancing variables X=cbind(c(1,1,1,1,1,1,1,1,1),c(1.1,2.2,3.1,4.2,5.1,6.3,7.1,8.1,9.1)) # the sample size is 3 # vector of inclusion probabilities pik=c(1/3,1/3,1/3,1/3,1/3,1/3,1/3,1/3,1/3) # pikstar is almost a balanced sample and is ready for the landing phase pikstar=fastflightcube(X,pik,order=1,comment=TRUE) # selection of the sample s=landingcube(X,pikstar,pik,comment=TRUE) round(s)
# matrix of balancing variables X=cbind(c(1,1,1,1,1,1,1,1,1),c(1.1,2.2,3.1,4.2,5.1,6.3,7.1,8.1,9.1)) # the sample size is 3 # vector of inclusion probabilities pik=c(1/3,1/3,1/3,1/3,1/3,1/3,1/3,1/3,1/3) # pikstar is almost a balanced sample and is ready for the landing phase pikstar=fastflightcube(X,pik,order=1,comment=TRUE) # selection of the sample s=landingcube(X,pikstar,pik,comment=TRUE) round(s)
Implements multistage sampling with equal/unequal probabilities.
mstage(data, stage=c("stratified","cluster",""), varnames, size, method=c("srswor","srswr","poisson","systematic"), pik, description=FALSE)
mstage(data, stage=c("stratified","cluster",""), varnames, size, method=c("srswor","srswr","poisson","systematic"), pik, description=FALSE)
data |
data frame or data matrix; its number of rows is N, the population size. |
stage |
list of sampling types at each stage; the possible values are: "stratified", "cluster" and "" (without stratification or clustering). For multistage element sampling, this argument is not necessary. |
varnames |
list of stratification or clustering variables. |
size |
list of sample sizes (in the order in which the samples appear in the multistage sampling). |
method |
list of methods to select units at each stage; the following methods are implemented: simple random sampling without replacement (srswor), simple random sampling with replacement (srswr), Poisson sampling (poisson), systematic sampling (systematic); if the method is not specified, by default the method is "srswor". The method can be different at each stage. |
pik |
list of selection probabilities or auxiliary information used to compute them; this argument is only used for unequal probability sampling (Poisson, systematic). If an auxiliary information is provided, the function uses the inclusionprobabilities function for computing these probabilities. |
description |
a message is printed if its value is TRUE; the message gives the number of selected units and the number of the units in the population. By default, its value is FALSE. |
The data should be sorted in ascending order by the columns given in the varnames argument before applying the function. Use, for example, data[order(data$state,data$region),].
The function returns a list, which contains the stages (if m is this list, the stage i is m$'i' etc) and the following information:
ID_unit |
the identifier of selected units at each stage. |
Prob_ number _stage |
the inclusion probability at stage 'number'. |
Prob |
the final unit inclusion probability given in the last stage; it is the product of unit inclusion probabilities at each stage. |
############ ## Example 1 ############ # Two-stage cluster sampling # Uses the 'swissmunicipalities' data data(swissmunicipalities) b=swissmunicipalities b=b[order(b$REG,b$CT),] attach(b) # the variable 'REG' (region) has 7 categories; # it is used as clustering variable in the first-stage sample # the variable 'CT' (canton) has 26 categories; # it is used as clustering variable in the second-stage sample # 4 clusters (regions) are selected in the first-stage # 1 canton is selected in the second-stage from each sampled region # the method is simple random sampling without replacement in each stage # (equal probability, without replacement) m=mstage(b,stage=list("cluster","cluster"), varnames=list("REG","CT"), size=list(4,c(1,1,1,1)), method=list("srswor","srswor")) # the first stage is m[[1]], the second stage is m[[2]] #the selected regions unique(m[[1]]$REG) #the selected cantons unique(m[[2]]$CT) # extracts the observed data x=getdata(b,m)[[2]] # check the output table(x$REG,x$CT) ############ ## Example 2 ############ # Two-stage element sampling # Generates artificial data (a 235X3 matrix with 3 columns: state, region, income). # The variable "state" has 2 categories ('n','s'). # The variable "region" has 5 categories ('A', 'B', 'C', 'D', 'E'). # The variable "income" is generated using the U(0,1) distribution. data=rbind(matrix(rep('n',165),165,1,byrow=TRUE),matrix(rep('s',70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep('A',115),rep('D',10),rep('E',40),rep('B',30),rep('C',40)), 100*runif(235)) names(data)=c("state","region","income") data=data[order(data$state,data$region),] table(data$state,data$region) # the method is simple random sampling without replacement # 25 units are drawn in the first-stage # in the second-stage, 10 units are drawn from the already 25 selected units m=mstage(data,size=list(25,10),method=list("srswor","srswor")) # the first stage is m[[1]], the second stage is m[[2]] # extracts the observed data xx=getdata(data,m)[[2]] # check the result table(xx$state,xx$region) ############ ## Example 3 ############ # Stratified one-stage cluster sampling # The same data as in Example 2 # the variable 'state' is used as stratification variable # 165 units are in the first stratum and 70 in the second one # the variable 'region' is used as clustering variable # 1 cluster (region) is drawn in each state using "srswor" m=mstage(data, stage=list("stratified","cluster"), varnames=list("state","region"), size=list(c(165,70),c(1,1)),method=list("","srswor")) # check the first stage table(m[[1]]$state) # check the second stage table(m[[2]]$region) # extracts the observed data xx=getdata(data,m)[[2]] # check the result table(xx$state,xx$region) ############ ## Example 4 ############ # Two-stage cluster sampling # The same data as in Example 1 # in the first-stage, the clustering variable is 'REG' (region) with 7 categories # 4 clusters (regions) are drawn in the first-stage # each region is selected with the probability 4/7 # in the second-stage, the clustering variable is 'CT'(canton) with 26 categories # 1 cluster (canton) is drawn in the second-stage from each selected region # in region 1, there are 3 cantons; one canton is selected with prob. 0.2, 0.4, 0.4, resp. # in region 2, there are 5 cantons; each canton is selected with the prob. 1/5 # in region 3, there are 3 cantons; each canton is selected with the prob. 1/3 # in region 4, there is 1 canton, which it is selected with the prob. 1 # in region 5, there are 7 cantons; each canton is selected with the prob. 1/7 # in region 6, there are 6 cantons; each canton is selected with the prob. 1/6 # in region 7, there is 1 canton, which it is selected with the prob. 1 # it is necessary to use a list of selection probabilities at each stage # prob is the list of the selection probabilities # the method is systematic sampling (unequal probabilities, without replacement) # ls is the list of sizes ls=list(4,c(1,1,1,1)) prob=list(rep(4/7,7),list(c(0.2,0.4,0.4),rep(1/5,5),rep(1/3,3),rep(1,1),rep(1/7,7), rep(1/6,6),rep(1,1))) m=mstage(b,stage=list("cluster","cluster"),varnames=list("REG","CT"), size=ls, method=c("systematic","systematic"),pik=prob) #the selected regions unique(m[[1]]$REG) #the selected cantons unique(m[[2]]$CT) # extracts the observed data xx=getdata(b,m)[[2]] # check the result table(xx$REG,xx$CT) ############ ## Example 5 ############ # Stratified two-stage cluster sampling # The same data as in Example 1 # the variable 'REG' is used as stratification variable # there are 7 strata # the variable 'CT' is used as first clustering variable # first stage, clusters (cantons) are drawn from each region using "srswor" # 3 clusters are drawn from the regions 1,2,3,5, and 6, respectively # 1 cluster is drawn from the regions 4 and 7, respectively # the variable 'COM' is used as second clustering variable # second stage, 2 clusters (municipalities) are drawn from each selected canton using "srswor" m=mstage(b,stage=list("stratified","cluster","cluster"), varnames=list("REG","CT","COM"), size=list(size1=table(b$REG),size2=c(rep(3,3),1,3,3,1), size3=rep(2,17)), method=list("","srswor","srswor")) # extracts the observed data getdata(b,m)[[3]]
############ ## Example 1 ############ # Two-stage cluster sampling # Uses the 'swissmunicipalities' data data(swissmunicipalities) b=swissmunicipalities b=b[order(b$REG,b$CT),] attach(b) # the variable 'REG' (region) has 7 categories; # it is used as clustering variable in the first-stage sample # the variable 'CT' (canton) has 26 categories; # it is used as clustering variable in the second-stage sample # 4 clusters (regions) are selected in the first-stage # 1 canton is selected in the second-stage from each sampled region # the method is simple random sampling without replacement in each stage # (equal probability, without replacement) m=mstage(b,stage=list("cluster","cluster"), varnames=list("REG","CT"), size=list(4,c(1,1,1,1)), method=list("srswor","srswor")) # the first stage is m[[1]], the second stage is m[[2]] #the selected regions unique(m[[1]]$REG) #the selected cantons unique(m[[2]]$CT) # extracts the observed data x=getdata(b,m)[[2]] # check the output table(x$REG,x$CT) ############ ## Example 2 ############ # Two-stage element sampling # Generates artificial data (a 235X3 matrix with 3 columns: state, region, income). # The variable "state" has 2 categories ('n','s'). # The variable "region" has 5 categories ('A', 'B', 'C', 'D', 'E'). # The variable "income" is generated using the U(0,1) distribution. data=rbind(matrix(rep('n',165),165,1,byrow=TRUE),matrix(rep('s',70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep('A',115),rep('D',10),rep('E',40),rep('B',30),rep('C',40)), 100*runif(235)) names(data)=c("state","region","income") data=data[order(data$state,data$region),] table(data$state,data$region) # the method is simple random sampling without replacement # 25 units are drawn in the first-stage # in the second-stage, 10 units are drawn from the already 25 selected units m=mstage(data,size=list(25,10),method=list("srswor","srswor")) # the first stage is m[[1]], the second stage is m[[2]] # extracts the observed data xx=getdata(data,m)[[2]] # check the result table(xx$state,xx$region) ############ ## Example 3 ############ # Stratified one-stage cluster sampling # The same data as in Example 2 # the variable 'state' is used as stratification variable # 165 units are in the first stratum and 70 in the second one # the variable 'region' is used as clustering variable # 1 cluster (region) is drawn in each state using "srswor" m=mstage(data, stage=list("stratified","cluster"), varnames=list("state","region"), size=list(c(165,70),c(1,1)),method=list("","srswor")) # check the first stage table(m[[1]]$state) # check the second stage table(m[[2]]$region) # extracts the observed data xx=getdata(data,m)[[2]] # check the result table(xx$state,xx$region) ############ ## Example 4 ############ # Two-stage cluster sampling # The same data as in Example 1 # in the first-stage, the clustering variable is 'REG' (region) with 7 categories # 4 clusters (regions) are drawn in the first-stage # each region is selected with the probability 4/7 # in the second-stage, the clustering variable is 'CT'(canton) with 26 categories # 1 cluster (canton) is drawn in the second-stage from each selected region # in region 1, there are 3 cantons; one canton is selected with prob. 0.2, 0.4, 0.4, resp. # in region 2, there are 5 cantons; each canton is selected with the prob. 1/5 # in region 3, there are 3 cantons; each canton is selected with the prob. 1/3 # in region 4, there is 1 canton, which it is selected with the prob. 1 # in region 5, there are 7 cantons; each canton is selected with the prob. 1/7 # in region 6, there are 6 cantons; each canton is selected with the prob. 1/6 # in region 7, there is 1 canton, which it is selected with the prob. 1 # it is necessary to use a list of selection probabilities at each stage # prob is the list of the selection probabilities # the method is systematic sampling (unequal probabilities, without replacement) # ls is the list of sizes ls=list(4,c(1,1,1,1)) prob=list(rep(4/7,7),list(c(0.2,0.4,0.4),rep(1/5,5),rep(1/3,3),rep(1,1),rep(1/7,7), rep(1/6,6),rep(1,1))) m=mstage(b,stage=list("cluster","cluster"),varnames=list("REG","CT"), size=ls, method=c("systematic","systematic"),pik=prob) #the selected regions unique(m[[1]]$REG) #the selected cantons unique(m[[2]]$CT) # extracts the observed data xx=getdata(b,m)[[2]] # check the result table(xx$REG,xx$CT) ############ ## Example 5 ############ # Stratified two-stage cluster sampling # The same data as in Example 1 # the variable 'REG' is used as stratification variable # there are 7 strata # the variable 'CT' is used as first clustering variable # first stage, clusters (cantons) are drawn from each region using "srswor" # 3 clusters are drawn from the regions 1,2,3,5, and 6, respectively # 1 cluster is drawn from the regions 4 and 7, respectively # the variable 'COM' is used as second clustering variable # second stage, 2 clusters (municipalities) are drawn from each selected canton using "srswor" m=mstage(b,stage=list("stratified","cluster","cluster"), varnames=list("REG","CT","COM"), size=list(size1=table(b$REG),size2=c(rep(3,3),1,3,3,1), size3=rep(2,17)), method=list("","srswor","srswor")) # extracts the observed data getdata(b,m)[[3]]
This data is from Särndal et al (1992), see Appendix B, p. 652.
data(MU284)
data(MU284)
A data frame with 284 observations on the following 11 variables.
identifier number from 1 to 284.
1985 population (in thousands).
1975 population (in thousands).
revenues from 1985 municipal taxation (in millions of kronor).
number of Conservative seats in municipal council.
number of Social-Democratic seats in municipal council.
total number of seats in municipal council.
number of municipal employees in 1984.
real estate values according to 1984 assessment (in millions of kronor).
geographic region indicator.
cluster indicator (a cluster consists of a set of neighboring).
http://lib.stat.cmu.edu/datasets/mu284
Särndal, C.-E., Swensson, B., and Wretman, J. (1992), Model Assisted Survey Sampling, Springer Verlag, New York.
data(MU284) hist(MU284$RMT85)
data(MU284) hist(MU284$RMT85)
Computes the poststratified estimator of the population total.
postest(data, y, pik, NG, description=FALSE)
postest(data, y, pik, NG, description=FALSE)
data |
data frame or data matrix; its number of rows is n, the sample size. |
y |
vector of the variable of interest; its length is equal to n, the sample size. |
pik |
vector of the first-order inclusion probabilities for the sampled units; its length is equal to n, the sample size. |
NG |
vector of population frequency in each group G; for stratified sampling with poststratification, NG is a matrix of population frequency in each cell GH. |
description |
if TRUE, the estimator is printed for each poststratum; by default, FALSE. |
############ ## Example 1 ############ #stratified sampling and poststratification data(swissmunicipalities) # the variable 'REG' has 7 categories in the population # it is used as stratification variable # Computes the population stratum sizes table(swissmunicipalities$REG) # do not run # 1 2 3 4 5 6 7 # 589 913 321 171 471 186 245 # the sample stratum sizes are given by size=c(30,20,45,15,20,11,44) # the method is simple random sampling without replacement st=strata(swissmunicipalities,stratanames=c("REG"), size=c(30,20,45,15,20,11,44), method="srswor") # extracts the observed data # the order of the columns is different from the order in the initial data x=getdata(swissmunicipalities, st) px=poststrata(x,"REG") #computes the population frequency in each group ct=unique(px$data$REG) yy=table(swissmunicipalities$REG)[ct] postest(px$data,y=px$data$Pop020,pik=px$data$Prob,NG=diag(yy)) HTstrata(x$Pop020,x$Prob,x$Stratum) #the two estimators are equal ############ ## Example 2 ############ # systematic sampling and poststratification data(belgianmunicipalities) Tot=belgianmunicipalities$Tot04 name=belgianmunicipalities$Commune pik=inclusionprobabilities(Tot,200) #selects a sample s=UPsystematic(pik) #the sample is which(s==1) # extracts the observed data b=getdata(belgianmunicipalities,s) pb=poststrata(b,"Province") #computes the population frequency in each group ct=unique(pb$data$Province) yy=table(belgianmunicipalities$Province)[ct] postest(pb$data,y=pb$data$TaxableIncome,pik=pik[s==1],NG=yy,description=TRUE) HTestimator(pb$data$TaxableIncome,pik=pik[s==1]) ############ ## Example 3 ############ #cluster sampling and postratification data(swissmunicipalities) # the variable 'REG' has 7 categories in the population # it is used as clustering variable # the sample size is 3; the method is simple random sampling without replacement cl=cluster(swissmunicipalities,clustername=c("REG"),size=3,method="srswor") # extracts the observed data # the order of the columns is different from the order in the initial data c=getdata(swissmunicipalities, cl) pc=poststrata(c,"CT") #computes the population frequency in each group ct=unique(pc$data$CT) yy=table(swissmunicipalities$CT)[ct] postest(pc$data,y=pc$data$Pop020,pik=pc$data$Prob,NG=yy,description=TRUE) ############ ## Example 4 ############ #postratification with two criteria #artificial data data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE),matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") # computes the population stratum sizes table(data$region,data$state) # not run # nc sc # 1 100 30 # 2 50 40 # 3 15 0 #selects a sample of size 10 s=srswor(10,nrow(data)) # postratification using region and state ps=poststrata(data[s==1,],c("region","state")) #computes the population frequency in each group ct=unique(ps$data$poststratum) yy=numeric(length(ct)) for(i in 1:length(ct)) { xy=ps$data[ps$data$poststratum==ct[i],] xstate=unique(xy$state) ystate=unique(xy$region) xx=data[data$state==xstate & data$region==ystate,] yy[i]=nrow(xx) } postest(ps$data,y=ps$data$income,pik=rep(10/nrow(data),10),NG=yy,description=TRUE)
############ ## Example 1 ############ #stratified sampling and poststratification data(swissmunicipalities) # the variable 'REG' has 7 categories in the population # it is used as stratification variable # Computes the population stratum sizes table(swissmunicipalities$REG) # do not run # 1 2 3 4 5 6 7 # 589 913 321 171 471 186 245 # the sample stratum sizes are given by size=c(30,20,45,15,20,11,44) # the method is simple random sampling without replacement st=strata(swissmunicipalities,stratanames=c("REG"), size=c(30,20,45,15,20,11,44), method="srswor") # extracts the observed data # the order of the columns is different from the order in the initial data x=getdata(swissmunicipalities, st) px=poststrata(x,"REG") #computes the population frequency in each group ct=unique(px$data$REG) yy=table(swissmunicipalities$REG)[ct] postest(px$data,y=px$data$Pop020,pik=px$data$Prob,NG=diag(yy)) HTstrata(x$Pop020,x$Prob,x$Stratum) #the two estimators are equal ############ ## Example 2 ############ # systematic sampling and poststratification data(belgianmunicipalities) Tot=belgianmunicipalities$Tot04 name=belgianmunicipalities$Commune pik=inclusionprobabilities(Tot,200) #selects a sample s=UPsystematic(pik) #the sample is which(s==1) # extracts the observed data b=getdata(belgianmunicipalities,s) pb=poststrata(b,"Province") #computes the population frequency in each group ct=unique(pb$data$Province) yy=table(belgianmunicipalities$Province)[ct] postest(pb$data,y=pb$data$TaxableIncome,pik=pik[s==1],NG=yy,description=TRUE) HTestimator(pb$data$TaxableIncome,pik=pik[s==1]) ############ ## Example 3 ############ #cluster sampling and postratification data(swissmunicipalities) # the variable 'REG' has 7 categories in the population # it is used as clustering variable # the sample size is 3; the method is simple random sampling without replacement cl=cluster(swissmunicipalities,clustername=c("REG"),size=3,method="srswor") # extracts the observed data # the order of the columns is different from the order in the initial data c=getdata(swissmunicipalities, cl) pc=poststrata(c,"CT") #computes the population frequency in each group ct=unique(pc$data$CT) yy=table(swissmunicipalities$CT)[ct] postest(pc$data,y=pc$data$Pop020,pik=pc$data$Prob,NG=yy,description=TRUE) ############ ## Example 4 ############ #postratification with two criteria #artificial data data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE),matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") # computes the population stratum sizes table(data$region,data$state) # not run # nc sc # 1 100 30 # 2 50 40 # 3 15 0 #selects a sample of size 10 s=srswor(10,nrow(data)) # postratification using region and state ps=poststrata(data[s==1,],c("region","state")) #computes the population frequency in each group ct=unique(ps$data$poststratum) yy=numeric(length(ct)) for(i in 1:length(ct)) { xy=ps$data[ps$data$poststratum==ct[i],] xstate=unique(xy$state) ystate=unique(xy$region) xx=data[data$state==xstate & data$region==ystate,] yy[i]=nrow(xx) } postest(ps$data,y=ps$data$income,pik=rep(10/nrow(data),10),NG=yy,description=TRUE)
Poststratification using several criteria.
poststrata(data, postnames = NULL)
poststrata(data, postnames = NULL)
data |
data frame or data matrix; its number of rows is n, the sample size. |
postnames |
vector of poststratification variables. |
The function |
produces an object, which contains the following information: |
data |
the final data frame with a new column ('poststratum') containg the unit poststratum. |
npost |
the number of poststrata. |
# Example from An and Watts (New SAS procedures for Analysis of Sample Survey Data) # generates artificial data (a 235X3 matrix with 3 columns: state, region, income). # the variable "state" has 2 categories ('nc' and 'sc'). # the variable "region" has 3 categories (1, 2 and 3). # the income variable is randomly generated data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE),matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") # computes the population stratum sizes table(data$region,data$state) # not run # nc sc # 1 100 30 # 2 50 40 # 3 15 0 # postratification using two criteria: state and region poststrata(data,postnames=c("state","region"))
# Example from An and Watts (New SAS procedures for Analysis of Sample Survey Data) # generates artificial data (a 235X3 matrix with 3 columns: state, region, income). # the variable "state" has 2 categories ('nc' and 'sc'). # the variable "region" has 3 categories (1, 2 and 3). # the income variable is randomly generated data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE),matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") # computes the population stratum sizes table(data$region,data$state) # not run # nc sc # 1 100 30 # 2 50 40 # 3 15 0 # postratification using two criteria: state and region poststrata(data,postnames=c("state","region"))
Computes the ratio estimator of the population total.
ratioest(y,x,Tx,pik)
ratioest(y,x,Tx,pik)
y |
vector of the variable of interest; its length is equal to n, the sample size. |
x |
vector of auxiliary information; its length is equal to n, the sample size. |
Tx |
population total of x. |
pik |
vector of the first-order inclusion probabilities; its length is equal to n, the sample size. |
The function returns the value of the ratio estimator.
# population data(MU284) # there are 3 outliers which are deleted from the population MU281=MU284[MU284$RMT85<=3000,] attach(MU281) # computes the inclusion probabilities using the variable P85; sample size 120 pik=inclusionprobabilities(P85,120) # defines the variable of interest y=RMT85 # defines the auxiliary information x=CS82 # draws a systematic sample of size 120 s=UPsystematic(pik) # computes the ratio estimator of the total of RMT85 ratioest(y[s==1],x[s==1],sum(x),pik[s==1]) detach(MU281)
# population data(MU284) # there are 3 outliers which are deleted from the population MU281=MU284[MU284$RMT85<=3000,] attach(MU281) # computes the inclusion probabilities using the variable P85; sample size 120 pik=inclusionprobabilities(P85,120) # defines the variable of interest y=RMT85 # defines the auxiliary information x=CS82 # draws a systematic sample of size 120 s=UPsystematic(pik) # computes the ratio estimator of the total of RMT85 ratioest(y[s==1],x[s==1],sum(x),pik[s==1]) detach(MU281)
Computes the ratio estimator of the population total for a stratified design. The ratio estimator of a total is the sum of ratio estimator in each stratum.
ratioest_strata(y,x,TX_strata,pik,strata,description=FALSE)
ratioest_strata(y,x,TX_strata,pik,strata,description=FALSE)
y |
vector of the variable of interest; its length is equal to n, the sample size. |
x |
vector of auxiliary information; its length is equal to n, the sample size. |
TX_strata |
vector of population x-total in each stratum; its length is equal to the number of strata. |
pik |
vector of the first-order inclusion probabilities; its length is equal to n, the sample size. |
strata |
vector of size n, with elements indicating the unit stratum. |
description |
if TRUE, the ratio estimator in each stratum is printed; by default, it is FALSE. |
The function returns the value of the ratio estimator.
########### # Example 1 ########### # uses MU284 data as population with the 'REG' variable for stratification data(MU284) # there are 3 outliers which are deleted from the population MU281=MU284[MU284$RMT85<=3000,] attach(MU281) # computes the inclusion probabilities using the variable P85 # sample size 120 pik=inclusionprobabilities(P85,120) # defines the variable of interest y=RMT85 # defines the auxiliary information x=CS82 # computes the population stratum sizes table(REG) # not run # 1 2 3 4 5 6 7 8 # 24 48 32 37 55 41 15 29 # a sample is drawn in each region # the sample stratum sizes are given by size=c(4,10,8,4,6,4,6,7) s=strata(MU281,c("REG"),size=c(4,10,8,4,6,4,6,7), method="systematic",pik=P85) # extracts the observed data MU281sample=getdata(MU281,s) # computes the population x-totals in each stratum TX_strata=as.vector(tapply(CS82,list(REG),FUN=sum)) # computes the ratio estimator ratioest_strata(MU281sample$RMT85,MU281sample$CS82,TX_strata, MU281sample$Prob,MU281sample$Stratum) detach(MU281) ########### # Example 2 ########### # this is an artificial example (see Example 1 in the 'strata' function) # there are 4 columns: state, region, income and aux # 'income' is the variable of interest, and 'aux' is the auxiliary information # which is correlated to the income data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE),matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") attach(data) aux=income+rnorm(length(income),0,1) data=cbind.data.frame(data,aux) # computes the population stratum sizes table(data$region,data$state) # not run # nc sc # 1 100 30 # 2 50 40 # 3 15 0 # there are 5 cells with non-zero values; one draws 5 samples (1 sample in each stratum) # the sample stratum sizes are 10,5,10,4,6, respectively # the method is 'srswor' (equal probability, without replacement) s=strata(data,c("region","state"),size=c(10,5,10,4,6), method="srswor") # extracts the observed data xx=getdata(data,s) # computes the population x-total for each stratum TX_strata=na.omit(as.vector(tapply(aux,list(region,state),FUN=sum))) # computes the ratio estimator ratioest_strata(xx$income,xx$aux,TX_strata,xx$Prob,xx$Stratum,description=TRUE)
########### # Example 1 ########### # uses MU284 data as population with the 'REG' variable for stratification data(MU284) # there are 3 outliers which are deleted from the population MU281=MU284[MU284$RMT85<=3000,] attach(MU281) # computes the inclusion probabilities using the variable P85 # sample size 120 pik=inclusionprobabilities(P85,120) # defines the variable of interest y=RMT85 # defines the auxiliary information x=CS82 # computes the population stratum sizes table(REG) # not run # 1 2 3 4 5 6 7 8 # 24 48 32 37 55 41 15 29 # a sample is drawn in each region # the sample stratum sizes are given by size=c(4,10,8,4,6,4,6,7) s=strata(MU281,c("REG"),size=c(4,10,8,4,6,4,6,7), method="systematic",pik=P85) # extracts the observed data MU281sample=getdata(MU281,s) # computes the population x-totals in each stratum TX_strata=as.vector(tapply(CS82,list(REG),FUN=sum)) # computes the ratio estimator ratioest_strata(MU281sample$RMT85,MU281sample$CS82,TX_strata, MU281sample$Prob,MU281sample$Stratum) detach(MU281) ########### # Example 2 ########### # this is an artificial example (see Example 1 in the 'strata' function) # there are 4 columns: state, region, income and aux # 'income' is the variable of interest, and 'aux' is the auxiliary information # which is correlated to the income data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE),matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") attach(data) aux=income+rnorm(length(income),0,1) data=cbind.data.frame(data,aux) # computes the population stratum sizes table(data$region,data$state) # not run # nc sc # 1 100 30 # 2 50 40 # 3 15 0 # there are 5 cells with non-zero values; one draws 5 samples (1 sample in each stratum) # the sample stratum sizes are 10,5,10,4,6, respectively # the method is 'srswor' (equal probability, without replacement) s=strata(data,c("region","state"),size=c(10,5,10,4,6), method="srswor") # extracts the observed data xx=getdata(data,s) # computes the population x-total for each stratum TX_strata=na.omit(as.vector(tapply(aux,list(region,state),FUN=sum))) # computes the ratio estimator ratioest_strata(xx$income,xx$aux,TX_strata,xx$Prob,xx$Stratum,description=TRUE)
This data provides census information about the municipalities of the Haute-Garonne department, France, with less than 10000 inhabitants in 1999.
data(rec99)
data(rec99)
A data frame with 554 observations on the following 10 variables:
municipality code.
municipality name.
code of the Daily Life Basin to which the municipality belongs.
number of inhabitants.
number of dwellings.
number of vacant dwellings.
a four-modality variable which equals 1 if the municipality has less than 100 dwellings, 2 if it has between 100 and 299 dwellings, 3 if it has between 300 and 999 dwellings and 4 if it has 1000 dwellings or more.
surface in square meters.
geographical latitude of the center.
geographical longitude of the center.
For the first 8 variables, 'Institut national de la statistique et des études économiques', France (http://www.insee.fr). The geographical positions are available under the Open Database License ("© OpenStreetMap contributors"). https://www.openstreetmap.org/copyright
data(rec99) hist(rec99$LOG)
data(rec99) hist(rec99$LOG)
Computes the regression estimator of the population total, using the design-based approach. The underling regression model is a model without intercept.
regest(formula,Tx,weights,pikl,n,sigma=rep(1,length(weights)))
regest(formula,Tx,weights,pikl,n,sigma=rep(1,length(weights)))
formula |
regression model formula (y~x). |
Tx |
population total of x, the auxiliary variable. |
weights |
vector of the weights; its length is equal to n, the sample size. |
pikl |
matrix of joint inclusion probabilities for the sample. |
n |
the sample size. |
sigma |
vector of positive values accounting for heteroscedasticity. |
The function returns a list with following components:
regest |
value of the regression estimator. |
coefficients |
vector of estimated beta coefficients. |
std_error |
estimated standard error of the estimated coefficients. |
t_value |
t-values associated to the coefficients. |
p_value |
p-values associated to the coefficients. |
cov_mat |
covariance matrix of the estimated coefficients. |
weights |
specified weights. |
y |
response variable. |
x |
model matrix. |
# uses the MU284 population to draw a systematic sample data(MU284) # there are 3 outliers which are deleted from the population MU281=MU284[MU284$RMT85<=3000,] attach(MU281) # computes the inclusion probabilities using the variable P85; sample size 40 pik=inclusionprobabilities(P85,40) # joint inclusion probabilities for systematic sampling pikl=UPsystematicpi2(pik) # draws a systematic sample of size 40 s=UPsystematic(pik) # defines the variable of interest for the selected sample y=RMT85[s==1] # defines the auxiliary information for the selected sample x1=CS82[s==1] x2=SS82[s==1] # joint inclusion probabilities for the selected sample pikls=pikl[s==1,s==1] # first-order inclusion probabilities for the selected sample piks=pik[s==1] # computes the regression estimator with the model y~x1+x2-1 r=regest(formula=y~x1+x2-1,Tx=c(sum(CS82),sum(SS82)),weights=1/piks,pikl=pikls,n=40) # the regression estimator r$regest # the estimated beta coefficients r$coefficients # the regression estimator is the same as the calibration estimator (method="linear") Xs=cbind(x1,x2) total=c(sum(CS82),sum(SS82)) g1=calib(Xs,d=1/piks,total,method="linear") checkcalibration(Xs,d=1/piks,total,g1) calibev(y,Xs,total,pikls,d=1/piks,g1,with=TRUE,EPS=1e-6) detach(MU281)
# uses the MU284 population to draw a systematic sample data(MU284) # there are 3 outliers which are deleted from the population MU281=MU284[MU284$RMT85<=3000,] attach(MU281) # computes the inclusion probabilities using the variable P85; sample size 40 pik=inclusionprobabilities(P85,40) # joint inclusion probabilities for systematic sampling pikl=UPsystematicpi2(pik) # draws a systematic sample of size 40 s=UPsystematic(pik) # defines the variable of interest for the selected sample y=RMT85[s==1] # defines the auxiliary information for the selected sample x1=CS82[s==1] x2=SS82[s==1] # joint inclusion probabilities for the selected sample pikls=pikl[s==1,s==1] # first-order inclusion probabilities for the selected sample piks=pik[s==1] # computes the regression estimator with the model y~x1+x2-1 r=regest(formula=y~x1+x2-1,Tx=c(sum(CS82),sum(SS82)),weights=1/piks,pikl=pikls,n=40) # the regression estimator r$regest # the estimated beta coefficients r$coefficients # the regression estimator is the same as the calibration estimator (method="linear") Xs=cbind(x1,x2) total=c(sum(CS82),sum(SS82)) g1=calib(Xs,d=1/piks,total,method="linear") checkcalibration(Xs,d=1/piks,total,g1) calibev(y,Xs,total,pikls,d=1/piks,g1,with=TRUE,EPS=1e-6) detach(MU281)
Computes the regression estimator of the population total for a stratified sampling, using the design-based approach. The same regression model is used for all strata. The underling regression model is a model without intercept.
regest_strata(formula,weights,Tx_strata,strata,pikl, sigma=rep(1,length(weights)),description=FALSE)
regest_strata(formula,weights,Tx_strata,strata,pikl, sigma=rep(1,length(weights)),description=FALSE)
formula |
regression model formula (y~x). |
weights |
vector of the weights; its length is equal to n, the sample size. |
Tx_strata |
population total of x, the auxiliary variable. |
strata |
vector of stratum identificator. |
pikl |
joint inclusion probabilities for the sample. |
sigma |
vector of positive values accounting for heteroscedasticity. |
description |
if TRUE, the following components are printed for each stratum: the Horvitz-Thompson estimator, the estimated beta coefficients, their estimated standard error, t_values, p_values, and the covariance matrix. By default, FALSE. |
The function returns the value of the regression estimator computed as the sum of the stratum estimators.
# generates artificial data y=rgamma(10,3) x=y+rnorm(10) Stratum=c(1,1,2,2,2,3,3,3,3,3) # population size N=200 # sample size n=10 # assume proportional allocation, nh/Nh=n/N # joint inclusion probabilities (for the sample) pikl=matrix(n*(n-1)/(N*(N-1)),n,n) diag(pikl)=n/N regest_strata(formula=y~x-1,weights=rep(N/n,n),Tx_strata=c(50,30,40), strata=Stratum,pikl,description=TRUE)
# generates artificial data y=rgamma(10,3) x=y+rnorm(10) Stratum=c(1,1,2,2,2,3,3,3,3,3) # population size N=200 # sample size n=10 # assume proportional allocation, nh/Nh=n/N # joint inclusion probabilities (for the sample) pikl=matrix(n*(n-1)/(N*(N-1)),n,n) diag(pikl)=n/N regest_strata(formula=y~x-1,weights=rep(N/n,n),Tx_strata=c(50,30,40), strata=Stratum,pikl,description=TRUE)
Computes the response homogeneity groups and the response probability for each unit in these groups.
rhg(X,selection)
rhg(X,selection)
X |
sample data frame; it should contain the columns 'ID_unit' and 'status'; 'ID_unit' denotes the unit identifier (a number); 'status' is a 1/0 variable denoting the response/non-response of a unit. |
selection |
vector of variable names in X used to construct the groups. |
Into a response homogeneity group, the reponse probability is the same for all units. Data are missing at random within groups, conditionally on the selected sample.
The initial sample data frame and also the following components:
rhgroup |
the response homogeneity group for each unit. |
prob_response |
the response probability for each unit; for the units with status=0, this probability is 0. |
Särndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling. Springer
# defines the inclusion probabilities for the population pik=c(0.2,0.7,0.8,0.5,0.4,0.4) # X is the population data frame X=cbind.data.frame(pik,c("A","B","A","A","C","B")) names(X)=c("Prob","town") # selects a sample using systematic sampling s=UPsystematic(pik) # Xs is the sample data frame Xs=getdata(X,s) # adds the status column to Xs (1 - sample respondent, 0 otherwise) Xs=cbind.data.frame(Xs,status=c(1,0,1)) # creates the response homogeneity groups using the 'town' variable rhg(Xs,selection="town")
# defines the inclusion probabilities for the population pik=c(0.2,0.7,0.8,0.5,0.4,0.4) # X is the population data frame X=cbind.data.frame(pik,c("A","B","A","A","C","B")) names(X)=c("Prob","town") # selects a sample using systematic sampling s=UPsystematic(pik) # Xs is the sample data frame Xs=getdata(X,s) # adds the status column to Xs (1 - sample respondent, 0 otherwise) Xs=cbind.data.frame(Xs,status=c(1,0,1)) # creates the response homogeneity groups using the 'town' variable rhg(Xs,selection="town")
Computes response homogeneity groups and the corresponding response probability for each unit into a group, for a stratified sampling.
rhg_strata(X,selection)
rhg_strata(X,selection)
X |
sample data frame; it should contain the columns 'ID_unit','Stratum', and 'status'; 'ID_unit' denotes the unit identifier (a number); 'Stratum' denotes the unit stratum; 'status' is a 1/0 variable denoting the response/non-response of a unit in the sample. |
selection |
vector of variable names in X used to construct the groups. |
Into a response homogeneity group, the reponse probability is the same for all units. Data are missing at random within groups, conditionally on the selected sample.
The initial sample data frame and also the following components:
rhgroup |
response homogeneity group for each unit, conditionally on its stratum. |
prob_response |
response probability for each unit; for the units with status=0, this probability is 0. |
Särndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling. Springer
############ ## Example 1 ############ # uses Example 2 from the 'strata' function help file data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE),matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") # draws a sample s1=strata(data,c("region","state"),size=c(10,5,10,4,6), method="systematic", pik=data$income) # extracts the observed data s1=getdata(data,s1) # randomly generates the 'status' variable (1-sample respondent, 0-otherwise) status=ifelse(runif(nrow(s1))<0.3,0,1) # adds the 'status' variable to the sample data frame s1 s1=cbind.data.frame(s1,status) # creates classes of income using the median of income # suppose that the income is available for all units in the sample classincome=ifelse(s1$income<median(s1$income),1,2) # adds 'classincome' to s1 s1=cbind.data.frame(s1,classincome) # computes the response homogeneity groups using the 'classincome' variable rhg_strata(s1,selection=c("classincome")) ############ ## Example 2 ############ # the same data as in Example 1 # but we also add the 'sex' column (1-female, 2-male) # suppose that the sex is available for all units in the sample sex=c(rep(1,12),rep(2,8),rep(1,10),rep(2,5)) s1=cbind.data.frame(s1,sex) # computes the response homogeneity groups using the 'classincome' and 'sex' variables rhg_strata(s1,selection=c("classincome","sex"))
############ ## Example 1 ############ # uses Example 2 from the 'strata' function help file data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE),matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") # draws a sample s1=strata(data,c("region","state"),size=c(10,5,10,4,6), method="systematic", pik=data$income) # extracts the observed data s1=getdata(data,s1) # randomly generates the 'status' variable (1-sample respondent, 0-otherwise) status=ifelse(runif(nrow(s1))<0.3,0,1) # adds the 'status' variable to the sample data frame s1 s1=cbind.data.frame(s1,status) # creates classes of income using the median of income # suppose that the income is available for all units in the sample classincome=ifelse(s1$income<median(s1$income),1,2) # adds 'classincome' to s1 s1=cbind.data.frame(s1,classincome) # computes the response homogeneity groups using the 'classincome' variable rhg_strata(s1,selection=c("classincome")) ############ ## Example 2 ############ # the same data as in Example 1 # but we also add the 'sex' column (1-female, 2-male) # suppose that the sex is available for all units in the sample sex=c(rep(1,12),rep(2,8),rep(1,10),rep(2,5)) s1=cbind.data.frame(s1,sex) # computes the response homogeneity groups using the 'classincome' and 'sex' variables rhg_strata(s1,selection=c("classincome","sex"))
Computes the response probabilities using logistic regression for non-response adjustment. For stratified sampling, the same logistic model is used for all strata.
rmodel(formula,weights,X)
rmodel(formula,weights,X)
formula |
regression model formula (y~x). |
weights |
vector of weights; its length is equal to n, the sample size. |
X |
sample data frame. |
The function returns the sample data frame with a new column 'prob_resp', which contains the response probabilities.
# Example from An and Watts (New SAS procedures for Analysis of Sample Survey Data) # generates artificial data (a 235X3 matrix with 3 columns: state, region, income). # the variable "state" has 2 categories ('nc' and 'sc'). # the variable "region" has 3 categories (1, 2 and 3). # the sampling frame is stratified by region within state. # the income variable is randomly generated data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE),matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") # computes the population stratum sizes table(data$region,data$state) # not run # nc sc # 1 100 30 # 2 50 40 # 3 15 0 # there are 5 cells with non-zero values; one draws 5 samples (1 sample in each stratum) # the sample stratum sizes are 10,5,10,4,6, respectively # the method is 'srswor' (equal probability, without replacement) s=strata(data,c("region","state"),size=c(10,5,10,4,6), method="srswor") # extracts the observed data x=getdata(data,s) # generates randomly the 'status' column (1 - respondent, 0 - nonrespondent) status=round(runif(nrow(x))) x=cbind(x,status) # computes the response probabilities rmodel(x$status~x$income+x$Stratum,weights=1/x$Prob,x) # the same example without stratification rmodel(x$status~x$income,weights=1/x$Prob,x)
# Example from An and Watts (New SAS procedures for Analysis of Sample Survey Data) # generates artificial data (a 235X3 matrix with 3 columns: state, region, income). # the variable "state" has 2 categories ('nc' and 'sc'). # the variable "region" has 3 categories (1, 2 and 3). # the sampling frame is stratified by region within state. # the income variable is randomly generated data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE),matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") # computes the population stratum sizes table(data$region,data$state) # not run # nc sc # 1 100 30 # 2 50 40 # 3 15 0 # there are 5 cells with non-zero values; one draws 5 samples (1 sample in each stratum) # the sample stratum sizes are 10,5,10,4,6, respectively # the method is 'srswor' (equal probability, without replacement) s=strata(data,c("region","state"),size=c(10,5,10,4,6), method="srswor") # extracts the observed data x=getdata(data,s) # generates randomly the 'status' column (1 - respondent, 0 - nonrespondent) status=round(runif(nrow(x))) x=cbind(x,status) # computes the response probabilities rmodel(x$status~x$income+x$Stratum,weights=1/x$Prob,x) # the same example without stratification rmodel(x$status~x$income,weights=1/x$Prob,x)
Selects a balanced sample (a vector of 0 and 1) or an almost balanced sample. Firstly, the flight phase is applied. Next, if needed, the landing phase is applied on the result of the flight phase.
samplecube(X,pik,order=1,comment=TRUE,method=1)
samplecube(X,pik,order=1,comment=TRUE,method=1)
X |
matrix of auxiliary variables on which the sample must be balanced. |
pik |
vector of inclusion probabilities. |
order |
1, the data are randomly arranged, |
comment |
a comment is written during the execution if |
method |
1, for a landing phase by linear programming, |
Tillé, Y. (2006), Sampling Algorithms, Springer.
Chauvet, G. and Tillé, Y. (2006). A fast algorithm of balanced sampling. Computational Statistics, 21/1:53–62.
Chauvet, G. and Tillé, Y. (2005). New SAS macros for balanced sampling. In INSEE, editor, Journées de Méthodologie Statistique, Paris.
Deville, J.-C. and Tillé, Y. (2004). Efficient balanced sampling: the cube method. Biometrika, 91:893–912.
Deville, J.-C. and Tillé, Y. (2005). Variance approximation under balanced sampling. Journal of Statistical Planning and Inference, 128/2:411–425.
############ ## Example 1 ############ # matrix of balancing variables X=cbind(c(1,1,1,1,1,1,1,1,1),c(1.1,2.2,3.1,4.2,5.1,6.3,7.1,8.1,9.1)) # vector of inclusion probabilities # the sample size is 3. pik=c(1/3,1/3,1/3,1/3,1/3,1/3,1/3,1/3,1/3) # selection of the sample s=samplecube(X,pik,order=1,comment=TRUE) # The selected sample (1:length(pik))[s==1] ############ ## Example 2 ############ # 2 strata and 2 auxiliary variables # we verify the values of the inclusion probabilities by simulations X=rbind(c(1,0,1,2),c(1,0,2,5),c(1,0,3,7),c(1,0,4,9), c(1,0,5,1),c(1,0,6,5),c(1,0,7,7),c(1,0,8,6),c(1,0,9,9), c(1,0,10,3),c(0,1,11,3),c(0,1,12,2),c(0,1,13,3), c(0,1,14,6),c(0,1,15,8),c(0,1,16,9),c(0,1,17,1), c(0,1,18,2),c(0,1,19,3),c(0,1,20,4)) pik=rep(1/2,times=20) ppp=rep(0,times=20) sim=10 #for accurate results increase this value for(i in (1:sim)) ppp=ppp+samplecube(X,pik,1,FALSE) ppp=ppp/sim print(ppp) print(pik) ############ ## Example 3 ############ # unequal probability sampling by cube method # one auxiliary variable equal to the inclusion probability N=100 pik=runif(N) pikfin=samplecube(array(pik,c(N,1)),pik,1,TRUE) ############ ## Example 4 ############ # p auxiliary variables generated randomly N=100 p=7 x=rnorm(N*p,10,3) # random inclusion probabilities pik= runif(N) X=array(x,c(N,p)) X=cbind(cbind(X,rep(1,times=N)),pik) pikfin=samplecube(X,pik,1,TRUE) ############ ## Example 5 ############ # strata and an auxiliary variable N=100 a=rep(1,times=N) b=rep(0,times=N) V1=c(a,b,b) V2=c(b,a,b) V3=c(b,b,a) X=cbind(V1,V2,V3) pik=rep(2/10,times=3*N) pikfin=samplecube(X,pik,1,TRUE) ############ ## Example 6 ############ # Selection of a balanced sample using the MU284 population, # Monte Carlo simulation and variance comparison with # unequal probability sampling of fixed sample size. ############ data(MU284) # inclusion probabilities, sample size 50 pik=inclusionprobabilities(MU284$P75,50) # matrix of balancing variables X=cbind(MU284$P75,MU284$CS82,MU284$SS82,MU284$S82,MU284$ME84,MU284$REV84) # Horvitz-Thompson estimator for a balanced sample s=samplecube(X,pik,1,FALSE) HTestimator(MU284$RMT85[s==1],pik[s==1]) # Horvitz-Thompson estimator for an unequal probability sample s=samplecube(matrix(pik),pik,1,FALSE) HTestimator(MU284$RMT85[s==1],pik[s==1]) # Monte Carlo simulation; for a better accuracy, increase the value 'sim' sim=5 res1=rep(0,times=sim) res2=rep(0,times=sim) for(i in 1:sim) { cat("Simulation number ",i,"\n") s=samplecube(X,pik,1,FALSE) res1[i]=HTestimator(MU284$RMT85[s==1],pik[s==1]) s=samplecube(matrix(pik),pik,1,FALSE) res2[i]=HTestimator(MU284$RMT85[s==1],pik[s==1]) } # summary and boxplots summary(res1) summary(res2) ss=cbind(res1,res2) colnames(ss) = c("balanced sampling","uneq prob sampling") boxplot(data.frame(ss), las=1)
############ ## Example 1 ############ # matrix of balancing variables X=cbind(c(1,1,1,1,1,1,1,1,1),c(1.1,2.2,3.1,4.2,5.1,6.3,7.1,8.1,9.1)) # vector of inclusion probabilities # the sample size is 3. pik=c(1/3,1/3,1/3,1/3,1/3,1/3,1/3,1/3,1/3) # selection of the sample s=samplecube(X,pik,order=1,comment=TRUE) # The selected sample (1:length(pik))[s==1] ############ ## Example 2 ############ # 2 strata and 2 auxiliary variables # we verify the values of the inclusion probabilities by simulations X=rbind(c(1,0,1,2),c(1,0,2,5),c(1,0,3,7),c(1,0,4,9), c(1,0,5,1),c(1,0,6,5),c(1,0,7,7),c(1,0,8,6),c(1,0,9,9), c(1,0,10,3),c(0,1,11,3),c(0,1,12,2),c(0,1,13,3), c(0,1,14,6),c(0,1,15,8),c(0,1,16,9),c(0,1,17,1), c(0,1,18,2),c(0,1,19,3),c(0,1,20,4)) pik=rep(1/2,times=20) ppp=rep(0,times=20) sim=10 #for accurate results increase this value for(i in (1:sim)) ppp=ppp+samplecube(X,pik,1,FALSE) ppp=ppp/sim print(ppp) print(pik) ############ ## Example 3 ############ # unequal probability sampling by cube method # one auxiliary variable equal to the inclusion probability N=100 pik=runif(N) pikfin=samplecube(array(pik,c(N,1)),pik,1,TRUE) ############ ## Example 4 ############ # p auxiliary variables generated randomly N=100 p=7 x=rnorm(N*p,10,3) # random inclusion probabilities pik= runif(N) X=array(x,c(N,p)) X=cbind(cbind(X,rep(1,times=N)),pik) pikfin=samplecube(X,pik,1,TRUE) ############ ## Example 5 ############ # strata and an auxiliary variable N=100 a=rep(1,times=N) b=rep(0,times=N) V1=c(a,b,b) V2=c(b,a,b) V3=c(b,b,a) X=cbind(V1,V2,V3) pik=rep(2/10,times=3*N) pikfin=samplecube(X,pik,1,TRUE) ############ ## Example 6 ############ # Selection of a balanced sample using the MU284 population, # Monte Carlo simulation and variance comparison with # unequal probability sampling of fixed sample size. ############ data(MU284) # inclusion probabilities, sample size 50 pik=inclusionprobabilities(MU284$P75,50) # matrix of balancing variables X=cbind(MU284$P75,MU284$CS82,MU284$SS82,MU284$S82,MU284$ME84,MU284$REV84) # Horvitz-Thompson estimator for a balanced sample s=samplecube(X,pik,1,FALSE) HTestimator(MU284$RMT85[s==1],pik[s==1]) # Horvitz-Thompson estimator for an unequal probability sample s=samplecube(matrix(pik),pik,1,FALSE) HTestimator(MU284$RMT85[s==1],pik[s==1]) # Monte Carlo simulation; for a better accuracy, increase the value 'sim' sim=5 res1=rep(0,times=sim) res2=rep(0,times=sim) for(i in 1:sim) { cat("Simulation number ",i,"\n") s=samplecube(X,pik,1,FALSE) res1[i]=HTestimator(MU284$RMT85[s==1],pik[s==1]) s=samplecube(matrix(pik),pik,1,FALSE) res2[i]=HTestimator(MU284$RMT85[s==1],pik[s==1]) } # summary and boxplots summary(res1) summary(res2) ss=cbind(res1,res2) colnames(ss) = c("balanced sampling","uneq prob sampling") boxplot(data.frame(ss), las=1)
Draws a simple random sampling without replacement of size n (equal probabilities, fixed sample size, without replacement).
srswor(n,N)
srswor(n,N)
n |
sample size. |
N |
population size. |
Returns a vector (with elements 0 and 1) of size N, the population size. Each element k of this vector indicates the status of unit k (1, unit k is selected in the sample; 0, otherwise).
############ ## Example 1 ############ #select a sample s=srswor(3,10) #the sample is which(s==1) ############ ## Example 2 ############ data(belgianmunicipalities) Tot=belgianmunicipalities$Tot04 name=belgianmunicipalities$Commune n=200 #select a sample s=srswor(n,length(Tot)) #the sample is which(s==1) #names of the selected units as.vector(name[s==1])
############ ## Example 1 ############ #select a sample s=srswor(3,10) #the sample is which(s==1) ############ ## Example 2 ############ data(belgianmunicipalities) Tot=belgianmunicipalities$Tot04 name=belgianmunicipalities$Commune n=200 #select a sample s=srswor(n,length(Tot)) #the sample is which(s==1) #names of the selected units as.vector(name[s==1])
Draws a simple random sampling without replacement of size n using the selection-rejection method (equal probabilities, fixed sample size, without replacement).
srswor1(n,N)
srswor1(n,N)
n |
sample size. |
N |
population size. |
Returns a vector (with elements 0 and 1) of size N, the population size. Each element k of this vector indicates the status of unit k (1, unit k is selected in the sample; 0, otherwise).
Fan, C.T., Muller, M.E., Rezucha, I. (1962), Development of sampling plans by using sequential (item by item) selection techniques and digital computer, Journal of the American Statistical Association, 57, 387–402.
s=srswor1(3,10) #the sample is which(s==1)
s=srswor1(3,10) #the sample is which(s==1)
Draws a simple random sampling with replacement of size n (equal probabilities, fixed sample size, with replacement).
srswr(n,N)
srswr(n,N)
n |
sample size. |
N |
population size. |
Returns a vector of size N, the population size. Each element k of this vector indicates the number of replicates of unit k in the sample.
s=srswr(3,10) #the selected units are which(s!=0) #with the number of replicates s[s!=0]
s=srswr(3,10) #the selected units are which(s!=0) #with the number of replicates s[s!=0]
Stratified sampling with equal/unequal probabilities.
strata(data, stratanames=NULL, size, method=c("srswor","srswr","poisson", "systematic"), pik,description=FALSE)
strata(data, stratanames=NULL, size, method=c("srswor","srswr","poisson", "systematic"), pik,description=FALSE)
data |
data frame or data matrix; its number of rows is N, the population size. |
stratanames |
vector of stratification variables. |
size |
vector of stratum sample sizes (in the order in which the strata are given in the input data set). |
method |
method to select units; the following methods are implemented: simple random sampling without replacement (srswor), simple random sampling with replacement (srswr), Poisson sampling (poisson), systematic sampling (systematic); if "method" is missing, the default method is "srswor". |
pik |
vector of inclusion probabilities or auxiliary information used to compute them; this argument is only used for unequal probability sampling (Poisson and systematic). If an auxiliary information is provided, the function uses the inclusionprobabilities function for computing these probabilities. |
description |
a message is printed if its value is TRUE; the message gives the number of selected units and the number of the units in the population. By default, the value is FALSE. |
The data should be sorted in ascending order by the columns given in the stratanames argument before applying the function. Use, for example, data[order(data$state,data$region),].
The function produces an object, which contains the following information:
ID_unit |
the identifier of the selected units. |
Stratum |
the unit stratum. |
Prob |
the unit inclusion probability. |
############ ## Example 1 ############ # Example from An and Watts (New SAS procedures for Analysis of Sample Survey Data) # generates artificial data (a 235X3 matrix with 3 columns: state, region, income). # the variable "state" has 2 categories ('nc' and 'sc'). # the variable "region" has 3 categories (1, 2 and 3). # the sampling frame is stratified by region within state. # the income variable is randomly generated data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE),matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") # computes the population stratum sizes table(data$region,data$state) # not run # nc sc # 1 100 30 # 2 50 40 # 3 15 0 # there are 5 cells with non-zero values # one draws 5 samples (1 sample in each stratum) # the sample stratum sizes are 10,5,10,4,6, respectively # the method is 'srswor' (equal probability, without replacement) s=strata(data,c("region","state"),size=c(10,5,10,4,6), method="srswor") # extracts the observed data getdata(data,s) # see the result using a contigency table table(s$region,s$state) ############ ## Example 2 ############ # The same data as in Example 1 # the method is 'systematic' (unequal probability, without replacement) # the selection probabilities are computed using the variable 'income' s=strata(data,c("region","state"),size=c(10,5,10,4,6), method="systematic",pik=data$income) # extracts the observed data getdata(data,s) # see the result using a contigency table table(s$region,s$state) ############ ## Example 3 ############ # Uses the 'swissmunicipalities' data as population for drawing a sample of units data(swissmunicipalities) # the variable 'REG' has 7 categories in the population # it is used as stratification variable # Computes the population stratum sizes table(swissmunicipalities$REG) # do not run # 1 2 3 4 5 6 7 # 589 913 321 171 471 186 245 # sort the data to obtain the same order of the regions in the sample data=swissmunicipalities data=data[order(data$REG),] # the sample stratum sizes are given by size=c(30,20,45,15,20,11,44) # 30 units are drawn in the first stratum, 20 in the second one, etc. # the method is simple random sampling without replacement # (equal probability, without replacement) st=strata(data,stratanames=c("REG"),size=c(30,20,45,15,20,11,44), method="srswor") # extracts the observed data getdata(data, st) # see the result using a contingency table table(st$REG)
############ ## Example 1 ############ # Example from An and Watts (New SAS procedures for Analysis of Sample Survey Data) # generates artificial data (a 235X3 matrix with 3 columns: state, region, income). # the variable "state" has 2 categories ('nc' and 'sc'). # the variable "region" has 3 categories (1, 2 and 3). # the sampling frame is stratified by region within state. # the income variable is randomly generated data=rbind(matrix(rep("nc",165),165,1,byrow=TRUE),matrix(rep("sc",70),70,1,byrow=TRUE)) data=cbind.data.frame(data,c(rep(1,100), rep(2,50), rep(3,15), rep(1,30),rep(2,40)), 1000*runif(235)) names(data)=c("state","region","income") # computes the population stratum sizes table(data$region,data$state) # not run # nc sc # 1 100 30 # 2 50 40 # 3 15 0 # there are 5 cells with non-zero values # one draws 5 samples (1 sample in each stratum) # the sample stratum sizes are 10,5,10,4,6, respectively # the method is 'srswor' (equal probability, without replacement) s=strata(data,c("region","state"),size=c(10,5,10,4,6), method="srswor") # extracts the observed data getdata(data,s) # see the result using a contigency table table(s$region,s$state) ############ ## Example 2 ############ # The same data as in Example 1 # the method is 'systematic' (unequal probability, without replacement) # the selection probabilities are computed using the variable 'income' s=strata(data,c("region","state"),size=c(10,5,10,4,6), method="systematic",pik=data$income) # extracts the observed data getdata(data,s) # see the result using a contigency table table(s$region,s$state) ############ ## Example 3 ############ # Uses the 'swissmunicipalities' data as population for drawing a sample of units data(swissmunicipalities) # the variable 'REG' has 7 categories in the population # it is used as stratification variable # Computes the population stratum sizes table(swissmunicipalities$REG) # do not run # 1 2 3 4 5 6 7 # 589 913 321 171 471 186 245 # sort the data to obtain the same order of the regions in the sample data=swissmunicipalities data=data[order(data$REG),] # the sample stratum sizes are given by size=c(30,20,45,15,20,11,44) # 30 units are drawn in the first stratum, 20 in the second one, etc. # the method is simple random sampling without replacement # (equal probability, without replacement) st=strata(data,stratanames=c("REG"),size=c(30,20,45,15,20,11,44), method="srswor") # extracts the observed data getdata(data, st) # see the result using a contingency table table(st$REG)
This population provides information about the Swiss municipalities in 2003.
data(swissmunicipalities)
data(swissmunicipalities)
A data frame with 2896 observations on the following 22 variables:
Swiss canton.
Swiss region.
municipality number.
municipality name.
municipality area.
wood area.
area under cultivation.
mountain pasture area.
area with buildings.
industrial area.
number of men.
number of women.
number of men and women aged between 0 and 19.
number of men and women aged between 20 and 39.
number of men and women aged between 40 and 64.
number of men and women aged between 65 and over.
number of households.
number of households with 1 person.
number of households with 2 persons.
number of households with 3 persons.
number of households with 4 persons.
total population.
Swiss Federal Statistical Office.
data(swissmunicipalities) hist(swissmunicipalities$POPTOT)
data(swissmunicipalities) hist(swissmunicipalities$POPTOT)
Uses the Brewer's method to select a sample of units (unequal probabilities, without replacement, fixed sample size).
UPbrewer(pik,eps=1e-06)
UPbrewer(pik,eps=1e-06)
pik |
vector of the inclusion probabilities. |
eps |
the control value, by default equal to 1e-06; it is used to control pik (pik>eps & pik < 1-eps). |
Returns a vector (with elements 0 and 1) of size N, the population size. Each element k of this vector indicates the status of unit k (1, unit k is selected in the sample; 0, otherwise).
Brewer, K. (1975), A simple procedure for $pi$pswor, Australian Journal of Statistics, 17:166-172.
#define the inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #select a sample s=UPbrewer(pik) #the sample is which(s==1)
#define the inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #select a sample s=UPbrewer(pik) #the sample is which(s==1)
Maximum entropy sampling with fixed sample size and unequal probabilities (or Conditional Poisson sampling) is implemented by means of a sequential method (unequal probabilities, without replacement, fixed sample size).
UPmaxentropy(pik) UPmaxentropypi2(pik) UPMEqfromw(w,n) UPMEpikfromq(q) UPMEpiktildefrompik(pik,eps=1e-6) UPMEsfromq(q) UPMEpik2frompikw(pik,w)
UPmaxentropy(pik) UPmaxentropypi2(pik) UPMEqfromw(w,n) UPMEpikfromq(q) UPMEpiktildefrompik(pik,eps=1e-6) UPMEsfromq(q) UPMEpik2frompikw(pik,w)
n |
sample size. |
pik |
vector of prescribed inclusion probabilities. |
eps |
tolerance in the Newton's method; by default is 1E-6. |
q |
matrix of the conditional selection probabilities for the sequential algorithm. |
w |
parameter vector of the maximum entropy design. |
The maximum entropy sampling maximizes the entropy criterion:
The main procedure is UPmaxentropy
which selects a sample (a vector of 0 and 1)
from a given vector of inclusion probabilities. The procedure UPmaxentropypi2
returns the matrix of joint inclusion probabilities from the first-order inclusion probability vector.
The other procedures are intermediate steps. They can be useful to run simulations as shown
in the examples below. The procedure UPMEpiktildefrompik
computes the vector
of the inclusion probabilities (denoted pikt
) of a Poisson sampling from the vector
of the inclusion probabilities of the maximum entropy sampling.
The maximum entropy sampling is the conditional
design given the fixed sample size. The vector w
can be easily obtained by
w=pikt/(1-pikt)
. Once piktilde
and w
are deduced from pik
,
a matrix of selection probabilities q
can be derived from the sample size n
and the vector w
via UPMEqfromw
.
Next, a sample can be selected from q
using UPMEsfromq
.
In order to generate several samples,
it is more efficient to compute the matrix q
(which needs some calculation),
and then to use the procedure UPMEsfromq
. The vector of the inclusion probabilities can
be recomputed from q
using UPMEpikfromq
, which also checks
the numerical precision of the algorithm. The procedure UPMEpik2frompikw
computes the matrix of the joint inclusion probabilities from q
and w
.
Chen, S.X., Liu, J.S. (1997).
Statistical applications of the Poisson-binomial and conditional Bernoulli distributions,
Statistica Sinica, 7, 875-892;
Deville, J.-C. (2000).
Note sur l'algorithme de Chen, Dempster et Liu.
Technical report, CREST-ENSAI, Rennes.
Matei, A., Tillé, Y. (2005) Evaluation of variance approximations and estimators in maximum entropy sampling with unequal
probability and fixed sample size,
Journal of Official Statistics, Vol. 21, No. 4, p. 543-570.
Tillé, Y. (2006), Sampling Algorithms, Springer.
############ ## Example 1 ############ # Simple example - sample selection pik=c(0.07,0.17,0.41,0.61,0.83,0.91) # First method UPmaxentropy(pik) # Second method by using intermediate procedures n=sum(pik) pikt=UPMEpiktildefrompik(pik) w=pikt/(1-pikt) q=UPMEqfromw(w,n) UPMEsfromq(q) # Matrix of joint inclusion probabilities # First method: direct computation from pik UPmaxentropypi2(pik) # Second method: computation from pik and w UPMEpik2frompikw(pik,w) ############ ## Example 2 ############ # other examples in the 'UPexamples' vignette # vignette("UPexamples", package="sampling")
############ ## Example 1 ############ # Simple example - sample selection pik=c(0.07,0.17,0.41,0.61,0.83,0.91) # First method UPmaxentropy(pik) # Second method by using intermediate procedures n=sum(pik) pikt=UPMEpiktildefrompik(pik) w=pikt/(1-pikt) q=UPMEqfromw(w,n) UPMEsfromq(q) # Matrix of joint inclusion probabilities # First method: direct computation from pik UPmaxentropypi2(pik) # Second method: computation from pik and w UPMEpik2frompikw(pik,w) ############ ## Example 2 ############ # other examples in the 'UPexamples' vignette # vignette("UPexamples", package="sampling")
Uses the Midzuno's method to select a sample of units (unequal probabilities, without replacement, fixed sample size).
UPmidzuno(pik,eps=1e-6)
UPmidzuno(pik,eps=1e-6)
pik |
vector of the inclusion probabilities. |
eps |
control value, by default equal to 1e-6. |
Returns a vector (with elements 0 and 1) of size N, the population size. Each element k of this vector indicates the status of unit k (1, unit k is selected in the sample; 0, otherwise). The value 'eps' is used to control pik (pik>eps & pik < 1-eps).
Midzuno, H. (1952), On the sampling system with probability proportional to sum of size.
Annals of the Institute of Statistical Mathematics, 3:99-107.
Deville, J.-C. and Tillé, Y. (1998),
Unequal probability sampling without replacement through a splitting method,
Biometrika, 85:89-101.
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #select a sample s=UPmidzuno(pik) #the sample is which(s==1)
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #select a sample s=UPmidzuno(pik) #the sample is which(s==1)
Computes the joint (second-order) inclusion probabilities for Midzuno sampling.
UPmidzunopi2(pik)
UPmidzunopi2(pik)
pik |
vector of the first-order inclusion probabilities. |
Returns a NxN matrix of the following form: the main diagonal contains the first-order inclusion probabilities for each unit k in the population; elements (k,l) are the joint inclusion probabilities of units k and l, with k not equal to l. N is the population size.
Midzuno, H. (1952), On the sampling system with probability proportional to sum of size. Annals of the Institute of Statistical Mathematics, 3:99-107.
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #matrix of joint inclusion probabilities UPmidzunopi2(pik)
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #matrix of joint inclusion probabilities UPmidzunopi2(pik)
Uses the minimal support method to select a sample of units (unequal probabilities, without replacement, fixed sample size).
UPminimalsupport(pik)
UPminimalsupport(pik)
pik |
vector of the inclusion probabilities. |
Returns a vector (with elements 0 and 1) of size N, the population size. Each element k of this vector indicates the status of unit k (1, unit k is selected in the sample; 0, otherwise).
Deville, J.-C., Tillé, Y. (1998), Unequal probability sampling without replacement
through a splitting method, Biometrika , 85, 89-101.
Tillé, Y. (2006), Sampling Algorithms, Springer.
############ ## Example 1 ############ #defines the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #selects a sample s=UPminimalsupport(pik) #the sample is which(s==1) ############ ## Example 2 ############ data(belgianmunicipalities) Tot=belgianmunicipalities$Tot04 name=belgianmunicipalities$Commune pik=inclusionprobabilities(Tot,200) #selects a sample s=UPminimalsupport(pik) #the sample is which(s==1) #names of the selected units as.vector(name[s==1])
############ ## Example 1 ############ #defines the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #selects a sample s=UPminimalsupport(pik) #the sample is which(s==1) ############ ## Example 2 ############ data(belgianmunicipalities) Tot=belgianmunicipalities$Tot04 name=belgianmunicipalities$Commune pik=inclusionprobabilities(Tot,200) #selects a sample s=UPminimalsupport(pik) #the sample is which(s==1) #names of the selected units as.vector(name[s==1])
Uses the Hansen-Hurwitz method to select a sample of units (unequal probabilities, with replacement, fixed sample size).
UPmultinomial(pik)
UPmultinomial(pik)
pik |
vector of the inclusion probabilities. |
Returns a vector of size N, the population size. Each element k of this vector indicates the number of replicates of unit k in the sample.
Hansen, M. and Hurwitz, W. (1943), On the theory of sampling from finite populations. Annals of Mathematical Statistics, 14:333-362.
#defines the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #selects a sample s=UPmultinomial(pik) #the selected units are which(s!=0) #with the number of replicates s[s!=0] #or use rep((1:length(pik))[s!=0],s[s!=0])
#defines the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #selects a sample s=UPmultinomial(pik) #the selected units are which(s!=0) #with the number of replicates s[s!=0] #or use rep((1:length(pik))[s!=0],s[s!=0])
Implements order sampling (unequal probabilities, without replacement, fixed sample size).
UPopips(lambda,type=c("pareto","uniform","exponential"))
UPopips(lambda,type=c("pareto","uniform","exponential"))
lambda |
vector of working inclusion probabilities or target ones. |
type |
the type of order sampling (pareto, uniform, exponential). |
Returns a vector of the selected units; its length is equal to the sample size.
Rosén, B. (1997), Asymptotic theory for order sampling, Journal of Statistical Planning and Inference,
62:135-158.
Rosén, B. (1997), On sampling with probability proportional to size, Journal of Statistical Planning and Inference,
62:159-191.
#define the working inclusion probabilities lambda=c(0.2,0.7,0.8,0.5,0.4,0.4) #draw a Pareto sample s=UPopips(lambda, type="pareto") #the sample is s
#define the working inclusion probabilities lambda=c(0.2,0.7,0.8,0.5,0.4,0.4) #draw a Pareto sample s=UPopips(lambda, type="pareto") #the sample is s
Selects an unequal probability sample using the pivotal method (unequal probabilities, without replacement, fixed sample size).
UPpivotal(pik,eps=1e-6)
UPpivotal(pik,eps=1e-6)
pik |
vector of the inclusion probabilities. |
eps |
control value, by default equal to 1e-6. |
Returns a vector (with elements 0 and 1) of size N, the population size. Each element k of this vector indicates the status of unit k (1, unit k is selected in the sample; 0, otherwise). The value eps is used to control pik (pik>eps & pik < 1-eps).
Deville, J.-C. and Tillé, Y. (1998),
Unequal probability sampling without replacement through a splitting method,
Biometrika, 85:89-101.
Chauvet, G. and Tillé, Y. (2006). A fast algorithm of balanced sampling. to appear in Computational Statistics.
Tillé, Y. (2006), Sampling Algorithms, Springer.
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #select a sample s=UPpivotal(pik) #the sample is which(s==1)
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #select a sample s=UPpivotal(pik) #the sample is which(s==1)
Draws a Poisson sample using a prescribed vector of first-order inclusion probabilities (unequal probabilities, without replacement, random sample size).
UPpoisson(pik)
UPpoisson(pik)
pik |
vector of the first-order inclusion probabilities. |
Returns a vector (with elements 0 and 1) of size N, the population size. Each element k of this vector indicates the status of unit k (1, unit k is selected in the sample; 0, otherwise).
############ ## Example 1 ############ # inclusion probabilities pik=c(1/3,1/3,1/3) # selects a sample s=UPpoisson(pik) #the sample is which(s==1) ############ ## Example 2 ############ data(belgianmunicipalities) Tot=belgianmunicipalities$Tot04 name=belgianmunicipalities$Commune n=200 pik=inclusionprobabilities(Tot,n) # select a sample s=UPpoisson(pik) #the sample is which(s==1) # names of the selected units getdata(name,s)
############ ## Example 1 ############ # inclusion probabilities pik=c(1/3,1/3,1/3) # selects a sample s=UPpoisson(pik) #the sample is which(s==1) ############ ## Example 2 ############ data(belgianmunicipalities) Tot=belgianmunicipalities$Tot04 name=belgianmunicipalities$Commune n=200 pik=inclusionprobabilities(Tot,n) # select a sample s=UPpoisson(pik) #the sample is which(s==1) # names of the selected units getdata(name,s)
Selects a sample using the pivotal method, when the order of the population units is random (unequal probabilities, without replacement, fixed sample size).
UPrandompivotal(pik,eps=1e-6)
UPrandompivotal(pik,eps=1e-6)
pik |
vector of the inclusion probabilities. |
eps |
control value, by default equal to 1e-6. |
Returns a vector (with elements 0 and 1) of size N, the population size. Each element k of this vector indicates the status of unit k (1, unit k is selected in the sample; 0, otherwise). The value 'eps' is used to control pik (pik>eps and pik<1-eps).
Deville, J.-C. and Tillé, Y. (1998),
Unequal probability sampling without replacement through a splitting method,
Biometrika, 85:89–101.
Tillé, Y. (2006), Sampling Algorithms, Springer.
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #select a sample s=UPrandompivotal(pik) #the sample is which(s==1)
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #select a sample s=UPrandompivotal(pik) #the sample is which(s==1)
Selects a sample using the systematic method, when the order of the population units is random (unequal probabilities, without replacement, fixed sample size).
UPrandomsystematic(pik,eps=1e-6)
UPrandomsystematic(pik,eps=1e-6)
pik |
vector of the inclusion probabilities. |
eps |
control value, by default equal to 1e-6. |
Returns a vector (with elements 0 and 1) of size N, the population size. Each element k of this vector indicates the status of unit k (1, unit k is selected in the sample; 0, otherwise). The value 'eps' is used to control pik (pik>eps and pik<1-eps).
Madow, W.G. (1949), On the theory of systematic sampling, II, Annals of Mathematical Statistics, 20, 333-354.
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #select a sample s=UPrandomsystematic(pik) #the sample is (1:length(pik))[s==1]
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #select a sample s=UPrandomsystematic(pik) #the sample is (1:length(pik))[s==1]
Uses the Sampford's method to select a sample of units (unequal probabilities, without replacement, fixed sample size).
UPsampford(pik,eps=1e-6, max_iter=500)
UPsampford(pik,eps=1e-6, max_iter=500)
pik |
vector of the inclusion probabilities. |
eps |
control value, by default equal to 1e-6. |
max_iter |
maximum number of iterations in the algorithm. |
Returns a vector (with elements 0 and 1) of size N, the population size. Each element k of this vector indicates the status of unit k (1, unit k is selected in the sample; 0, otherwise). The value eps is used to control pik (pik>eps & pik < 1-eps). The sample size must be small with respect to the population size; otherwise, the selection time can be very long.
Sampford, M. (1967), On sampling without replacement with unequal probabilities of selection, Biometrika, 54:499-513.
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) s=UPsampford(pik) #the sample is which(s==1)
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) s=UPsampford(pik) #the sample is which(s==1)
Computes the joint (second-order) inclusion probabilities for Sampford sampling.
UPsampfordpi2(pik)
UPsampfordpi2(pik)
pik |
vector of the first-order inclusion probabilities. |
Returns a NxN matrix of the following form: the main diagonal contains the first-order inclusion probabilities for each unit k in the population; elements (k,l) are the joint inclusion probabilities of units k and l, with k not equal to l. N is the population size.
Sampford, M. (1967), On sampling without replacement with unequal probabilities of
selection, Biometrika, 54:499-513.
Wu, C. (2004). R/S-PLUS Implementation of pseudo empirical
likelihood methods under unequal probability sampling. Working
paper 2004-07, Department of Statistics and Actuarial Science,
University of Waterloo.
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #matrix of joint inclusion probabilities UPsampfordpi2(pik)
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #matrix of joint inclusion probabilities UPsampfordpi2(pik)
Uses the systematic method to select a sample of units (unequal probabilities, without replacement, fixed sample size).
UPsystematic(pik,eps=1e-6)
UPsystematic(pik,eps=1e-6)
pik |
vector of the inclusion probabilities. |
eps |
control value, by default equal to 1e-6. |
Returns a vector (with elements 0 and 1) of size N, the population size. Each element k of this vector indicates the status of unit k (1, unit k is selected in the sample; 0, otherwise).
Madow, W.G. (1949), On the theory of systematic sampling, II, Annals of Mathematical Statistics, 20, 333-354.
inclusionprobabilities
, UPrandomsystematic
############ ## Example 1 ############ #defines the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #selects a sample s=UPsystematic(pik) #the sample is which(s==1) ############ ## Example 2 ############ data(belgianmunicipalities) Tot=belgianmunicipalities$Tot04 name=belgianmunicipalities$Commune pik=inclusionprobabilities(Tot,200) #selects a sample s=UPsystematic(pik) #the sample is which(s==1) # extracts the observed data getdata(belgianmunicipalities,s)
############ ## Example 1 ############ #defines the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #selects a sample s=UPsystematic(pik) #the sample is which(s==1) ############ ## Example 2 ############ data(belgianmunicipalities) Tot=belgianmunicipalities$Tot04 name=belgianmunicipalities$Commune pik=inclusionprobabilities(Tot,200) #selects a sample s=UPsystematic(pik) #the sample is which(s==1) # extracts the observed data getdata(belgianmunicipalities,s)
Computes the joint (second-order) inclusion probabilities for systematic sampling.
UPsystematicpi2(pik)
UPsystematicpi2(pik)
pik |
vector of the first-order inclusion probabilities. |
Returns a NxN matrix of the following form: the main diagonal contains the first-order inclusion probabilities for each unit k in the population; elements (k,l) are the joint inclusion probabilities of units k and l, with k not equal to l. N is the population size.
Madow, W.G. (1949), On the theory of systematic sampling, II, Annals of Mathematical Statistics, 20, 333-354.
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #matrix of joint inclusion probabilities UPsystematicpi2(pik)
#define the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #matrix of joint inclusion probabilities UPsystematicpi2(pik)
Uses the Tillé's method to select a sample of units (unequal probabilities, without replacement, fixed sample size).
UPtille(pik,eps=1e-6)
UPtille(pik,eps=1e-6)
pik |
vector of the inclusion probabilities. |
eps |
control value, by default equal to 1e-6. |
Returns a vector (with elements 0 and 1) of size N, the population size. Each element k of this vector indicates the status of unit k (1, unit k is selected in the sample; 0, otherwise). The value eps is used to control pik (pik>eps & pik < 1-eps).
Tillé, Y. (1996), An elimination procedure of unequal probability sampling without
replacement, Biometrika, 83:238-241.
Deville, J.-C. and Tillé, Y. (1998),
Unequal probability sampling without replacement through a splitting method,
Biometrika, 85:89-101.
############ ## Example 1 ############ #defines the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #selects a sample s=UPtille(pik) #the sample is which(s==1) ############ ## Example 2 ############ # see in the 'UPexamples' vignette # vignette("UPexamples", package="sampling")
############ ## Example 1 ############ #defines the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #selects a sample s=UPtille(pik) #the sample is which(s==1) ############ ## Example 2 ############ # see in the 'UPexamples' vignette # vignette("UPexamples", package="sampling")
Computes the joint (second-order) inclusion probabilities for Tillé sampling.
UPtillepi2(pik,eps=1e-6)
UPtillepi2(pik,eps=1e-6)
pik |
vector of the first-order inclusion probabilities. |
eps |
control value, by default equal to 1e-6. |
Returns a NxN matrix of the following form: the main diagonal contains the first-order inclusion
probabilities for each unit k in the population; elements (k,l) are the joint inclusion
probabilities of units k and l, with k not equal to l. N is the population size. The value eps
is used to
control pik
(pik>eps & pik < 1-eps).
Tillé, Y. (1996), An elimination procedure of unequal probability sampling without replacement, Biometrika, 83:238-241.
#defines the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #matrix of joint inclusion probabilities UPtillepi2(pik)
#defines the prescribed inclusion probabilities pik=c(0.2,0.7,0.8,0.5,0.4,0.4) #matrix of joint inclusion probabilities UPtillepi2(pik)
Computes the variance estimation of an estimator of the population total using the Deville's method.
varest(Ys,Xs=NULL,pik,w=NULL)
varest(Ys,Xs=NULL,pik,w=NULL)
Ys |
vector of the variable of interest; its length is equal to n, the sample size. |
Xs |
matrix of the auxiliary variables; for the calibration estimator, this is the matrix of the sample calibration variables. |
pik |
vector of the first-order inclusion probabilities; its length is equal to n, the sample size. |
w |
vector of the calibrated weights (for the calibration estimator); its length is equal to n, the sample size. |
The function implements the following estimator:
where .
Deville, J.-C. (1993). Estimation de la variance pour les enquêtes en deux phases. Manuscript, INSEE, Paris.
# Belgian municipalities data base data(belgianmunicipalities) attach(belgianmunicipalities) # Computes the inclusion probabilities pik=inclusionprobabilities(Tot04,200) N=length(pik) n=sum(pik) # Defines the variable of interest y=TaxableIncome # Draws a Tille sample of size 200 s=UPtille(pik) # Computes the Horvitz-Thompson estimator HTestimator(y[s==1],pik[s==1]) # Computes the variance estimation of the Horvitz-Thompson estimator varest(Ys=y[s==1],pik=pik[s==1]) # for an example using calibration estimator, see the 'calibration' vignette # vignette("calibration", package="sampling")
# Belgian municipalities data base data(belgianmunicipalities) attach(belgianmunicipalities) # Computes the inclusion probabilities pik=inclusionprobabilities(Tot04,200) N=length(pik) n=sum(pik) # Defines the variable of interest y=TaxableIncome # Draws a Tille sample of size 200 s=UPtille(pik) # Computes the Horvitz-Thompson estimator HTestimator(y[s==1],pik[s==1]) # Computes the variance estimation of the Horvitz-Thompson estimator varest(Ys=y[s==1],pik=pik[s==1]) # for an example using calibration estimator, see the 'calibration' vignette # vignette("calibration", package="sampling")
Computes variance estimators of the Horvitz-Thompson estimator of the population total.
varHT(y,pikl,method)
varHT(y,pikl,method)
y |
vector of the variable of interest; its length is equal to n, the sample size. |
pikl |
matrix of joint inclusion probabilities; its dimension is nxn. |
method |
if 1, an unbiased variance estimator is computed; if 2, the Sen-Yates-Grundy variance estimator for fixed sample size is computed; be default, the method is 1. |
If method is 1, the following estimator is implemented
If method is 2, the following estimator is implemented
pik=c(0.2,0.7,0.8,0.5,0.4,0.4) N=length(pik) n=sum(pik) # Defines the variable of interest y=rnorm(N,10,2) # Draws a Poisson sample of expected size n s=UPpoisson(pik) # Computes the Horvitz-Thompson estimator HTestimator(y[s==1],pik[s==1]) # Computes the joint inclusion prob. for Poisson sampling pikl=outer(pik,pik,"*") diag(pikl)=pik # Computes the variance estimator (method=1, the sample size is not fixed) varHT(y[s==1],pikl[s==1,s==1],1) # Draws a Tille sample of size n s=UPtille(pik) # Computes the Horvitz-Thompson estimator HTestimator(y[s==1],pik[s==1]) # Computes the joint inclusion prob. for Tille sampling pikl=UPtillepi2(pik) # Computes the variance estimator (method=2, the sample size is fixed) varHT(y[s==1],pikl[s==1,s==1],2)
pik=c(0.2,0.7,0.8,0.5,0.4,0.4) N=length(pik) n=sum(pik) # Defines the variable of interest y=rnorm(N,10,2) # Draws a Poisson sample of expected size n s=UPpoisson(pik) # Computes the Horvitz-Thompson estimator HTestimator(y[s==1],pik[s==1]) # Computes the joint inclusion prob. for Poisson sampling pikl=outer(pik,pik,"*") diag(pikl)=pik # Computes the variance estimator (method=1, the sample size is not fixed) varHT(y[s==1],pikl[s==1,s==1],1) # Draws a Tille sample of size n s=UPtille(pik) # Computes the Horvitz-Thompson estimator HTestimator(y[s==1],pik[s==1]) # Computes the joint inclusion prob. for Tille sampling pikl=UPtillepi2(pik) # Computes the variance estimator (method=2, the sample size is fixed) varHT(y[s==1],pikl[s==1,s==1],2)
Computes the Taylor-series linearization variance estimation of the ratio
The estimators in the ratio are Horvitz-Thompson type estimators.
vartaylor_ratio(Ys,Xs,pikls)
vartaylor_ratio(Ys,Xs,pikls)
Ys |
vector of the first observed variable; its length is equal to n, the sample size. |
Xs |
vector of the second observed variable; its length is equal to n, the sample size. |
pikls |
matrix of joint inclusion probabilities of the sample units; its dimension is nxn. |
The function implements the following estimator:
where .
Woodruff, R. (1971). A Simple Method for Approximating the Variance of a Complicated Estimate, Journal of the American Statistical Association, Vol. 66, No. 334 , pp. 411–414.
data(belgianmunicipalities) attach(belgianmunicipalities) # inclusion probabilities, sample size 200 pik=inclusionprobabilities(Tot04,200) # the first variable (population level) Y=Men04 # the second variable (population level) X=Women04 # population size N=length(pik) # joint inclusion probabilities for Poisson sampling pikl=outer(pik,pik,"*") diag(pikl)=pik # draw a sample using Poisson sampling s=UPpoisson(pik) # sample inclusion probabilities piks=pik[s==1] # the first observed variable (sample level) Ys=Y[s==1] # the second observed variable (sample level) Xs=X[s==1] # matrix of joint inclusion prob. (sample level) pikls=pikl[s==1,s==1] # ratio estimator and its estimated variance vartaylor_ratio(Ys,Xs,pikls)
data(belgianmunicipalities) attach(belgianmunicipalities) # inclusion probabilities, sample size 200 pik=inclusionprobabilities(Tot04,200) # the first variable (population level) Y=Men04 # the second variable (population level) X=Women04 # population size N=length(pik) # joint inclusion probabilities for Poisson sampling pikl=outer(pik,pik,"*") diag(pikl)=pik # draw a sample using Poisson sampling s=UPpoisson(pik) # sample inclusion probabilities piks=pik[s==1] # the first observed variable (sample level) Ys=Y[s==1] # the second observed variable (sample level) Xs=X[s==1] # matrix of joint inclusion prob. (sample level) pikls=pikl[s==1,s==1] # ratio estimator and its estimated variance vartaylor_ratio(Ys,Xs,pikls)
Gives a matrix whose rows are the vectors (with 0 and 1; 1 - a unit is selected, 0 - otherwise) of all samples of fixed size.
writesample(n,N)
writesample(n,N)
n |
sample size. |
N |
population size. |
# all samples of size 4 # from a population of size 10 w=writesample(4,10) # the samples are (read by rows) t(apply(w,1,function(x) (1:ncol(w))[x==1]))
# all samples of size 4 # from a population of size 10 w=writesample(4,10) # the samples are (read by rows) t(apply(w,1,function(x) (1:ncol(w))[x==1]))