Details

Discontinuities due to survey redesigns : a structural time series approach

by Goesjenov, Ibragim, MS

Abstract (Summary)
In this paper discontinuities in time series which arise from a survey redesign are analyzed using a structural time series framework. This framework is applied to a set of sample surveys conducted consecutively by Statistics Netherlands on subjects including victimization and the number of crimi- nal o enses in the period from 1980 up until 2010. To estimate the discontinuities the Kalman lter is implemented using the library Ssfpack in OxMetrics. Additionally, in the estimation of the e ects due to a survey redesign this paper also analyses the e ects of explanatory variables such as police registrations.
Full Text Links

Main Document: View

10-page Sections: 1 2 3 4 5 6 7 Next >

Bibliographical Information:

Advisor:Jan van den Brakel

School:Universiteit Maastricht

School Location:Netherlands

Source Type:Master's Thesis

Keywords:Structural time series analysis

ISBN:

Date of Publication:10/03/2011

Document Text (Pages 1-10)

Discontinuities due to survey redesigns :
a structural time series approach

Master thesis MSc. Econometrics and Operations Research
Maastricht University

By Ibragim Goesjenov

under supervision of
Prof. dr. ir. ing. Jan van den Brakel (Statistics Netherlands / Maastricht University)

Abstract

In this paper discontinuities in time series which arise from a survey redesign are analyzed using
a structural time series framework. This framework is applied to a set of sample surveys conducted
consecutively by Statistics Netherlands on subjects including victimization and the number of criminal
offenses in the period from 1980 up until 2010. To estimate the discontinuities the Kalman filter
is implemented using the library Ssfpack in OxMetrics. Additionally, in the estimation of the effects
due to a survey redesign this paper also analyses the effects of explanatory variables such as police
registrations.

This project is set up in cooperation with Statistics Netherlands in Heerlen in the form of an internship.
I would like to take the opportunity and thank everyone at Statistics Netherlands for helping
me collecting the data, gathering the literature and performing the analysis in order to make this
project a success. Especially I would like to thank my supervisor Jan van den Brakel for the extensive
support and feedback during the entire project and Harrie Huys for assistance in data collection and
interpretation.


Page 2

1 Introduction

National statistical offices conduct surveys in several fields, including transportation, demography
and crime. Statistics Netherlands (SN) was established in 1899 and currently has its locations in The
Hague and Heerlen, being responsible for collecting the statistics of the Dutch economy and society
which amongst others serve as a basis for policy makers and politicians. Whilst the main task of SN is
to collect data in an appropriate way, the Division of Methodology and Quality (DMQ) is responsible
for the whole process of survey sampling, research on the proper data collection methodologies and quality
control. One of the challenges is producing continuous time series out of a sequence of series which
contain discontinuities due to survey redesigns.
More specific, people are usually interested in particular finite population parameters such as the percentage
of unemployed of a population, which could be the population of a country but also companies
or municipalities. In theory the population parameters could be obtained by collecting data of every
individual in the population, which is called a census. However, due to practical and financial limitations
national statistical offices are not able to consider the whole population when collecting data. Interviewing
the almost 17 billion Dutch citizens every year is not realistic and on top of that inefficient.
Alternatively, statistical offices consider samples from a population, meaning that the population parameters
of interest are estimated from the data collected. Compared to a census, considering a sample has
several advantages, including that it is less expensive and the information is usually available much faster
since not all individuals in a population have to be interviewed. The major disadvantage of considering a
sample is that the estimates based on the sample almost never exactly coincide with the parameters for
the entire population. The error in the analysis which can be addressed to the fact that only a sample is
considered instead of all individuals in a population is also called the sampling error. This is an inevitable
implication of using samples for the estimation of population parameters, but it can be controlled for
by using a sufficiently large sample size in order to minimize it. In many cases exact information is not
needed. Instead of that researchers are generally more interested in the confidence interval in which the
parameter of interest is likely to lie. However, constructing such a confidence interval is only possible if
the sample is selected by some kind of lottery mechanism, or probability samples, Särndal et al. (1992)
and Renssen (1998). The basic example of such a lottery mechanism is the simple random sample, where
every element of the population has the same probability to be chosen into the sample (Bethlehem, 2009).
On the other hand, the non-sampling errors are deviations from the true value of the parameter which
cannot be addressed to sampling errors. These errors are not necessarily related to a sample survey and
are also likely to occur in a census. Whereas the expectation of the sampling error is zero, non-sampling
errors generally introduce a bias in the estimate. Non-sampling errors can be divided in non-response,
measurement and coverage errors. For example, the data collection method used in a survey, as discussed
in de Leeuw (2005), accounts for a part of the measurement error. It appears that especially the

1


Page 3

presence of an interviewer has a large influence on the interviewee, and it is particularly present in case
the interview questions get more sensitive.
Changes in the survey process might lead to systematic differences in the outcomes of a survey, such
that the non-sampling error is changed. This in turn might lead to a discontinuity in the series, such
that comparing the new series to the series before the change-over is not appropriate anymore (van den
Brakel and Roels, 2010). Sometimes changes in the survey process are inevitable, for example to improve
the quality of the survey. Note that this could also happen due to budget cuts. An example of such a
change in the survey process is when the data collection method has been changed from a web based
interview to personal interviewing. The interview speed of the latter data collection method is usually
lower, such that the respondents are given more time to think about their answers. In such, compared to
the series before the redesign, the measurement error in the new series might have been partly changed.
There exist several methods to quantify the discontinuities and then correct the series to get a continuous
one. First, if the micro data observed under the regular and the new survey are consistent, it is
possible to quantify the discontinuities by recalculating the observed series (van den Brakel et al., 2008).
For example, if a new classification of publication domains is introduced, using a domain indicator it is
possible to quantify the discontinuities and then recalculate the values according to the old classification.
However, in many real life situations, after a survey redesign the micro data are not consistent under the
old and the new approach, and hence recalculation is not applicable. Therefore, other methods have to be
used. One method is to conduct an experiment and run it parallel for a certain time period with the new
survey, called a parallel run. The difference between the two surveys is the discontinuity. This approach
might be rather expensive and therefore it is not always possible to afford an appropriate parallel series.
Another methodology to quantify discontinuities which does not require a parallel run is a structural
time series approach. It assumes that a series can be modeled in a state space model including several
components, namely a trend, cyclic components, an intervention variable and an irregular term. The
intervention variable accounts for the discontinuities, assuming it is known at which time points the new
surveys were introduced. Then, after the state space model is set up, one can apply a filtering technique
called the Kalman filter, which attempts to disentangle the discontinuity from the real development of
the parameter. In this paper a structural time series model as described in Durbin and Koopman (2001)
is applied to model a series of crime victimization.
This paper is set up as follows. First, section 2 describes the relevant sampling theory techniques. The
data considered in this paper is discussed in section 3, after that the literature on state space models
and the Kalman filter will be reviewed in section 4. The application to the data and the results will be
presented in the fifth section. The paper concludes with a discussion in section 6.

2


Page 4

2 Sampling theory

Ever since the existence of statistical offices, sampling theory has been a significant part of research.
In this paragraph sampling theory as discussed in Bethlehem (2009), Särndal et al. (1992) and Renssen
(1998) is summarized. In short, sampling theory boils down to choosing a sampling strategy, which
is the combination of a sampling design and an estimator, in such a way that the estimates for the
unknown population parameters are as precise as possible given a certain budget. These finite population
parameters could be e.g. means, totals or fractions.
In this section first the basics of survey sampling are shortly discussed. After the main mathematical
tools are introduced in subsection 2.2, in subsection 2.3 the sampling design will be addressed, whereas
subsection 2.4 will explain the use of auxiliary information in the estimator.

2.1 Survey sampling

The objective of a survey research is ”to draw a sample, carry out measurements on the sampled population
elements, and, on basis of that information, draw conclusions about some population parameter”
(Bethlehem, 2009). More specific, in a survey process the following steps can be distinguished.

1. Determining the population of interest, the target variables and the auxiliary variables and constructing
the questionnaire
2. Obtaining a sample frame
3. Determining the sampling design and the estimator
4. Drawing the sample
5. Doing the field work
6. Processing and analysing the sample data
Following step 1, it is crucial to define the population from where one is taking a sample and the variables
one is interested in. This is because it is not possible to change these parameters after the collection process
has been started. Moreover, the design of the survey questionnaires is also done in step 1.
A finite population is a union of units, called sampling units. To draw a sample, a list with sampling units
is needed, which is called a sample frame. Hence, the sample frame is usually an administrative register
from which the sample can be drawn, this could be the phone book if the population consists of Dutch
citizens or the files of the Chamber of Commerce, if companies are considered as the population. This
could lead to immediate bias in the sample, since one can imagine that not all households are listed in
a phone book, for example. On the other hand, some individuals could be included unintentionally, such
as non-Dutch citizens having a Dutch phone number. Non-sampling errors which are due to an omission
of some units belonging to the target population and inclusion of units not belonging to the population

3


Page 5

are called coverage errors (Bethlehem, 2009).
Step 3 is the main focus of this section. Here, the aforementioned probability samples are set up, depending
on the population parameters and the budget of the statistical institute. The combination of
the sampling design and the estimator is called the sampling strategy. In both parts of the strategy
auxiliary information can be used to improve precision of the estimator. First, the sampling design is the
selection procedure of the lottery mechanism, and in an ideal case every population element should be
assigned a probability to be selected. Auxiliary information is used by more advanced sampling designs.
Then, the estimator is closely related to the sampling design, it is namely the formula which converts
the sample measurements into an estimate of the unknown population parameter of interest. Depending
on the sampling design, the estimator calculates the best possible value given the sample observed and
the auxiliary information available. This implies that the auxiliary information must be somehow related
to the target variable. The following subsections will deepen the theory on choosing the best strategy
making the sampling error as small as possible.
After the strategy from step 3 is fixed, the sample can be drawn in step 4. This boils down to selecting
the individuals from the population using a random number generator by applying the sampling design.
Once these individuals are selected, the field work can be performed in step 5. This is again a crucial step
in the roadmap, and the errors made here might explain discontinuities due to redesign. More specific,
the field work is heavily under subject of non-sampling errors. First, non-response errors occur when
the respondent refuses to cooperate in the survey, e.g. because he or she has no interest in participating
in any survey. If there is a part of the population which is systematically left out of the survey due to
non-response, this will bias the results. Then, measurement errors occur when the respondent is reached,
but he or she fails to produce the correct answer. The questionnaire design and data collection method
mostly determine the amount of this kind of bias. Hence, these errors account for a bias in the sample
such that the sample parameters differ from the population parameters (Bethlehem, 2009). If the survey
designed in steps 1-5 is kept is unchanged, these errors do not change either such that the results of the
survey are comparable over time. Things change if the survey is redesigned in steps 1-5, such that this
bias changes making it necessary to quantify this change in order to be able to compare it with previous
surveys.
Last, the sampled data is analyzed and published in step 6, which is the final step where the survey can
be checked for extreme outliers and other mistakes. An example is to sort data and check for extreme
ages in the sample, having someone aged 134 is quite unlikely, but possible. (Bethlehem, 2009)

2.2 Definitions

In this section the following notational conventions have been used. An uppercase letter, e.g. Y or U,
will always denote a population parameter. On the other hand, lowercase letters such as y or s will refer

4


Page 6

to sample parameters. Additionally, upperbars such as in Y and y are systematically used to denote
means, whereas the circumflex in ˆy refers to estimators.
Let the finite population be represented by U, which in turn consists of N identifiable individuals. Note
that the target variable is the variable which is necessary to answer the question that one wants to
answer about the population (Bethlehem, 2009). Each element k of U (k = 1, . . . , N) is associated with
Yk, the value of the scalar target variable and a vector Xk which includes the values of the auxiliary
variable. The population totals of the target and auxiliary variables are defined as Y = N

k=1 Yk and

X = N

k=1 Xk, respectively.

Then, let s = (k1, . . . , kn) be the sample which is drawn from U using a lottery-like mechanism, also
called a random sample. Here, ki is an element from the population U and n is the sample size. The
values of the target variable for the selected elements k1, . . . , kn are denoted as y1, . . . , yn. Note that a
random sample is necessary to draw a sample which is representative for the population. The sample s is
not unique meaning that using the lottery-like mechanism several different samples can be drawn from
a population. The complete set of all possible samples which can be drawn from U is represented by .
Every sample s ∈ ∇ has a probability p(s) to be selected (Särndal et al., 1992). In detail, p(s) is called
the sampling design and has to fulfill the following constraints :

0 < p(s) 1, and p(s) = 1,
such that p(s) represents the probability to every conceivable sample s from U (Bethlehem, 2009). If

sk represents the number of times the element k is included in sample s, then the first order inclusion
s

expectation of k is defined as

πk = skp(s),
s∈∇

whereas the second order inclusion expectation of elements k and l is

πkl = skslp(s).
s∈∇

The first order inclusion expectation πk can be interpreted as the expectation of the number of times
one particular element is included in the sample, whereas πkl represents the expectation that a pair of
elements is included in the sample. Note that these inclusion expectations are the crucial determinant
for the sampling design, the exact design of the lottery-like mechanism is done here.

On the other hand, an estimator ˆ
θ(s) is a sample statistic that can be used for estimating the value
for a population parameter θ using sample s. Here, a sample statistic is a function that depends on the
values observed in the sample (Bethlehem, 2009). A basic example of a sample statistic is the sample

5


Page 7

mean ¯y = 1 n
n i=1 yi. The expectation and variance of the estimator ˆ
θ for a population parameter θ equal
E(ˆ
θ) =
s∈∇ ˆ
θ(s)p(s), and V ar(ˆ
θ) = s∈∇[ˆ
θ(s) E(ˆ
θ)]2p(s). As aforementioned, the combination of
sampling design p(s) and the estimator ˆ
θ is called the sampling strategy.

2.3 Sampling design

In this subsection it is assumed that step 1 and 2 as described in section 2.1 are already fixed.
While several sampling designs are considered in this subsection, the estimator is kept fixed. In the next
subsection the sampling design will be fixed and the estimator is discussed further. The basic estimator
used in this subsection is the Horvitz-Thompson estimator.

2.3.1 Horvitz - Thompson estimator

Let the population mean Y be defined as Y = 1 N
N k=1 Yk. Using the first order inclusion expectation,
the Horvitz-Thompson (HT) estimator for the population mean is then defined as

ˆ¯yHT = 1
N

ks

yk
πk

. (2.1)

Note that here ˆ¯yHT is one of the estimators ˆ
θ from the previous section. The HT estimator (2.1) is
a general estimator since it is design unbiased for all possible random sampling designs, such that
E(ˆ¯yHT ) = Y . Its variance is V ar(ˆ¯yHT ) = 0.5
N2
N ( )2
N

k=1 l=1 πkπlπkl , which can be estimated

by ̂
V ar(ˆ¯yHT ) = 0.5
N2
n

k=1

n

l=1

(

πkπlπkl
πkl

)(

yk
πk

yl


πl

)(

Yk
πk


Yl
πl

)2. Note that the more proportional the inclusion
expectations are to the target variables, the smaller is the variance of the HT estimator. In the ideal case,
if the inclusion expectations are exactly proportional to the target variables, the ratios in the variance
formula become constant and hence the variance will be zero.
All the estimators discussed in this subsection are special cases of the Horvitz-Thompson estimator.

2.3.2 Simple random sampling without replacement

This brings us to the first sampling design, called simple random sampling. Here, let s be a random
sample without replacement. In simple random sampling, each population element has the same probability
to be drawn. The inclusion expectations are therefore computed easily since the set consists
of ( ) ( ) ( )

N N1 N2
n

, such that there are exactly

n1

samples containing element k, and

n2

samples containing
elements k and l (Renssen, 1998). Therefore,

πk = n
N , and πkl

n(n 1)
=

N(N 1) .

6


Page 8

Because the sampling is done without replacement, πkl = πk if k = l. Also, the ratio n
N
the sample fraction f since it is the ratio of sample size to population size.
Plugging πk into the HT estimator ˆ¯yHT leads to the conclusion that the HT estimator for the po-
n
i=1 yi. The variance of the HT estimator is
pulation mean is equal to the sample mean ˆ¯y = 1
n
is often called

given by V ar(ˆ¯yHT ) = 1f
n Ψ2, where f = n
N and Ψ2 is the population variance defined as Ψ2 =
(

Yk Y )2 2

. Note that given Ψ , the larger sample size n the more precise the estimator

1
N1

N

k=1

gets. However, Ψ2 is in general not available and therefore V ar(ˆ¯yHT ) has to be estimated using the
(

yi ˆ¯y)2 which provides the unbiased estimator

sample variance ˆ
ψ2 = 1
n1
n

i=1

̂V ar(ˆ¯yHT ) = 1 f

n ˆ
ψ2.
For a more detailed discussion, the reader is referred to chapter 3 of (Bethlehem, 2009).

2.3.3 Stratified simple random sampling

In case there exist homogeneous subpopulations or strata in the population of interest, it is possible
to use this information in the sampling design as auxiliary information. An example of homogeneous
subpopulations is the case when the subpopulation is divided in men and women, and it is known that men
have on average a higher salary than women. Stratification boils down to drawing a number of mutually
independent samples from each stratum, instead of only one sample for the whole population which is
the case with simple random sampling. As it will be shown, the main idea is to reduce variance because
the samples are drawn independently so the variance between the subpopulations is eliminated from the
sampling error. Hence, in case the target variable is homogeneous within the strata and heterogeneous
across the strata, it is valuable to use stratified sampling.

For stratified simple random sampling the population U is divided into M mutually exclusive subpopulations
U1, . . . , UM , which together should cover the entire population. If Nh is the size of stratum
h = (1, . . . , M), then it should hold that N1 + . . . + NM = N. Accordingly, the population values of the
target variables in stratum h are denoted as Y (h)
1 , . . . , Y (h)
Nh , with Y (h) = Nh (h)

k=1

Yk and Y (h) (h)

Y

=

Nh .
The sample is naturally also drawn from these strata, which results in the following notational adjustments.
First, the sample also consists of M mutually independent strata with sizes n1, . . . , nM , with
n1+. . .+nM = n. The target variables observed in the sample are denoted for stratum h by y(h)
1 , . . . , y(h)
nh .
The inclusion expectations per strata are similar to those of simple random sampling, except for the adjustment
for the strata. Now the set h consists of ( ) ( )

Nh Nh1

per stratum, such that there are

nh nh1

samples
containing element k, and ( )

Nh2
nh2

samples containing elements k and l. Therefore,

π(h)
k

nh

= , π

Nh
(h)
kl = nh(nh 1)

Nh(Nh 1) , and π(hh
)

kl

= π(h)

k × π(h
)
l

,

7


Page 9

where h and hare two different strata. The last inclusion expectation follows from the fact that the
strata are drawn independently. Note that the ratio nh

Nh

is called the sample fraction fh.
Plugging in the inclusion expectations, it follows that the HT estimator per stratum equals

ˆ¯y(h) 1

HT

=

nh

nh

i=1

y(h)
i = ˆ¯y(h). (2.2)

The variance of this estimator is V ar(ˆ¯y(h)
HT
Ψ2 ∑ (

Nh

h k=1

Y (h)
k Y (h))2. Since Ψ2
h
=

1
Nh1

using the sample variance in stratum h ˆ
ψ2
h
estimator

1fh

) =

nh Ψ2
h, where Ψ2
h

is the variance in stratum h defined as

(h)

is in general not available, V ar(ˆ¯y

HT

) has to be estimated
(
y(h)
i ˆ¯y(h)
)2
, which provides the unbiased
=

1
nh1


nh

i=1

̂V ar(ˆ¯y(h) 1 fh

HT

) = ˆψ

nh
2
h.

Note that so far the estimator and its variance were discussed at the stratum level. To get the HT
estimator for the population mean Y , the estimators at the stratum levels are simply added :

ˆ¯yST R =
M

h=1

Nh

N ˆ¯y(h),
and similarly the variance of this HT estimator and the estimate of this variance are obtained by

V ar(ˆ¯yST R) =
M

h=1

N 2
h
N 2 V ar(ˆ¯y(h)), and ̂
V ar(ˆ¯yST R) =
M

h=1

N 2
h
N 2 ̂
V ar(ˆ¯y(h)).
These properties hold because the samples for each stratum h are drawn independently, and the variance
of the sum of independent random variables is the sum of the variances. This is the main strength of stratified
simple random sampling, since only the variance within the strata is taken into the variance. The
variance between the strata is eliminated from variance of the estimator, which improves its precision.
Therefore, when looking for auxiliary variables as stratification variables one should look for subpopulations
which are as homogeneous as possible.
The allocation issue concentrates on the distribution of the stratum sample sizes n1, . . . , nM . Two methods
to allocate the total sample size n over the M strata are known as the Neyman-allocation and the
proportional allocation, respectively :

nh =

NhΨh

H
h=1 NhΨh
n, and nh = Nh
N n.
It can be shown that the optimal allocation, or the allocation with minimum variance, is obtained if
the Neyman-allocation is used. However, this requires that Ψh is known. If that is not the case, the
proportional allocation can be used, which assumes that the variances of all strata have the same order

8


Page 10

of magnitude. Note that this simplifies the first order inclusion expectations to πk = n
N

, such that
irrespective of their stratum all population elements have the same first order inclusion expectations. A
further discussion of stratified sampling is found in Bethlehem (2009).

2.3.4 Two stage sampling

A more general sampling design is known as two stage sampling. Here, in the first stage there is
a sample drawn from M subpopulations, whereas in the second stage a sample is drawn within the
subpopulations selected in the first stage. The units drawn in the first stage are referred to as primary
sampling units, whereas those drawn in the second stage are called secondary sampling units. An example
is when the primary sampling units are Dutch municipalities and the secondary sampling units are
households, such that in the first stage a number of municipalities is drawn and then a sample from the
households in the selected municipalities is taken. Note that stratification as discussed previously is a
special case of two stage sampling, namely when all primary sampling units are selected, and a sample
is drawn from the secondary sampling units.
Cluster sampling does exactly the opposite : a sample is drawn from the primary sampling units, and
from the selected primary sampling units all secondary sampling units are observed. While stratified
sampling is relatively precise compared to simple sampling, both have a common disadvantage. Namely,
when taking a sample there is a considerable amount of travel time because the respondents are located
throughout the whole country. Two stage cluster sampling tackles this problem by drawing a random
sample from the primary sampling units. Because the elements are only observed within the selected
clusters, the traveling time is smaller compared to stratified sampling. However, this also increases the
variance since the observations under cluster sampling are less spread out over the population since only
a few primary sampling units are selected. For the formulas for cluster sampling and two stage sampling,
including the HT estimators, the reader is referred to Särndal et al. (1992).

2.4 Estimators

In the previous section several sampling designs were considered and two ways of improving the precision
of the results by incorporating auxiliary information in the design stage has been discussed, while
the estimator has been kept fixed to the Horvtiz-Thopmson estimator. On the contrary in this subsection
the auxiliary information is implemented in the estimator, while the estimators for the sampling designs
considered in the previous subsection will be presented.
Following the derivations of Renssen (1998) and Särndal et al. (1992), using (2.1) the generalized regression
estimator is defined as

ˆ¯ygreg = ˆ¯yHT + ˆ
b( ¯
X ˆ¯xHT ), where ˆ
b =

(

xkλkx
k
)1


xkλkyk
πk πk
ks ks

. (2.3)

9

© 2009 OpenThesis.org. All Rights Reserved.