[DDI-users] standard missing values via DDI

Fri Jul 11 09:09:50 EDT 2014

Hi Larry,

Thanks for this paper, it's very useful.
By "including extra variables" do you mean including them in the
original dataset?

I think this would be hardly useful and actually very confusing for
the average user. Instead, I believe using attributes for each
variable is a much better option (examples in the previous messages).

Software that move data among different platforms could use DDI in the
future, as a exchange standard. It is very easy to both read and write
XML files, and "attributes" are native to XML DDI documents.

Best,
Adrian

On Thu, Jul 10, 2014 at 5:42 PM, Hoyle, Larry <larryhoyle at ku.edu> wrote:
> I've attached a pdf of what could turn into a best-practice paper on missing value representation. I think that this is a fairly complex issue and currently I don't think any tools for moving data among platforms deal with all of the complexities.
>
> One important question is how should data intended for transport either across time or software packages (or both) be best arranged in a non-proprietary format. I lean toward including extra variables that code the categorization of the missing data.
>
> --- Larry Hoyle
>
>
> -----Original Message-----
> From: ddi-users-bounces at icpsr.umich.edu [mailto:ddi-users-bounces at icpsr.umich.edu] On Behalf Of Adrian Du?a
> Sent: Wednesday, June 25, 2014 11:25 AM
> To: Data Documentation Initiative Users Group
> Subject: Re: [DDI-users] standard missing values via DDI
>
> Excellent.
> Looking forward to see a concrete DDI XML example and put this procedure to a test...
>
> Adrian
>
> On Wed, Jun 25, 2014 at 7:13 PM, Wendy Thomas <wlt at umn.edu> wrote:
>> OK. Each variable declares BOTH its valid value representation and it
>> missing value representation. Missing value representations are
>> managed structures which can be described by any combination of a
>> code/numeric/text representation. In addition a default missing value
>> can be declared for a logical record or for a physical data file.
>>
>> So in effect each variable using the same set of missing values would
>> each reference the same managed missing value description. If a
>> missing value is not an option (i.e. it must have a valid value) then
>> no MissingValuesReference would be included in the
>> Variable/VariableRepresentation.
>>
>> Regarding identification of CaseID: see
>> DataRelationship/LogicalRecord/CaseIdentification/
>>
>>
>> On Wed, Jun 25, 2014 at 10:52 AM, Adrian Dușa <dusa.adrian at unibuc.ro> wrote:
>>>
>>> Not sure...
>>> It has to be variable specific, because each variable has different
>>> cases with missing data.
>>>
>>> But if the CodeList contains information for <each> variable which
>>> has missing data, then it's ok. I was thinking about embedding this
>>> kind of information inside each variable, but a reference to a
>>> CodeList might also be an idea (provided the above).
>>>
>>> My previous email needs a slight correction: the numbers 1, 5, 8, 9,
>>> 15 and 78 should not be line numbers but rather unique identifiers
>>> (sort of a Primary Key) for the cases where the missing values are
>>> found.
>>>
>>> IMPORTANT: in this case, we also need to know which variable in the
>>> dataset contains the unique identifiers (ex. "CaseID").
>>>
>>> That actually solves all matters, because I can automatically create
>>> the necessary commands in the specific setup file(s) which will
>>> replace missing with the specific desired values depending on the
>>> statistical package.
>>>
>>> In SPSS, for a hypothetical variable "Age" it would be something like
>>> this:
>>>
>>> DO IF (CaseID = 1 | CaseID = 5 | CaseID = 9).
>>> RECODE Age (SYSMIS = -1).
>>> END IF.
>>> EXECUTE.
>>>
>>> I'm sure that SAS and Stata are much easier to work with, and R is
>>> just
>>> trivial:
>>> mydata$Age[mydata$CaseID %in% c(1, 5, 9)] <- -1
>>>
>>> On Wed, Jun 25, 2014 at 5:31 PM, Wendy Thomas <wlt at umn.edu> wrote:
>>> >
>>> > Does this 3.2 structure do what you need? it can be referenced from
>>> > any variable, noted as the default missing values for a
>>> > LogicalRecord and a Physical Instance.
>>> >
>>> > <r:ManagedMissingValuesRepresentation>  (note I've left off the
>>> > identification and other versionable type information)
>>> >   <r:ManagedMissingValuesRepresenntationName>Combined Missing
>>> > Types</r:ManagedMissingValuesRepresentationName>
>>> >   <r:MissingCodeRepresentation>
>>> >     <r:RecommendedDataType>integer</r:RecommendedDataType>
>>> >     <r:CodeListReference/>               to a CodeList with name Missing
>>> > at
>>> > Random
>>> >  </r:MissingCodeRepresentation>
>>> >   <r:MissingCodeRepresentation>
>>> >     <r:RecommendedDataType>integer</r:RecommendedDataType>
>>> >     <r:CodeListReference/>               to a CodeList with name Missing
>>> > by
>>> > Design
>>> >  </r:MissingCodeRepresentation>
>>> > </r:ManagedMissingValuesRepresentation>
>>> >
>>> >
>>> > On Wed, Jun 25, 2014 at 2:39 AM, Adrian Dușa
>>> > <dusa.adrian at unibuc.ro>
>>> > wrote:
>>> >>
>>> >> Dear All,
>>> >>
>>> >> Following a private discussion, an idea emerged that i think it's
>>> >> useful to circulate and discuss.
>>> >>
>>> >> From what I understand, SAS codes special missing values as
>>> >> extremely low values, while Stata went for the opposite way,
>>> >> coding them as extremely large values.
>>> >>
>>> >> Those are decisions which are software specific, and it is
>>> >> unlikely that other software packages will follow one trend or
>>> >> another.
>>> >>
>>> >> There might be a way to solve all particular needs, using DDI as a
>>> >> mediator and most importantly using only "normal" values.
>>> >>
>>> >> The main quest is to differentiate between missing values. In R,
>>> >> and I'm sure DDI can do that too, each variable can be attached
>>> >> with a list of attributes. One such component of the list of
>>> >> attributes could be dedicated to the missing values, and further
>>> >> differentiate within:
>>> >> - "missing at random": 1, 5, 9
>>> >> - "missing by design": 8, 15, 78
>>> >>
>>> >> Here, the (simple integer) numbers 1, 5, 8, 9, 15 and 78 are
>>> >> nothing but the indexes of the line numbers (ie the cases) where
>>> >> the missing values reside in a particular variable.
>>> >>
>>> >> If I had this kind of information in the DDI XML file, I could
>>> >> then instruct my R function to create <specific> setup files for
>>> >> SAS or Stata using .r and .d in those specific cases, while in R
>>> >> all missing values could remain as simple NAs but users can still
>>> >> differentiate between missings by just looking at the list of
>>> >> attributes.
>>> >>
>>> >> This way it would accomplish the other need to avoid accidental
>>> >> mistakes, and it is both package independent and specific in the
>>> >> same time, using DDI as an exchange platform.
>>> >>
>>> >> Recoding specific missing values is trivial in R, but I have to
>>> >> confess I don't know if and how this might be done in other
>>> >> software via setup files.
>>> >> People using specific software packages might confirm if this
>>> >> approach is possible or not. Raw data should be read by all
>>> >> packages from a .csv file where missing values are system missing
>>> >> (empty) values.
>>> >>
>>> >> Best wishes,
>>> >> Adrian
>>> >>
>>> >>
>>> >> --
>>> >> Adrian Dusa
>>> >> University of Bucharest
>>> >> Romanian Social Data Archive
>>> >> 1, Schitu Magureanu Bd.
>>> >> 050025 Bucharest sector 5
>>> >> Romania
>>> >> Tel.:+40 21 3126618 \
>>> >>         +40 21 3120210 / int.101
>>> >> Fax: +40 21 3158391
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> DDI-users mailing list
>>> >> DDI-users at icpsr.umich.edu
>>> >> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Wendy L. Thomas                              Phone: +1 612.624.4389
>>> > Data Access Core Director                 Fax:   +1 612.626.8375
>>> > Minnesota Population Center             Email: wlt at umn.edu
>>> > University of Minnesota
>>> > 50 Willey Hall
>>> > 225 19th Avenue South
>>> > Minneapolis, MN 55455
>>> >
>>> > _______________________________________________
>>> > DDI-users mailing list
>>> > DDI-users at icpsr.umich.edu
>>> > http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>>> >
>>>
>>>
>>>
>>> --
>>> Adrian Dusa
>>> University of Bucharest
>>> Romanian Social Data Archive
>>> 1, Schitu Magureanu Bd.
>>> 050025 Bucharest sector 5
>>> Romania
>>> Tel.:+40 21 3126618 \
>>>         +40 21 3120210 / int.101
>>> Fax: +40 21 3158391
>>>
>>> _______________________________________________
>>> DDI-users mailing list
>>> DDI-users at icpsr.umich.edu
>>> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>>
>>
>>
>>
>> --
>> Wendy L. Thomas                              Phone: +1 612.624.4389
>> Data Access Core Director                 Fax:   +1 612.626.8375
>> Minnesota Population Center             Email: wlt at umn.edu
>> University of Minnesota
>> 50 Willey Hall
>> 225 19th Avenue South
>> Minneapolis, MN 55455
>>
>> _______________________________________________
>> DDI-users mailing list
>> DDI-users at icpsr.umich.edu
>> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>>
>
>
>
> --
> Adrian Dusa
> University of Bucharest
> Romanian Social Data Archive
> 1, Schitu Magureanu Bd.
> 050025 Bucharest sector 5
> Romania
> Tel.:+40 21 3126618 \
>         +40 21 3120210 / int.101
> Fax: +40 21 3158391
>
> _______________________________________________
> DDI-users mailing list
> DDI-users at icpsr.umich.edu
> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>
> _______________________________________________
> DDI-users mailing list
> DDI-users at icpsr.umich.edu
> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>

-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Sos. Panduri nr.90
050663 Bucharest sector 5
Romania