[DDI-users] standard missing values via DDI

Hoyle, Larry larryhoyle at ku.edu
Thu Jul 10 10:42:36 EDT 2014


I've attached a pdf of what could turn into a best-practice paper on missing value representation. I think that this is a fairly complex issue and currently I don't think any tools for moving data among platforms deal with all of the complexities.

One important question is how should data intended for transport either across time or software packages (or both) be best arranged in a non-proprietary format. I lean toward including extra variables that code the categorization of the missing data.  

--- Larry Hoyle


-----Original Message-----
From: ddi-users-bounces at icpsr.umich.edu [mailto:ddi-users-bounces at icpsr.umich.edu] On Behalf Of Adrian Du?a
Sent: Wednesday, June 25, 2014 11:25 AM
To: Data Documentation Initiative Users Group
Subject: Re: [DDI-users] standard missing values via DDI

Excellent.
Looking forward to see a concrete DDI XML example and put this procedure to a test...

Adrian

On Wed, Jun 25, 2014 at 7:13 PM, Wendy Thomas <wlt at umn.edu> wrote:
> OK. Each variable declares BOTH its valid value representation and it 
> missing value representation. Missing value representations are 
> managed structures which can be described by any combination of a 
> code/numeric/text representation. In addition a default missing value 
> can be declared for a logical record or for a physical data file.
>
> So in effect each variable using the same set of missing values would 
> each reference the same managed missing value description. If a 
> missing value is not an option (i.e. it must have a valid value) then 
> no MissingValuesReference would be included in the 
> Variable/VariableRepresentation.
>
> Regarding identification of CaseID: see 
> DataRelationship/LogicalRecord/CaseIdentification/
>
>
> On Wed, Jun 25, 2014 at 10:52 AM, Adrian Dușa <dusa.adrian at unibuc.ro> wrote:
>>
>> Not sure...
>> It has to be variable specific, because each variable has different 
>> cases with missing data.
>>
>> But if the CodeList contains information for <each> variable which 
>> has missing data, then it's ok. I was thinking about embedding this 
>> kind of information inside each variable, but a reference to a 
>> CodeList might also be an idea (provided the above).
>>
>> My previous email needs a slight correction: the numbers 1, 5, 8, 9,
>> 15 and 78 should not be line numbers but rather unique identifiers 
>> (sort of a Primary Key) for the cases where the missing values are 
>> found.
>>
>> IMPORTANT: in this case, we also need to know which variable in the 
>> dataset contains the unique identifiers (ex. "CaseID").
>>
>> That actually solves all matters, because I can automatically create 
>> the necessary commands in the specific setup file(s) which will 
>> replace missing with the specific desired values depending on the 
>> statistical package.
>>
>> In SPSS, for a hypothetical variable "Age" it would be something like
>> this:
>>
>> DO IF (CaseID = 1 | CaseID = 5 | CaseID = 9).
>> RECODE Age (SYSMIS = -1).
>> END IF.
>> EXECUTE.
>>
>> I'm sure that SAS and Stata are much easier to work with, and R is 
>> just
>> trivial:
>> mydata$Age[mydata$CaseID %in% c(1, 5, 9)] <- -1
>>
>> On Wed, Jun 25, 2014 at 5:31 PM, Wendy Thomas <wlt at umn.edu> wrote:
>> >
>> > Does this 3.2 structure do what you need? it can be referenced from 
>> > any variable, noted as the default missing values for a 
>> > LogicalRecord and a Physical Instance.
>> >
>> > <r:ManagedMissingValuesRepresentation>  (note I've left off the 
>> > identification and other versionable type information)
>> >   <r:ManagedMissingValuesRepresenntationName>Combined Missing 
>> > Types</r:ManagedMissingValuesRepresentationName>
>> >   <r:MissingCodeRepresentation>
>> >     <r:RecommendedDataType>integer</r:RecommendedDataType>
>> >     <r:CodeListReference/>               to a CodeList with name Missing
>> > at
>> > Random
>> >  </r:MissingCodeRepresentation>
>> >   <r:MissingCodeRepresentation>
>> >     <r:RecommendedDataType>integer</r:RecommendedDataType>
>> >     <r:CodeListReference/>               to a CodeList with name Missing
>> > by
>> > Design
>> >  </r:MissingCodeRepresentation>
>> > </r:ManagedMissingValuesRepresentation>
>> >
>> >
>> > On Wed, Jun 25, 2014 at 2:39 AM, Adrian Dușa 
>> > <dusa.adrian at unibuc.ro>
>> > wrote:
>> >>
>> >> Dear All,
>> >>
>> >> Following a private discussion, an idea emerged that i think it's 
>> >> useful to circulate and discuss.
>> >>
>> >> From what I understand, SAS codes special missing values as 
>> >> extremely low values, while Stata went for the opposite way, 
>> >> coding them as extremely large values.
>> >>
>> >> Those are decisions which are software specific, and it is 
>> >> unlikely that other software packages will follow one trend or 
>> >> another.
>> >>
>> >> There might be a way to solve all particular needs, using DDI as a 
>> >> mediator and most importantly using only "normal" values.
>> >>
>> >> The main quest is to differentiate between missing values. In R, 
>> >> and I'm sure DDI can do that too, each variable can be attached 
>> >> with a list of attributes. One such component of the list of 
>> >> attributes could be dedicated to the missing values, and further 
>> >> differentiate within:
>> >> - "missing at random": 1, 5, 9
>> >> - "missing by design": 8, 15, 78
>> >>
>> >> Here, the (simple integer) numbers 1, 5, 8, 9, 15 and 78 are 
>> >> nothing but the indexes of the line numbers (ie the cases) where 
>> >> the missing values reside in a particular variable.
>> >>
>> >> If I had this kind of information in the DDI XML file, I could 
>> >> then instruct my R function to create <specific> setup files for 
>> >> SAS or Stata using .r and .d in those specific cases, while in R 
>> >> all missing values could remain as simple NAs but users can still 
>> >> differentiate between missings by just looking at the list of 
>> >> attributes.
>> >>
>> >> This way it would accomplish the other need to avoid accidental 
>> >> mistakes, and it is both package independent and specific in the 
>> >> same time, using DDI as an exchange platform.
>> >>
>> >> Recoding specific missing values is trivial in R, but I have to 
>> >> confess I don't know if and how this might be done in other 
>> >> software via setup files.
>> >> People using specific software packages might confirm if this 
>> >> approach is possible or not. Raw data should be read by all 
>> >> packages from a .csv file where missing values are system missing 
>> >> (empty) values.
>> >>
>> >> Best wishes,
>> >> Adrian
>> >>
>> >>
>> >> --
>> >> Adrian Dusa
>> >> University of Bucharest
>> >> Romanian Social Data Archive
>> >> 1, Schitu Magureanu Bd.
>> >> 050025 Bucharest sector 5
>> >> Romania
>> >> Tel.:+40 21 3126618 \
>> >>         +40 21 3120210 / int.101
>> >> Fax: +40 21 3158391
>> >>
>> >>
>> >> _______________________________________________
>> >> DDI-users mailing list
>> >> DDI-users at icpsr.umich.edu
>> >> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>> >>
>> >
>> >
>> >
>> > --
>> > Wendy L. Thomas                              Phone: +1 612.624.4389
>> > Data Access Core Director                 Fax:   +1 612.626.8375
>> > Minnesota Population Center             Email: wlt at umn.edu
>> > University of Minnesota
>> > 50 Willey Hall
>> > 225 19th Avenue South
>> > Minneapolis, MN 55455
>> >
>> > _______________________________________________
>> > DDI-users mailing list
>> > DDI-users at icpsr.umich.edu
>> > http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>> >
>>
>>
>>
>> --
>> Adrian Dusa
>> University of Bucharest
>> Romanian Social Data Archive
>> 1, Schitu Magureanu Bd.
>> 050025 Bucharest sector 5
>> Romania
>> Tel.:+40 21 3126618 \
>>         +40 21 3120210 / int.101
>> Fax: +40 21 3158391
>>
>> _______________________________________________
>> DDI-users mailing list
>> DDI-users at icpsr.umich.edu
>> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>
>
>
>
> --
> Wendy L. Thomas                              Phone: +1 612.624.4389
> Data Access Core Director                 Fax:   +1 612.626.8375
> Minnesota Population Center             Email: wlt at umn.edu
> University of Minnesota
> 50 Willey Hall
> 225 19th Avenue South
> Minneapolis, MN 55455
>
> _______________________________________________
> DDI-users mailing list
> DDI-users at icpsr.umich.edu
> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>



--
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
1, Schitu Magureanu Bd.
050025 Bucharest sector 5
Romania
Tel.:+40 21 3126618 \
        +40 21 3120210 / int.101
Fax: +40 21 3158391

_______________________________________________
DDI-users mailing list
DDI-users at icpsr.umich.edu
http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MissingValueRepresentations.pdf
Type: application/pdf
Size: 351160 bytes
Desc: MissingValueRepresentations.pdf
Url : http://lists.icpsr.umich.edu/pipermail/ddi-users/attachments/20140710/3e72ab70/attachment-0001.pdf 


More information about the DDI-users mailing list