[DDI-users] standard missing values via DDI

Hoyle, Larry larryhoyle at ku.edu
Sun Jul 13 12:49:10 EDT 2014


Attributes for variables might be a good approach in R, but what about in other software packages?  When working in any given package there will be an approach that makes the most sense. The problem is that these are not all completely compatible.

The question I'm interested in is how to represent data in a non-proprietary format that is capable of preserving all of the representations in different packages which will also be the least likely to produce serious errors when reimported (like treating a missing value of "9" as valid).

One approach might be to embed both data and metadata in a single DDI instance (DDI allows data in a Dataset element). I think that this would be a very good approach,  but not a lot of software can currently easily import data from this representation.




----- Larry Hoyle

-----Original Message-----
From: ddi-users-bounces at icpsr.umich.edu [mailto:ddi-users-bounces at icpsr.umich.edu] On Behalf Of Adrian Du?a
Sent: Friday, July 11, 2014 8:10 AM
To: Data Documentation Initiative Users Group
Subject: Re: [DDI-users] standard missing values via DDI

Hi Larry,

Thanks for this paper, it's very useful.
By "including extra variables" do you mean including them in the original dataset?

I think this would be hardly useful and actually very confusing for the average user. Instead, I believe using attributes for each variable is a much better option (examples in the previous messages).

Software that move data among different platforms could use DDI in the future, as a exchange standard. It is very easy to both read and write XML files, and "attributes" are native to XML DDI documents.

Best,
Adrian

On Thu, Jul 10, 2014 at 5:42 PM, Hoyle, Larry <larryhoyle at ku.edu> wrote:
> I've attached a pdf of what could turn into a best-practice paper on missing value representation. I think that this is a fairly complex issue and currently I don't think any tools for moving data among platforms deal with all of the complexities.
>
> One important question is how should data intended for transport either across time or software packages (or both) be best arranged in a non-proprietary format. I lean toward including extra variables that code the categorization of the missing data.
>
> --- Larry Hoyle
>
>
> -----Original Message-----
> From: ddi-users-bounces at icpsr.umich.edu 
> [mailto:ddi-users-bounces at icpsr.umich.edu] On Behalf Of Adrian Du?a
> Sent: Wednesday, June 25, 2014 11:25 AM
> To: Data Documentation Initiative Users Group
> Subject: Re: [DDI-users] standard missing values via DDI
>
> Excellent.
> Looking forward to see a concrete DDI XML example and put this procedure to a test...
>
> Adrian
>
> On Wed, Jun 25, 2014 at 7:13 PM, Wendy Thomas <wlt at umn.edu> wrote:
>> OK. Each variable declares BOTH its valid value representation and it 
>> missing value representation. Missing value representations are 
>> managed structures which can be described by any combination of a 
>> code/numeric/text representation. In addition a default missing value 
>> can be declared for a logical record or for a physical data file.
>>
>> So in effect each variable using the same set of missing values would 
>> each reference the same managed missing value description. If a 
>> missing value is not an option (i.e. it must have a valid value) then 
>> no MissingValuesReference would be included in the 
>> Variable/VariableRepresentation.
>>
>> Regarding identification of CaseID: see 
>> DataRelationship/LogicalRecord/CaseIdentification/
>>
>>
>> On Wed, Jun 25, 2014 at 10:52 AM, Adrian Dușa <dusa.adrian at unibuc.ro> wrote:
>>>
>>> Not sure...
>>> It has to be variable specific, because each variable has different 
>>> cases with missing data.
>>>
>>> But if the CodeList contains information for <each> variable which 
>>> has missing data, then it's ok. I was thinking about embedding this 
>>> kind of information inside each variable, but a reference to a 
>>> CodeList might also be an idea (provided the above).
>>>
>>> My previous email needs a slight correction: the numbers 1, 5, 8, 9,
>>> 15 and 78 should not be line numbers but rather unique identifiers 
>>> (sort of a Primary Key) for the cases where the missing values are 
>>> found.
>>>
>>> IMPORTANT: in this case, we also need to know which variable in the 
>>> dataset contains the unique identifiers (ex. "CaseID").
>>>
>>> That actually solves all matters, because I can automatically create 
>>> the necessary commands in the specific setup file(s) which will 
>>> replace missing with the specific desired values depending on the 
>>> statistical package.
>>>
>>> In SPSS, for a hypothetical variable "Age" it would be something 
>>> like
>>> this:
>>>
>>> DO IF (CaseID = 1 | CaseID = 5 | CaseID = 9).
>>> RECODE Age (SYSMIS = -1).
>>> END IF.
>>> EXECUTE.
>>>
>>> I'm sure that SAS and Stata are much easier to work with, and R is 
>>> just
>>> trivial:
>>> mydata$Age[mydata$CaseID %in% c(1, 5, 9)] <- -1
>>>
>>> On Wed, Jun 25, 2014 at 5:31 PM, Wendy Thomas <wlt at umn.edu> wrote:
>>> >
>>> > Does this 3.2 structure do what you need? it can be referenced 
>>> > from any variable, noted as the default missing values for a 
>>> > LogicalRecord and a Physical Instance.
>>> >
>>> > <r:ManagedMissingValuesRepresentation>  (note I've left off the 
>>> > identification and other versionable type information)
>>> >   <r:ManagedMissingValuesRepresenntationName>Combined Missing 
>>> > Types</r:ManagedMissingValuesRepresentationName>
>>> >   <r:MissingCodeRepresentation>
>>> >     <r:RecommendedDataType>integer</r:RecommendedDataType>
>>> >     <r:CodeListReference/>               to a CodeList with name Missing
>>> > at
>>> > Random
>>> >  </r:MissingCodeRepresentation>
>>> >   <r:MissingCodeRepresentation>
>>> >     <r:RecommendedDataType>integer</r:RecommendedDataType>
>>> >     <r:CodeListReference/>               to a CodeList with name Missing
>>> > by
>>> > Design
>>> >  </r:MissingCodeRepresentation>
>>> > </r:ManagedMissingValuesRepresentation>
>>> >
>>> >
>>> > On Wed, Jun 25, 2014 at 2:39 AM, Adrian Dușa 
>>> > <dusa.adrian at unibuc.ro>
>>> > wrote:
>>> >>
>>> >> Dear All,
>>> >>
>>> >> Following a private discussion, an idea emerged that i think it's 
>>> >> useful to circulate and discuss.
>>> >>
>>> >> From what I understand, SAS codes special missing values as 
>>> >> extremely low values, while Stata went for the opposite way, 
>>> >> coding them as extremely large values.
>>> >>
>>> >> Those are decisions which are software specific, and it is 
>>> >> unlikely that other software packages will follow one trend or 
>>> >> another.
>>> >>
>>> >> There might be a way to solve all particular needs, using DDI as 
>>> >> a mediator and most importantly using only "normal" values.
>>> >>
>>> >> The main quest is to differentiate between missing values. In R, 
>>> >> and I'm sure DDI can do that too, each variable can be attached 
>>> >> with a list of attributes. One such component of the list of 
>>> >> attributes could be dedicated to the missing values, and further 
>>> >> differentiate within:
>>> >> - "missing at random": 1, 5, 9
>>> >> - "missing by design": 8, 15, 78
>>> >>
>>> >> Here, the (simple integer) numbers 1, 5, 8, 9, 15 and 78 are 
>>> >> nothing but the indexes of the line numbers (ie the cases) where 
>>> >> the missing values reside in a particular variable.
>>> >>
>>> >> If I had this kind of information in the DDI XML file, I could 
>>> >> then instruct my R function to create <specific> setup files for 
>>> >> SAS or Stata using .r and .d in those specific cases, while in R 
>>> >> all missing values could remain as simple NAs but users can still 
>>> >> differentiate between missings by just looking at the list of 
>>> >> attributes.
>>> >>
>>> >> This way it would accomplish the other need to avoid accidental 
>>> >> mistakes, and it is both package independent and specific in the 
>>> >> same time, using DDI as an exchange platform.
>>> >>
>>> >> Recoding specific missing values is trivial in R, but I have to 
>>> >> confess I don't know if and how this might be done in other 
>>> >> software via setup files.
>>> >> People using specific software packages might confirm if this 
>>> >> approach is possible or not. Raw data should be read by all 
>>> >> packages from a .csv file where missing values are system missing
>>> >> (empty) values.
>>> >>
>>> >> Best wishes,
>>> >> Adrian
>>> >>
>>> >>
>>> >> --
>>> >> Adrian Dusa
>>> >> University of Bucharest
>>> >> Romanian Social Data Archive
>>> >> 1, Schitu Magureanu Bd.
>>> >> 050025 Bucharest sector 5
>>> >> Romania
>>> >> Tel.:+40 21 3126618 \
>>> >>         +40 21 3120210 / int.101
>>> >> Fax: +40 21 3158391
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> DDI-users mailing list
>>> >> DDI-users at icpsr.umich.edu
>>> >> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Wendy L. Thomas                              Phone: +1 612.624.4389
>>> > Data Access Core Director                 Fax:   +1 612.626.8375
>>> > Minnesota Population Center             Email: wlt at umn.edu
>>> > University of Minnesota
>>> > 50 Willey Hall
>>> > 225 19th Avenue South
>>> > Minneapolis, MN 55455
>>> >
>>> > _______________________________________________
>>> > DDI-users mailing list
>>> > DDI-users at icpsr.umich.edu
>>> > http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>>> >
>>>
>>>
>>>
>>> --
>>> Adrian Dusa
>>> University of Bucharest
>>> Romanian Social Data Archive
>>> 1, Schitu Magureanu Bd.
>>> 050025 Bucharest sector 5
>>> Romania
>>> Tel.:+40 21 3126618 \
>>>         +40 21 3120210 / int.101
>>> Fax: +40 21 3158391
>>>
>>> _______________________________________________
>>> DDI-users mailing list
>>> DDI-users at icpsr.umich.edu
>>> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>>
>>
>>
>>
>> --
>> Wendy L. Thomas                              Phone: +1 612.624.4389
>> Data Access Core Director                 Fax:   +1 612.626.8375
>> Minnesota Population Center             Email: wlt at umn.edu
>> University of Minnesota
>> 50 Willey Hall
>> 225 19th Avenue South
>> Minneapolis, MN 55455
>>
>> _______________________________________________
>> DDI-users mailing list
>> DDI-users at icpsr.umich.edu
>> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>>
>
>
>
> --
> Adrian Dusa
> University of Bucharest
> Romanian Social Data Archive
> 1, Schitu Magureanu Bd.
> 050025 Bucharest sector 5
> Romania
> Tel.:+40 21 3126618 \
>         +40 21 3120210 / int.101
> Fax: +40 21 3158391
>
> _______________________________________________
> DDI-users mailing list
> DDI-users at icpsr.umich.edu
> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>
> _______________________________________________
> DDI-users mailing list
> DDI-users at icpsr.umich.edu
> http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users
>



--
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Sos. Panduri nr.90
050663 Bucharest sector 5
Romania

_______________________________________________
DDI-users mailing list
DDI-users at icpsr.umich.edu
http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users



More information about the DDI-users mailing list