[DDI-users] standard missing values via DDI

Sun Jul 13 16:52:17 EDT 2014

On Sun, Jul 13, 2014 at 7:49 PM, Hoyle, Larry <larryhoyle at ku.edu> wrote:
> Attributes for variables might be a good approach in R, but what about in other software packages?  When working in any given package there will be an approach that makes the most sense. The problem is that these are not all completely compatible.
>
> The question I'm interested in is how to represent data in a non-proprietary format that is capable of preserving all of the representations in different packages which will also be the least likely to produce serious errors when reimported (like treating a missing value of "9" as valid).
>
> One approach might be to embed both data and metadata in a single DDI instance (DDI allows data in a Dataset element). I think that this would be a very good approach,  but not a lot of software can currently easily import data from this representation.

Maybe I am misinterpreting your approach, but IMHO they don't even
need to be compatible. Whatever can be converted from one software to
another will be converted, the other (specific) missing will simply
remain empty cells in a different software.

Attributes are native to XML and read from the DDI metadata
information to build software specific setup files that deal with the
missing information. The fact that R supports attributes is only a
coincidence, the other software will get specific codes for specific
missing information.

What I am proposing:
- is non-proprietary (just another information in the DDI metadata)
- can preserve any standard representation from any software
- there is no need for the software to import, there are tools to
export the attributes to software specific setup files

For example "missing at random": it is specific to SAS, probably
specific to Stata, but not specific to R or SPSS.

The .csv data does not contain any .r values but only empty cells, and
the DDI metadata also doesn't contain .r but the (standard) attribute
"missing at random". This surely is non-proprietary.

The setup file exporter generates the command to recode the empty
cells to .r for SAS and Stata. If R and SPSS need to preserve this
(non-specific) missing information, it's easy to generate a
supplementary command to recode those empty cells to (say) the value
of -1.

A future software might decide to preserve the missing information due
to "respondent is shy". This is not specific to any of the current
software, but as an attribute it's still non-proprietary.

The real matter is to agree the "standard" attributes for missing
values, the rest being a very simple task of mapping these attributes
between different software packages. It doesn't require including
extra variables, just as it doesn't require the data being embedded in
the DDI instance.

-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Sos. Panduri nr.90
050663 Bucharest sector 5
Romania