[DDI-users] DDI-users Digest, Vol 105, Issue 6 (SAS/Stata extended missings)

Hoyle, Larry larryhoyle at ku.edu
Tue Jun 24 12:01:56 EDT 2014


I would not minimize the utility of avoiding accidental treatment of missing values as missing This is one reason that R has NA. Also when more than one variable is involved having missing values allows for the possibility of doing things other than casewise deletion.

DDI3.2 allows for the specification of a codelist describing missing values even for continuous (numeric) variables. Such a codelist can be applied to numeric values (like -1 = 'Refused' in your original example) or other values (like .r = 'Refused') so this achieves a level of software independence.

The differences among software packages in their treatment of missing values are deeper than just syntax. Those that allow distinguishing among types of missing are allowing for the assignment of semantics to the different types of missing but they are still missing values.  Missing values with different meanings may even need to be treated differently.  For imputation “missing at random” has different implications than “missing by design”.

--- Larry Hoyle


From: ddi-users-bounces at icpsr.umich.edu [mailto:ddi-users-bounces at icpsr.umich.edu] On Behalf Of Adrian Du?a
Sent: Tuesday, June 24, 2014 10:13 AM
To: Data Documentation Initiative Users Group
Subject: Re: [DDI-users] DDI-users Digest, Vol 105, Issue 6 (SAS/Stata extended missings)

Yes I understand the need for the special values, and your answer just confirms what I had thought: they're needed only to avoid <accidental> treatment as valid values.

This shouldn't be the case, however, if the researcher really knows what to do.
As you say, in SPSS the value -1 can be declared as missing, and in R one could simply calculate the mean of a vector excluding the negative values, or create a copy of the entire dataset where all the negative values are replaced with NAs.

This is simply a matter of procedure, which is software independent. However, <special> numeric values are not software independent, while using DDI as a common gateway should (in principle) be universal, rather than specific.

No matter what the software package is, when a researcher is careful to replace the respective negative values with missings, there would be no mistakes... which kind of works along my argument with cross-portability.

The DDI file could contain only valid numeric values, and then researchers might convert those to any kind of special missings depending on software.

Adrian

On Tue, Jun 24, 2014 at 4:56 PM, Hoyle, Larry <larryhoyle at ku.edu<mailto:larryhoyle at ku.edu>> wrote:
In SAS and Stata the values ._   .a - .z are special numeric values, treated as missing, which compare less than the smallest valid value.

If you use -1, for example, to represent “refused” and compute a mean on the variable the -1 will be included in the computation – not ignored.
Using a scheme like
value timetopg
1 = '1-2 mos'
2 = '3-5 mos'
3 = '6-12 mos'
4 = ' > 1 yr'
.r = 'Refused'
.d = "don't  remember"
.s  = 'set to missing by rule'
.o = 'other missing'
;

Would allow you to compute statistics ignoring the missing values as well as tabulations using the missing values (e.g. computing the % refused).

In packages like SPSS one can specify that otherwise valid values (like -1 in your example) can be treated as missing. The advantage of using “out of band” values is that they cannot accidentally be treated as valid values.

R, I believe, only has two missing values: NA and NaN. In order to prevent treating -1 - -4 as valid values in your example in R you would need to transform the variable to convert all of these values to NA.  If you are moving data from any software that allows multiple missing values SPSS, SAS or Stata to R you may need to use NA as the missing value for all of the categories and perhaps create a secondary variable preserving the different values of missing.



--- Larry Hoyle


From: ddi-users-bounces at icpsr.umich.edu<mailto:ddi-users-bounces at icpsr.umich.edu> [mailto:ddi-users-bounces at icpsr.umich.edu<mailto:ddi-users-bounces at icpsr.umich.edu>] On Behalf Of Adrian Du?a
Sent: Tuesday, June 24, 2014 4:03 AM
To: Data Documentation Initiative Users Group
Subject: Re: [DDI-users] DDI-users Digest, Vol 105, Issue 6 (SAS/Stata extended missings)

Hi Bob,

I've never used SAS, but have to ask something regarding these different types of missings.
Is there any particular advantage of .r, .d and .m over something like:

value timetopg
1 = '1-2 mos'
2 = '3-5 mos'
3 = '6-12 mos'
4 = ' > 1 yr'
-1 = 'Refused'
-2 = "don't  remember"
-3 = 'set to missing by rule'
-4 = 'other missing'
;

I'm thinking about cross portability of these codes, and the above suggestion would work (I think) in every statistical package while .d and .r etc are specific for SAS only.

Thanks,
Adrian



On Mon, Jun 23, 2014 at 7:49 PM, Bob McConnaughey <bobmcconn at gmail.com<mailto:bobmcconn at gmail.com>> wrote:
i suspect i'm belaboring the obvious here, but here's how SAS treats numeric missings
SAS numeric missings appear to be "character strings" - but they are treated, within SAS (and Stata i believe) as "invented" numbers, smaller than the "smallest" negative number.  eg -1*10**10000 > .z > .a > . > ._ ;  (though i don't think i've ever seen "._" used).  However their great virtues are: 1. As "known" missings they automatically get excluded from computations involving the variable they represent.  And, like any other value (character or numeric) the can be described using formats..  That is when you do, say, a frequency proc and assign formats to the missing you'd see something like:
time_to_pregnancy1
value timetopg
  1-2 = '1-2 mos'
  3-5 = '3-5 mos'
  6-12 = '6-12 mos'
 13-high = ' > 1 yr'
 .r         = 'Refused'
 .d        = "don't  remember"
 .m       = 'set to missing by rule'
.          = 'other missing'
;
Value labels are the equivalent SPSS feature (i think..i haven't used SPSS in 35 yrs) and even now most of our original questionnaires use "out of range" numbers for special missing values.  But the number of times post-docs and researchers have come up with funky basic descriptive statistics because, oh, "99" was used for a missing value for "height_inches" is well nigh uncountable.  And matters are getting worse because there's a general tendency to not use codebooks any more;  instead projects rely on "annotated questionnaires" and SAS "proc contents" I am very much hoping to get people here to go back to using codebooks and the various DDI products SHOULD be convincing. (well, convincing for people other than the small group of reproductive epidemiology researchers I work with most closely).

thanks for the responses!
Bob McC....

"At times like this, an adult needs a drink."
Dance, Dance, Dance.  H. Murakami




_______________________________________________
DDI-users mailing list
DDI-users at icpsr.umich.edu<mailto:DDI-users at icpsr.umich.edu>
http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users



--
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
1, Schitu Magureanu Bd.
050025 Bucharest sector 5
Romania
Tel.:+40 21 3126618<tel:%2B40%2021%203126618> \
        +40 21 3120210<tel:%2B40%2021%203120210> / int.101
Fax: +40 21 3158391<tel:%2B40%2021%203158391>

_______________________________________________
DDI-users mailing list
DDI-users at icpsr.umich.edu<mailto:DDI-users at icpsr.umich.edu>
http://lists.icpsr.umich.edu/mailman/listinfo/ddi-users



--
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
1, Schitu Magureanu Bd.
050025 Bucharest sector 5
Romania
Tel.:+40 21 3126618 \
        +40 21 3120210 / int.101
Fax: +40 21 3158391
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.icpsr.umich.edu/pipermail/ddi-users/attachments/20140624/8c279010/attachment-0001.html 


More information about the DDI-users mailing list