[DDI-users] data dissemination..how

Fri Feb 18 15:44:52 EST 2011

Hi Bob,

I saw one reply for the email; I don't know if you have received answers in private with regards to the questions you pose below; if I may, I 
would like to offer some thoughts. In my view, DDI provides a framework, which consists of constructs, to document and describe almost 
every single process step, decision point in a study. In that regard, DDI is very rich. In parallel, data is collected and stored in a database or 
flat files, or database with references to flat files, etc. If a project decides to use DDI from the beginning,it [the project] may use DDI to 
design its questions, and instruments; all this provides the context and metadata (which can be connected to ontology and terminology) for 
the data that will be collected when a participant is interviewed. 

I assume the collected data is stored in the database, or flat files.

In addition to data that is collected via survey, instruments, there is lab data, which you mentioned. It is most possible that the amount 
from the laboratory is very large (for example, some projects may collect genomic sequencing for mother or new born to say something 
about the amount of data). At this juncture, I see DDI can be used to "document" the data up to certain point; because it is depending on 
the lab work, there might have been standards exist to provide such documentation. One example comes to my mind: if the lab work 
happens to be using microarray to for genomic and/or genetic information, there are number of standards to document a microarray 
experiment. It is then the question of how DDI would allow for such bridging to happen. 

With regards to the question "In general, how would people be inclined to make this type of data public? One file for hCG? another for PdG, 
E1G and CRT and another for LH? or merge them all together into one synthetic file" I wonder about the same question in my mind. I can 
use both {LogicalProduct and PhysicalDataProduct} of DDI to "document" in absolute (excruciating) details about a dataset. and still keep 
data in a file. That's great (in my humble opinion). But, what it means I will need to do one set of {logical, physical} ddi for each type of 
dataset; this can be overwhelming!!! (if one has a lot of different dataset). One solution is to "dump" all data into one; However, I don't 
think this is very good solution, because (1) some times it is NOT possible to do because of summary and aggregate data and (2) it creates 
a burden to the consumer of data to sort out the needed data items. 

With regards to the opinion "FAR too many lotus and Quattro Pro spreadsheets and *.prn files and there seems no compelling reason I can 
imagine to make this morass of "ur-data" public??"  I guess this is the PI's and Study's question.  Even if it is far more too much data, if it is 
the requirements, then the data will be in the public domain.  However, data should be documented-- for example, participant profile 
(medical or other factor) when specimens are collected, collection methods and how specimen was processed, perhaps.  

I would be very interested to find out how DDI is used to document which algorithm is used in processing of data.   If you find out, I would 
like to hear.  

My two cents.

On Tue 15/02/11  2:55 PM , Bob McConnaughey bobmcconn at gmail.com sent:
> Dear folks:  We're working on making the key data files for one
> of our major long term studies @ the EPI branch of the NIEHS public,
> at this point as I and the original researchers are getting older,
> sooner rather than later.  The Early Pregnancy Study, not atypically,
> has multiple components - both several sets of interview instruments
> but also a good deal of lab data, as our subjects not only were
> willing to provide detailed reproductive diaries but also daily urine
> samples.  Initially the urines were assayed for LH
> (luteinizing hormone) - the gold standard for ovulation, but one
> that's easy to miss.  Soon, thereafter, the urines were assayed
> for other hormonal metabolites (estradiol, progesterone, hCG; now the
> urines are being sent to CDC for BPA and pthalate assays..).  In
> analyses, over time, basic statistical handling of the variables has
> changed .  In general our analyses switched, quite early on, to using
> geometric means rather than arithmetic means.  But my inclination is
> to present the data "as is."
> For hCG, for instance, for any given day, there might be 2,3 or 4
> replicates.  So someone else wanting to use the data could, if they
> so desire, use other averaging techniques.
> 
> id          week         day              date    
>                hcG1          hcG2        hcG3    
>   pilot          hcG4         geo_mean hCG  
> xx 7 4 03/17/83 0 0 . 1 . 0.01 
> 
> xx 7 5 03/18/83 0 0 . 1 . 0.01 
> 
> xx 7 6 03/19/83 0 0 . 1 . 0.01 
> 
> xx 7 7 03/20/83 0.03 0.02 . 1 . 0.0244948974 (just fyi...indicative
> of conception) 
> 
> xx 8 1 03/21/83 0.044 0.053 . 1 . 0.0482907859 
> 
> xx 8 2 03/22/83 0.162 0.19 . 1 . 0.1754422982 
> 
> xx 8 3 03/23/83 0.145 0.175 . 1 . 0.1592953232 
> we have 27000+ days of urine samples, multiple assays, with
> replicates, for most days.  
> For our purposes, we've created composite files that put all the
> different assays together, in a few instances the final analysis files
> use imputed values, or lab values that are clearly wack set to
> missings.   In general, how would people be inclined to make this
> type of data public?  One file for hCG? another for PdG, E1G and CRT
>  and another for LH? or merge them all together into one synthetic
> file   There are also subsets of women who've had other assay
> suites done on their pee.  The original lab data arrived in a variety
> of ways: lab forms that were double keyed, transmitted over modem,
> FAR too many lotus and Quattro Pro spreadsheets and *.prn files and
> there seems no compelling reason I can imagine to make this morass of
> "ur-data" public??
> There're also files, really long snippets of SAS code, that were
> used when researchers determined that changes/imputes were appropriate
> (ie, recalibrating assay thresholds) and these could be made
> available for others to use or not use as they wish. 
> Our composite files also contain  demographic, cycle specific and
> study outcome information; my inclination, again, is to distangle
> these data and make them available in separate files.  
> 
> Thanks for any and all suggestions
> 
> http://www.niehs.nih.gov/research/atniehs/labs/epi/studies/eps/index.cfm
> [1]  
> 
> Bob McConn....
> 
> "She is at the brink of never being hurt again
> but pauses to say, All of us.  Every blade of grass."
> from Kuan Yin, Laura Fargas 
> 
> Links:
> ------
> [1]
> http://www.niehs.nih.gov/research/atniehs/labs/epi/studies/eps/index.cfm
> 
>