[DDI-users] Re: [DDI-SRG] When the Author of a DDI isn't the Archive

Mark R. Diggory mdiggory at latte.harvard.edu
Fri Sep 9 08:49:44 EDT 2005


Wendy, we have 3 cases to deal with here so I'll try to respond for each 
case:

Case 1 : We are importing an existing DDI into a VDC Server.

Case 2 : We are importing a MARC record into a VDC Server
          (which generates a new DDI in the VDC Server)

Case 3 : A VDC Server in the Federation is Harvesting an existing OAI 
service. (caching a copy of the DDI within its Index/Repository)

Wendy Thomas wrote:
> Mark,
> 
> There are currently 3 sections that identify the source, intellectual
> content and current XML instance in the DDI. Section 1.1 describes the
> current document. If all your system is doing is importing an extant XML
> instance verbatim then generally the only thing that would change in this
> section is the holdings and deposit information. 

I think the question is "Are we putting this 'original' in our archive, 
or are we creating a 'derivative' of the 'original' in our archive?". In 
some of our cases its the former (3) in others its the later (2) and 
others its very unclear which it is (1).

(1) It seems that in this case, the Server would at least generate a new 
docDscr, describing the entire event of the import, this includes as you 
state, holdings and deposit information. It is important to preserve the 
original information for provenance and archival reasons. So I'm not 
suggesting appending these in the original docDscr. There is also an 
issue that theres more than holdings and deposit info, and theres more 
than one archive to be tracked, so all this information needs to be 
organized per archive.

(2) This is much easier, a docDscr is generated for the study, like 
above, but the curator is the producer of the instance. the docSrc can 
be used to document the original MARC record.

(3) In the case of a Harvesting of metadata from one VDC to another for 
search and discovery purposes. Again, there is a requirement of 
preservation of the original, while documenting its inclusion into the 
new server. In our "Harvesting" case, all the original holdings still 
persist and are accessible, all the fileDscr/otherMat URI's still point 
to the original resources, this is just a mirroring of that information 
on the new system. This is where the challenge of altering the original 
really becomes problematic to me. If we insert a docDscr representing 
the harvest into our archive, we've in a sense altered the "instance" 
simply to document that it was added to our archive. I'm unsure if this 
is necessary.

> If you are creating an
> XML instance from another source this it the section you would use to
> record the "authorship" of the XML version. Your source documents would go
> in section 1.2. This would include an XML version where you made changes
> or incorporated additional materials when importing an XML. The original
> XML would be cited in section 1.2.

Do you mean 1.3 (docSrc) not 1.2 (guide) ?

This issue is that in the case that this is actually another DDI that we 
are using to generate the DDI in the system, there would already 
possibly be a docDscr/docSrc present.

It seems wierd to alter the original existing docDscr, moving its 
citation contents into a docSrc section, some information resides 
outside of the docDscr/citation, and the place to preserve this metadata 
(guide, status, notes, other docSrc's) is unclear.

It seems to me that preservation of the original structure is of the 
utmost importance, that the unit of administrative metadata needs to be 
as encapsulated as much as possible. The docDscr appears to be such a 
unit of encapsulation, whereas I stated above, theres unclarity as to do 
with the rest of the content if "citation" is the unit of encapsulation.

If it is the case that this docDscr structure is preserved and that new 
archives insert new docDscrs, then the origin of the document can be 
consistently tracked through hetergenious archive systems.

> 
> As Reto suggests this seems inadequate for documenting the modular and
> sub-modular level changes occuring in a life cycle model. 

This is more a discussion for 3.x than a discussion about current usage:

I do agree, at first I thought it is possible to combine this with the 
usage of Link elements to attain some degree of change documentation. 
But it is very poor form.

> It is the
> primary reason for providing the reusable class "citation" to allow for
> tracking the provinance and source of descrete pieces of information.

As Reto points out (and I've discovered through trying to use it) this 
"citation" thing is just not enough.

> There is a need to track the path the XML followed to get to its present
> state. This should be kept in the "archive" module as it is dynamic in
> nature and varies archive by archive. Right now this module is not well
> defined. It is definately something that uses will need to hammer on a bit
> when testing out the proposed 3.0 structure.


One idea I have for 2.x is that we could redefine the definition of 
"@source" to be an IDREF (or a "reference" in 3.0) which points at a 
"docDscr". So as instead of having @source="archive|producer", there is 
a docDscr for each and every "editor" of the document, and then 
modifications to content can be at least linked back to the editor.

For example:

<codeBook>
    <docDscr id="producer">...</docDscr>
    <docDscr id="archive1">...</docDscr>
    <docDscr id="archive2">...</docDscr>
    ...
    <fileDscr source="producer">...</otherMat>
    <fileDscr source="archive1">...</otherMat>
    <fileDscr source="archive2">...</otherMat>


Another even better idea I have for 3.x is something like the change 
tracking found in Open Office. It's format has a "Change Tracking" 
strategy which may be of interest to the SRG group.

it contains the following elements

<!-- elements for change tracking -->

<!ELEMENT text:change EMPTY>
<!ATTLIST text:change text:change-id CDATA #REQUIRED>

<!ELEMENT text:change-start EMPTY>
<!ATTLIST text:change-start text:change-id CDATA #REQUIRED>

<!ELEMENT text:change-end EMPTY>
<!ATTLIST text:change-end text:change-id CDATA #REQUIRED>

<!ELEMENT text:tracked-changes (text:changed-region)*>
<!ATTLIST text:tracked-changes text:track-changes %boolean; "true">
<!ATTLIST text:tracked-changes text:protection-key CDATA #IMPLIED>

<!ELEMENT text:changed-region (text:insertion |
							   (text:deletion, text:insertion?) |
                                text:format-change) >
<!ATTLIST text:changed-region text:id ID #REQUIRED>
<!ATTLIST text:changed-region text:merge-last-paragraph %boolean; "true">

<!ELEMENT text:insertion (office:change-info, %sectionText;)>
<!ELEMENT text:deletion (office:change-info, %sectionText;)>
<!ELEMENT text:format-change (office:change-info)>

A reference can be found here:
http://xml.openoffice.org/source/browse/xml/xmloff/dtd/text.mod?rev=1.57.220.1&content-type=text/vnd.viewcvs-markup

This used in conjunction with identification of the change authors in 
the archive section provides a means of tracking changes done in the DDI 
over time, however, this is clearly something which an application would 
need to provide support for as it would be allowed almost everywhere 
within the DDI.

Having changes documented in an alternate namespace does provide easy 
identification for adding and stripping them out during 
rendering/processing if necessary.

-Mark


More information about the DDI-users mailing list