closer-cohorts / extract2ddi Goto Github PK

View Code? Open in Web Editor NEW

1.0 5.0 0.0 6.65 MB

Tool to extract DDI metadata from SPSS and Stata

License: Other

Java 100.00%

ddi-codebook java metadata ddi-lifecycle ddi

extract2ddi's People

Contributors

Stargazers

Watchers

extract2ddi's Issues

Validation of outputs

Could an XML profile be used to validate the output? See attached
ddi32-profile.txt

RecommendedDataType should be Numeric not type

Variable Representation: RecommendedDataType should be Numeric for Code Representation, not "type"
ddi:VariableRepresentation
<r:CodeRepresentation>
<r:RecommendedDataType>type</r:RecommendedDataType>
<r:CodeListReference>
<r:URN typeOfIdentifier="Canonical">urn:ddi:uk.closer:72c7a745-5bad-42e9-b95d-ddfde33587e2:1</r:URN>
<r:TypeOfObject>CodeList</r:TypeOfObject>
</r:CodeListReference>
</r:CodeRepresentation>
</ddi:VariableRepresentation>

DataItem attribute values missing from Stata output

For DataItem,. attribute values are not correctly populated

            <ddi1:DataItem>
                <r:VariableReference>
                    <r:URN typeOfIdentifier="Canonical">urn:ddi:uk.closer:e9b94a1b-797c-4f89-bf1c-09bf25492359:1</r:URN>
                    <r:TypeOfObject>Variable</r:TypeOfObject>
                </r:VariableReference>
                <r:ProprietaryInfo>
                    <r:ProprietaryProperty>
                        <r:AttributeKey>Width</r:AttributeKey>
                        <r:AttributeValue>???</r:AttributeValue>
                    </r:ProprietaryProperty>
                    <r:ProprietaryProperty>
                        <r:AttributeKey>Decimals</r:AttributeKey>
                        <r:AttributeValue>???</r:AttributeValue>
                    </r:ProprietaryProperty>
                    <r:ProprietaryProperty>
                        <r:AttributeKey>WriteFormatType</r:AttributeKey>
                        <r:AttributeValue>???</r:AttributeValue>
                    </r:ProprietaryProperty>
                </r:ProprietaryInfo>
            </ddi1:DataItem>

Update documentation

Update documentation to show all options

Codebook output options
DDI-L 3.2 output options
DDI-L 3.3 output options

Errors in summary stats in format 3.3Fragment

CodeValue should have valid and invalid counts, currently showing 0

        <TotalResponses>4</TotalResponses>
        <SummaryStatistic>
            <TypeOfSummaryStatistic>ValidCases</TypeOfSummaryStatistic>
            <Statistic>0</Statistic>
        </SummaryStatistic>
        <SummaryStatistic>
            <TypeOfSummaryStatistic>InvalidCases</TypeOfSummaryStatistic>
            <Statistic>0</Statistic>
        </SummaryStatistic>
    </VariableStatistics>

Add optional citation to 3.2 Format

Add configuration option in config file to add blank citation
After DDIInstance / URN
After ResourcePackage / URN
After PhysicalInstance / URN
<r:Citation>
<r:Title><r:String xml:lang="en-GB"></r:String></r:Title>
<r:AlternateTitle><r:String xml:lang="en-GB"></r:String></r:AlternateTitle>
</r:Citation>

Rename jar to Extract2DDI

Rename jar to Extract2DDI.jar

Add software tagging for SPSS and Stata 3.3

  <GrossFileStructure isUniversallyUnique="true">
    <r:URN>urn:ddi:uk.genscotland:e3fc49ff-d893-434e-a4c6-24bc6b7c3934:1</r:URN>
    <r:Agency>uk.genscotland</r:Agency>
    <r:ID>e3fc49ff-d893-434e-a4c6-24bc6b7c3934</r:ID>
    <r:Version>1</r:Version>
    <CaseQuantity>5</CaseQuantity>
  </GrossFileStructure>

Add CreationSoftware element as below

DataItem attribute values not populated correctly from SPSS 3.2 output

see https://github.com/ncrncornell/ced2arspssreader/blob/master/src/edu/cornell/ncrn/ced2ar/data/spss/SPSSVariable.java line 414 onwards

Currently have ??? should be valid entries

Variable Label not correct in Formats 32 and 33

      <VariableName>
            <r:String xml:lang="en-GB">TestDate2</r:String>
        </VariableName>
        <r:Label>
            <r:Content xml:lang="en-GB">V8_A</r:Content>
        </r:Label>

TopLevelReference is not referenced again in 3.3 output

The TopLevelReference should be a reference to the main item in the process. In the attached example
lfsp_jm15_eul_11.sav.xml.txt
the reference is to a ResourcePackage (TypeOfObject) but the ID in the TopLevelReference is c2b23dcd-dde4-413b-93c4-36e64e7ead44 and the ID of the ResourcePackage itself is 723e856b-9f4f-4cd0-861b-f3cebb1f8612

Add validation test for 3.2 and 3.3 schemas

XSD for 3.3: https://ddialliance.org/Specification/DDI-Lifecycle/3.3/XMLSchema/instance.xsd
XSD for 3.2: https://ddialliance.org/Specification/DDI-Lifecycle/3.2/XMLSchema/instance.xsd

Log file should overwirte, not append

DataItem attribute values missing from SPSS 3.3 output

see https://github.com/ncrncornell/ced2arspssreader/blob/master/src/edu/cornell/ncrn/ced2ar/data/spss/SPSSVariable.java line 414 onwards

Add support for Stata for 3.2 output

Summary stats not being output in format 3.2

Using either the -s flag or sumstats=TRUE in config does not output summary statistics

Unit tests and integration tests

Scope effort required for using GitHubActions for unit and integration tests

Duplicate VariableStatistics IDs in 3.3. output

For all variables, the corresponding VariableStatistics element has the same ID as the variable. In the example attached, variable 5a99caa4-6044-41d1-b7cc-cc94f1ae0e9c has a corresponding VariableStatistics element with the same ID. This element then has a VariableReference with the same correct
lfsp_jm15_eul_11.sav.xml.txt
ID

Identical Category fragments are repeated in 3.3 output

Categories that are common across variables appear to have been identified as such but identical Category fragments are then created
lfsp_jm15_eul_11.sav.xml.txt

for example 22e27ff5-c408-4ef8-aef5-d13cf0b17cbe "No answer". In the output attached there are 11 identical fragments containing a Category with that ID

Duplicate variable CodeListReference IDs in 3.3. output

For all variables, the VariableRepresentation -> CodeRepresentation -> CodeListReference has the same ID as the variable and so does not exist in the output file. The corresponding CodeList IDs are unique but never referenced. In the example output, if you take ACTHR (ID 5a99caa4-6044-41d1-b7cc-cc94f1ae0e9c) and look at the reference to its CodeList it has the same ID as the variable (not the ID of its corresponding CodeList, presumably 25dac1d6-6cf8-4602-b2a8-9f169ffed68f)
lfsp_jm15_eul_11.sav.xml.txt

Stata fails with java.lang.OutOfMemoryError: Java heap space

its-meta:extract jon$ java -jar Extract2DDI.jar -f test-file-data-types.dta --format 3.2 --config format32-stata
2022-11-15 08:42:48,454 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 115.
2022-11-15 08:42:48,455 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 114.
2022-11-15 08:42:48,455 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 113.
2022-11-15 08:42:48,457 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 117
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at edu.cornell.ncrn.ced2ar.stata.impl.DtaReader.readValueLabels(DtaReader.java:419)
at edu.cornell.ncrn.ced2ar.stata.impl.Dta118Reader.readVariables(Dta118Reader.java:222)
at edu.cornell.ncrn.ced2ar.stata.impl.Dta117Reader.(Dta117Reader.java:65)
at edu.cornell.ncrn.ced2ar.stata.impl.Dta118Reader.(Dta118Reader.java:49)
at edu.cornell.ncrn.ced2ar.stata.StataReaderFactory.getStataReader(StataReaderFactory.java:42)
at edu.cornell.ncrn.ced2ar.ddigen.csv.StataCsvGenerator.generateVariablesCsv(StataCsvGenerator.java:40)
at edu.cornell.ncrn.ced2ar.ddigen.DdiLifecycleGenerator.generateVariablesCsv(DdiLifecycleGenerator.java:55)
at edu.cornell.ncrn.ced2ar.ddigen.GenerateDDI32.generateDDI(GenerateDDI32.java:34)
at edu.cornell.ncrn.ced2ar.ddigen.Main.main(Main.java:132)

Custom currency not supported

String variables with Missing and Values specified cause a null pointer exception (3.3 Fragment)

DDI still seems to be produced however. The reason is that they are flagged as CodeRepresentation (as opposed to TextRepresentation). Because of this, in FragmentGenerator (line 416), they are looked up in ‘variableToFrequencyMap’ but not found because only Numeric variables are added to the map.

Add software tag for SPSS 3.2

see https://github.com/ncrncornell/ced2arspssreader/blob/master/src/edu/cornell/ncrn/ced2ar/data/spss/SPSSFile.java line 781 from spssreader

Debug output should write out column

If the column is not processable, the debug should identify the column which is the problem
e.g.
2022-11-15 08:42:48,454 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 115.
2022-11-15 08:42:48,455 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 114.
2022-11-15 08:42:48,455 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 113.
2022-11-15 08:42:48,457 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 117

Mandatory items

For 3.2. and 3.3,

f - filename
agency
ddilang
format (3.2.or 3.3
For 2.5
filename
format (2.5)

Currency format $ not supported

NumberFormatException error

Using format33 and format 32 config file

2022-10-10 14:04:29,000 [main] ERROR edu.cornell.ncrn.ced2ar.ddigen.csv.SpssCsvGenerator - An error occured in reading observation 1. Skipping this observation java.lang.NumberFormatException: empty String
2022-10-10 14:04:29,003 [main] ERROR edu.cornell.ncrn.ced2ar.ddigen.csv.SpssCsvGenerator - An error occured in reading observation 2. Skipping this observation java.lang.NumberFormatException: empty String
2022-10-10 14:04:29,005 [main] ERROR edu.cornell.ncrn.ced2ar.ddigen.csv.SpssCsvGenerator - An error occured in reading observation 3. Skipping this observation java.lang.NumberFormatException: empty String
2022-10-10 14:04:29,007 [main] ERROR edu.cornell.ncrn.ced2ar.ddigen.csv.SpssCsvGenerator - An error occured in reading observation 5. Skipping this observation java.lang.NumberFormatException: empty String

Add configuration to 2.5 output

Add software tag for SPSS 3.3

https://github.com/ncrncornell/ced2arspssreader/blob/master/src/edu/cornell/ncrn/ced2ar/data/spss/SPSSFile.java line 781 from spssreader populate xxxxx

After </p:PhysicalStructureLinkReference> add
ddi1:SystemSoftware
<r:SoftwareName>
<r:String xml:lang="en-GB">xxxxxxx</r:String>
</r:SoftwareName>
<r:Description>
<r:Content>xxxxxxxxxx</r:Content>
</r:Description>
</ddi1:SystemSoftware>

closer-cohorts / extract2ddi Goto Github PK

extract2ddi's People

Contributors

Stargazers

Watchers

extract2ddi's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs