closer-cohorts / extract2ddi Goto Github PK
View Code? Open in Web Editor NEWTool to extract DDI metadata from SPSS and Stata
License: Other
Tool to extract DDI metadata from SPSS and Stata
License: Other
Could an XML profile be used to validate the output? See attached
ddi32-profile.txt
Variable Representation: RecommendedDataType should be Numeric for Code Representation, not "type"
ddi:VariableRepresentation
<r:CodeRepresentation>
<r:RecommendedDataType>type</r:RecommendedDataType>
<r:CodeListReference>
<r:URN typeOfIdentifier="Canonical">urn:ddi:uk.closer:72c7a745-5bad-42e9-b95d-ddfde33587e2:1</r:URN>
<r:TypeOfObject>CodeList</r:TypeOfObject>
</r:CodeListReference>
</r:CodeRepresentation>
</ddi:VariableRepresentation>
For DataItem,. attribute values are not correctly populated
<ddi1:DataItem>
<r:VariableReference>
<r:URN typeOfIdentifier="Canonical">urn:ddi:uk.closer:e9b94a1b-797c-4f89-bf1c-09bf25492359:1</r:URN>
<r:TypeOfObject>Variable</r:TypeOfObject>
</r:VariableReference>
<r:ProprietaryInfo>
<r:ProprietaryProperty>
<r:AttributeKey>Width</r:AttributeKey>
<r:AttributeValue>???</r:AttributeValue>
</r:ProprietaryProperty>
<r:ProprietaryProperty>
<r:AttributeKey>Decimals</r:AttributeKey>
<r:AttributeValue>???</r:AttributeValue>
</r:ProprietaryProperty>
<r:ProprietaryProperty>
<r:AttributeKey>WriteFormatType</r:AttributeKey>
<r:AttributeValue>???</r:AttributeValue>
</r:ProprietaryProperty>
</r:ProprietaryInfo>
</ddi1:DataItem>
Update documentation to show all options
CodeValue should have valid and invalid counts, currently showing 0
<TotalResponses>4</TotalResponses>
<SummaryStatistic>
<TypeOfSummaryStatistic>ValidCases</TypeOfSummaryStatistic>
<Statistic>0</Statistic>
</SummaryStatistic>
<SummaryStatistic>
<TypeOfSummaryStatistic>InvalidCases</TypeOfSummaryStatistic>
<Statistic>0</Statistic>
</SummaryStatistic>
</VariableStatistics>
Add configuration option in config file to add blank citation
After DDIInstance / URN
After ResourcePackage / URN
After PhysicalInstance / URN
<r:Citation>
<r:Title><r:String xml:lang="en-GB"></r:String></r:Title>
<r:AlternateTitle><r:String xml:lang="en-GB"></r:String></r:AlternateTitle>
</r:Citation>
Rename jar to Extract2DDI.jar
<GrossFileStructure isUniversallyUnique="true">
<r:URN>urn:ddi:uk.genscotland:e3fc49ff-d893-434e-a4c6-24bc6b7c3934:1</r:URN>
<r:Agency>uk.genscotland</r:Agency>
<r:ID>e3fc49ff-d893-434e-a4c6-24bc6b7c3934</r:ID>
<r:Version>1</r:Version>
<CaseQuantity>5</CaseQuantity>
</GrossFileStructure>
Add CreationSoftware element as below
see https://github.com/ncrncornell/ced2arspssreader/blob/master/src/edu/cornell/ncrn/ced2ar/data/spss/SPSSVariable.java line 414 onwards
Currently have ??? should be valid entries
<VariableName>
<r:String xml:lang="en-GB">TestDate2</r:String>
</VariableName>
<r:Label>
<r:Content xml:lang="en-GB">V8_A</r:Content>
</r:Label>
The TopLevelReference should be a reference to the main item in the process. In the attached example
lfsp_jm15_eul_11.sav.xml.txt
the reference is to a ResourcePackage (TypeOfObject) but the ID in the TopLevelReference is c2b23dcd-dde4-413b-93c4-36e64e7ead44 and the ID of the ResourcePackage itself is 723e856b-9f4f-4cd0-861b-f3cebb1f8612
Using either the -s flag or sumstats=TRUE in config does not output summary statistics
Scope effort required for using GitHubActions for unit and integration tests
For all variables, the corresponding VariableStatistics element has the same ID as the variable. In the example attached, variable 5a99caa4-6044-41d1-b7cc-cc94f1ae0e9c has a corresponding VariableStatistics element with the same ID. This element then has a VariableReference with the same correct
lfsp_jm15_eul_11.sav.xml.txt
ID
Categories that are common across variables appear to have been identified as such but identical Category fragments are then created
lfsp_jm15_eul_11.sav.xml.txt
For all variables, the VariableRepresentation -> CodeRepresentation -> CodeListReference has the same ID as the variable and so does not exist in the output file. The corresponding CodeList IDs are unique but never referenced. In the example output, if you take ACTHR (ID 5a99caa4-6044-41d1-b7cc-cc94f1ae0e9c) and look at the reference to its CodeList it has the same ID as the variable (not the ID of its corresponding CodeList, presumably 25dac1d6-6cf8-4602-b2a8-9f169ffed68f)
lfsp_jm15_eul_11.sav.xml.txt
its-meta:extract jon$ java -jar Extract2DDI.jar -f test-file-data-types.dta --format 3.2 --config format32-stata
2022-11-15 08:42:48,454 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 115.
2022-11-15 08:42:48,455 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 114.
2022-11-15 08:42:48,455 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 113.
2022-11-15 08:42:48,457 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 117
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at edu.cornell.ncrn.ced2ar.stata.impl.DtaReader.readValueLabels(DtaReader.java:419)
at edu.cornell.ncrn.ced2ar.stata.impl.Dta118Reader.readVariables(Dta118Reader.java:222)
at edu.cornell.ncrn.ced2ar.stata.impl.Dta117Reader.(Dta117Reader.java:65)
at edu.cornell.ncrn.ced2ar.stata.impl.Dta118Reader.(Dta118Reader.java:49)
at edu.cornell.ncrn.ced2ar.stata.StataReaderFactory.getStataReader(StataReaderFactory.java:42)
at edu.cornell.ncrn.ced2ar.ddigen.csv.StataCsvGenerator.generateVariablesCsv(StataCsvGenerator.java:40)
at edu.cornell.ncrn.ced2ar.ddigen.DdiLifecycleGenerator.generateVariablesCsv(DdiLifecycleGenerator.java:55)
at edu.cornell.ncrn.ced2ar.ddigen.GenerateDDI32.generateDDI(GenerateDDI32.java:34)
at edu.cornell.ncrn.ced2ar.ddigen.Main.main(Main.java:132)
DDI still seems to be produced however. The reason is that they are flagged as CodeRepresentation (as opposed to TextRepresentation). Because of this, in FragmentGenerator (line 416), they are looked up in ‘variableToFrequencyMap’ but not found because only Numeric variables are added to the map.
see https://github.com/ncrncornell/ced2arspssreader/blob/master/src/edu/cornell/ncrn/ced2ar/data/spss/SPSSFile.java line 781 from spssreader
If the column is not processable, the debug should identify the column which is the problem
e.g.
2022-11-15 08:42:48,454 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 115.
2022-11-15 08:42:48,455 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 114.
2022-11-15 08:42:48,455 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 113.
2022-11-15 08:42:48,457 [main] INFO edu.cornell.ncrn.ced2ar.stata.StataReaderFactory - Stata Data file test-file-data-types.dta is not a Format 117
For 3.2. and 3.3,
Currency format $ not supported
Using format33 and format 32 config file
2022-10-10 14:04:29,000 [main] ERROR edu.cornell.ncrn.ced2ar.ddigen.csv.SpssCsvGenerator - An error occured in reading observation 1. Skipping this observation java.lang.NumberFormatException: empty String
2022-10-10 14:04:29,003 [main] ERROR edu.cornell.ncrn.ced2ar.ddigen.csv.SpssCsvGenerator - An error occured in reading observation 2. Skipping this observation java.lang.NumberFormatException: empty String
2022-10-10 14:04:29,005 [main] ERROR edu.cornell.ncrn.ced2ar.ddigen.csv.SpssCsvGenerator - An error occured in reading observation 3. Skipping this observation java.lang.NumberFormatException: empty String
2022-10-10 14:04:29,007 [main] ERROR edu.cornell.ncrn.ced2ar.ddigen.csv.SpssCsvGenerator - An error occured in reading observation 5. Skipping this observation java.lang.NumberFormatException: empty String
https://github.com/ncrncornell/ced2arspssreader/blob/master/src/edu/cornell/ncrn/ced2ar/data/spss/SPSSFile.java line 781 from spssreader populate xxxxx
After </p:PhysicalStructureLinkReference> add
ddi1:SystemSoftware
<r:SoftwareName>
<r:String xml:lang="en-GB">xxxxxxx</r:String>
</r:SoftwareName>
<r:Description>
<r:Content>xxxxxxxxxx</r:Content>
</r:Description>
</ddi1:SystemSoftware>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.