pepfar-datim / datapackr Goto Github PK
View Code? Open in Web Editor NEWLicense: Creative Commons Zero v1.0 Universal
License: Creative Commons Zero v1.0 Universal
Lines 30-31 of unPackData.R makes reference to a function which cannot be found.
With this datapack: https://www.pepfar.net/Project-Pages/collab-38/Shared%20Documents/Data%20Pack%202019%20Staging%20Area/Support%20Files/datapacks/DataPack_Malawi_03182019.xlsx
I am receiving a large number of errors related to incorrectly distributing data on the SNU x IM tab:
13 : WARNING!: 519 cases where distributed total is either more or less than total Target. To identify these, go to your SNU x IM tab and filter the Rollup column for Pink cells. This has affected the following indicators ->
* GEND_GBV.N.ViolenceServiceType.20T.physEmot
* GEND_GBV.N.ViolenceServiceType.20T.postRape
* HTS_INDEX_COM.N.Age/Sex/Result.20T.NewNeg
* HTS_INDEX_COM.N.Age/Sex/Result.20T.NewPos
* HTS_INDEX_FAC.N.Age/Sex/Result.20T.NewNeg
* HTS_INDEX_FAC.N.Age/Sex/Result.20T.NewPos
* HTS_SELF.N.Age/Sex/HIVSelfTest.20T.Directly_Assisted
* HTS_SELF.N.HIVSelfTest.20T.Unassisted
* HTS_TST_Inpat.N.Age/Sex/Result.20T.Negative
* HTS_TST_Inpat.N.Age/Sex/Result.20T.Positive
* HTS_TST_MobileMod.N.Age/Sex/Result.20T.Negative
* HTS_TST_MobileMod.N.Age/Sex/Result.20T.Positive
* HTS_TST_OtherMod.N.Age/Sex/Result.20T.Negative
* HTS_TST_OtherPITC.N.Age/Sex/Result.20T.Negative
* HTS_TST_OtherPITC.N.Age/Sex/Result.20T.Positive
* HTS_TST_PMTCTPostANC1.N.Age/Sex/Result.20T.Negative
* HTS_TST_PMTCTPostANC1.N.Age/Sex/Result.20T.Positive
* HTS_TST_STIClinic.N.Age/Sex/Result.20T.Negative
* HTS_TST_STIClinic.N.Age/Sex/Result.20T.Positive
* HTS_TST_VCT.N.Age/Sex/Result.20T.Negative
* HTS_TST_VCT.N.Age/Sex/Result.20T.Positive
* HTS_TST.N.KeyPop/Result.20T.Negative
* HTS_TST.N.KeyPop/Result.20T.Positive
* KP_PREV.N.KeyPop.20T
* PMTCT_ART.N.Age/NewExistingART/Sex/HIVStatus.20T.Already
* PMTCT_ART.N.Age/NewExistingART/Sex/HIVStatus.20T.New
* PMTCT_STAT.D.Age/Sex.20T
* PMTCT_STAT.N.Age/Sex/KnownNewResult.20T.NewNeg
* PMTCT_STAT.N.Age/Sex/KnownNewResult.20T.NewPos
* PP_PREV.N.Age/Sex.20T
* PrEP_CURR.N.Age/Sex.20T
* PrEP_CURR.N.KeyPop.20T
* PrEP_NEW.N.Age/Sex.20T
* PrEP_NEW.N.KeyPop.20T
* TB_ART.N.Age/Sex/NewExistingART/HIVStatus.20T.Already
* TB_ART.N.Age/Sex/NewExistingART/HIVStatus.20T.New
* TB_PREV.D.Age/TherapyType/NewExistingArt/HIVStatus.20T.IPTNew
* TB_PREV.N.Age/TherapyType/NewExistingArt/HIVStatus.20T.IPTNew
* TX_CURR.N.Age/Sex/HIVStatus.20T
* TX_NEW.N.Age/Sex/HIVStatus.20T
* TX_NEW.N.KeyPop/HIVStatus.20T
* TX_PVLS.D.Age/Sex/Indication/HIVStatus.20T.Routine
* TX_PVLS.N.Age/Sex/Indication/HIVStatus.20T.Routine
* TX_TB.D.Age/Sex/TBScreen/NewExistingART/HIVStatus.20T.ScreenNegAlready
* TX_TB.D.Age/Sex/TBScreen/NewExistingART/HIVStatus.20T.ScreenNegNew
* TX_TB.D.Age/Sex/TBScreen/NewExistingART/HIVStatus.20T.ScreenPosAlready
* TX_TB.D.Age/Sex/TBScreen/NewExistingART/HIVStatus.20T.ScreenPosNew
* VMMC_CIRC.N.Age/Sex/HIVStatus.20T.Negative
* VMMC_CIRC.N.Age/Sex/HIVStatus.20T.Positive
* VMMC_CIRC.N.Age/Sex/HIVStatus.20T.Unknown
Many if not all of these seem to be because the targets are not rounded, but the SNUxIM tab has rounded targets.
Notice these are not actually flagged in pink in the data pack as suggested in the error message.
This is the related code.
Lines 91 to 120 in 9b34ce5
Lines 98-103 in createKeyChainInfo seem to want to compare the names of sheets from the schema, to the names of the sheets contained in the DataPack to be parsed.
any(tab_names_expected != tab_names_received)
If any tabs have been added, and the lengths are different, this warning will appear, since the vectors are of different lengths.
Warning in tab_names_expected != tab_names_received :
longer object length is not a multiple of shorter object length
checkStructure
anyway.@jacksonsj could you have a look and fix/remove?
Revisit solution in #89. Implemented fix seems functional but clunky. Due to time constraints we implemented the fix but consider it technical debt.
See pull request for details on the issue and solution.
All developers need to be using the same version of dependencies in order to ensure that everything is reproducible across different environments.
The current OPU and Datapack app share basically the same code and functionality. Keeping both of these apps maintained will be laborious and duplicative. With one app, we should be able to perform the necessary validations on both OPU DataPacks and normal DataPacks, since the vast majority of the code is essentially the same.
We should be able to pretty easily determine what type of tool we are working with, and from there, decide what to do with it in the app. Ideally, we could write the specific type of tool "Data Pack", "OPU Data Pack", etc into a specific range of cells on the Home tab, but this is currently available in cell B10, like "COP21 Data Pack" or "COP20 OPU Data Pack". Once we have this information in the app, we can proceed with the specific processing which each tool requires.
Command line users or apps would still be able to specify this information for specific use cases, but if left blank (NULL) we would try and obtain this information from the home tab.
Change the way site tool computes OU sums from Data Pack. Instead of pulling from d$data$site$distributed, pull from d$data$MER for purest link.
Current structure of datapackr::site_tool_schema
does not actually relfect the outputted schema.
@jacksonsj I am getting an error when trying to pack a site tool that I have traced to this point.
Line 405 in 600d4a8
It appears that get MechList is not returning a column named name as expected at the referenced point in the code (and mabe some future points e.g. x = data.frame(mechID = mechList$name)
). I get these columns when calling getMechList directly
> names(mechList)
[1] "mechanism" "code" "uid" "partner" "primeid" "agency" "ou" "startdate"
[9] "enddate"
I don't feel I know the code well enough here to fix this bug. Perhaps we should be using mechanism instead of name or perhaps we need to rename what is returned from getMechList.
As noted in the code, _Military PSNUs should not have any prioritization, and even if they do, it should not be imported and just ignored.
The code in this section of the parser could be improved a bit to provide better feedback to the user.
FYI @sam-bao @jacksonsj
@gsarfaty in SA and I are having some trouble installing datapackr
. There seem to be some upstream issues with installing datacommons
, which has a dependency for doMC
. We both are working off R 4.0.3.
remotes::install_github("pepfar-datim/datapackr")
#> Using github PAT from envvar GITHUB_PAT
#> Downloading GitHub repo pepfar-datim/datapackr@HEAD
#> datapackc... (NA -> cc99f39e4...) [GitHub]
#> piton (NA -> 1.0.0 ) [CRAN]
#> tidyxl (NA -> 1.0.7 ) [CRAN]
#> Downloading GitHub repo pepfar-datim/data-pack-commons@HEAD
#> Skipping 1 packages not available: doMC
#> checking for file 'C:\Users\achafetz\AppData\Local\Temp\2\Rtmp61BBOY\remotes1ae4793a38f\pepfar-datim-data-pack-commons-cc99f39/DESCRIPTION' ... v checking for file 'C:\Users\achafetz\AppData\Local\Temp\2\Rtmp61BBOY\remotes1ae4793a38f\pepfar-datim-data-pack-commons-cc99f39/DESCRIPTION' (711ms)
#> - preparing 'datapackcommons':
#> checking DESCRIPTION meta-information ... checking DESCRIPTION meta-information ... v checking DESCRIPTION meta-information
#> - checking for LF line-endings in source and make files and shell scripts
#> - checking for empty or unneeded directories
#> - building 'datapackcommons_0.2.1.tar.gz'
#>
#>
#> Installing package into 'C:/Users/achafetz/Documents/R/win-library/4.0'
#> (as 'lib' is unspecified)
#> Error: Failed to install 'datapackr' from GitHub:
#> Failed to install 'datapackcommons' from GitHub:
#> (converted from warning) installation of package 'C:/Users/achafetz/AppData/Local/Temp/2/Rtmp61BBOY/file1ae4676b2d13/datapackcommons_0.2.1.tar.gz' had non-zero exit status
remotes::install_github("pepfar-datim/data-pack-commons")
#> Using github PAT from envvar GITHUB_PAT
#> Downloading GitHub repo pepfar-datim/data-pack-commons@HEAD
#> Skipping 1 packages not available: doMC
#> checking for file 'C:\Users\achafetz\AppData\Local\Temp\2\RtmpOQqKpN\remotes22c442247a37\pepfar-datim-data-pack-commons-cc99f39/DESCRIPTION' ... v checking for file 'C:\Users\achafetz\AppData\Local\Temp\2\RtmpOQqKpN\remotes22c442247a37\pepfar-datim-data-pack-commons-cc99f39/DESCRIPTION' (720ms)
#> - preparing 'datapackcommons':
#> checking DESCRIPTION meta-information ... checking DESCRIPTION meta-information ... v checking DESCRIPTION meta-information
#> - checking for LF line-endings in source and make files and shell scripts
#> - checking for empty or unneeded directories
#> - building 'datapackcommons_0.2.1.tar.gz'
#>
#>
#> Installing package into 'C:/Users/achafetz/Documents/R/win-library/4.0'
#> (as 'lib' is unspecified)
#> Error: Failed to install 'datapackcommons' from GitHub:
#> (converted from warning) installation of package 'C:/Users/achafetz/AppData/Local/Temp/2/RtmpOQqKpN/file22c4e174685/datapackcommons_0.2.1.tar.gz' had non-zero exit status
install.packages("doMC")
#> Installing package into 'C:/Users/achafetz/Documents/R/win-library/4.0'
#> (as 'lib' is unspecified)
#> Warning: package 'doMC' is not available for this version of R
#>
#> A version of this package for your version of R might be available elsewhere,
#> see the ideas at
#> https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Created on 2021-01-28 by the reprex package (v0.3.0)
See DP-353 for details.
Related to pepfar-datim/datimutils#5
Logins functions should
Accept a config file with no slash, a slash, or multiple slashes for baseurl and mutate this to a single trailing slash
All other API calls in code, should never start with a slash
A utility function to a) encode all URIs and b) check if there are any double slashes (which we know will fail)
e.g. a wrapper around utils::urlencode that thrown an error if there is a "//" other than in "https://"
Eventually, this function should really be replaced entirely be something similar from the upcoming datimutils
package.
> datapackr::packSiteTool(d,
+ output_path = paste0(support_dir_path, "site_tools/"))
Error in if (!stringr::str_detect(names(wb), "Home")) { :
argument is of length zero
>
Seems to happen here, perhaps wb has no names at this point.
Line 21 in a99ed9d
Reviewing the indicators in the schema (cop23_data_pack_schema
), why is index testing the only indicator that does not match MER - stored as HTS.Index
as opposed to HTS_INDEX
?
datapackr::cop23_data_pack_schema |>
tibble::as_tibble() |>
dplyr::filter(col_type == "target") |>
dplyr::select(indicator_code) |>
dplyr::distinct(indicator_code) |>
dplyr::mutate(indicator = stringr::str_extract(indicator_code, "[^\\.]+")) |>
dplyr::arrange(indicator) |>
print(n = Inf)
There is a data pack with text in value columns of the SNUxIM tab.
This results in a warning: Warning: NAs introduced by coercion
generated by this line of code:
Line 67 in 9b34ce5
I suggest we include the presence of text in a value column as an explicit error or warning.
Instead of string of comma separated issues.
We will often need to automate these scripts, and having a user interaction required is problematic.Be sure to remove the use of file.choose
if the required file path is not supplied as a parameter to the function which needs it.
Only relevant for regional OUs
with this data pack: https://www.pepfar.net/Project-Pages/collab-38/Shared%20Documents/Data%20Pack%202019%20Staging%20Area/Support%20Files/datapacks/DataPack_Malawi_03182019.xlsx
I am getting erroneous imbalanced distribution warnings:
13 : WARNING!: 131 cases where distributed total is either more or less than total Target. To identify these, go to your SNU x IM tab and filter the Rollup column for Pink cells. This has affected the following indicators ->
* GEND_GBV.N.ViolenceServiceType.20T.physEmot
* GEND_GBV.N.ViolenceServiceType.20T.postRape
* HTS_INDEX_COM.N.Age/Sex/Result.20T.NewNeg
* HTS_INDEX_COM.N.Age/Sex/Result.20T.NewPos
* HTS_INDEX_FAC.N.Age/Sex/Result.20T.NewPos
* HTS_SELF.N.Age/Sex/HIVSelfTest.20T.Directly_Assisted
* HTS_SELF.N.HIVSelfTest.20T.Unassisted
* HTS_TST_OtherMod.N.Age/Sex/Result.20T.Negative
* HTS_TST_OtherPITC.N.Age/Sex/Result.20T.Positive
* HTS_TST.N.KeyPop/Result.20T.Negative
* HTS_TST.N.KeyPop/Result.20T.Positive
* KP_PREV.N.KeyPop.20T
* PrEP_CURR.N.Age/Sex.20T
* PrEP_CURR.N.KeyPop.20T
* PrEP_NEW.N.Age/Sex.20T
* PrEP_NEW.N.KeyPop.20T
* TB_ART.N.Age/Sex/NewExistingART/HIVStatus.20T.Already
* TB_ART.N.Age/Sex/NewExistingART/HIVStatus.20T.New
* TX_CURR.N.Age/Sex/HIVStatus.20T
* TX_NEW.N.Age/Sex/HIVStatus.20T
* TX_NEW.N.KeyPop/HIVStatus.20T
* VMMC_CIRC.N.Age/Sex/HIVStatus.20T.Negative
* VMMC_CIRC.N.Age/Sex/HIVStatus.20T.Positive
* VMMC_CIRC.N.Age/Sex/HIVStatus.20T.Unknown
This screen shot has two rows from the same PSNU. NOTE that the value column has a different (exactly double) entry in the second row.
If we look at the data pack we see the targets are correctly allocated:
The affected code is here:
Lines 92 to 120 in 9b34ce5
Seems like a problem in the group by/aggregation of the data.
To detect cases where users have added rows above row 5 that causes problems, or where row 6 is not the beginning of data.
datapackr/R/checkColStructure.R
Lines 33 to 38 in 5c70538
Likely best to make this change in the schema, via produceConfig.R
This line of code is not reliably detecting non-integers.
Line 133 in 9b34ce5
As an example for this data pack: https://www.pepfar.net/Project-Pages/collab-38/Shared%20Documents/Data%20Pack%202019%20Staging%20Area/Support%20Files/datapacks/71_DataPack_Uganda_20190124160453_03082019.xlsx,
non decimals are flagged on the PMTCT_STAT_ART tab in the PMTCT_STAT.D.Age/Sex.20T
column. However looking at the excel version of the data pack does not reveal any non-integer numbers. There appears to be some floating point error introduced when readxl::read_excel initially reads in the sheet.
For this data pack: https://www.pepfar.net/Project-Pages/collab-38/Shared%20Documents/Data%20Pack%202019%20Staging%20Area/Support%20Files/datapacks/DataPack_Malawi_03182019.xlsx
I am recieving this blocking error:
13 : ERROR!: 1 cases where no distribution was attempted for Targets. To identify these, go to your SNU x IM tab and filter the Rollup column for Pink cells. This has affected the following indicators ->
* PMTCT_STAT.N.Age/Sex/KnownNewResult.20T.NewPos
Investigating I find that the source of the error is a target < .5 that is rounded to 0 on theSNUxIM tab. So no distribution against this target was made.
This is the code that produces the error.
Lines 36 to 57 in 9b34ce5
It is currently not possible to validate the West Africa Regional Data Pack, due to the lack of a UID.
> d<-unPackSiteToolData("/home/jason/consultancy/DATIM/Site Tool_West-Central Africa Region_20190410085106.15Apr2019.GLMSBT.xlsx")
[1] "Checking the file exists..."
[1] "Checking the OU name and UID on HOME tab..."
Error in if (d$info$datapack_name != datapack_name | d$info$datapack_uid != :
argument is of length zero
is the error.
Problem seems to be here.
I would rather not fix a hack with another hack. For West Africa, can't we just use the UID which is in DATIM?
There are some significant performance issues when calling the method adornMechanisms
, since each time the function is applied, an API request must be obtained from the DATIM server, which is fairly slow. This does not happen if a support file is present, which is simply an RDS file containing the API view.
With the deployment of the app on the new connect servers, we need a slightly different mechanism to store this file. This function will be refactored slightly to
Scott et al.,
I ran this, after what seemed a successful install, and a restart (so I didn't have the whole history of the install), and I it dropped me out early with an error.
R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> require(devtools)
Loading required package: devtools
> require(datapackr)
Loading required package: datapackr
> d <- unPackData()
[1] "Checking the file exists..."
[1] "Checking the OU name and UID on HOME tab..."
Error: expected <
Here is link to latest datapack which I was trying to check:
https://www.pepfar.net/ou/vietnam/HQ%20Collaboration/COP%202019%20%E2%80%93%20FY%202020/Original%20Submission%20of%20Required%20Tools%20(Feb%2021)/DataPack_Vietnam%2020190225%2018h00.xlsb
If no targets come from Data Pack for an entire tab, remove it to:
Running datapackr
(master) I run into a bug with the newly introduced Year 2
tab. The tab does not does not match the structure of the other tabs (no PSNU information) which results in an error mapping through the data import. Not sure if this is part of your PR @jason-p-pickering in the parse-year2
branch.
Identical to logic/process used in the Data Pack. Can borrow code from there.
There was an issue in the following lines of code:
dplyr::group_by(PSNU,psnuid,indicator_code,Age,Sex,KeyPop,support_type) %>%
dplyr::summarize(distribution = sum(distribution)) %>%
dplyr::mutate(distribution_diff = abs(distribution - 1.0)) %>%
dplyr::filter(distribution_diff >= 1e-3 & distribution != 1.0) %>%
So, since the data was being grouped by support_type
and then summed...well, it was just wrong. Sloppy copy and paste from the pure dedupe section.
The correct way to identify dedupes is to calculate the count of components (DSD/DSD or TA/TA) for pure duplication, and for crosswalks, to determine if there is any DSD/TA allocation for the same data element disagg. There is no need at the identification phase to worry about what the allocation is. Its better just to count and see how many potential data element/disaggs overlap, and then filter for the 100% allocations.
https://github.com/pepfar-datim/datapackr/blob/master/R/unPackData.R#L95-L116 is quite redundant. Create a function to create the file path if possible with parameters supplied as required.
Since Data Pack targets do not have DSD/TA assignments and won't until Site distribution, push all into DSD dataelements and run these through datimvalidation::validateData()
The writePSNUxIM function requires the path to the SNUxIM model file. This works fine for local installs where a file path is known, but does not work great for server installations/apps where there is no intrinsic ability to control the location of the file path on the server. We have previously not been able to store the model file as part of the source code due to security concerns.
The basic approach is to use a single symmetric key, which we can then store on the server and retrieve as an environment variable.
#Create a random string of 32 characters
k<-stringi::stri_rand_strings(1,32)
> k
[1] "JiVsc14Ob9L7FClK6OVTxvAfHW9U7XZS"
#Convert this to a sodium key
key<-cyphr::key_sodium(charToRaw(k))
#Read the data to be encrypted
foo<-readRDS("PSNUxIM_20200319.rds")
#Save as an encrypted file
cyphr::encrypt(saveRDS(foo,"foo.encrypted"),key)
#This does not work
> readRDS("foo.encrypted")
Error in readRDS("foo.encrypted") : unknown input format
#This does work
cyphr::decrypt(readRDS("foo.encrypted"),key)
The encrypted file cannot be read without the key, and can thus be securely stored as part of the source code in GitHub (as long as the key itself is kept secret).
This approach should alleviate the issues we have with not being able to store support files, such as the model file, as part of the source code itself, which is needed to deploy the app to the server, without intrinsic knowledge of where the file itself is going to be stored.
Thoughts @sam-bao @jacksonsj ?
using endDate field
Line 427 in 0a19c12
While validating a South Africa site tool the validation app states:
Running the code from the terminal states: Error in if (any(has_positive_dedupe)) { :
missing value where TRUE/FALSE needed
I determined there were rows in the PrEP tab of the site tool with blanks for the mechanism code (the very last rows of the table to be exact.) Once these cells were populated the validation worked.
Receiving this error:
Error in d[["tests"]][["decimal_cols"]][[as.character(sheet)]] <- decimal_cols :
more elements supplied than there are to replace
trying to parse this datapack:
The code I used to parse the data pack:
country_uids <- c("FFVkaV9Zk1S")
submission_path <- "###"
## Note that submission_path is optional in this setup. If not supplied, a console window will pop up to allow you to pick the file.
d <- datapackr::unPackTool(submission_path = submission_path,
tool = "Data Pack",
country_uids = country_uids)
Problem is related to this line of code:
Line 140 in e6553e7
It is not obvious to me what should go in to d[["tests"]][["decimal_cols"]][[as.character(sheet)]]
but it seems like this works:
d[["tests"]][["decimal_cols"]][[as.character(sheet)]] <- list(decimal_cols)
This same issue may be repeated in other pieces of code such as
d[["tests"]][["non_numeric"]][[as.character(sheet)]] <- non_numeric
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.