pepfar-datim / datapackr Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 7.0 104.49 MB

License: Creative Commons Zero v1.0 Universal

R 98.16% Rez 1.84%

datapackr's People

Contributors

Stargazers

Watchers

Forkers

jason-p-pickering awnage maxwellchandler adebaddest solanrewaju ilimikato enosande44

datapackr's Issues

Resolve missing cross reference in documentation

Lines 30-31 of unPackData.R makes reference to a function which cannot be found.

\item Extracts MER data for use by the \code{\link{packSiteTool}}

function.

rePackSNUxIM giving distribution error because of decimals

With this datapack: https://www.pepfar.net/Project-Pages/collab-38/Shared%20Documents/Data%20Pack%202019%20Staging%20Area/Support%20Files/datapacks/DataPack_Malawi_03182019.xlsx

I am receiving a large number of errors related to incorrectly distributing data on the SNU x IM tab:

13 :  WARNING!: 519 cases where distributed total is either more or less than total Target. To identify these, go to your SNU x IM tab and filter the Rollup column for Pink cells. This has affected the following indicators -> 
	* GEND_GBV.N.ViolenceServiceType.20T.physEmot
	* GEND_GBV.N.ViolenceServiceType.20T.postRape
	* HTS_INDEX_COM.N.Age/Sex/Result.20T.NewNeg
	* HTS_INDEX_COM.N.Age/Sex/Result.20T.NewPos
	* HTS_INDEX_FAC.N.Age/Sex/Result.20T.NewNeg
	* HTS_INDEX_FAC.N.Age/Sex/Result.20T.NewPos
	* HTS_SELF.N.Age/Sex/HIVSelfTest.20T.Directly_Assisted
	* HTS_SELF.N.HIVSelfTest.20T.Unassisted
	* HTS_TST_Inpat.N.Age/Sex/Result.20T.Negative
	* HTS_TST_Inpat.N.Age/Sex/Result.20T.Positive
	* HTS_TST_MobileMod.N.Age/Sex/Result.20T.Negative
	* HTS_TST_MobileMod.N.Age/Sex/Result.20T.Positive
	* HTS_TST_OtherMod.N.Age/Sex/Result.20T.Negative
	* HTS_TST_OtherPITC.N.Age/Sex/Result.20T.Negative
	* HTS_TST_OtherPITC.N.Age/Sex/Result.20T.Positive
	* HTS_TST_PMTCTPostANC1.N.Age/Sex/Result.20T.Negative
	* HTS_TST_PMTCTPostANC1.N.Age/Sex/Result.20T.Positive
	* HTS_TST_STIClinic.N.Age/Sex/Result.20T.Negative
	* HTS_TST_STIClinic.N.Age/Sex/Result.20T.Positive
	* HTS_TST_VCT.N.Age/Sex/Result.20T.Negative
	* HTS_TST_VCT.N.Age/Sex/Result.20T.Positive
	* HTS_TST.N.KeyPop/Result.20T.Negative
	* HTS_TST.N.KeyPop/Result.20T.Positive
	* KP_PREV.N.KeyPop.20T
	* PMTCT_ART.N.Age/NewExistingART/Sex/HIVStatus.20T.Already
	* PMTCT_ART.N.Age/NewExistingART/Sex/HIVStatus.20T.New
	* PMTCT_STAT.D.Age/Sex.20T
	* PMTCT_STAT.N.Age/Sex/KnownNewResult.20T.NewNeg
	* PMTCT_STAT.N.Age/Sex/KnownNewResult.20T.NewPos
	* PP_PREV.N.Age/Sex.20T
	* PrEP_CURR.N.Age/Sex.20T
	* PrEP_CURR.N.KeyPop.20T
	* PrEP_NEW.N.Age/Sex.20T
	* PrEP_NEW.N.KeyPop.20T
	* TB_ART.N.Age/Sex/NewExistingART/HIVStatus.20T.Already
	* TB_ART.N.Age/Sex/NewExistingART/HIVStatus.20T.New
	* TB_PREV.D.Age/TherapyType/NewExistingArt/HIVStatus.20T.IPTNew
	* TB_PREV.N.Age/TherapyType/NewExistingArt/HIVStatus.20T.IPTNew
	* TX_CURR.N.Age/Sex/HIVStatus.20T
	* TX_NEW.N.Age/Sex/HIVStatus.20T
	* TX_NEW.N.KeyPop/HIVStatus.20T
	* TX_PVLS.D.Age/Sex/Indication/HIVStatus.20T.Routine
	* TX_PVLS.N.Age/Sex/Indication/HIVStatus.20T.Routine
	* TX_TB.D.Age/Sex/TBScreen/NewExistingART/HIVStatus.20T.ScreenNegAlready
	* TX_TB.D.Age/Sex/TBScreen/NewExistingART/HIVStatus.20T.ScreenNegNew
	* TX_TB.D.Age/Sex/TBScreen/NewExistingART/HIVStatus.20T.ScreenPosAlready
	* TX_TB.D.Age/Sex/TBScreen/NewExistingART/HIVStatus.20T.ScreenPosNew
	* VMMC_CIRC.N.Age/Sex/HIVStatus.20T.Negative
	* VMMC_CIRC.N.Age/Sex/HIVStatus.20T.Positive
	* VMMC_CIRC.N.Age/Sex/HIVStatus.20T.Unknown

Many if not all of these seem to be because the targets are not rounded, but the SNUxIM tab has rounded targets.

Gend GBV tab

SNUxIM tab

Notice these are not actually flagged in pink in the data pack as suggested in the error message.

This is the related code.

datapackr/R/rePackSNUxIM.R

Lines 91 to 120 in 9b34ce5

 # TEST where attempted distribution sum != target 

 imbalancedDistribution <- d$data$distributedMER %>% 

 tidyr::drop_na(value, distribution) %>% 

 dplyr::select(-Age, -distribution, -mechanism_code) %>% 

 dplyr::group_by_at(dplyr::vars(dplyr::everything(), -value)) %>% 

 dplyr::summarize(value = round(sum(value), digits = 5)) %>% 

 dplyr::ungroup() %>% 

 dplyr::group_by_at(dplyr::vars(dplyr::everything(), -SNUxIM_value)) %>% 

 dplyr::summarize(SNUxIM_value = round(sum(SNUxIM_value), digits = 5)) %>% 

 dplyr::ungroup() %>% 

 dplyr::filter(value != SNUxIM_value) 

 if (NROW(imbalancedDistribution) > 0) { 

 d$tests$imbalancedDistribution <- imbalancedDistribution 

 imbalancedDistribution_inds <- imbalancedDistribution %>% 

 dplyr::select(indicator_code) %>% 

 dplyr::distinct() %>% 

 dplyr::arrange(indicator_code) %>% 

 dplyr::pull(indicator_code) 

 warning_msg <- 

 paste0( 

 "WARNING!: ", 

 NROW(imbalancedDistribution), 

 " cases where distributed total is either more or less than total Target.", 

 " To identify these, go to your SNU x IM tab and filter the Rollup column for Pink cells.", 

 " This has affected the following indicators -> \n\t* ", 

 paste(imbalancedDistribution_inds, collapse = "\n\t* "), 

 "\n")

Fix buggy code in createKeychainInfo

Lines 98-103 in createKeyChainInfo seem to want to compare the names of sheets from the schema, to the names of the sheets contained in the DataPack to be parsed.

This does not do what I think the authors intention was

any(tab_names_expected != tab_names_received)

If any tabs have been added, and the lengths are different, this warning will appear, since the vectors are of different lengths.

Warning in tab_names_expected != tab_names_received :                                  
  longer object length is not a multiple of shorter object length

Pretty sure this is already taken care of in checkStructure anyway.

@jacksonsj could you have a look and fix/remove?

Revisit solution in PR #89

Revisit solution in #89. Implemented fix seems functional but clunky. Due to time constraints we implemented the fix but consider it technical debt.

See pull request for details on the issue and solution.

Update dependencies

All developers need to be using the same version of dependencies in order to ensure that everything is reproducible across different environments.

Try and get type of tool and COP year if not specified

The current OPU and Datapack app share basically the same code and functionality. Keeping both of these apps maintained will be laborious and duplicative. With one app, we should be able to perform the necessary validations on both OPU DataPacks and normal DataPacks, since the vast majority of the code is essentially the same.

We should be able to pretty easily determine what type of tool we are working with, and from there, decide what to do with it in the app. Ideally, we could write the specific type of tool "Data Pack", "OPU Data Pack", etc into a specific range of cells on the Home tab, but this is currently available in cell B10, like "COP21 Data Pack" or "COP20 OPU Data Pack". Once we have this information in the app, we can proceed with the specific processing which each tool requires.

Command line users or apps would still be able to specify this information for specific use cases, but if left blank (NULL) we would try and obtain this information from the home tab.

Site Tool: Change OU sum logic

Change the way site tool computes OU sums from Data Pack. Instead of pulling from d$data$site$distributed, pull from d$data$MER for purest link.

Revise site tool schema

Current structure of datapackr::site_tool_schema does not actually relfect the outputted schema.

Add schema for Mechanism Map in datapackr

getMechMap doesn't return the column name

@jacksonsj I am getting an error when trying to pack a site tool that I have traced to this point.

datapackr/R/packSiteTool.R

Line 405 in 600d4a8

dplyr::select(name, code) %>%

It appears that get MechList is not returning a column named name as expected at the referenced point in the code (and mabe some future points e.g. x = data.frame(mechID = mechList$name)). I get these columns when calling getMechList directly

> names(mechList)
[1] "mechanism" "code"      "uid"       "partner"   "primeid"   "agency"    "ou"        "startdate"
[9] "enddate"

I don't feel I know the code well enough here to fix this bug. Perhaps we should be using mechanism instead of name or perhaps we need to rename what is returned from getMechList.

Improve validation of Prioritization tab

As noted in the code, _Military PSNUs should not have any prioritization, and even if they do, it should not be imported and just ignored.

The code in this section of the parser could be improved a bit to provide better feedback to the user.

Unable to install package

FYI @sam-bao @jacksonsj

@gsarfaty in SA and I are having some trouble installing datapackr. There seem to be some upstream issues with installing datacommons, which has a dependency for doMC. We both are working off R 4.0.3.

remotes::install_github("pepfar-datim/datapackr")
#> Using github PAT from envvar GITHUB_PAT
#> Downloading GitHub repo pepfar-datim/datapackr@HEAD
#> datapackc... (NA -> cc99f39e4...) [GitHub]
#> piton        (NA -> 1.0.0       ) [CRAN]
#> tidyxl       (NA -> 1.0.7       ) [CRAN]
#> Downloading GitHub repo pepfar-datim/data-pack-commons@HEAD
#> Skipping 1 packages not available: doMC
#>          checking for file 'C:\Users\achafetz\AppData\Local\Temp\2\Rtmp61BBOY\remotes1ae4793a38f\pepfar-datim-data-pack-commons-cc99f39/DESCRIPTION' ...  v  checking for file 'C:\Users\achafetz\AppData\Local\Temp\2\Rtmp61BBOY\remotes1ae4793a38f\pepfar-datim-data-pack-commons-cc99f39/DESCRIPTION' (711ms)
#>       -  preparing 'datapackcommons':
#>    checking DESCRIPTION meta-information ...     checking DESCRIPTION meta-information ...   v  checking DESCRIPTION meta-information
#>       -  checking for LF line-endings in source and make files and shell scripts
#>       -  checking for empty or unneeded directories
#>       -  building 'datapackcommons_0.2.1.tar.gz'
#>      
#> 
#> Installing package into 'C:/Users/achafetz/Documents/R/win-library/4.0'
#> (as 'lib' is unspecified)
#> Error: Failed to install 'datapackr' from GitHub:
#>   Failed to install 'datapackcommons' from GitHub:
#>   (converted from warning) installation of package 'C:/Users/achafetz/AppData/Local/Temp/2/Rtmp61BBOY/file1ae4676b2d13/datapackcommons_0.2.1.tar.gz' had non-zero exit status

remotes::install_github("pepfar-datim/data-pack-commons")
#> Using github PAT from envvar GITHUB_PAT
#> Downloading GitHub repo pepfar-datim/data-pack-commons@HEAD
#> Skipping 1 packages not available: doMC
#>          checking for file 'C:\Users\achafetz\AppData\Local\Temp\2\RtmpOQqKpN\remotes22c442247a37\pepfar-datim-data-pack-commons-cc99f39/DESCRIPTION' ...  v  checking for file 'C:\Users\achafetz\AppData\Local\Temp\2\RtmpOQqKpN\remotes22c442247a37\pepfar-datim-data-pack-commons-cc99f39/DESCRIPTION' (720ms)
#>       -  preparing 'datapackcommons':
#>    checking DESCRIPTION meta-information ...     checking DESCRIPTION meta-information ...   v  checking DESCRIPTION meta-information
#>       -  checking for LF line-endings in source and make files and shell scripts
#>       -  checking for empty or unneeded directories
#>       -  building 'datapackcommons_0.2.1.tar.gz'
#>      
#> 
#> Installing package into 'C:/Users/achafetz/Documents/R/win-library/4.0'
#> (as 'lib' is unspecified)
#> Error: Failed to install 'datapackcommons' from GitHub:
#>   (converted from warning) installation of package 'C:/Users/achafetz/AppData/Local/Temp/2/RtmpOQqKpN/file22c4e174685/datapackcommons_0.2.1.tar.gz' had non-zero exit status

install.packages("doMC")
#> Installing package into 'C:/Users/achafetz/Documents/R/win-library/4.0'
#> (as 'lib' is unspecified)
#> Warning: package 'doMC' is not available for this version of R
#> 
#> A version of this package for your version of R might be available elsewhere,
#> see the ideas at
#> https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

^{Created on 2021-01-28 by the reprex package (v0.3.0)}

Site Tool: Add newly added countries and _Military nodes into getSiteList

Upgrade R dependency to 4.1.1

See DP-353 for details.

Better handling of authentication files.

Related to pepfar-datim/datimutils#5

Logins functions should

Accept a config file with no slash, a slash, or multiple slashes for baseurl and mutate this to a single trailing slash
All other API calls in code, should never start with a slash
A utility function to a) encode all URIs and b) check if there are any double slashes (which we know will fail)
e.g. a wrapper around utils::urlencode that thrown an error if there is a "//" other than in "https://"

Eventually, this function should really be replaced entirely be something similar from the upcoming datimutils package.

Error when packing site tool related to home tab

 > datapackr::packSiteTool(d,
+                         output_path = paste0(support_dir_path, "site_tools/"))
Error in if (!stringr::str_detect(names(wb), "Home")) { : 
  argument is of length zero
>

Seems to happen here, perhaps wb has no names at this point.

datapackr/R/packFrame.R

Line 21 in a99ed9d

if(!stringr::str_detect(names(wb), "Home")) {

HTS.Index

Reviewing the indicators in the schema (cop23_data_pack_schema), why is index testing the only indicator that does not match MER - stored as HTS.Index as opposed to HTS_INDEX?

  datapackr::cop23_data_pack_schema |>
    tibble::as_tibble() |>
    dplyr::filter(col_type == "target") |>
    dplyr::select(indicator_code) |>
    dplyr::distinct(indicator_code) |>
    dplyr::mutate(indicator = stringr::str_extract(indicator_code, "[^\\.]+")) |>
    dplyr::arrange(indicator) |>
    print(n = Inf)

Convert implicit warning in unpackSNUxIM to explicit error

There is a data pack with text in value columns of the SNUxIM tab.

This results in a warning: Warning: NAs introduced by coercion

generated by this line of code:

datapackr/R/unPackSNUxIM.R

Line 67 in 9b34ce5

dplyr::mutate(value = as.numeric(value)) %>%

I suggest we include the presence of text in a value column as an explicit error or warning.

unPackData: Format messaging to display as itemized list

Instead of string of comma separated issues.

Remove use of "file.choose" if possible

We will often need to automate these scripts, and having a user interaction required is problematic.Be sure to remove the use of file.choose if the required file path is not supplied as a parameter to the function which needs it.

Site Tool: Add feature to allow distribution from regional to national Military nodes

Only relevant for regional OUs

Bug in imbalancedDistribution Test

with this data pack: https://www.pepfar.net/Project-Pages/collab-38/Shared%20Documents/Data%20Pack%202019%20Staging%20Area/Support%20Files/datapacks/DataPack_Malawi_03182019.xlsx

I am getting erroneous imbalanced distribution warnings:

13 :  WARNING!: 131 cases where distributed total is either more or less than total Target. To identify these, go to your SNU x IM tab and filter the Rollup column for Pink cells. This has affected the following indicators -> 
	* GEND_GBV.N.ViolenceServiceType.20T.physEmot
	* GEND_GBV.N.ViolenceServiceType.20T.postRape
	* HTS_INDEX_COM.N.Age/Sex/Result.20T.NewNeg
	* HTS_INDEX_COM.N.Age/Sex/Result.20T.NewPos
	* HTS_INDEX_FAC.N.Age/Sex/Result.20T.NewPos
	* HTS_SELF.N.Age/Sex/HIVSelfTest.20T.Directly_Assisted
	* HTS_SELF.N.HIVSelfTest.20T.Unassisted
	* HTS_TST_OtherMod.N.Age/Sex/Result.20T.Negative
	* HTS_TST_OtherPITC.N.Age/Sex/Result.20T.Positive
	* HTS_TST.N.KeyPop/Result.20T.Negative
	* HTS_TST.N.KeyPop/Result.20T.Positive
	* KP_PREV.N.KeyPop.20T
	* PrEP_CURR.N.Age/Sex.20T
	* PrEP_CURR.N.KeyPop.20T
	* PrEP_NEW.N.Age/Sex.20T
	* PrEP_NEW.N.KeyPop.20T
	* TB_ART.N.Age/Sex/NewExistingART/HIVStatus.20T.Already
	* TB_ART.N.Age/Sex/NewExistingART/HIVStatus.20T.New
	* TX_CURR.N.Age/Sex/HIVStatus.20T
	* TX_NEW.N.Age/Sex/HIVStatus.20T
	* TX_NEW.N.KeyPop/HIVStatus.20T
	* VMMC_CIRC.N.Age/Sex/HIVStatus.20T.Negative
	* VMMC_CIRC.N.Age/Sex/HIVStatus.20T.Positive
	* VMMC_CIRC.N.Age/Sex/HIVStatus.20T.Unknown

This screen shot has two rows from the same PSNU. NOTE that the value column has a different (exactly double) entry in the second row.

If we look at the data pack we see the targets are correctly allocated:

The affected code is here:

datapackr/R/rePackSNUxIM.R

Lines 92 to 120 in 9b34ce5

 imbalancedDistribution <- d$data$distributedMER %>% 

 tidyr::drop_na(value, distribution) %>% 

 dplyr::select(-Age, -distribution, -mechanism_code) %>% 

 dplyr::group_by_at(dplyr::vars(dplyr::everything(), -value)) %>% 

 dplyr::summarize(value = round(sum(value), digits = 5)) %>% 

 dplyr::ungroup() %>% 

 dplyr::group_by_at(dplyr::vars(dplyr::everything(), -SNUxIM_value)) %>% 

 dplyr::summarize(SNUxIM_value = round(sum(SNUxIM_value), digits = 5)) %>% 

 dplyr::ungroup() %>% 

 dplyr::filter(value != SNUxIM_value) 

 if (NROW(imbalancedDistribution) > 0) { 

 d$tests$imbalancedDistribution <- imbalancedDistribution 

 imbalancedDistribution_inds <- imbalancedDistribution %>% 

 dplyr::select(indicator_code) %>% 

 dplyr::distinct() %>% 

 dplyr::arrange(indicator_code) %>% 

 dplyr::pull(indicator_code) 

 warning_msg <- 

 paste0( 

 "WARNING!: ", 

 NROW(imbalancedDistribution), 

 " cases where distributed total is either more or less than total Target.", 

 " To identify these, go to your SNU x IM tab and filter the Rollup column for Pink cells.", 

 " This has affected the following indicators -> \n\t* ", 

 paste(imbalancedDistribution_inds, collapse = "\n\t* "), 

 "\n")

Seems like a problem in the group by/aggregation of the data.

Add check for row 5 to colStructure checks

To detect cases where users have added rows above row 5 that causes problems, or where row 6 is not the beginning of data.

datapackr/R/checkColStructure.R

Lines 33 to 38 in 5c70538

 col_check <- schema %>% 

 dplyr::filter(sheet_name == sheet 

 & !(sheet == "SNU x IM" & indicator_code == "Mechanism1")) %>% 

 dplyr::select(indicator_code, template_order = col) %>% 

 dplyr::full_join(submission_cols, by = c("indicator_code" = "indicator_code")) %>% 

 dplyr::mutate(order_check = template_order == submission_order)

Site Tool: Remove note header boxes on HTS & KP tabs

Likely best to make this change in the schema, via produceConfig.R

False Positives for decimal values in unPackSheet

This line of code is not reliably detecting non-integers.

datapackr/R/unPackSheet.R

Line 133 in 9b34ce5

dplyr::filter(value %% 1 != 0

As an example for this data pack: https://www.pepfar.net/Project-Pages/collab-38/Shared%20Documents/Data%20Pack%202019%20Staging%20Area/Support%20Files/datapacks/71_DataPack_Uganda_20190124160453_03082019.xlsx,

non decimals are flagged on the PMTCT_STAT_ART tab in the PMTCT_STAT.D.Age/Sex.20T column. However looking at the excel version of the data pack does not reveal any non-integer numbers. There appears to be some floating point error introduced when readxl::read_excel initially reads in the sheet.

Update datim to site tool config file to correct erroneous categoryOptionCombos for TB_ART

Error message on SNUxIM not distributed when target <.5

For this data pack: https://www.pepfar.net/Project-Pages/collab-38/Shared%20Documents/Data%20Pack%202019%20Staging%20Area/Support%20Files/datapacks/DataPack_Malawi_03182019.xlsx

I am recieving this blocking error:

13 :  ERROR!: 1 cases where no distribution was attempted for Targets. To identify these, go to your SNU x IM tab and filter the Rollup column for Pink cells. This has affected the following indicators -> 
	* PMTCT_STAT.N.Age/Sex/KnownNewResult.20T.NewPos

Investigating I find that the source of the error is a target < .5 that is rounded to 0 on theSNUxIM tab. So no distribution against this target was made.

This is the code that produces the error.

datapackr/R/rePackSNUxIM.R

Lines 36 to 57 in 9b34ce5

 undistributed <- d$data$distributedMER %>% 

 dplyr::filter(!is.na(value) & is.na(distribution)) 

 if (NROW(undistributed) > 0) { 

 d$tests$undistributed <- undistributed 

 undistributed_inds <- undistributed %>% 

 dplyr::select(indicator_code) %>% 

 dplyr::distinct() %>% 

 dplyr::arrange(indicator_code) %>% 

 dplyr::pull(indicator_code) 

 warning_msg <- 

 paste0( 

 "ERROR!: ", 

 NROW(undistributed), 

 " cases where no distribution was attempted for Targets.", 

 " To identify these, go to your SNU x IM tab and filter the Rollup column for Pink cells.", 

 " This has affected the following indicators -> \n\t* ", 

 paste(undistributed_inds, collapse = "\n\t* "), 

 "\n")

Site Tool: Cannnot validate regional Data Packs

It is currently not possible to validate the West Africa Regional Data Pack, due to the lack of a UID.

> d<-unPackSiteToolData("/home/jason/consultancy/DATIM/Site Tool_West-Central Africa Region_20190410085106.15Apr2019.GLMSBT.xlsx")
[1] "Checking the file exists..."
[1] "Checking the OU name and UID on HOME tab..."
Error in if (d$info$datapack_name != datapack_name | d$info$datapack_uid !=  :                                                     
  argument is of length zero

is the error.

Problem seems to be here.

I would rather not fix a hack with another hack. For West Africa, can't we just use the UID which is in DATIM?

Improve performance of adornMechanisms

There are some significant performance issues when calling the method adornMechanisms, since each time the function is applied, an API request must be obtained from the DATIM server, which is fairly slow. This does not happen if a support file is present, which is simply an RDS file containing the API view.

With the deployment of the app on the new connect servers, we need a slightly different mechanism to store this file. This function will be refactored slightly to

Load the cached file if its present and less than a day old
Attempt to retrieve the file and cache it locally if the file is stale or not present.

Trying datapackr, not working

Scott et al.,

I ran this, after what seemed a successful install, and a restart (so I didn't have the whole history of the install), and I it dropped me out early with an error.

R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> require(devtools)
Loading required package: devtools
> require(datapackr)
Loading required package: datapackr
> d <- unPackData()
[1] "Checking the file exists..."
[1] "Checking the OU name and UID on HOME tab..."
Error: expected <

Here is link to latest datapack which I was trying to check:
https://www.pepfar.net/ou/vietnam/HQ%20Collaboration/COP%202019%20%E2%80%93%20FY%202020/Original%20Submission%20of%20Required%20Tools%20(Feb%2021)/DataPack_Vietnam%2020190225%2018h00.xlsb

Split functions into separate files for easier tracking of changes

Site Tool: Add code to remove tab if no data present

If no targets come from Data Pack for an entire tab, remove it to:

Save file space.
Prevent addition of targets in Site Tool outside of what was set in Data Pack.

Year 2 sheet bug on check

Running datapackr (master) I run into a bug with the newly introduced Year 2 tab. The tab does not does not match the structure of the other tabs (no PSNU information) which results in an error mapping through the data import. Not sure if this is part of your PR @jason-p-pickering in the parse-year2 branch.

Site Tool: Add conditional formatting on HTS, OVC, KP tabs

Identical to logic/process used in the Data Pack. Can borrow code from there.

Flawed logic in dedupe resolution.

There was an issue in the following lines of code:

dplyr::group_by(PSNU,psnuid,indicator_code,Age,Sex,KeyPop,support_type) %>% 
    dplyr::summarize(distribution = sum(distribution)) %>% 
    dplyr::mutate(distribution_diff = abs(distribution - 1.0)) %>% 
    dplyr::filter(distribution_diff >= 1e-3 & distribution != 1.0) %>%

So, since the data was being grouped by support_type and then summed...well, it was just wrong. Sloppy copy and paste from the pure dedupe section.

The correct way to identify dedupes is to calculate the count of components (DSD/DSD or TA/TA) for pure duplication, and for crosswalks, to determine if there is any DSD/TA allocation for the same data element disagg. There is no need at the identification phase to worry about what the allocation is. Its better just to count and see how many potential data element/disaggs overlap, and then filter for the 100% allocations.

Functionalize duplicative code

https://github.com/pepfar-datim/datapackr/blob/master/R/unPackData.R#L95-L116 is quite redundant. Create a function to create the file path if possible with parameters supplied as required.

unPackData: Map indicators to dataelements and run through DATIM validations

Since Data Pack targets do not have DSD/TA assignments and won't until Site distribution, push all into DSD dataelements and run these through datimvalidation::validateData()

Implement encryption of support files

The writePSNUxIM function requires the path to the SNUxIM model file. This works fine for local installs where a file path is known, but does not work great for server installations/apps where there is no intrinsic ability to control the location of the file path on the server. We have previously not been able to store the model file as part of the source code due to security concerns.

The basic approach is to use a single symmetric key, which we can then store on the server and retrieve as an environment variable.

#Create a random string of 32 characters
k<-stringi::stri_rand_strings(1,32)
> k
[1] "JiVsc14Ob9L7FClK6OVTxvAfHW9U7XZS"
#Convert this to a sodium key
key<-cyphr::key_sodium(charToRaw(k))
#Read the data to be encrypted
foo<-readRDS("PSNUxIM_20200319.rds")
#Save as an encrypted file 
cyphr::encrypt(saveRDS(foo,"foo.encrypted"),key)
#This does not work
> readRDS("foo.encrypted")
Error in readRDS("foo.encrypted") : unknown input format
#This does work
cyphr::decrypt(readRDS("foo.encrypted"),key)

The encrypted file cannot be read without the key, and can thus be securely stored as part of the source code in GitHub (as long as the key itself is kept secret).

This approach should alleviate the issues we have with not being able to store support files, such as the model file, as part of the source code itself, which is needed to deploy the app to the server, without intrinsic knowledge of where the file itself is going to be stored.

Thoughts @sam-bao @jacksonsj ?

Site Tool: Remove expiring mechanisms from getMechList

using endDate field

Bug/BadError message in unpack site tool data

datapackr/R/parseSiteTool.R

Line 427 in 0a19c12

if ( any( has_positive_dedupe ) ) {

While validating a South Africa site tool the validation app states:

Running the code from the terminal states: Error in if (any(has_positive_dedupe)) { :
missing value where TRUE/FALSE needed

I determined there were rows in the PrEP tab of the site tool with blanks for the mechanism code (the very last rows of the table to be exact.) Once these cells were populated the validation worked.

Bug if >1 column contains decimal values

Receiving this error:

Error in d[["tests"]][["decimal_cols"]][[as.character(sheet)]] <- decimal_cols : 
  more elements supplied than there are to replace

trying to parse this datapack:

https://www.pepfar.net/Project-Pages/collab-38/Shared%20Documents/Data%20Pack%202019%20Staging%20Area/Support%20Files/datapacks/DataPack_Namibia_20190314_1200.xlsx

The code I used to parse the data pack:

country_uids <- c("FFVkaV9Zk1S")

submission_path <- "###"
## Note that submission_path is optional in this setup. If not supplied, a console window will pop up to allow you to pick the file.

d <- datapackr::unPackTool(submission_path = submission_path,
                tool = "Data Pack",
                country_uids = country_uids)

Problem is related to this line of code:

datapackr/R/unPackSheet.R

Line 140 in e6553e7

d[["tests"]][["decimal_cols"]][[as.character(sheet)]] <- decimal_cols

It is not obvious to me what should go in to d[["tests"]][["decimal_cols"]][[as.character(sheet)]] but it seems like this works:

d[["tests"]][["decimal_cols"]][[as.character(sheet)]] <- list(decimal_cols)

This same issue may be repeated in other pieces of code such as

 d[["tests"]][["non_numeric"]][[as.character(sheet)]] <- non_numeric

	# TEST where attempted distribution sum != target
	imbalancedDistribution <- d$data$distributedMER %>%
	tidyr::drop_na(value, distribution) %>%
	dplyr::select(-Age, -distribution, -mechanism_code) %>%
	dplyr::group_by_at(dplyr::vars(dplyr::everything(), -value)) %>%
	dplyr::summarize(value = round(sum(value), digits = 5)) %>%
	dplyr::ungroup() %>%
	dplyr::group_by_at(dplyr::vars(dplyr::everything(), -SNUxIM_value)) %>%
	dplyr::summarize(SNUxIM_value = round(sum(SNUxIM_value), digits = 5)) %>%
	dplyr::ungroup() %>%
	dplyr::filter(value != SNUxIM_value)

	if (NROW(imbalancedDistribution) > 0) {
	d$tests$imbalancedDistribution <- imbalancedDistribution

	imbalancedDistribution_inds <- imbalancedDistribution %>%
	dplyr::select(indicator_code) %>%
	dplyr::distinct() %>%
	dplyr::arrange(indicator_code) %>%
	dplyr::pull(indicator_code)

	warning_msg <-
	paste0(
	"WARNING!: ",
	NROW(imbalancedDistribution),
	" cases where distributed total is either more or less than total Target.",
	" To identify these, go to your SNU x IM tab and filter the Rollup column for Pink cells.",
	" This has affected the following indicators -> \n\t* ",
	paste(imbalancedDistribution_inds, collapse = "\n\t* "),
	"\n")

	col_check <- schema %>%
	dplyr::filter(sheet_name == sheet
	& !(sheet == "SNU x IM" & indicator_code == "Mechanism1")) %>%
	dplyr::select(indicator_code, template_order = col) %>%
	dplyr::full_join(submission_cols, by = c("indicator_code" = "indicator_code")) %>%
	dplyr::mutate(order_check = template_order == submission_order)

	undistributed <- d$data$distributedMER %>%
	dplyr::filter(!is.na(value) & is.na(distribution))

	if (NROW(undistributed) > 0) {
	d$tests$undistributed <- undistributed

	undistributed_inds <- undistributed %>%
	dplyr::select(indicator_code) %>%
	dplyr::distinct() %>%
	dplyr::arrange(indicator_code) %>%
	dplyr::pull(indicator_code)

	warning_msg <-
	paste0(
	"ERROR!: ",
	NROW(undistributed),
	" cases where no distribution was attempted for Targets.",
	" To identify these, go to your SNU x IM tab and filter the Rollup column for Pink cells.",
	" This has affected the following indicators -> \n\t* ",
	paste(undistributed_inds, collapse = "\n\t* "),
	"\n")

pepfar-datim / datapackr Goto Github PK

datapackr's People

Contributors

Stargazers

Watchers

Forkers

datapackr's Issues

\item Extracts MER data for use by the \code{\link{packSiteTool}}

function.

Gend GBV tab

SNUxIM tab

Recommend Projects

Recommend Topics

Recommend Org

Jobs