GithubHelp home page GithubHelp logo

afsc-gap-products / gapindex Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 3.0 10.49 MB

Calculation of Design-Based Indices of Abundance and Composition for AFSC GAP Bottom Trawl Surveys

Home Page: https://afsc-gap-products.github.io/gapindex/

License: Other

R 23.43% HTML 76.54% TeX 0.04%
cpue database-management index-production

gapindex's People

Contributors

benwilliams-noaa avatar emilymarkowitz-noaa avatar margaretsiple-noaa avatar zoyafuso-noaa avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gapindex's Issues

Name of functions for calculating comps by subareas/areas/regions

Question

I was looking at the gapindex R functions for calculating comps by areas and noticed that the larger area calculations for sizecomps and agecomps are different. Is there a reason that one is named "_subarea" and the other "_region"? Wouldn't they do the same thing? I would advocate for "area" for both, but am also surprised these couldn't be combined with the "*_stratum" functions. More a comment than anything, but it would be good to share naming styles across function names.

image

Regional differences in size composition calculation

@Ned-Laman-NOAA @Duane-Stevenson-NOAA

I’m fairly close with the gapindex::calc_sizecomp_stratum function that calculate size comps but I wanted to see whether we can come to a consensus with how the size comps are calculated between regions.

Currently, gapindex::calc_sizecomp_stratum() has an argument called "fill_NA_method" that changes the way hauls with positive weights but no associated size data are dealt with (either "AIGOA" or "BS"). If == "BS", these hauls contribute to the dummy length -9 category for their respective strata. If == "AIGOA", an average size distribution is applied to these hauls so the length -9 category does not exist in the AI or GOA versions of the size composition tables. Initially I created the options for both regions so I can reproduce the historical sizecomp tables with the intention of having this discussion later in the process.

The main question is do we want to continue to calculate the size comps differently between BS and AIGOA survey regions going forward or should there be just one way that we calculate this? Luckily, since I've already compared the GAP_PRODUCTS tables with the current tables, whatever calculation we go with will have already been vetted.

I can see arguments for either approach or to keep them separate but I think the more conservative approach would be the dummy length -9 approach utilized in the Bering Sea. In the AI/GOA scripts, we make the extra assumption that the size distribution of a station with missing size data can be estimated by taking the average size distribution within that stratum/year. I don’t think I have the experience to assess whether that is a reasonable assumption.

AI-GOA Age Data Upload Schedule

Hi Nancy (Ned cc'd),

Thanks for posting the EBS/NBS age data updates to the modsquad discussion board last week. That will be super helpful in the future. 

I'm in the process of comparing the age compositions that come out of the gapindex package with those from AGECOMP_TOTAL tables in the AI and GOA schemata. I'm getting mismatches with the AGECOMP_TOTAL tables and the agecomps that come out of gapindex so I'm investigating any possible reasons for this. I'm reminded of this issue from the gap requests repo but I think this is a separate issue.

Here's the question: In the AI-GOA, when age data are uploaded for a particular species/year, are they all uploaded all at the same time? Conversely are there cases where only a portion of the age data for a species/year are uploaded and then some time in the future (years later), the other portion of the age data for that species/year are uploaded. I'm asking this because if this is occurring and we don't redo agecomp calculations for all years, then those added years data aren't integrated in the historical agecomps over time, which could explain those mismatches.

Thanks,
Zack

database query

Seems that ocdb() is faster than Rodbc() may want to switch over given the size of the data you are pulling in. Here is an example https://github.com/BenWilliams-NOAA/swo/blob/09256fbdfe5bcb7d312ef83ba7357d36b8d4ccfe/R/query_data.R#L31

Also, recommend using vroom or data.table::fread for files this size, will speed up your saves and reads substantially.

https://github.com/afsc-gap-products/design-based-indices/blob/09d003fa2a7048298cf5823957d8b93a040fe287/R/00_download_data_from_oracle.R#L30

Fix BIOMASS_TOTAL table

Check totals; not sure how to get the survey area totals from stratum totals. May need to check with Wakabayashi and/or Wayne.
Weights for test case (POP, GOA, 2019) are not matching SQL Dev CPUE tables.
image

SQL Dev table:

MEAN_WGT_CPUE: 3930.22
VAR_WGT_CPUE: 282986.12
MEAN_NUM_CPUE: 6706.92
VAR_NUM_CPUE: 771466.465

AREA_ID = 99902 not returned from calc_biomass_subarea()

Issue Description

It appears that the total area calculation for NBS (AREA_ID = 99902) complexes is not being returned by the calc_biomass_subarea function. This works fine for EBS, but just not for NBS. Below I take one fish complex and one invert compex to illustrate following the instructions from the package documentation for preparing complexes.

Is it something that I am doing or something funky with the function? Thanks! Tagging in @sowasser for awareness.

Steps to Reproduce

Sign into oracle

sql_channel <- gapindex::get_connected()

Create complex data

## Pull data. Note the format of the `spp_codes` argument with the GROUP column
library(gapindex)
production_data <- get_data(
  year_set = c(1982:2023),
  survey_set = c("EBS", "NBS"),
  spp_codes = rbind(data.frame(GROUP = "all flatfishes", SPECIES_CODE = c(10000:10399)), 
                    data.frame(GROUP = "neptune welks", SPECIES_CODE = c(71884, 71882))),
  pull_lengths = TRUE, 
  haul_type = 3, 
  abundance_haul = "Y",
  sql_channel = channel)

## Zero-fill and calculate CPUE
production_cpue <- calc_cpue(racebase_tables = production_data)

## Calculate Biomass, abundance, mean CPUE, and associated variances by stratum
production_biomass_stratum <- 
  gapindex::calc_biomass_stratum(racebase_tables = production_data,
                                 cpue = production_cpue)

## Aggregate Biomass to subareas and region
production_biomass_subarea <- 
  calc_biomass_subarea(racebase_tables = production_data, 
                       biomass_strata = production_biomass_stratum)

## Calculate size composition by stratum. Note fill_NA_method == "BS" because
## our region is EBS, NBS, or BSS. If the survey region of interest is AI or 
## GOA, use "AIGOA". See ?gapindex::gapindex::calc_sizecomp_stratum for more
## details. 
production_sizecomp_stratum <- 
  gapindex::calc_sizecomp_stratum(
    racebase_tables = production_data,
    racebase_cpue = production_cpue,
    racebase_stratum_popn = production_biomass_stratum,
    spatial_level = "stratum",
    fill_NA_method = "BS")

## Aggregate size composition to subareas/region
production_sizecomp_subarea <- gapindex::calc_sizecomp_subarea(
  racebase_tables = production_data,
  size_comps = production_sizecomp_stratum)

## rbind stratum and subarea/region biomass estimates into one dataframe
names(x = production_biomass_stratum)[
  names(x = production_biomass_stratum) == "STRATUM"
] <- "AREA_ID"
production_biomass <- rbind(production_biomass_stratum, 
                            production_biomass_subarea)

## rbind stratum and subarea/region biomass estimates into one dataframe
names(x = production_sizecomp_stratum)[
  names(x = production_sizecomp_stratum) == "STRATUM"] <- "AREA_ID"
production_sizecomp <- 
  rbind(production_sizecomp_subarea,
        production_sizecomp_stratum[, names(production_sizecomp_subarea)])

image

Biomass complex data

Here I check the frequency of times a calculation for a SURVEY_DEFINITION_ID, AREA_ID, SPECIES_CODE combination was made for biomass estimates. This should:

  1. Include an entry for 99902 (NBS Total Area), which it does not.
  2. be equal to the number of years there are data for that combination in the resultant data. This appears once duplicated, for some reason. SURVEY_DEFINITION_ID = 98, AREA_ID = 99901, SPECIES_CODE = "all flatfishes" should have 41 years of data, but returns 82. I checked and the duplicates are actually duplicates and removed with a unique() or dplyr::distinct(), as shown in second chunk.
library(dplyr)
production_biomass %>% 
  dplyr::select(SURVEY_DEFINITION_ID, AREA_ID, SPECIES_CODE) %>% 
  table() %>% 
  data.frame() %>% 
  dplyr::filter(Freq != 0) %>% 
  dplyr::arrange(desc(AREA_ID)) %>% 
  head()

image

production_biomass %>% 
  unique() %>% # to address point 2
  dplyr::select(SURVEY_DEFINITION_ID, AREA_ID, SPECIES_CODE) %>% 
  table() %>% 
  data.frame() %>% 
  dplyr::filter(Freq != 0) %>% 
  dplyr::arrange(desc(AREA_ID)) %>% 
  head()

image

Sizecomp complex data

Here I check the frequency of times a calculation for a SURVEY_DEFINITION_ID, AREA_ID, SPECIES_CODE combination was made for sizecomp estimates. I show this because it works as expected:

  1. The resultant table includes AREA_ID = 99902
  2. Rows are not duplicated.
production_sizecomp %>% 
  dplyr::select(SURVEY_DEFINITION_ID, AREA_ID, SPECIES_CODE) %>% 
  table() %>% 
  data.frame() %>% 
  dplyr::arrange(desc(AREA_ID)) %>% 
  head(10)

image

data access

"RACE_DATA.CRUISES",

Getting this message:

Error in gapindex::get_connected(db = "afsc") :
Cannot connect to these tables in Oracle:
RACE_DATA.CRUISES
RACE_DATA.SURVEYS
RACE_DATA.SURVEY_DEFINITIONS
RACE_DATA.VESSELS

Please contact [email protected] for access to these tables and then try connecting again.

And the email get's bounced back. Also guessing that many assessment authors would get this message if they use this package?

Added functionality for multiple regions

Enable multiple regions to be called by functions:

  • get_data()
  • calc_cpue()
  • calc_biomass_stratum()
  • calc_agg_biomass()
  • calc_size_stratum_BS()
  • calc_size_stratum_AIGOA()
  • calc_agg_size_comp()
  • calc_age_comp()

connecting to database via VM

I can't seem to connect to oracle via get_connected() when on a virtual machine, while I confirmed that it worked on my laptop using the same credentials.

sql_channel <- gapindex::get_connected()
Error in gapindex::get_connected() :
Unable to connect. Username or password may be incorrect. Check that you are connected to the network (e.g., VPN). Please re-enter.

I don't recall having this problem in the past, can somebody confirm that this works on their VM so I can identify what is causing this?

SPECIES_CODE changes in GAP_PRODUCTS break functionality

Issue

I'm trying to wrap my head around a recent change to GAP_PRODUCTS and not entirely sure where this issue should go because it is a problem for how multiple repos interact with GAP_PRODUCTS (e.g. here)

In GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION, SPECIES_CODE 41500 changed to 41099 but there are still catch records for 41500 across RACEBASE tables. For gapindex, that means get_data() retrieves catch data for 41500 but not species data. The lack of species data seems to break CPUE calculations and subsequent steps.

Cross-posting an issue in gapindex.

Possible AI CPUE discrepancy

Copying from an email @Ned-Laman-NOAA sent me regarding a discrepancy in the station-level CPUE estimates in the AI estimates and in FOSS, that this package may also likely encounter. My resolve on the FOSS side was to disregard, but I am interested to know how the R package will want to address it.

There are 15 [station-level cpue] records in the AI record that could be problematic. [Upon recalculating the station-level cpue estimates,] It seems odd that the vast majority of records [in the current ai cpue oracle tables] comply with our expectations and just a few don't. There is a chance that for these older records, RACEBASE has been revised (e.g., unid'd sculpins (21300) might've been broken out into species ids) and CPUE/Biomass weren't re-run for that cruise. Re-running older biomass estimates is problematic for a few reasons that I'm happy to chat about, but I have a sneaking suspicion that this may be part of the discrepancy.

Here's the SQL Plus I was using to address the question:

select a.vessel,a.cruise,a.haul,a.species_code,a.wgtcpue,b."cpue_kgkm2",
a.wgtcpue-b."cpue_kgkm2" diff, round(100*(abs(a.wgtcpue-b."cpue_kgkm2")/a.wgtcpue),1) pct_diff
from ai.cpue a, racebase_foss.racebase_public_foss b
where a.hauljoin = b."hauljoin"
and a.species_code = b."species_code"
and a.wgtcpue != 0
order by -(abs(a.wgtcpue-b."cpue_kgkm2"))

Data input for Agecomp scripts (PERFORMANCE codes)

Hi @Ned-Laman-NOAA, @Duane-Stevenson-NOAA, and @RebeccaHaehn-NOAA

I noticed that across the different scripts that run age composition, there are slight differences regarding which specimen data are used:

EBS: On line 10 of G:/EBSother/SCRIPTS_FOR_STOCK_ASSESSMENT/AGECOMP/agecomp_ebs_plusnw_stratum.sql, there's a note that specimen records can come from hauls with negative performance codes.

NBS: On line 25 of G:/EBSother/SCRIPTS_FOR_STOCK_ASSESSMENT/NBS/agecomp_nbs_stratum.sql, there's the clause "WHERE PERFORMANCE >= 0", meaning specimen records can only come from hauls with performance code >= 0.

AI/GOA: Ned, I looked at the A.agecomp function that is contained in G:/GOA/R/agecomp/.RData. I assume that specimen records can come from any haul regardless of the PERFORMANCE code, is this correct? For example, I don't see a call that subsets specimen records from hauls with PERFORMANCE >= 0 in that function.

Is there historical context for excluding negative performance hauls from the biomass/abundance calculations but including negative performance hauls in the age composition calculations? Ideally we should stick to one rule for consistency but I wonder if there was a discussion about this prior to me coming into the group.

Thanks in advance,
Zack

are all species in the spp_start_year table?

Issue

I ran the vignette code to generate spp_start_year and noticed that the code for BSAI Flathead Sole, 10130, is missing. Could this be clarified? Are data for that species not available via this package, simply not populated in that table [yet], or don't have any year constraints?

spp_start_year <-
  RODBC::sqlQuery(channel = sql_channel, 
                  query = "SELECT * FROM GAP_PRODUCTS.SPECIES_YEAR")

 spp_start_year
   SPECIES_CODE YEAR_STARTED
1           435         1999
2           436         1999
3           455         1999
4           456         1999
5           471         1999
6           472         1999
7           473         1999
8           474         1999
9           480         1988
10          481         1988
11        10110         1992
12        10112         1992
13        10212         1984
14        10261         1996
15        10262         1996
16        30051         2006
17        30052         2006
18        30152         1996

Data pulls for model-based estimates

It would be GREAT if this package included the functionality to replicate the data pulls for model-based survey data products. These are currently created using sumfish, which many staff members cannot get to reliably work. This means that data pulls for these time-sensitive products may be unnecessarily delayed by limited staff availability and access.

An example of a data pull script is here, and an example of a formatted data file here. If we can replicate these data files, in this format, it will reduce the pressure on the few staff that can currently provide these data sets.

calc_biomass_subarea bug

Using version 2.1.1

This piece of script does not reproduce what is in HAEHNR.BIOMASS_EBS_PLUSNW. Bug has to do with how NAs are dealt in the function.

library(gapindex)

## Connect to Oracle
sql_channel <- gapindex::get_connected()

## Pull data.
gapindex_data <- gapindex::get_data(
  year_set = c(1993),
  survey_set = "EBS",
  spp_codes = 81742,   
  haul_type = 3,
  abundance_haul = "Y",
  pull_lengths = F,
  sql_channel = sql_channel)

## Fill in zeros and calculate CPUE
cpue <- gapindex::calc_cpue(racebase_tables = gapindex_data)

## Calculate stratum-level biomass, population abundance, mean CPUE and 
## associated variances
biomass_stratum <- gapindex::calc_biomass_stratum(
  racebase_tables = gapindex_data,
  cpue = cpue)

## Calculate aggregated biomass and population abundance across subareas,
## management areas, and regions
biomass_subareas <- gapindex::calc_biomass_subarea(
  racebase_tables = gapindex_data,
  biomass_strata = biomass_stratum)

Question about GOA Big Skate

Hey @Ned-Laman-NOAA

I'm doing some output checking with the package we've been working on consolidating index computation. Can you (hopefully quickly and painlessly) confirm something for me: in RACEBASE.CATCH, querying HAULJOIN = -5945 AND SPECIES_CODE = 420 produces one record with weight 0.088 kg (1 fish). In GOA.CPUE, that same query produces one record with zero weight and number of fish. Given the weight, was this record supposed to be big skate egg case (given the weight) and then changed to a different species code? This pattern popped up for handful of records in my output checks but luckily I believe this is the only species that I found to have this mismatch.

Thanks in advance and no rush,
Zack

gap_products area_id correction

In the gap_products.area table for the EBS slope there is a duplicate area_id for instance:

survey_definition_id = 78
area_id = 1
type = "STRATUM"
area_name = "all"
description = Bering Sea Slope Survey, All Subareas, Depth Range 200-300 m

survey_definition_id = 78
area_id = 1
type = "SUBAREA"
area_name = "1.0"
description = EBS slope subarea 1: All depths

No other tables entries (that I've looked at) appear to duplicate values. This instance precludes filtering biomass or specimen data on aurvey_definition_id and area_id alone, thus forcing an undesired join with the area table. Any chance of a correction?

Issue with get_data() when there's no size/age data

Using version 2.1.1

library(gapindex)

## Connect to Oracle
sql_channel <- gapindex::get_connected()

## Pull data.
gapindex_data <- gapindex::get_data(
  year_set = c(2022),
  survey_set = "EBS",
  spp_codes = 81742,   
  haul_type = 3,
  abundance_haul = "Y",
  pull_lengths = T,
  sql_channel = sql_channel)

gives this error:

Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
In addition: Warning messages:
1: In gapindex::get_data(year_set = c(2022), survey_set = "EBS", spp_codes = 81742, :
There are no length data for any of the species_codes for
survey area 'EBS' in the chosen years (2022)
2: In gapindex::get_data(year_set = c(2022), survey_set = "EBS", spp_codes = 81742, :
There are no age data for any the species_codes for
survey area 'EBS' in the chosen years (2022)

Add error checkpoints for get_data() function: unavailable years/species, multiple regions

From @RebeccaHaehn-NOAA:

  1. if no data exists for that year (ex: EBS 2020 NBS 2020, NBS 2018)

Error: "Error in haul_data$START_TIME : $ operator is invalid for atomic vectors"

Would it be possible to replace the error with something link "No data found for input year/survey?"

  1. if species was not caught during that year or region (ex: sablefish in NBS or some years in the EBS)

Error: Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

Would it be possible to replace error message with something like "Species not found"?

From @zoyafuso-NOAA

Until the package can handle multiple regions with multiple stratifications, add an error if the user pulls more than one survey region.

Questions for Ned/Wayne about size/age comp code

Questions - edited to reflect answers

  1. What does -9 mean in the age comps table? It shows up in GOA.AGECOMP_TOTAL as an age class
    A: Length but no age.
  2. Where does AGEPOP come from in GOA.AGECOMP_TOTAL ?
    From Maia:

A colleague indicated that AGEPOP is the sum (in weight?) of trawl catches for that year/age/sex. Here's the query in the original script I have:

MyQuery<-paste0("SELECT GOA.AGECOMP_TOTAL.SURVEY,\n ",
"GOA.AGECOMP_TOTAL.SURVEY_YEAR,\n ",
"GOA.AGECOMP_TOTAL.SPECIES_CODE,\n ",
"GOA.AGECOMP_TOTAL.AGE,\n ",
"GOA.AGECOMP_TOTAL.SEX,\n ",
"GOA.AGECOMP_TOTAL.AGEPOP\n",
"FROM GOA.AGECOMP_TOTAL\n ",
"WHERE GOA.AGECOMP_TOTAL.SURVEY =" 'GOA' \n ",
"AND GOA.AGECOMP_TOTAL.SPECIES_CODE = 10130")

... Is AGEPOP something I can straightforwardly calculate on my own from the RACEBASE.SPECIMEN data, and/or are there plans to update the GOA.AGECOMP_TOTAL table to include the latest age data?

  1. For the workflow of generating age comps, if Peter/Michael ran the a.agecomp.R script in R every year or 2x/year, did they loop through all the species and replace the table entirely? Or just replace a specific year/species combo? The answer to this may be in the SQL code within Peter's script.

[tasks] Specify AFSC database in get_connected()

Issue Description

Update the prompt text in get_connected() to specify that folks should enter their password for the AFSC Oracle database. Stock assessors have access via separate creds to AKFIN.

Tasklist

Error checkpoints for calc_size_stratum_BS/AIGOA(): unavailable data

  • Put an error if you don’t have a size list in your data
  • Put an error if you have invertebrates
  • Put an error when you don’t have unsexed individuals.
  • Error when characters are used in GROUP column (message: no rows to aggregate)
  • Add units to the length column “LENGTH_MM”

First stab at package development

I forked this repo and started with an initial R package development to (1) pull data, (2) fill zeros and calculate CPUE, (3) calculate stratum biomass, and (4) calculate biomass across subareas and across region for the AI, GOA, EBS, EBS + NW, and NBS regions. The fork is located here. Could one or all of you run the code in the Initial Testing section and see whether the package installs properly and you can run the different functions. Reply on this thread if you run into problems. @MargaretSiple-NOAA probably has a better workflow to do the code testing, which I am open to learning about as we go along.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.