rbatt / trawldata Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 2.0 218.31 MB

Collate and clean bottom trawl survey data

R 100.00%

benthic biodiversity bottom-trawl coastal data demersal dfo ecology long-term marine noaa northamerica

trawldata's People

Contributors

Stargazers

Watchers

Forkers

afredston zoekitchel

trawldata's Issues

Good temporal covariates

@mpinsky @JWMorley @bselden What are some variables that change far more over time than they do over space?

E.g., bottom temperature might change among years, but there's also a fair bit of spatial variabilty within a region, too, I'm guessing (!).

I was wondering if ENSO would be a good predictor (or NAO, PDO out west)

We have a few predictors that change over space but not time (lat, lon, depth, rugosity [coming soon]), but not as much that really varies primarily over time (can use year as a predictor!).

Just something to think about as a future enhancement.

year, datetime etc wrong in gmex?

get rid of file-read errors; work with new data.table

SA (seus) has cold stratum in 2013?

In 2013, there's a really cold 'stratum' (according to my definition) in SEUS. It's the southern-most stratum, and the region-wide average temperature in 2013 is about 4 degrees cooler than the long term average.

@JWMorley Have you seen anything like this in your analysis of these data? I haven't dug into the raw data yet.

Redefine haulid for SA (has Paired tows)

@JWMorley says use EVENT instead of COLLECTION in sa; related to previous discussions #9

make trawlAgg more efficient by skipping over .SD's with 1 row

One of the slowest steps in my current workflow is aggregation. In particular, aggregating within a species-haul. This is only needed to aggregated among individuals or sex, or if taxonomy has changed, to aggregate species previous ID'd as different taxa but which should actually be grouped.

My (informed) guess is that most of the time there is only 1 row of data per species-haulid combination. Thus, it might be a speed-up to only perform the aggregation on subsets with more than 1 row of data.

An example could be to define a new column or external indexing/ reference vector like in the following:

# clean.ebs[,nrow(.SD),by=c("haulid","spp")] # IS ACTUALLY SLOW!
multiple_rows <- clean.ebs[,length(wtcpue),by=c("haulid","spp")][,V1 > 1] # snappy :)

Note that it takes a lot longer to count the rows in .SD than to simply identify the length of one of the columns. This is probably because the .SD data.table still has to be populated (I think, not sure), and also contains all of the columns (although, even clean.ebs[,length(wtcpue),by=c("haulid","spp")][,nrow(.SD),by=c("haulid","spp")] is slow, so the extra columns isn't the issue).

After identifying rows that would form .SD's with more than 1 row, can probably add an i logical vector at the final aggregation step. But this could be slow as it would evaluate the i for each combination in by (so it would skip using the aggregation function, but would still go through each combination). A faster alternative may be to split the data.table into 2 data.tables: 1 that needs aggregation, and the other that doesn't. Then perform aggregation on the appropriate data.table, and recombine the two.

Note that this approach might be hard to implement because it assumes that if there's only 1 row in a .SD, then nothing needs to be modified. For aggregation, this is generally true. However, the nAgg column would still need to be added to the non-aggregated portion (all would have nAgg := 1), and the lu and drop functions alter the value even if there's only 1 (as might a custom function).

However, something like this could be pretty handy, and checks/ warnings could be added alongside an option argument to skip aggregation for .SD's with 1 column. Test could take the form of, say, X[1, j={...}, by=c(byCols)] == X[1] to make sure that nothing changes when functions are applied to only 1 row of data.

Might also be a good idea to an argument like skip_single_aggs=FALSE.

The key place to make the change would probably be on this line: https://github.com/rBatt/trawlData/blob/development/R/trawlAgg.R#L230

Add example of SODA use: Code to restrict SODA to 200m depth contour

If desired, I wrote a code to restrict the SODA bottom temperatures to the 200m depth contour

https://gist.github.com/bselden/81c139c3c1dcfe89d225

Error in NEUS datetime

The last two digits of CRUISE6 in raw.neus do NOT appear to encode the correct sampling month.

In the surveys labeled "fall", the month that is calculated using those last two digits implies that sampling occurs every month except January and February.

I checked this against the raw data for cod from the trawl survey available on the OBIS website http://www.iobis.org/mapper/?dataset=1435, and found that the monthcollected for the fall survey only included September-December, suggesting that there is an error in the way that datetime is calculated for neus files in the trawlData package

Code highlighting the discrepancy between the months sampled in clean.neus and the cod OBIS data file is attached below, as is the cod OBIS data file itself.

BS-batch Genus error

@rBatt
#1 For those species with an NA conflict field, and a BS-batch flag (from the batch download I did from WORMS), the spp will show the accepted name. But if this is different from the species it matched in ref, the genus will still be the old genus.

Example:
ref=BARBATIA DOMIGESIS
species that was matched in the database (does not appear in file)=Barbatia domingensis
spp=accepted name=Acar domingensis
species=Acar domingensis
genus=Barbatia

See http://www.marinespecies.org/aphia.php?p=taxdetails&id=582484

Will need to subset the data by the BS-batch flag, create a temporary genus column that is a split of spp, then run something along the lines of ifelse(genus.temp == genus, genus, genus.temp)

Function to grid and trim strata

Like what created these: https://github.com/rBatt/trawl/tree/master/Figures/stratTolFigs

That function is already in repo: https://github.com/rBatt/trawlData/blob/development/R/formatStrat.R#L72-L185

Just needs to be dusted off.

Then need to add a function that can use it to help the user decide how to trim the strata.

Function should have defaults, but user can specify. Maybe this should be integrated into clean.trimCols

Looking here: rBatt/trawl@d91331f

Can see that tolerances I had chosen before were c(ai=3, ebs=5, gmex=4, goa=3, neus=5, newf=4, sa=0, sgulf=2, shelf=6, wcann=2, wctri=3)

year missing in clean.wcann

I think the year can be retrieved from the first four digits of the haulid, unless someone knows differently

clean.wcann$year <- substr(clean.wcann$haulid,1,4)

Write important formatting helper functions (add 0's, etc)

expand.data might be too much of a brute for this; but we'll have to see how unwieldy it gets with a simpler but more memory-hungry approach

error in odd case for aggData

aggData(X=mini_data, FUN=mean, bio_lvl="spp",space_lvl="stratum",time_lvl="year", bioCols="wtcpue",envCols=c("stemp","btemp"), metaCols=c("datetime","reg"), meta.action=c("unique1"))

gives

Error in as(NA, class(x)) : 
  c("no method or default for coercing “logical” to “POSIXct”", "no method or default for coercing “logical” to “POSIXt”")
In addition: Warning messages:
1: In if (class(x) == "integer64") { :
  the condition has length > 1 and only the first element will be used
2: In if (is.na(i)) { :
  the condition has length > 1 and only the first element will be used

no btemp shelf 2011

No bottom temperature in the Scotian Shelf in 2011. @mpinsky sorry to keep pinging you on things like this, but any ideas?

Wrong definition of `haulid` in SA

pinskylab/OceanAdapt#45 and pinskylab/OceanAdapt#44; related to #6 here.

There's now a few things swirling around related to my confusion on the issue:

is the use of eventname and collectionnumber intentional? Jim says yes
is collectionnumber the haulid, or is that eventname? Jim says latter; need to check
do all of Jim's indicated collection numbers and eventname 's return rows? (see pinskylab/OceanAdapt#45
I then need to make sure I'm using the correct haulid, and/or make sure that Jim's corrections are working as intended

Even though I might be using COLLECTIONNUMBER as haulid instead of EVENTNAME, that still doesn't explain why I'm not geting some rows returned (because I'm still referring to the same columns as Jim; switching the column name won't affect the subsetting).

trimData error

trimData(clean.ai) gives:

Error in setcolorder(X, cols4order) : x is not a data.table
In addition: Warning message:
In is.na(c.match) : is.na() applied to non-(list or vector) of type 'NULL'

Update Source Data Files

Most regions need updating.

Can get US updates from the OA repo, generally.

Need new data requests for all of the Canadian data sets.

NA keep.row in clean.newf in 2008

I tracked this back to NA seasons in this year. Most other years the region had season defined in some way other than my own getSeason(), so I wrongly assumed it was there for all years.

0 wtcpue, >0 cntcpue

Sometimes the weight is 0, even though the count is positive.

Could fix with length/ weight regressions, particularly for neus (which has length data).

Would need to update spp.key by adding parameter columns, and getting parameters from rfishbase. Then in clean.columns could add a step to fill in these cases using the regression.

Instead of relying on fishbase, could also just find average weight per individual, or the regression, from the trawl data itself.

Implement user info request and post

library(httr)
POST('http://oceanadapt.rutgers.edu/download/', 
         encode = c("form"),
         body = list('page-action'='submit-info', 
                            'my-name'='Luke+From+R',
                            'my-email'='[email protected]', 
                            'my-institution'='inst', 
                            'my-purpose'='R testing ')
         )

dates are off for some regions

E.g., from clean.wcann:

                              ref       haulid weight Individual.Average.Weight..kg.
     1:         Abraliopsis felis 200603010029  0.010                             NA
     2:         Abraliopsis felis 200803010013  0.010                             NA
     3:         Abraliopsis felis 201003008060  0.070                             NA
     4: Acanthephyra curtirostris 200303003074  0.002                             NA
     5: Acanthephyra curtirostris 200303006160  0.002                             NA
    ---                                                                             
217510:             fish unident. 201403008024  0.010                           0.01
217511:             fish unident. 201403017132  0.420                           0.42
217512:             fish unident. 201403020009  0.160                           0.16
217513:             fish unident. 201403020189  1.250                             NA
217514:            shark unident. 200303006151  2.300                           2.30
                                               Survey year       vessel Cruise.Leg               Trawl.Performance      date
     1: Groundfish Slope and Shelf Combination Survey   NA    Ms. Julie          1 Fisheries Assessment Acceptable  6/2/2006
     2: Groundfish Slope and Shelf Combination Survey   NA    Ms. Julie          1 Fisheries Assessment Acceptable 5/19/2008
     3: Groundfish Slope and Shelf Combination Survey   NA    Excalibur          2 Fisheries Assessment Acceptable  9/7/2010
     4: Groundfish Slope and Shelf Combination Survey   NA Blue Horizon          3 Fisheries Assessment Acceptable 9/28/2003
     5: Groundfish Slope and Shelf Combination Survey   NA Captain Jack          5 Fisheries Assessment Acceptable 8/12/2003
    ---                                                                                                                     
217510: Groundfish Slope and Shelf Combination Survey   NA    Excalibur          1 Fisheries Assessment Acceptable 8/28/2014
217511: Groundfish Slope and Shelf Combination Survey   NA    Noahs Ark          4 Fisheries Assessment Acceptable  7/2/2014
217512: Groundfish Slope and Shelf Combination Survey   NA   Last Straw          1 Fisheries Assessment Acceptable 5/26/2014
217513: Groundfish Slope and Shelf Combination Survey   NA   Last Straw          5 Fisheries Assessment Acceptable 7/19/2014
217514: Groundfish Slope and Shelf Combination Survey   NA Captain Jack          5 Fisheries Assessment Acceptable 8/10/2003
        datetime      lat       lon    Best.Position.Type  depth Best.Depth.Type towduration  towarea btemp   stratum
     1:     <NA> 47.17305 -124.9230   Gear Track Midpoint  174.1    Bottom Depth       16.90 1.615875  7.46 47.5--150
     2:     <NA> 47.53331 -125.1849   Gear Track Midpoint  562.5    Bottom Depth       15.80 1.573020  4.71 47.5--150
     3:     <NA> 43.96620 -124.9836   Gear Track Midpoint  601.4    Bottom Depth       15.68 1.687510  4.75 43.5--150
     4:     <NA> 41.81069 -124.8967   Gear Track Midpoint  974.6    Bottom Depth       25.10 2.998476  3.57 41.5--150
     5:     <NA> 32.84721 -117.8039   Gear Track Midpoint 1072.3    Bottom Depth       30.60 2.828254  4.20 32.5--150
    ---                                                                                                              
217510:     <NA> 47.23652 -125.0741   Gear Start Haulback  754.1    Bottom Depth       17.68 1.564785  4.22 47.5--150
217511:     <NA> 37.01164 -122.6967   Gear Track Midpoint  568.7    Bottom Depth       20.35 1.744080  6.28 37.5--150
217512:     <NA> 46.85403 -125.1106   Gear Track Midpoint  689.0    Bottom Depth       24.57 2.202347  4.55 46.5--150
217513:     <NA> 33.23764 -117.5581   Gear Track Midpoint  390.1    Bottom Depth       18.00 1.654792  8.30 33.5--150
217514:     <NA> 33.41229 -118.1399 Vessel Track Midpoint  640.8    Bottom Depth       23.33 2.485380  6.01 33.5--150

Found for:

wcann
sa

Make package work w/o data files

Most of the functions are intended to interact with data files. However, they currently expect the data files to be part of the package, and don't typically require that the data object be passed as an argument to the function. This will cause problems when the data files are not installed along with the package itself.

A new scheme might go like this:

d <- get_data_file('ai') # new function not yet implemented
trawlData_operation(d) # any old trawl data function, will now require that the data.table be passed as argument

Create Vignette and "package doc"

Need a vignette and documentation that'll respond to ?trawlData

add metadata for ETOPO depth

Add this metadata for ETOPO depth (object "depth") in trawlData package

From https://www.ngdc.noaa.gov/mgg/global/relief/ETOPO1/docs/ETOPO1.pdf

Table 1: Specifications for ETOPO1.
Versions: Ice Surface, Bedrock
Coverage Area Global: -180º to 180º; -90º to 90º
Coordinate System: Geographic decimal degrees
Horizontal Datum: World Geodetic System of 1984 (WGS 84)
Vertical Datum: Sea Level
Vertical Units: Meters
Cell Size: 1 arc-minute
Grid Format: Multiple: netCDF, g98, binary float, tiff, xyz

Update American Data Sets

these can probably just be taken from the recent Ocean Adapt udpate

fix horrible characters in newf numeric columns

> ((X[,unique(depth.min)])[1101:1150])
 [1] "207"  "275"  "240"  "101"  "222"  "59"   "118"  "111"  "127"  "0413" "0608" "0631" "0643" "0558" "0551" "310"  "190"  "217"  "407"  "315"  "271"  "317" 
[23] "195"  "255"  "9"    "197"  "199"  "OO12" "0127" "0573" "0568" "0531" "0559" "0607" "0591" "0123" "213"  "215"  "236"  "0398" "0572" "0000" "0565" "0557"
[45] "0465" "0644" "0626" "0634" "0665" "0571"

See any odd ones in there? Maybe a "0012"? Well, there are also "" and "0 16" in there too. Have to handle these better. Occurs for a few columns, and at least for newf.

add source and references fields for data sets

Particularly the trawl data sets. URL's where possible. Emails as well.

Formally incorporate rugosity

Make it an exported data object
Make it gridded like other data sets
Document

Lion fish picture

Found one species, but file name is not name for that species in gmex I don't think.

Checking on phone. Need double check we have right lion fish species and picture file name.

fread not working for some files now

Rdatatable/data.table#1371

https://github.com/mpinsky/OceanAdapt/tree/testFread/testFread

In contact w/ data.table developers about solution

find common in sppImg

Given scientific name, could look up common and add it to plot. Often cumbersome to provide both.

Need SA Trimming from OA

@JWMorley did a nice job of putting together logic to trim data from SA (or 'seus') for OceanAdapt. Many of those same steps likely need to be taken here, too.

Some of those steps are more a matter of preference than others. Nonetheless, excellent place to start.

fix Parastichopus leukothele

Is outdated

http://www.marinespecies.org/aphia.php?p=taxdetails&id=530071

Picture here

Not changing now b/c I just finished running the model, but will need to change in spp.key and the picture file name.

Add description to sst help

I think hadisst is 0.5 deg. Forget years

@JWMorley

Add trim_flag column in clean.trimRow

https://github.com/rBatt/trawlData/blob/development/R/clean.trimRow.R

In there I need to add a column that has a key indicating why I am suggesting the row be dropped.

Then can write a helper function to execute the row trimming, with specific interpretation for the flags. That would be a good place to document what each flag means in each region

This is a feature that I know @mpinsky needs too. At some point, I might ask for help in describing the reasoning behind dropping some of those rows.

reduce package size; be selective with columns retained

Need to make the package more lightweight by reducing the size of the associated data sets.

One step here is to drop extra columns where possible.

There are 3 basic approaches I'm going to take:

drop columns that I know are redundant; e.g., regions have date and time separate, and I can drop those 2 after I create datetime
drop columns that I added and I think aren't useful (some of the taxonomic stuff)
drop columns that are derived from the data providers, but I don't think are that useful (e.g., CATCHJOIN or ALTERATIONDESC).

That 3rd category is tricky, because it represents a loss of information relative to what is provided in raw data. That's what I want feedback on in this Issue: which columns from the raw data do I need to keep?

Below I'll make a list of columns, organized under a few categories. I'm open to any feedback as to which columns would be needed; I can add more options if something is suggested that I don't have, but I'll use checking a box as a way of indicating that I intend to keep the column. The goal is to have the package contain only 1 data set per region, and raw data available by download (possibly via a package function). In other words, if a column isn't included here, it won't be easily accessible elsewhere.

Most of the following columns will have the same name in all regions. Or there will be a similar equivalent in the regions that have it. If editing this list and adding a column that only needs to be included for a particular region and doesn't need to be included for other regions even if the column exists, please specify which region.

Time and Location of Sample

Species ID and Characteristics

Additional Method Metadata

Environmental and Sample Data

Biological Measurements

Other

keep.row
row_flag

Many of the columns don't have values that change among every row. In particular, many of the "meta data" columns don't vary within a haul, and the species taxonomy columns don't change at all (across species or regions). Just like we save all the species taxonomy (etc) information in they spp.key data.table, we could save many of the haul- or cruise- specific information in separate data.tables. In fact, many of the raw data sets arrive in such a format, where environmental, survey, and biological data are separated. While this makes it less convenient to access the data, it makes it so that we can provide more information while staying under CRAN size limits. So there is definitely room to compromise.

missing trophic levels?

Suddenly I noticed that a whole bunch of species are missing the trophic level information in spp.key and in the data sets. I have no idea when this happened.

Making these corrections ... I need to write more scripts to check these things. It could have even my original code that integrated trophicLevel incorrectly.

match.tl <- match.tbl(spp.key[,spp], taxInfo[,spp], taxInfo[,trophicLevel], exact=TRUE)
match.tl[,sum(!is.na(val))] #1093
spp.key[,sum(!is.na(trophicLevel))] # 40!! :(
spp.key[is.na(trophicLevel), trophicLevel:=match.tl[spp.key[,is.na(trophicLevel)], val]]
spp.key[,sum(!is.na(trophicLevel))] # 1118!! :)


match.tl <- match.tbl(spp.key[,spp], taxInfo[,spp], taxInfo[,trophicDiet], exact=TRUE)
match.tl[,sum(!is.na(val))] #847
spp.key[,sum(!is.na(trophicDiet))] # 874
spp.key[is.na(trophicDiet), trophicDiet:=match.tl[spp.key[,is.na(trophicDiet)], val]]
spp.key[,sum(!is.na(trophicDiet))] # 878


match.tl <- match.tbl(spp.key[,spp], taxInfo[,spp], taxInfo[,trophicOrig], exact=TRUE)
match.tl[,sum(!is.na(val))] #847
spp.key[,sum(!is.na(trophicOrig))] # 874
spp.key[is.na(trophicOrig), trophicOrig:=match.tl[spp.key[,is.na(trophicOrig)], val]]
spp.key[,sum(!is.na(trophicOrig))] # 878


match.tl <- match.tbl(spp.key[,spp], taxInfo[,spp], taxInfo[,Picture], exact=TRUE)
match.tl[,sum(!is.na(val))] #1613
spp.key[,sum(!is.na(Picture))] # 1684
spp.key[is.na(Picture), Picture:=match.tl[spp.key[,is.na(Picture)], val]]
spp.key[,sum(!is.na(Picture))] # 1686

@bselden @JWMorley do you know anything about this? I might go back to check where it happened, just because I need to know if part of my code is broken. I'm hoping someone just accidentally deleted a couple values (but that the accident was limited to trophicLevel !).

Function to plot raw data

It's hard to think of how to do this. The ultimate goal is to have a visual qa/qc. So need to represent the data in a way that will make it easy to see weird numbers. It's hard to see that for colors, making the maps kinda pointless I think.

Could plot 1 stratum at a time, plot regional mean/min/max/median ...

Maybe have a 3 panel plot. Top panel has the x-axis as time, second has lon, third has lat. The y axis is any variable. This would not require any aggregating.

Variables need to be split by different ID columns though. E.g., it wouldn't make sense to group wtcpue together for all species. But creating a separate plot for each species might be rough. Maybe set an option in the function to only bother plotting levels in the ID column that show up sometimes. So you could do plot_raw(X, "wtcpue", by=c("spp"), min_n_obs=20), and that would interactively (or save all?) plot the 3 panel for each species that showed up at least 20 times.

I think all aggregating should be left out of the function. User could do it by X <- trawlAgg(), or just standard aggregation.

A bell could be to color points that lie outside n standard deviations or something. A whistle could be to use outlier detection, like Noah used for NCEAS ice data.

Need to handle NEUS aggregation differently b/c duplicated wtcpue per spp

The rows are per individual due to having length information, but the wtcpue column is for the species.

This is a problem because the intuition only works when you assign that the original species names are all correct.

I checked, and this raises very few problem for NEUS if approached simply (i.e., take the mean of the wtcpue within each unique combination of spp-haulid). However, there are a couple cases for which there was a species name correction. So 2 taxonimic ID's originally had their own (different) wtcpue's in a given haul, and each of those taxa may have had some individuals lengthed. So the wtcpue value is repeated several times for the taxon. But after correcting taxonomy, the 2 taxa are actually the same species. So you can't simply take the average (what you would do if all same taxa and duplicated wtcpue, as was probably intended interpretation) or the sum of wtcpue (if multiple rows for the same species-haul did not have duplicated wtcpue).

I hope this issue does not apply to sex too, but it could (i.e., when sex is listed, is the wtcpue sex-specific, or for the whole spp?).

One approach is to first aggregate while including wtcpue as a factor. This can be done with trawlAgg(), because usually at this stage of data processing both space_lvl and time_lvl are "haulid", so one of those (probably time) can be changed to "wtcpue". However, this might become challenging when there are NA's etc for wtcpue ... idk how the grouping would work.

Another approach could be to make the bioFun argument something like function(x)sumna(una(x)), where x is "wtcpue" passed to bioCols argument. This assumes equivalent wtcpue are from duplicated rows that shouldn't be summed together to get the total wtcpue for a species in a haul. May or may not be true.

Yet another approach could be to aggregate not by "spp", but by the original taxonomic ID column first. In that first aggregation, do bioFun = meanna. Then do the subsequent aggregation by "spp" with bioFun = sumna. This assumes that duplicate rows for a species within a haul should not be summed. It also obscures the potentially problematic scenario of there actually being multiple wtcpue values .... maybe instead of meanna could do something that lists the unique values, and hopefully throws an error when there's more than 1.

Implement etopo, soda, hadisst reads

Should have this R code, at least for reference, if not as an actual exported function.

This is here: rBatt/trawl#104

Some of the files are here: working on finding soda

wctri 0 effort and Inf wtcpue

Only in clean wctri. Raw wctri has finite WEIGHT.

toarea is 0, effort 0, wtcpue and cntcpue Inf

Ultimately results, I think, from 0 towdistances (towarea calculated from towdistance, towarea is effort, effort in denominator for wtcpue).

If you look at raw.wctri[DISTANCE_FISHED<=0] for the raw, (becomes towdistance after name cleaning), you can see 67 rows that lead to the problem there. There are also 14 rows with 0 duration.

If I do

clean.wctri[!is.finite(wtcpue) & !is.na(wtcpue), lu(haulid)]
[1] 6

clean.wctri[haulid%in%clean.wctri[!is.finite(wtcpue) & !is.na(wtcpue), haulid], mean(towduration[towduration!=0]), by="haulid"]
           haulid   V1
1:  83-199201-257  NaN
2:  19-198901-180 0.25
3:  19-198601-298 0.08
4:  73-198901- 16 0.08
5:  37-199202-157 0.01
6:   5-198001-139 0.10

I see that there are 6 hauls that are problematic. But 5 of those 6 actually have the correct information for other rows. 1 of the 6 hauls apparently doesn't have any true information.

But ultimately this indicates that everything can be fixed for most by just filling with the mean.

Just to further put mind at ease:

clean.wctri[haulid%in%clean.wctri[!is.finite(wtcpue) & !is.na(wtcpue), haulid], lu(towduration[towduration!=0], na.rm=TRUE), by="haulid"]

           haulid V1
1:  83-199201-257  0
2:  19-198901-180  1
3:  19-198601-298  1
4:  73-198901- 16  1
5:  37-199202-157  1
6:   5-198001-139  1

So there's only 1 unique value anyway ... which makes sense, as these are haul-specific values.

To fix, in clean.columns, something like the following should work:

X[towduration==0, towduration:=NA]
X[towdistance==0, towdistance:=NA]
X[towarea==0, towarea:=NA]

X[,c("towduration", "towdistance", "towarea"):=lapply(list(towduration, towdistance, towarea), fill.mean), by=c("haulid")]

check spp.key

@bselden @JWMorley

This repo will become an R package, but it's still in development.

The file spp.key.csv has all of the known "raw" (as-entered) taxonomic identifiers (species names) from all regions. But it needs to be checked.

Most species have had something found. The "raw" column is named "ref", and the "corrected" column is named "spp".

Looking through, some of the "corrected" spp names are clearly wrong, as are some of the common names.

Feel free to make corrections, and commit/ push the changes. But please use Git. You may want to install git lfs before downloading this repo (otherwise, the large file storage might break, or you'll end up with bigger files than you want; I'm not sure what happens).

Note that each value in "ref" is unique, but the "spp" values are not. Make sure you do not create any inconsistencies as you edit the file. E.g., if you see that spp=="zoroaster" does not actually have a common name of "frogfish", don't change the common name to "seastar" on only 1 line ... make sure that the updated file has the same common name for all "zoroaster".

I can explain further when you decide to take a look. Just let me know.

Function to merge gridded data into survey data

Having the gridded data is nice, but it'd be much better to match it up to the survey data in a seamless way (i.e., make it part of the data.table).

needs to merge on time to finest scale
needs to merge on space;
- determine resolution of gridded data
- round grid of survey data to match resolution
- merge on rounded, but don't lose original

lat and lon NA in newf

There are 23,000 rows in the clean.newf file that have lat and lon that are NA. This matches how many rows had NA for lat.end. In contrast, there are only 16 rows where lat.start was NA.

It looks like lat was calculated from the mean of (lat.start, lat.end). In the cases in which lat.end was NA, can we assign lat to be the the starting lat instead of NA?

See this gist
https://gist.github.com/bselden/cc86c0e9e219b8cd4ffc

added stratum area for inshore strata N of Hatteras (NEUS)

Based on this document: http://www.gulfofmaine-census.org/data-mapping/visualizations/national-marine-fisheries-service-data-overview/#stratum_table

Prefix 1=offshore strata North of Cape Hatteras --> Shore="Offshore_N"
Prefix 3= inshore strata North of Cape Hatteras--> Shore="Inshore_N"
Prefix 5= Scotian Shelf--> Shore="Scotian"
Prefix 7= inshore S of Hatteras--> Shore="Inshore_S"
Prefix 8= offshore S of Hatteras --> Shore="Offshore_S"

Stratum area for inshore strata N (prefix 3) were compiled from Table 4 in http://www.nefsc.noaa.gov/publications/tm/pdfs/tmfnec52.pdf

stratum_filled_area.txt

I can add these new strata to the neus-neusStrata.csv in the trawlData/inst/extdata/neus folder of the repo, unless something else special needs to be done to make sure the area in square nautical miles gets converted during the data processing.

gmex and neus 2015 problems

Gmex has very little data in 2015

NEUS has NA bottom temperature in 2015

Not sure what's going on. The NEUS issue is not related to cleaning; it's missing in the raw data.

@mpinsky Any ideas about not having bottom temperature for NEUS?

lfs in wiki

need to mention that git lfs (git lfs pull, etc) is needed for initial repo clone/ pull .... can't install the package w/ the pointers instead of the data files

fread in remake process not working?

Was trying to play with the remake setup. Keep getting fread errors:

Restoring previous version of data/raw.gmex.RData
Error in fread(...) : 
  Expected sep (',') but new line or EOF ends field 41 on line 52870 when reading data: 174730,936,88,1401,88004,8,319,28,54.24,N,90,33.08,W,16.2,B192,5/7/14,409,28,55.56,N,90,33.24,W,,BGBCPNNNCASXOX,,,23.1,1016,11.2,102,0.6,4,S,LA,,7,14,,,"Additional gears used: OY

Something up with fread?

bad anchovies?

@JWMorley what's the deal here?

What are the ones I have to watch out for??

It looks like they get Anchoa hepsetus, Anchoa lyolepis, and Anchoa mitchilli every year.

I have Macrocoeloma camptocerum all years from 1989-1995 (except 1993), then never again (going up until 2012).

I see Anchoa cubana in 1989 only, Lobopilumnus agassizii in 1989 and 1993 only, and Engraulis eurystole in 1990 only.

My code:

clean.sa[grepl("Anchovy", common, ignore.case=TRUE), sort(una(spp)), by="year"][,table(year, V1)]

So I just searched for any species that had 'anchovy' in the common name to get that summary. But I' know you know better.

Pending updates to spp.key

List changes that need to be made, with the code to make them.

spp.key[spp=="Homaxinella amphispicula", common:="firm finger sponge"]
spp.key[spp=="Isodictya rigida", common:="soft finger sponge"]
spp.key[spp=="Leptasterias coei", common:="aleutian six-rayed sea star"]
spp.key[spp=="Neoesperiopsis infundibula", common:="rough China hat sponge"]
spp.key[spp=="Neptunea amianta", common:="white neptune"]
spp.key[spp=="Reinhardtius stomias", spp:="Atheresthes stomias"]
spp.key[ref=="Atheresthes stomias", spp:="Atheresthes stomias"]
spp.key[ref=="CROSS PAPPOSUS", spp:="Crossaster papposus"]
- note that it has the right name in EBS. Make sure to do check_and_set(wrong="Cross papposus", corrected="Crossaster papposus") first

rbatt / trawldata Goto Github PK

trawldata's People

Contributors

Stargazers

Watchers

Forkers

trawldata's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs