Comments (4)
If so, would it be possible to do a one-time delete of everything from 2019-onward and a fresh upload that maps the lot number to the serum id column? Then subsequent updates wouldn't change the index, right?
Yup, I can track down the date that we started using the new database dump and delete records in cdc_tdb/flu
starting from that date. Then we can re-upload all records with the changed serum id.
Surfacing lot_number in addition to serum_id sounds much easier technically, although this means adding a column that only the CDC will use and that serum_id remains meaningless for them.
Oh right! The lot_number
column only exists in the CDC database, but does not exist for the other CCs. That would mean we would have to special case the columns to use in the tdb/download script based on the database. Hmm, I think your original proposal is better 🤔
I did a little more digging into the fauna/tdb code and the parse step automatically assigns the ferret_id
to the serum_id
, so we may also want to remove the ferret_id
in the column map.
from fauna.
One small issue with changing the underlying data for serum_id
is that it is one of the index fields for generating the record index. We need to remember to delete the records already in fauna and re-upload them with the new serum_id
so that we don't have duplicate titers.
Another issue is do we need to change over the serum_id
for older titer measurements? The mapping of columns for CDC data only applies to their new data dump that only goes back to September 2019. I would have to do some digging to figure out how the older titers serum_id
was set.
Stepping back a bit, I think the current values of serum_id
and lot_number
are accurate for our data model. If the CDC is wanting to use the lot number for discussions, then maybe we can just include this column in our downloads from fauna instead of overhauling the data model.
from fauna.
Good points! I hadn't thought about repopulating the database. I was thinking more about updating the mapping now so future imports (new records) use the lot number instead. Would that kind of change disrupt uploads of new data, though, because the uploads work from the full CDC database TSV and all older records would get an updated index? If so, would it be possible to do a one-time delete of everything from 2019-onward and a fresh upload that maps the lot number to the serum id column? Then subsequent updates wouldn't change the index, right?
Surfacing lot_number
in addition to serum_id
sounds much easier technically, although this means adding a column that only the CDC will use and that serum_id
remains meaningless for them. This becomes a new field to display in the measurements tooltips, a filter and group-by column to add to the measurements config, etc. It's not a big deal, but it reduces the value of the original serum id.
The issue of recreating fauna from older data could be important to figure out eventually, if we really hope to deprecate fauna in favor of a cloud-based file store solution...but this probably isn't the place to discuss that undertaking. 😅
from fauna.
Yup, I can track down the date that we started using the new database dump and delete records in cdc_tdb/flu starting from that date.
Never mind, just realized the start date of the new database dump doesn't matter because the data contains tests from 2019. Like you said, we can delete records based on assay_date
. The earliest assay_date
included in the database dump is 2019-09-03.
from fauna.
Related Issues (20)
- Geographic error? HOT 2
- Switch out `xlrd` HOT 1
- fauna downloads fail with Python 3.10
- PhantomJS not found on PATH - installation via npm install HOT 2
- feat: BV-BRC support HOT 2
- serum_passage_category should be set to "egg" instead of "cell" for CDC human pool data like "L21/22 H3-EGG HUMAN POOL" HOT 7
- Assign correct host to titers from non-ferret hosts (e.g., human and mouse)
- Geolocation assignments fail for duplicate location names HOT 2
- Replace nextstrain remote with aws commands HOT 1
- Automate backup of Fauna databases to S3 HOT 4
- Support ingest of individual-level human serology data for seasonal flu viruses
- Revisit tdb/upload's `index_fields` HOT 1
- Suggest using direct clinical sample sequence for MEX_CIENI551 Zika genome
- Annotate titer TSVs with source and passage
- fauna uploads fail in python 3 unicode error HOT 1
- argument parser in upload.py HOT 3
- Migrate to pandas 0.17 HOT 6
- Fauna installation fails for some users who don't run `npm install` inside of `/chateau` HOT 3
- fauna doesn't work with rethinkdb 2.4 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fauna.