GithubHelp home page GithubHelp logo

awells-uva / capstone_2022 Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 1.86 MB

Investigating unmapped reads within next generation sequencing data will provide additional information regarding the source of the trace microbial reads.

License: MIT License

Python 0.12% Jupyter Notebook 99.88%
bioinformatics biopython-library hisat2 pysam

capstone_2022's Introduction

Trace Microbial Sequences & NGS Instruments

Research: To see if there is a relationship between the metadata and the contamination present in the resulting sequencing data.

Python Environment

requirements.txt

Use at your Discretion. Provided as reference to reproduce working Environment.

Data

Assumption 1 : Public Study Accession ID name begins with "PRJ"

data
├── bam_files
|   ├── PRJ######
|       └──  <Run Accession>.bam
├── csv_files
|   ├── PRJ######
|       └── PRJ######_<Run Accession>.csv
├── sam_files
|   ├── PRJ######
|       └── <Run Accession>.sam
└── PRJ######
    ├── <Run Accession>_1.fastq
    └── <Run Accession>_2.fastq

Scripts

run_download_data.ipynb

Description

Notebook to download FASTQ Files from The European Nucleotide Archive (ENA)

Dependencies

Inputs

Public Study Accession ID ( Typically in Form PRJ##### )

Outputs

FASTQ files Unpacked into $PWD/< Public Study Accession ID >/*.fastq

Known Issues

Sometimes on dual read, the 2nd read will try to download twice ( does not affect storage)

run_HISAT2.ipynb

Description

Notebook to process FASTQ files through HISAT2 to produce .sam files

Dependencies

HISAT2

SAMTOOLS

Genome Reference Consortium Human Build 38

Inputs:

Directory Path

FASTQ files in $PWD/< Public Study Accession ID >/< Run Accession >.fastq data structure

Note: Single Reads typically store as .fastq while Dual reads will be stored as _1.fastq and _2.fastq

Outputs

SAM file in .sam format Files are stored as $PWD/sam_files/< Public Study Accession ID >/.sam data structure

Support

The Genome Reference Consortium

run_SAM_to_BAM.ipynb

Description

Notebook to Convert Sequence Alignment and Map (SAM) file to Binary Alignment and Map (BAM) File. Additionally, will sort BAM file during conversion

Dependencies

SAMTOOLS

Inputs:

$PWD/sam_files/< Public Study Accession ID >/.sam data structure

Outputs

BAM file in .bam format Files are stored as $PWD/bam_files/< Public Study Accession ID >/.bam data structure

run_etl_BAM_to_pd.ipynb

Description

Processing Pipeline

Dependencies

capstoneUtils.py : Script containing additional processing functions

Inputs:

Directory in structure $PWD/bam_files/< Public Study Accession ID >/.bam data structure Directory in structure $PWD/csv_files/< Public Study Accession ID >_.csv data structure

Outputs

Run Directory structure

runs
└── K_< K >
    ├── images
    |   └── *.png
    ├── models
    |   └── lda_model_Ntopic< Number of Topics>_K<K>*
    ├── cluster_sample_K< K >.csv
    ├── crosstab_K< K >.csv
    ├── filtered_crosstab_K< K >.csv
    ├── K< K >_library.txt
    ├── kmer_df_K< K >.csv
    ├── objs_K< K >.pkl
    ├── sizes.csv
    ├── test.csv
    └── train.csv
File Descriptions
  • cluster_sample_K< K >.csv
    • 100000 Rows of K-Mer Length Sequences and Estimated Topic
  • crosstab_K< K >.csv
    • Cross Tab in the Format K-Mer Length Sequence Rows and < Public Study Accession ID > Columns pre Chi-Squared Filtering
  • filtered_crosstab_K< K >.csv
    • Cross Tab in the Format K-Mer Length Sequence Rows and < Public Study Accession ID > Columns post Chi-Squared Filtering
  • K< K >_library.txt
    • List K-Mer Length Sequences derived from Training Set Post Chi-Squared Filtering
  • kmer_df_K< K >.csv
    • K-Mer Length Sequences derived from Training Set. Used to generate model
  • objs_K< K >.pkl
    • Pickle file in format K, kmer_dictionary. kmer_dictionary format is: {< Public Study Accession ID and Run Accession> : K-Mer Length Sequences as List }
  • sizes.csv
    • File to show Raw Data number of Sequences and Pandas Data Frame Size. Useful to estimate RAM and Hard Disk Space Requirements
  • test.csv
    • Test Set of Sequences after initial import of Raw Data
  • train.csv
    • Training Set of Sequences after initial import of Raw Data

Optional Files

run_chi_squared.ipynb

run_classification.ipynb

run_optimize.ipynb

capstone_2022's People

Contributors

awells-uva avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.