BIOS 823 describes the challenges faced by analysts with the increasing importance of large data sets, and the strategies that have been developed in response to these challenges. The core topics are how to manage data and how to make computation scalable. The data management module covers guidelines for working with open data, and the concepts and practical skills for working with in-memory, relational and NoSQL databases. The scalable computing module focuses on asynchronous, concurrent, parallel and distributed computing, as wel as the construction of effective workflows following following DevOps practices. Applications to the analysis of structured, semi-structured and unstructured data, especially from biomedical contexts, will be interleaved into the course. The course examples are primarily in Python and fluency in Python is assumed.
Pre-requisites:
- Fluency in Python (BIOS821, STA 663 or equivalent)
Course repository is at https://github.com/cliburn/bios-823-2019
- Adminstration
- Syllabus
- Python
- Data science and healthcare
- Data pipelines
- Why functional programming?
- Use of lambdas and higher-order functions
- Using
toolz
to build lazy pipelines
- Using
numpy
- Using
scipy
- Using
pandas
- Using
scikit-learn
- Using
statsmodels
- Using
matplotlib
- Using
seaborn
- Delimited text files
- JSON
- XML
- HDF5
- Avro
- Parquet
- APIs for data sharing
- Using
odo
- Tuples and set operations
- The database schema
- Tables and views
- Tables, rows, columns, cells
- Primary keys, foreign keys and referential integrity
- Normalization for data entry
- Indexing and optimization
- Database migrations
- De-normalization for data query
- Star schema for data warehouses
- Why learn SQL
- The stages of data normalization
- The CREATE statement
- The INSERT statement
- The UPDATE statement
- Adding indexes
- ACID
- Transactions and rollback
- ETL to populate databases
- Server vs client side queries
- The Select statement
- Projection
- Filtering on rows
- Sorting
- Transforms
- Grouping
- Filtering on groups
- Summarization
- Sub-queries
- Using explain
- Set operations
- Joins and semi-joins
- Window functions
- User-defined functions (1:1, N:1, 1:N)
- Concepts of NoSQL: From ACID to BASE
- What is a key-value database?
- Using
redis
- What is a document database?
- Using
mongodb
- What is a column family database?
- Using
hbase
- What is a graph database?
- Using
neo4j
- Trade-offs (when to use SQL, key-value, document, graph and column family)
Midterm I (10%)
- Concurrent, parallel and distributed
- Why asynchronous programming?
- Latency and resource starvation
- I/O and computation bottlenecks
- Generators and Coroutines
- Coroutines and tasks
- Sending messages to coroutines, threads and processes
- The event loop
- async and await
- Amdahl and Gustaffson laws
- Threads and processes
- Embarrassingly parallel problems
- Shared memory issues
- Deadlocks and race conditions
- Low level parallel programming with
multiprocessing
- Using
concurrent.futures
andmultiprocessing
pools
- Why distributed computing?
- Google Map-Reduce
- Hadoop
- HDFS: Distributed file system
- YARN: Resource manager
- MapReduce: Compute engine
- MapReduce programming
- Writing a MapReduce program in Python using Streaming
- Tools for putting data in HDFS (Flume, Sqoop)
- Tools for SQL access to HDFS (Hive, Impala)
- Tools for workflow and pipeline construction (Crunch, Oozie, Airflow)
- Tools for coordination of distributed programs (Zookeeper)
- NoSQL database (HBase)
- Dask concents
- Working with
dask
DataFrames - Dask efficiency
- Working with
dask
arrays - Working with
dask
bags - ML with
dask
- What is DevOps?
- Practices and tools
- Source code control
- Using Docker containers
- Walk-through using AWS
- Spark concepts
- The Spark context
- The data flow DAG
- Resilient Distributed Datasets (RDD)
- Key-value RDDs
- Creating and saving RDDs
- Actions and Transforms
- Caching RDDs
- Accumulators and Broadcast variables
- Using UDFs (User Defined Functions)
- Example: Hello, word count!
- The Spark session
- Creating and saving a DataFrame
- DataFrame operations
- DataFrame and RDD conversions
- Using SQL to query a DataFrame
- Caching a DataFrame
- Using vectorized UDFs
- Column family databases
- Columnar data stores arrow and parquet
- Basic statistics with Spark
- Pipelines
- Data processing
- Clustering
- Classification and regression
- Collaborative filtering
- Model selection
- Streaming concepts
- StreamingContext
- Discretized Streams
- Sources of data
- Transforms
- Checkpoints
- DataFrame operations
- Machine learning operations
- Processing event logs
Midterm II (10%)
- Structured data using
dask
- Statistical visualization with
seaborn
,plotly
,bokeh
- From long/lat to x/y coordinates
- Interactive mapping with
datashader
- Concepts of text analysis
- From text to matrix
- Natural language processing with
nltk
,spacy
- Topic modeling with
spacy
andgensim
- Sentiment classification
- Concepts of image processing
- Using
scikit-image
- Using a CNN to classify images
- Concepts of time series analysis
- Using
statsmodels
- Using
prophet
- Concepts of graph and network analysis
- Using
networkx
- Using
neo4j
- Using
Spark GraphFrames
- Concepts of genomic processing
- Unix pipelines
- Distributed processing with
Adam
- Example: counting k-mers
Final Exam (30%)