A project that incorporates SQL, NoSQL, Apache Airflow, IBM Cognos Analytics and PySpark into a data pipeline.
The Capstone project aimed to develop an end-to-end data pipeline. In the real world, there may be multiple sources of data. These sources include SQL databases, NoSQL databases and Data warehouses. The pipeline would need to query the databases and retrieve data for further analysis.
An analytics such as IBM Cognos Analytics would be used to visualize the data. Tools such as Apache Airflow were used to automate the process of data processing. Finally, a PySpark model was used to make sales predictions on the data.
As structured in the IBM professional certificate, each stage of the data pipeline is placed in separate files. These files can be viewed whereby SQL scripts, screenshots, python files and notebooks can be found.