These notebooks comprise an commented outline used to teach, understand and pratice the pySpark DataFrames. We use an open administrative dataset of live birth childs in Bahia state from Brazil in 2017. Besides, a syntetic dataset were used to pratice the implementation of a record linkage method. At last, the main notebook proposes some challenges to improve the knowledge of students either on the processing tool and the concepts of data integration.
sandy4321 / handson_spark Goto Github PK
View Code? Open in Web Editor NEWThis project forked from pierrepita/handson_spark