Data is processed, transformed, and loaded into the Neo4j graph database. Using the cleaned and modelled data, authors are disambiguated, recommendations in terms of what authors could review the incoming publications are made, and the most influential authors are identified.
The sample data files in CSV format:
- publications.csv,
- authors.csv,
- topics.csv,
- publications_incoming.csv.
- Draft the initial data model (nodes, relationship and labels) and ETL strategy in order to load into Neo4j graph data is relevant included in publications, authors and topics csv files.
- Clean the datasets up, considering , for example, possible duplicated authors.
- Recommend a group of people to review the incoming publications.
- Depict the more influential authors.
- Assignment for Knowledge Graph Engineer.pdf file depicting the assessment.
- assignment_slideck.pdf file is the slide deck depicting the solution process.
- graphDB_model.svg depicts the graph data model.
- data/ folder. Contains the data CSV files.
- notebooks/ folder. Contains the Jupyter notebooks:
- to perform initial data exploration (exploration.ipynb),
- to run the graph data science algorithms (analysis.ipynb)
- src/ folder. Contains the Python script to load the data into the graph db (etl_pandas.py).