NoMoSQL

Venkat Parthasarathy - parth039
Ojas Narayanann - bhava006
Ahmet D - dokum001

Slack Channel

User Documentation

Data Ingestion

Setting appropriate delimiter based on observing the CSV file.
Deciding on data format. For example, we used decimal(38,4) for the prices field because using the default float/decimal included more values after the decimal point and some of those values after 4 were garbage values.

Performance on the view using partitioned and non-partitioned data (Deliverable 3- Part B Q2)

Aggregating queries from partitioned data can be faster because processing queries from partitions can be done in parallel and at the end result from all partitions can be aggregated. But in non-parititioned data, the query processing is done serially. So execution time is better partitioned data.

Partitioning strategy for Kudu (Deliverable 4- Q1)

Initially, we decided to parition on the basis of RANGE of year because it is apt for the business question being answered here. But, Kudu does not support partitioning on non-primary key fields. Then, we noticed that as the sales table is stored in chronological order with auto incrementing orderid. The first record has the year as 2018 and the last record has the year 2020 Hence, we decided to 3 near equi-sized partitions based on orderid (0,220000),(220000,440000) and (440000,..). The motivation here is that, records with same year are likely to reside in the same partition.

Suggestion on how to stream records into the Sales table (Deliverable 4- Q9)

As we have observed above with Kudu being restrictive on paritioning based on primary key only and since our primary key is OrderID as given in the question, we can have two different strategies while streaming the data into the table.

Assumption- All the records currently are in chronological order

We can have an additional column called isActive that signifies if the record should exist on the table. This will make deletion faster as we mark the record to be deleted later when there is lesser load on the database.
We can construct a secondary index on date.
If there is a update to the Date field of a record, we can choose to set isActive to 0 and insert it as a new record in the table with the same orderId as the old record and insert it in the correct partition and correct position within the partition using the secondary index we created.
This will ensure that the records will always be in chronological order and help us answer the business questions where we need to group by year.
If there is no update to the date, we can simply upsert the data which will update the record in return.

Kudu Results (Deliverable-4)

Youtube Link: https://youtu.be/LKMVNENjUOU

Q3:

+------+-------------------+
| year | amount            |
+------+-------------------+
| 2020 | 138265412390.2275 |
| 2018 | 797483356326.3954 |
| 2019 | 931884640402.8126 |
+------+-------------------+
Fetched 3 row(s) in 2.12s

Q5:

+------+-------------------+
| year | amount            |
+------+-------------------+
| 2020 | 138265430220.2075 |
| 2018 | 797483356326.3954 |
| 2019 | 931884640402.8126 |
+------+-------------------+
Fetched 3 row(s) in 1.94s

Q8:

+------+-------------------+
| year | amount            |
+------+-------------------+
| 2020 | 138265387834.0804 |
| 2018 | 797483356326.3954 |
| 2019 | 931884640402.8126 |
+------+-------------------+
Fetched 3 row(s) in 1.89s

Comments on performance

For deliverable-2 and 3, performance slightly improved when we were working with partitioned tables and we can expect it to improve a lot in production as well.
Subsequent queries in Kudu generally took slightly lesser time because we were only modifying data in one partition and the answer for only that is probably updated. This can produce a large performance boost when dealing with larger sized datasets and more partitions.

Deployment Runbook

Give permissions to run the script.

chmod +x main.sh

Execute the script.

./main.sh

To execute Kudu script.

 cd kudu
 chmod +x main_kudu.sh
./main_kudu.sh

To remove all data from Impala and HDFS.

./clear.sh

The .sql files for creating different views and tables

<name_of_table/view>.sql

venkatp1997 / paper-author-identification-problem Goto Github PK

paper-author-identification-problem's Introduction

NoMoSQL

Slack Channel

User Documentation

Data Ingestion

Performance on the view using partitioned and non-partitioned data (Deliverable 3- Part B Q2)

Partitioning strategy for Kudu (Deliverable 4- Q1)

Suggestion on how to stream records into the Sales table (Deliverable 4- Q9)

Kudu Results (Deliverable-4)

Comments on performance

Deployment Runbook

paper-author-identification-problem's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs