GithubHelp home page GithubHelp logo

endeva83 / agricultrends Goto Github PK

View Code? Open in Web Editor NEW

This project forked from xinyi329/agricultrends

0.0 0.0 0.0 34.87 MB

Project of Big Data Application Development @ NYU

Scala 23.97% Jupyter Notebook 76.03%

agricultrends's Introduction

AgriculTrends: Big Data Analytics for Agriculture

Project for Big Data Application Development @ NYU
Team Members: Xinyi Liu (xl2700), Yiming Li (yl6183), Ian Lam (iil209)

Project Description

An analytic is introduced to provide insights in agriculture production and market supply by country. Based on historical climate data, crop production level and producer prices, people can learn the relations among weather, production amount, and producer prices through visualizations. Future harvests and producer prices can be estimated based on the historical data with this analytic. In this way, the supply in the market might be adjusted in advance to benefit both farmers and consumers.

Directory Structure

.
├── README.md
├── app_code
│   ├── BDAD_prject
│   │   ├── build.sbt
│   │   ├── project
│   │   ├── src
│   │   │   └── main
│   │   │       └── scala
│   │   │           └── BDAD_project.scala
│   │   └── target
│   │       └── scala-2.11
│   │           └──bdad_prject_2.11-0.1.jar
│   ├── analytic
│   │   ├── aggregation.scala
│   │   ├── iteration_price_yield.scala
│   │   ├── iteration_yield_country.scala
│   │   ├── iteration_yield_weather.scala
│   │   ├── regression_price.scala
│   │   └── regression_yield.scala
│   └── visualization
│       ├── AgriculTrends.twb
│       └── tables
│           ├── AgriculTrendsAggregation.csv
│           ├── AgriculTrendsChangeRate.csv
│           ├── AgriculTrendsPriceChangeRegression.csv
│           ├── AgriculTrendsWeather.csv
│           ├── AgriculTrendsYieldHeatMap.csv
│           ├── AgriculTrendsYieldRegression.csv
│           ├── agriculTrendsMostYield.csv
│           ├── agriculTrendsTop10Yield.csv
│           └── agriculTrendsTotalYield.csv
├── data_ingest
│   ├── Production
│   │   └── data_ingest.txt
│   ├── climate
│   │   └── climate_ingest.ipynb
│   └── producerPrice
│       └── data_ingest
├── etl_code
│   ├── climate
│   │   └── climate_clean.scala
│   ├── producerPrice
│   │   └── cleaning.scala
│   └── production
│       └── data_cleaning.scala
├── profiling_code
│   ├── climate
│   │   └── climate_profiling.scala
│   ├── producerPrice
│   │   └── profiling.scala
│   └── production
│       └── data_profiling.scala
├── screenshots
│   ├── analytic
│   └── visualization
│       ├── AgriculTrends.png
│       └── AgriculTrends_Yield_Weather.png
└── test_code
    └── regression.scala
  • /app_code: source code for the application, includes the Spark Scala analytics code, the JAR file, and the Tableau visualization code
  • /app_code/BDAD_projct/target/scala-2.11: contains the project JAR file
    • bdad_prject_2.11-0.1.jar: final project JAR file that can be used for Spark job.
  • /app_code/analytic:
    • aggregation.scala: joins the three datasets into one large dataframe
    • iteration_price_yield.scala: analyzes the relationship between producer price and crop yield
    • iteration_yield_country.scala: analyzes the relationship between crop yield and the country
    • iteration_yield_weather.scala: analyzes the relationship between crop yield and weather
    • regression_price.scala: linear regression for producer price
    • regression_yield.scala: linear regression for crop yield
  • /data_ingest: commands used to upload each of the three datasets to the Dumbo HDFS
  • /etl_code: Scala source code used to clean and transform each of the three datasets in Spark
  • /profiling_code: Scala source code for profiling the three datasets, before and after the ETL step
  • /screenshots: screenshots of analytic running, includes the analytic result from Spark and the visualization result from Tableau
  • /test_code: unused regression code due to bad performance

How to Build/Run Code

No building of code is required for this application.

To upload the datasets into HDFS, follow the commands within /data_ingest directory.

All the cleaning, ETL, and analysis codes were originally run in the Spark Shell (REPL). In order to run the code, copy and paste the code into the Spark Shell. Comments are also written within the code to provide clearer description of the different analyses.

Aside from running the code in Spark Shell, user can also run the it by submitting a Spark job with the JAR file bdad_prject_2.11-0.1.jar provided in the /app_code/BDAD_projct/target/scala-2.11directory.

The command that is used for submitting a Spark job is the following:

spark2-submit --name "BDAD-project" --class "BDAD_project" --master yarn --deploy-mode cluster --verbose --driver-memory 5G --executor-memory 2G --num-executors 10 ~/bdad_prject_2.11-0.1.jar

In order to visualize the results on Tableau, we have to first save our Spark analysis results into Hive Tables in HDFS, then export the Hive tables as csv files to Dumbo locally, and finally manually transfer them to our local computer. This is because the data transfer rate between Tableau and Dumbo directly is too slow to handle the large volume of data.

The following command is used to export a Hive table as csv format:

beeline -u jdbc:hive2://babar.es.its.nyu.edu:10000/[netID] -n [netID] -w [textFileContainingUserPassword] --outputformat=csv2 -e "select * from [hiveTableName]" > [output].csv

Where to Find Results of Run

The results of the runs are saved as Hive Tables in the HDFS. The /screenshots directory also contains examples of the expected results.

Input Datasets

Crops

Source: Crop data from 1961-2018: http://www.fao.org/faostat/en/#data/QC
Dumbo Location: hdfs:/user/yl6183/BDAD_project/cleaned_data_v2

Producer Prices - Annual

Source: Producer Prices from 1966-1991: www.fao.org/faostat/en/#data/PA , Producer Prices from 1992-2018: www.fao.org/faostat/en/#data/PP
Dumbo Location: hdfs:/user/iil209/bdad_project/cleaned_data_combined_v2

World Bank Climate Data

Source: https://datahelpdesk.worldbank.org/knowledgebase/articles/902061-climate-data-api
Dumbo Location: hdfs:/user/xl2700/agricultrends/climate/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.