GithubHelp home page GithubHelp logo

bigdatawarehouse's Introduction

BigDataWarehouse

2020年数据仓库大项目

数据集信息

原始数据来源http://snap.stanford.edu/data/web-Movies.html

该数据集包含来自亚马逊的电影评论。数据的使用期超过10年,包括截至2012年10月的所有约800万条评论。评论包括产品和用户信息,评分以及纯文本评论。

数据扩充:爬取原始数据集中涉及到的所有产品ID对应的亚马逊网页

数据仓储

使用Neo4j, Mysql 和 Hive 三种仓储方式进行仓储,并进行查询性能比较。

Folder Structure

.
├── ETL                         # ETL脚本项目
│   ├── processedData           # 处理后的数据
│   │   └── neo4jcsv            # neo4j入库需要的csv文件
│   ├── rawData                 # 存放原始数据集(movies.txt)
│   └── utils                   # ETL工具模块
│       ├── deduplicate         # 原始数据去重的模块
│       └── neo4jPreprocess     # 处理数据为neo4j所需的csv格式的模块
│           ├── csvData
│           └── model
├── mysql
│   ├── config
│   └── data
│       ├── #innodb_temp
│       ├── mysql
│       ├── performance_schema
│       └── sys
├── neo4j
│   ├── conf
│   ├── data
│   │   ├── databases
│   │   │   ├── neo4j
│   │   │   └── system
│   │   │       └── schema
│   │   └── transactions
│   │       ├── neo4j
│   │       └── system
│   ├── import
│   ├── logs
│   └── plugins
└── ...

Requirements

ETL Project

$ cd ETL/
$ pip install -r requirements.txt

Database

Install docker on macOS

$ brew cask install docker
$ brew install docker-compose

Install docker on others https://docs.docker.com/compose/install/

Usage

ETL Project

$ cd ETL/
$ python run.py

Database

$ docker-compose up

bigdatawarehouse's People

Contributors

major-333 avatar nntraveler avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.