Self-service modeling

Description

At present, most of the data modeling tools require users to have a high level of programming ability in data processing and model algorithm selection, and the technical threshold is high, and the modeling process cannot be fully automated, which brings no small challenge to the front-line business personnel. At the same time, due to the increasing amount of data to be processed, the traditional modeling process based on R language consumes a lot of time, and can not realize the real-time synchronization of modeling results and client requests. The tool is to solve the problem of self-service and real-time data analysis modeling.

The project has been integrated into the customer relationship management (CRM) system of financial institutions, and has played an important role in precision marketing

Processing flow

Step 1. Define modeling goals

Built-in scenario: VIP customer loss warning
Custom modeling: Define the modeling target by query criteria

Step 2. Select Customer group

Filter by indicator: For example, credit card customer segmentation identifier =2
Filter by label: If it is our credit card customer (0 or 1)

Step 3. Generate statistical description report

Including max and min, mean and median, standard deviation, missing value, correlation, histogram, etc.

step 4. Selected model algorithm

Logistic regression (spark.logit)
Decision tree (spark.randomForest)

Step 5. Model execution result

Output hit rate, coverage and result set
Data statistics (Summary by institutions, e.g. clients, total assets, average holdings)

Technical architecture

The tool relies on and requires the use of the R language environment, the SparkR distributed computing environment, and the Rserve component service.

First, the powerful function of R language in the field of statistical analysis and predictive modeling is used to realize data storage and processing, array and matrix operations, and statistical description and mapping.

Second, using the lightweight front end provided by the SparkR distributed computing environment, Apache Spark can be called on R. With the help of various operations such as selection, filtering and aggregation based on distributed data frames provided by SparkR, the processing of massive data sets can be realized. With the MLlib distributed machine learning algorithm library integrated in Spark, the tool makes it easy to build back-end algorithm engines.

Third, use Rserve component service technology to realize the remote call of interactive side to R language server. With the feature that Rserve uses C/S (client/server) mode to call, the interactive side does not need to connect to the R language library, and the purpose of low coupling between the interactive side Java program and the background R program can be realized.

At present, this tool has been well applied in large commercial banks.

Patent Information

SanShan Sun, Bing Han, et al., “Data modeling methods, devices, storage media, and processors”, Invention patent, CN112988119A, publicly available.

konhay / self-service-modeler Goto Github PK

self-service-modeler's Introduction

Self-service modeling

Description

Processing flow

Technical architecture

Patent Information

self-service-modeler's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs