GithubHelp home page GithubHelp logo

Treselle Systems Pvt Ltd's Projects

airflow_to_manage_talend_etl_jobs icon airflow_to_manage_talend_etl_jobs

Airflow, an open source platform, is used to orchestrate workflows as Directed Acyclic Graphs (DAGs) of tasks in a programmatic manner. An airflow scheduler is used to schedule workflows and data processing pipelines. Airflow user interface allows easy visualization of pipelines running in production environment, monitoring of the progress of the workflows, and troubleshooting issues when needed.

apache_drill_vs_amazon_athena icon apache_drill_vs_amazon_athena

Amazon Athena, a serverless, interactive query service, is used to easily analyze big data using standard SQL in Amazon S3. Apache Drill, a schema-free, low-latency SQL query engine, enables self-service data exploration on big data. Let us compare data partitioning in Apache Drill & AWS Athena and the distinct features of both.

cdr_analysis_using_k_means_in_tableau icon cdr_analysis_using_k_means_in_tableau

Clustering of the customer activities for 24 hours by using K-means clustering feature in Tableau 10.Tableau 10 clustering feature automatically groups together similar data points. This type of clustering helps you create statistically-based segments which provide insight into how different groups are similar as well as how they are performing compared to each other.

cdr_analysis_using_k_means_with_r icon cdr_analysis_using_k_means_with_r

Call Detail Record (CDR) is the information captured by the telecom companies during the Call, SMS and Internet activity. These information’s provides insights about the customer needs when it is used with customer demographics. Most of the telecom companies using call detail record information in the fraud detection by clustering the user profiles, Customer churn by usage activity and targeting the profitable customers by using RFM analysis.

crime_analysis_using_h2o_autoencoders icon crime_analysis_using_h2o_autoencoders

Deep Learning (DL) and Machine Learning (ML) are used to analyze and accurately predict data.Crime prediction not only helps in crime prevention but also enhances public safety. H2O Autoencoders model is deployed into a real-time production environment by converting it into POJO objects using H2O functions Crime Analysis.

customer_churn_analysis icon customer_churn_analysis

In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. Customer loyalty and customer churn always add up to 100%. If a firm has a 60% of loyalty rate, then their loss or churn rate of customers is 40%. As per 80/20 customer profitability rule, 20% of customers are generating 80% of revenue. So, it is very important to predict the users likely to churn from business relationship and the factors affecting the customer decisions. We are going to show how logistic regression model using R can be used to identify the customer churn in the telecom dataset.

data_analysis_using_apache_hive_and_apache_pig icon data_analysis_using_apache_hive_and_apache_pig

Apache Hive, an open-source data warehouse system, is used with Apache Pig for loading and transforming unstructured, structured, or semi-structured data for data analysis and getting better business insights. Pig, a standard ETL scripting language, is used to export and import data into Apache Hive and to process large number of datasets.

data_normalization_and_filtration_using_drools icon data_normalization_and_filtration_using_drools

Drools, a Rule Engine, is used to implement an expert system using a rule-based approach. It is used to convert both structured and unstructured data into transient data by applying business logic for normalizing and filtering data in drools DRL file.

data_quality_checks_with_streamsets icon data_quality_checks_with_streamsets

StreamSets is not only used for big data ingestion but also for analyzing real-time streaming data. It is used to identify null or bad data in source data and filter out the bad data from the source data in order to get precise results. It also helps the businesses in making quick and accurate decisions.

falcon_data_pipeline icon falcon_data_pipeline

In our use case, we have used Apache Falcon to centrally define data pipelines, and then Falcon uses those definitions to auto-generate workflows in Apache Oozie. Falcon data flows are sinking with Atlas through Kafka topics so Atlas knows about Falcon metadata. Atlas provides Falcon feed lineage and it can tell what table was the source for another table.

handle_class_imbalance_data icon handle_class_imbalance_data

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Extreme imbalance data can be seen in banking or financial data.

hive_streaming_with_storm_kafka icon hive_streaming_with_storm_kafka

With the release of Hive 0.13.1 and HCatalog, a new Streaming API was released to support continuous data ingestion into Hive tables. This API is intended to support streaming clients like Flume or Storm to better store data in Hive, which traditionally had batch oriented storage. In our use case, we are going to use Kafka with Storm to load streaming data into bucketed Hive table. Multiple Kafka topics produce the data to Storm that ingests the data into transactional Hive table. Data committed in a transaction is immediately available to Hive queries from other Hive clients. Apache Atlas will track the lineage of Hive transactional table, Storm (Bolt, Spout), and Kafka topic, which will help us to understand how data is ingested into the Hive table.

loan_application_metrics_by_state_using_pig icon loan_application_metrics_by_state_using_pig

In our use case we are going to use Pig to calculate and find loan application metrics using loan, application and applicant mock data. These mock data are prepared by us with specific criteria in mind and the fields are included in each datasets as per the standard data model used in finance industry in United States (US).

loan_prediction_using_pca_and_naive_bayes_classification_with_r icon loan_prediction_using_pca_and_naive_bayes_classification_with_r

Nowadays, there are numerous risks related to bank loans both for the banks and the borrowers, who get the loans. The risk analysis about bank loans needs understanding about the risk and the risk level. Banks need to analyze their customers for loan eligibility so that they can specifically target those customers. Banks wanted to automate the loan eligibility process (real time) based on customer details such as Gender, Marital Status, Age, Occupation, Income, debts and others, provided in their online application form. As the number of transactions in banking sector is rapidly growing and huge data volumes are available, the customers’ behavior can be easily analyzed and the risks around loan can be reduced. So, it is very important to predict the loan type and loan amount based on the banks’ data

mongodb_compression icon mongodb_compression

MongoDB – WiredTiger Storage Engine. “mongod_v3.conf” configuration file repositories.

mostpopularwords icon mostpopularwords

It is a common word count example to find out different words and its count from input raw data using Apache Spark.

mostpopularwordsbetter icon mostpopularwordsbetter

Apply regular expression with Spark filter function to filter only words (Alphabets) and length of the words greater than 2 characters.

pivot_unpivot_multiple_columns_in_ms_sql icon pivot_unpivot_multiple_columns_in_ms_sql

MS SQL Server, a Relational Database Management System (RDBMS), is used for storing and retrieving data. Data integrity, data consistency, and data anomalies play primary role when storing data into database. Data is provided in different formats to create different visualizations for analysis. For this purpose, you need to pivot (rows to columns) and unpivot (columns to rows) your data.

predict_bad_loans icon predict_bad_loans

The greatest challenge in machine learning is to employ the best models and algorithms to accurately predict the probability of loan default in making the best financial decisions by both investors and borrowers. H2O’s AutoML, an easy-to-use interface for advanced users, automates the machine learning workflow including training a large set of models.

restful_api_using_loopback icon restful_api_using_loopback

Buidling RESTful API for CRUD operation using LoopBack and backend store is MongoDB.LoopBack, an easy to learn and understand open-source Node.js framework, allows you to create end-to-end REST APIs with less code compared to Express and other frameworks. It allows you to create your basic routes on adding a model into the application.

sensor_data_quality_management icon sensor_data_quality_management

Data Quality Management (DQM) is the process of analyzing, defining, monitoring, and improving quality of data continuously. DQM is applied to check data for required values, validate data types, and detect integrity violation & data anomalies using Python.

sfo_fire_service_call_analysis_using_spark icon sfo_fire_service_call_analysis_using_spark

To understand the Spark performance and tuning the application we have created Spark application using RDD, DataFrame, Spark SQL and Dataset APIs to answer the below questions from the SFO Fire department call service dataset.How many different types of calls were made to the Fire Department?,How many incidents of each call type were there?,How many years of Fire Service Calls are in the data file?, How many service calls were logged in the past 7 days? and Which neighborhood in SF generated the most calls last year?.

spark-druid-olap icon spark-druid-olap

Sparkline BI Accelerator provides fast ad-hoc query capability over Logical Cubes. This has been folded into our SNAP Platform(http://bit.ly/2oBJSpP) an Integrated BI platform on Apache Spark.

streaming_analytics_using_ksql icon streaming_analytics_using_ksql

Kafka SQL, a streaming SQL engine for Apache Kafka by Confluent, is used for real-time data integration, data monitoring, and data anomaly detection. KSQL is used to read, write, and process Citi Bike trip data in real-time.

talend_java_component icon talend_java_component

Dynamic Jasper is a great tool for designing and creating simple or complex dynamic reports. Talend is not only used as the most common tool for data transformation. It is also used for dynamic Jasper report generation using tJasperInput component.

text_normalization_using_spark icon text_normalization_using_spark

We are going to show how text normalization using Spark with regular expression can be applied to identify most and least popular words from the input raw data

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.