treselle-systems Goto Github PK

followers: 49.0 following: 0.0 repos: 33.0 gists: 0.0

Name: Treselle Systems Pvt Ltd

Type: User

Bio: Treselle Systems is an award winning, premier technology consulting firm in US with offices in India.

Treselle Systems Pvt Ltd's Projects

airflow_to_manage_talend_etl_jobs

Airflow, an open source platform, is used to orchestrate workflows as Directed Acyclic Graphs (DAGs) of tasks in a programmatic manner. An airflow scheduler is used to schedule workflows and data processing pipelines. Airflow user interface allows easy visualization of pipelines running in production environment, monitoring of the progress of the workflows, and troubleshooting issues when needed.

apache_drill_vs_amazon_athena

Amazon Athena, a serverless, interactive query service, is used to easily analyze big data using standard SQL in Amazon S3. Apache Drill, a schema-free, low-latency SQL query engine, enables self-service data exploration on big data. Let us compare data partitioning in Apache Drill & AWS Athena and the distinct features of both.

cdr_analysis_using_k_means_in_tableau

Clustering of the customer activities for 24 hours by using K-means clustering feature in Tableau 10.Tableau 10 clustering feature automatically groups together similar data points. This type of clustering helps you create statistically-based segments which provide insight into how different groups are similar as well as how they are performing compared to each other.

cdr_analysis_using_k_means_with_r

Call Detail Record (CDR) is the information captured by the telecom companies during the Call, SMS and Internet activity. These information’s provides insights about the customer needs when it is used with customer demographics. Most of the telecom companies using call detail record information in the fraud detection by clustering the user profiles, Customer churn by usage activity and targeting the profitable customers by using RFM analysis.

crime_analysis_using_h2o_autoencoders

Deep Learning (DL) and Machine Learning (ML) are used to analyze and accurately predict data.Crime prediction not only helps in crime prevention but also enhances public safety. H2O Autoencoders model is deployed into a real-time production environment by converting it into POJO objects using H2O functions Crime Analysis.

customer_churn_analysis

In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. Customer loyalty and customer churn always add up to 100%. If a firm has a 60% of loyalty rate, then their loss or churn rate of customers is 40%. As per 80/20 customer profitability rule, 20% of customers are generating 80% of revenue. So, it is very important to predict the users likely to churn from business relationship and the factors affecting the customer decisions. We are going to show how logistic regression model using R can be used to identify the customer churn in the telecom dataset.

data_analysis_using_apache_hive_and_apache_pig

Apache Hive, an open-source data warehouse system, is used with Apache Pig for loading and transforming unstructured, structured, or semi-structured data for data analysis and getting better business insights. Pig, a standard ETL scripting language, is used to export and import data into Apache Hive and to process large number of datasets.

data_normalization_and_filtration_using_drools

Drools, a Rule Engine, is used to implement an expert system using a rule-based approach. It is used to convert both structured and unstructured data into transient data by applying business logic for normalizing and filtering data in drools DRL file.

data_quality_checks_with_streamsets

StreamSets is not only used for big data ingestion but also for analyzing real-time streaming data. It is used to identify null or bad data in source data and filter out the bad data from the source data in order to get precise results. It also helps the businesses in making quick and accurate decisions.

dedup-poc

DeDupe Python Repo

falcon_data_pipeline

In our use case, we have used Apache Falcon to centrally define data pipelines, and then Falcon uses those definitions to auto-generate workflows in Apache Oozie. Falcon data flows are sinking with Atlas through Kafka topics so Atlas knows about Falcon metadata. Atlas provides Falcon feed lineage and it can tell what table was the source for another table.

handle_class_imbalance_data

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Extreme imbalance data can be seen in banking or financial data.

hive_streaming_with_storm_kafka

With the release of Hive 0.13.1 and HCatalog, a new Streaming API was released to support continuous data ingestion into Hive tables. This API is intended to support streaming clients like Flume or Storm to better store data in Hive, which traditionally had batch oriented storage. In our use case, we are going to use Kafka with Storm to load streaming data into bucketed Hive table. Multiple Kafka topics produce the data to Storm that ingests the data into transactional Hive table. Data committed in a transaction is immediately available to Hive queries from other Hive clients. Apache Atlas will track the lineage of Hive transactional table, Storm (Bolt, Spout), and Kafka topic, which will help us to understand how data is ingested into the Hive table.

loan_application_metrics_by_state_using_pig

In our use case we are going to use Pig to calculate and find loan application metrics using loan, application and applicant mock data. These mock data are prepared by us with specific criteria in mind and the fields are included in each datasets as per the standard data model used in finance industry in United States (US).

loan_prediction_using_pca_and_naive_bayes_classification_with_r

Nowadays, there are numerous risks related to bank loans both for the banks and the borrowers, who get the loans. The risk analysis about bank loans needs understanding about the risk and the risk level. Banks need to analyze their customers for loan eligibility so that they can specifically target those customers. Banks wanted to automate the loan eligibility process (real time) based on customer details such as Gender, Marital Status, Age, Occupation, Income, debts and others, provided in their online application form. As the number of transactions in banking sector is rapidly growing and huge data volumes are available, the customers’ behavior can be easily analyzed and the risks around loan can be reduced. So, it is very important to predict the loan type and loan amount based on the banks’ data

mongodb_compression

MongoDB – WiredTiger Storage Engine. “mongod_v3.conf” configuration file repositories.

mostpopularwords

It is a common word count example to find out different words and its count from input raw data using Apache Spark.

mostpopularwordsbetter

Apply regular expression with Spark filter function to filter only words (Alphabets) and length of the words greater than 2 characters.

nifi-examples

Apache NiFi example flows

normalizedmostpopularwords

Apply stop list of words to filter out common words used in English Language using Spark.

pivot_unpivot_multiple_columns_in_ms_sql

MS SQL Server, a Relational Database Management System (RDBMS), is used for storing and retrieving data. Data integrity, data consistency, and data anomalies play primary role when storing data into database. Data is provided in different formats to create different visualizations for analysis. For this purpose, you need to pivot (rows to columns) and unpivot (columns to rows) your data.

predict_bad_loans

The greatest challenge in machine learning is to employ the best models and algorithms to accurately predict the probability of loan default in making the best financial decisions by both investors and borrowers. H2O’s AutoML, an easy-to-use interface for advanced users, automates the machine learning workflow including training a large set of models.

restful_api_using_loopback

Buidling RESTful API for CRUD operation using LoopBack and backend store is MongoDB.LoopBack, an easy to learn and understand open-source Node.js framework, allows you to create end-to-end REST APIs with less code compared to Express and other frameworks. It allows you to create your basic routes on adding a model into the application.

sensor_data_quality_management

Data Quality Management (DQM) is the process of analyzing, defining, monitoring, and improving quality of data continuously. DQM is applied to check data for required values, validate data types, and detect integrity violation & data anomalies using Python.

sfo_fire_service_call_analysis_using_spark

To understand the Spark performance and tuning the application we have created Spark application using RDD, DataFrame, Spark SQL and Dataset APIs to answer the below questions from the SFO Fire department call service dataset.How many different types of calls were made to the Fire Department?,How many incidents of each call type were there?,How many years of Fire Service Calls are in the data file?, How many service calls were logged in the past 7 days? and Which neighborhood in SF generated the most calls last year?.

treselle-systems Goto Github PK

Treselle Systems Pvt Ltd's Projects

Recommend Projects

Recommend Topics

Recommend Org

Jobs