khsibr / enron-email Goto Github PK
View Code? Open in Web Editor NEWEnron Dataset ETL
Enron Dataset ETL
###################################################### # SETUP PROJECT ###################################################### install gradle 3+ gradle clean install test shadowJar ###################################################### # SOLUTION ARCH ###################################################### Architecture: - Implementation using Spark 2.1 running in EMR and S3. - Zip files uncompressed and process EML file to store cleansed view in parquet format 3 jobs are used: 1- preprocess_job: create a clean parquet storage 2- average_job: using the parquet storage, computes the average body length 3- topRecipients_job: using the parquet storage, computes top 100 recipients ###################################################### # DEPLOYMENT SOLUTION ###################################################### # CREATE S3 BUCKET USING the SNAPSHOT ###################################################### local# aws ec2 create-volume --snapshot snap-d203feb5 --availability-zone us-east-1a --region us-east-1 local# aws ec2 attach-volume --volume-id vol-05d52ef0e53a99ea7 --instance-id i-0b1a4b402a80996b2 --device /dev/sdf local# aws s3api create-bucket --bucket enronEmails ssh -i Dev/tools/aws/pi-ec2-us-east-1.pem [email protected] ec2# sudo mkdir /mnt/enronEmails ec2# sudo mount /dev/xvdf /mnt/enronEmails/ ec2# sudo yum install โy https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm ec2# sudo yum -y install python-pip ec2# aws s3 cp --recursive /mnt/enronEmails/edrm-enron-v2/ s3://enronEmails # SUBMIT JOB To EMR ###################################################### - create a cluster with 3 nodes - add Spark step for all the jobs - upload build/libs/enron-email-1.0.1-all.jar to S3 Arguments: spark-submit --deploy-mode cluster --executor-memory 10g \ --class etl.ETLApp s3://enron-emails-jar/enron-email-1.0.1-all.jar \ -c /topRecipients_job.properties ###################################################### # REUSLTS ###################################################### +------------------+ |avg_body_length | +------------------+ |160.04765675556902| +------------------+ +-----------------------------------------------+----------+ |recipient |totalScore| +-----------------------------------------------+----------+ |Jeff Dasovich |15857.0 | |Tana Jones |14865.5 | |Mark Taylor |14306.5 | |Richard Shapiro |12470.5 | |Pete Davis <[email protected]> |12451.5 | |Sara Shackleton |11985.0 | |James D Steffes |11801.0 | |Susan J Mara |9264.0 | |Sally Beck |8960.5 | |Paul Kaufman |8556.0 | |[email protected] |8001.0 | |Daren J Farmer |7636.5 | |Sandra McCubbin |6996.5 | |Tim Belden |6453.0 | |William S Bradford |6432.0 | |Gerald Nemec |6085.5 | |Carol St Clair |5787.5 | |Harry Kingerski |5697.5 | |Steven J Kean |5664.0 | |Jeffrey T Hodge |5601.0 | |Susan Bailey |5591.0 | |Elizabeth Sager |5542.0 | |John J Lavorato |5449.0 | |[email protected] |5343.0 | |[email protected] |5334.0 | |Karen Denne |5252.0 | |Kay Mann |5247.0 | |Richard B Sanders |5068.5 | |Joe Hartsoe |4946.5 | |Mark E Haedicke |4938.5 | |Alan Comnes |4663.5 | |Kate Symes |4562.0 | |Brent Hendry |4550.5 | |Mark Guzman <[email protected]> |4497.5 | |Greg Whalley |4375.0 | |Tom Moran |4228.5 | |Ryan Slinger <[email protected]> |4204.0 | |Bert Meyers <[email protected]> |4186.0 | |Stacy E Dickson |4174.0 | |Bill Williams III <[email protected]>|4128.0 | |Geir Solberg <[email protected]> |4115.0 | |Mary Cook |4114.0 | |Karen Lambert |4065.5 | |Sarah Novosel |4026.5 | |Leslie Hansen |3973.0 | |Smith |3930.5 | |Chris H Foster |3919.0 | |Beck |3909.5 | |All Enron Worldwide |3888.0 | |Mona L Petrochko |3865.5 | |Debbie R Brackett |3798.0 | |Alan Aronowitz |3787.0 | |Mary Hain |3775.5 | |Craig Dean <[email protected]> |3724.5 | |Frank L Davis |3716.0 | |Louise Kitchen |3662.0 | |Stephanie Panus |3648.0 | |Jeffrey A Shankman |3631.0 | |Samantha Boyd |3579.0 | |Outlook Migration Team |3536.0 | |Kitchen Louise <[email protected]> |3533.0 | |Brant Reves |3526.5 | |[email protected] |3518.0 | |Sally </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Sbeck> |3478.0 | |Dan J Hyvl |3414.0 | |Jeffery Fawcett |3413.0 | |Russell Diamond |3371.5 | |Suzanne Adams |3360.0 | |Kevin Hyatt |3288.5 | |Linda Robertson |3280.5 | |Christian Yoder |3272.5 | |Genia FitzGerald |3250.0 | |Tanya Rohauer |3211.5 | |David W Delainey |3189.0 | |Steven Harris |3167.5 | |Phillip K Allen |3146.0 | |Sheri Thomas |3121.0 | |Tracy Ngo |3117.5 | |Janel Guerrero |3111.5 | |Leslie Reeves |3104.0 | |Edward Sacks |3066.0 | |Bryan Hull |3050.0 | |Samuel Schott |3021.0 | |[email protected] |3006.0 | |Shari Stack |2994.5 | |Leaf Harasin <[email protected]> |2984.0 | |Mark Palmer |2970.5 | |Christopher F Calger |2964.5 | |Lisa Lees |2944.5 | |Bob Bowen |2924.5 | |Robert Badeer |2924.0 | |[email protected] |2923.5 | |Ginger Dernehl |2912.5 | |Monika Causholli <[email protected]> |2871.0 | |EX |2854.0 | |Harry M Collins |2853.0 | |Williams |2828.5 | |Mark |2824.5 | |Stephanie Sever |2813.0 | |Benjamin Rogers |2795.0 | +-----------------------------------------------+----------+
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.