Data Mining. Brad Morantz PhD.

Similar documents
Soil Classification and Fertilizer Recommendation using WEKA

Coupling Multiple Hypothesis Testing with Proportion Estimation in Heterogeneous Categorical Sensor Networks

B&M European Value Retail S.A. : Company Profile and SWOT Analysis

The Millionaire Barber Stylist

The system that digitally organizes your laundry.

3598 OLD JENNINGS RD 3598 Old Jennings Rd, Middleburg, FL 32068

Demographic and Income Profile

Applied Data Science: Using Machine Learning for Alarm Verification

EL Civics COAAP 43/Environment Level: Beginning Low-Beginning High Task #1: Identify & Sort Recyclables

1. EXPLORE A CAREER PLAN IN DESIGN AND MERCHANDISING

Community Recycling Centre

AFMG. Design for Speech Intelligibility Using Software Modeling. Stefan Feistel, Bruce C. Olson AHNERT FEISTEL MEDIA GROUP

PADS ESTIMATION RESULTS FOR LATVIA. A.Auziņa Riga Technical University Latvia

Taking Out the Trash

Research on Decision Tree Application in Data of Fire Alarm Receipt and Disposal

EAGLE DISPOSAL. Look inside for tips on getting started and lists of acceptable and non-acceptable items. Creating a green community together.

Intelligent Keys. A smart solution for recurring revenue

PENN WASTE, INC. For Your H ome. We re very excited to be partnering with you to help create a green community together!

Supporting Information

Oracle Retail Merchandising System Release Notes Release 12.0 May 2006

English as a Second Language Podcast ESL Podcast 168 The Home Improvement Store

PENN WASTE, INC. It s Easy!

PENN WASTE, INC. It s Easy!

Applications of Thermodynamics: Heat Pumps and Refrigerators

Lowe's Companies Inc (LOW) - Financial and Strategic SWOT Analysis Review

INSTANT MEETING. Earth Day: Sparks Sunday April 22, 2018

SUMMARY CENSUS Population 135, , ,685. Households 44,675 52,029 57,758. Families 36,035 41,582 45,978

ASSIGNMENT OBJECTIVES

EnergyMaster. ENEC Monitoring System. An on-site supervision and remote monitoring solution

When you re an authority, act like it. A brand positioning solution for Tyco Fire Suppression & Building Products.

Lower LSM s: facilitating survival priorities

Home Landscaping: Northeast Region: Including Southeast Canada (Home Landscaping) By Roger Holmes Mr.;Greg Grant Mr.

Scholars Research Library. The Role of Plant Clinic in Protecting Vertical Urban Green Spaces in Tehran

Smoke Alarm Response Time:

PENN WASTE, INC. It s Easy!

NEW CD WARP CONTROL SYSTEM FOR THE CORRUGATING INDUSTRY

Lowe s discount gift cards

Arnotts drives business expansion with people and process initiatives

WHITE PAPER FEBRUARY Oracle Retail: Optimize Performance and Reduce Business Risk

Professional Flooring Contractor Shares Secrets To A Successful Garage Floor Coating Project

PENN WASTE, INC. It s Easy!

A: The last radiators might not be able to heat the rooms they serve on the coldest days of the year. Your system would be out of balance.

EE04 804(B) Soft Computing Ver. 1.2 Class 1. Introduction February 21st,2012. Sasidharan Sreedharan

An Alarm Correlation Algorithm for Network Management Based on Root Cause Analysis

LESSON 8: Recycling OVERVIEW

Data-Driven Energy Efficiency in Buildings

Reading (E3) ESOL Skills for Life. Reading E3 (Practice Set 1) Practice Set 1 THE TIME ALLOWED FOR THIS TEST IS 1 HOUR 4/4/4 W37564A

The frequency of maintenance and service correlates with the frequency the pressure washer is used.

Marketing to Builders and Homeowners

CCTV Camera - Global Market Outlook ( )

PENN WASTE, INC. It s Easy!

Creating an awesome work space is just good business.

PRINCIPLES OF OPERATIONS MANAGEMENT (9TH EDITION) BY JAY HEIZER, BARRY RENDER

KARNS, TENNESSEE 5.7 Acres Available Located on Oak Ridge Highway Excellent Development Opportunity!

TrueAllele Technology Computer interpretation of DNA evidence

Recycle! With the Quattro Select System

UNDERSTANDING CUSTOMER NEEDS

TAI-PAN BY JAMES CLAVELL DOWNLOAD EBOOK : TAI-PAN BY JAMES CLAVELL PDF

General Licensing Information. Checklist

click & collect capability - delivered to your business when you need it

LOWE S SECURITY ANALYSIS TERRY ASANTE

Intrusion Detection System: Facts, Challenges and Futures. By Gina Tjhai 13 th March 2007 Network Research Group

Introduction. 444 Creamery Way #500, Exton, PA Tel:

Forecasting the Number of Fire Accidents in the Philippines through Multiple Linear Regression

Why recycle? We can recycle more. Recycling saves energy. Recycling benefits the economy. Recycling protects the environment

UNIT 6 Garden Friends and Pests

Q.: I liked our single stream recycling. It was easier and more convenient. This is a step backwards. Why?

Burpee Home Gardens. Certified Grower Guidelines

Summary of Action Strategies

ChurchSafety InfoSheet: Fire Risk Assessment

WIN-PAK SE with VISTA Integration INTEGRATED SECURITY, VIDEO AND ACCESS CONTROL SOLUTIONS. A Winning Combination

2011 Dumpster Dive totals

Adopt a Garden Scheme - Review & Report - March 2009 A Footprint Trust project based on the Isle of Wight

Retail Leakage and Surplus Analysis

Cold Storage Management Standard Operation Procedure

Intrusion Detection: Eliminating False Alarms

OFFICIAL PUBLICATION OF THE N. C. COMMERCIAL FLOWER GROWERS' ASSOCIATION. A Packaged Deal for a New Specialty Robert Milks

The NYS Electronic Equipment Recycling & Reuse Act

Are quiet products an advantage for manufacturers? Industry Experience in the United States

Spaces + Places: Everyday Landmarks

Artificial Intelligence Enables a Network Revolution

Composting: the rotten truth

MULTISITE. Multisite Activation. Microsoft Dynamics AX White Paper

Paper Reference. Paper Reference(s) 4340/03 London Examinations IGCSE. Monday 12 November 2007 Afternoon Time: 1 hour, plus reading time of 10 minutes

Business Studies BUSS1 (JAN09BUSS101) General Certificate of Education Advanced Subsidiary Examination January Planning and Financing a Business

To appear in the IEEE/IFIP 1996 Network Operations and Management. Symposium (NOMS'96), Kyoto, Japan, April TASA: Telecommunication

Color, Design and Market Trends for 2014 Residential Outdoor Products

1.1. SYSTEM MODELING

FIMD: Fine-grained Device-free Motion Detection

THE BRAND OF QUALITY. Averna Homes

RLDS - Remote LEAK DETECTION SYSTEM

CREATIVE INNOVATIONS IN HOME IMPROVEMENT & DESIGN

presentations the week of the final exams. Anyone who misses the appointment will receive a grade of zero.

Drying Simulation the most effective way to save energy

Crawl Space 101. Any mold or mildew that may be living in your crawl space is being circulated through your entire home.

Eating your Energy s Worth (Exploring energy consumption through food)

DEFECTOVISION IR. Non-Destructive Testing of Steel Billets and Tubes

Due: in lecture Friday, September 23 rd Attach your Baseline data

Food and Nutrition in Space

NET LEASE INVESTMENT OFFERING. Representative Photo DOLLAR GENERAL. Highway 34 & 435th Avenue Howard, SD 57349

Transcription:

Data Mining Brad Morantz PhD bradscientist@machine-cognition.com Copyright Brad Morantz 2009

What is a Mountain? A big pile of dirt! Brad Morantz 2

Data? Some people would say that data on them is... Dirt A big pile of data is a... Mountain of data or a data warehouse Brad Morantz 3

Mountain of data Chen has $10 Bob has blue Ford Mary eats rice krisp Brad Morantz 4

Notice in Last Example.. Different people Different facts Unconnected? Maybe: Bob bought the Ford from Chen (for $10?) Mary bought the Rice Krisp from Chen Maybe Bob delivers the Rice Krisp in his Ford Maybe something else Maybe nothing Brad Morantz 5

Data Mining Is the nontrivial extraction of implicit previously unknown and potentially useful information and data. (James Buckley) Machine Learning employs search heuristics to uncover interesting and systematic relationships in data. (Dhar & Stein) Brad Morantz 6

Data Mining Going into a data file and looking for patterns and relationships Use of historical data to find patterns and improve future decisions Also called KDD Knowledge discovery from databases Brad Morantz 7

Uses Merchandising Association rules Customer profiling Problem solving & risk profiling Medical Financial Pattern Recognition IRS NSA Many more Brad Morantz 8

We want information that is: Original Nontrivial Fundamental Simple Useful Currently needs to be numerical or categorical Brad Morantz 9

Theoretical Definition of Data Mining Go into data No preconceived ideas See what is found Brad Morantz 10

Analogy Somebody gives you a box of stuff You have no idea what is in it You open it up You look through the contents You make note of all that you have found You may sort the stuff into piles based upon some similarities Brad Morantz 11

Association Rule Example Check out at store Each check out is a record Called a basket Look at what was purchased Look for relationships or associated items milk and cereal, beer and pizza Brad Morantz 12

Grocery Example Invoice Milk Cereal Pizza Beer Eggs 1 X X X X 2 X X X 3 4 5 6 X X X X X X X X X X Brad Morantz 13

Support Milk AND Cereal Percent of baskets where rule is true P(milk AND cereal) = 3/6 = 0.500 Confidence Percent of baskets that have cereal given milk is in the basket P(cereal milk) = ¾ = 0.75 Brad Morantz 14

Benefits to Retailer Store layout & increased sales Puts items that sell together close to each other Increase impulse purchases Help forecast inventory levels Learn customer preferences Increase profitability Brad Morantz 15

Go into data Common Meaning of Data Mining Look for patterns Look for answers Look for similarities Look for rules Look for relationships Brad Morantz 16

More Analogy Someone gives you the box full of stuff You are looking for hit singles by Mick Jagger (very specific) Along the way, you might discover that they are round, plastic, about 7 diameter, with a 1 hole in the middle You might discover that they are on London label You might discard all other things in the box Brad Morantz 17

Example High risk pregnancy Tons of data What are common factors? Myocardial Infarction (MCI) Tons of data What factors common in non-occurrence? Loan application Many other situations Brad Morantz 18

Simple Use of Common Factors Nurse/banker/etc can ask a few questions The key factors Identified by data mining Now has indication of risk or outcome Brad Morantz 19

Interesting Finding Data mining stolen credit cards Common pattern Pay at pump authorization To determine if card is good Fast indication of stolen card Brad Morantz 20

What is in the Mountain of data? Better yet, ask What is NOT in the mountain of data? Brad Morantz 21

US Navy Original Boilers on boats exploding Brad Morantz 22

Think about a trip to the store They scan your purchases You give them credit card or check They have your name Address Telephone number Maybe drivers license Maybe SSN List of items that you bought, including size Brad Morantz 23

Grocery Store Single male Pizza Beer Potato chips Mother with kids Milk Bread Eggs Cereal If diapers, then infants Brad Morantz 24

Think About.. All of the baskets about you out there Medical Information Bureau (MIB) Charge Cards Bank information Public records Store purchases Travel records Much more The combination of all this creates a pattern that describes you exactly Brad Morantz 25

Applications Marketing Target marketing Store layout Better understanding Medical Criminal Financial Science Brad Morantz 26

Methods Rule Induction Similarity Engine OLS & logistic regression Data visualization Bayesian Classification & networks Clustering Specialized programs Clementine Enterprise Miner Brad Morantz 27

Plot Plot data and put in breaks to split it Months on the job Bad credit risks Good credit risks Debt ratio Brad Morantz 28

Clustering This can be in N-space or hyperspace Each dimension is a variable Distance can be mahalanobus, Euclidian, or other metric Each Cluster Brad Morantz 29

Trees Poor No purchase Bald Wealthy Brand A Over 35 Hair Poor Wealthy Brand V Under 35 Long hair Brand C Brand G Short hair SUV Brand P Non SUV Brand L Brad Morantz 30

Recursive Partitioning Algorithms Recursively make choice at node purer Adding another node = another variable Add until desired accuracy is met CART Classification And Regression Trees Reducing variance at node makes cluster tighter ID3/C4.5 Bayes Classification Reduce entropy/disorganization Brad Morantz 31

Statistical Processes Plot it and look at plot Step wise regression Significant variables T value Eigenvalue Pearson correlation coefficient Cluster Analysis Brad Morantz 32

Bayes Classification Conditional Probability P(A B) = P(A B) / P(B) Independence If P(A B) = P(A), then A independent B If not independent, then B contributes Compare P(A B) to P(A C) to. Can rank variables to see how much knowledge each adds Brad Morantz 33

Required steps Build & maintain a database Data formatting Data cleansing & anomaly detection Data visualization Currently need human expert to evaluate Brad Morantz 34

Data Preparation 75% of effort and time Can mean difference between success and failure Can greatly affect accuracy Brad Morantz 35

Data Preparation Example Mathematical relationships Weight of animal in shell > animal only Ratios More than possible quantity in container Impossible answers Out of range answers Data entry errors Etc Brad Morantz 36

Why now? Rapidly decreasing cost of data storage Increased ease of collecting data Networks Internet Intranet Store scanners Computerized lifestyle Heavy use of credit/debit cards Store customer value cards More & cheaper computational power Brad Morantz 37

Future Incorporation of background & associated knowledge More exacting data sources Hybrid systems Mixed media use Brad Morantz 38

References Leo Breiman Tom M Mitchell & Bruce Buchanan Toshinori Munakata Quinlan Clark & Niblett www.kdnuggets.com www.machine-cognition.com Brad Morantz 39