Data Analysis/- Case Study

[SQL] ๐Ÿ“Œ ์ฝ”ํ˜ธํŠธ ๋ถ„์„ - ๋ฆฌํ…์…˜ ์•Œ์•„๋ณด๊ธฐ | Kaggle ์ด์ปค๋จธ์Šค ๋ฐ์ดํ„ฐ ํ™œ์šฉ ์˜ˆ์ œ

์Šค์œ„๋ฏผ 2025. 2. 22. 17:23

Cohort Analysis (์ฝ”ํ˜ธํŠธ ๋ถ„์„)

 

Cohort๋Š” '๋™์งˆ์ง‘๋‹จ'์ด๋ผ๋Š” ๋œป์ด๋‹ค.

Cohort ๋ถ„์„์€ ๊ณตํ†ต๋œ ํŠน์„ฑ์„ ๊ฐ€์ง„ ์‚ฌ์šฉ์ž ์ง‘๋‹จ(์ฝ”ํ˜ธํŠธ)์˜ ํ–‰๋™ ๋ณ€ํ™”๋ฅผ ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๋ถ„์„ํ•˜์—ฌ ํŠธ๋ Œ๋“œ์™€ ํŒจํ„ด์„ ํŒŒ์•…ํ•˜๋Š” ๊ธฐ๋ฒ•์ด๋‹ค.

 

ํ”Œ๋žซํผ์˜ ์‚ฌ์šฉ์ž๋“ค์ด ์„œ๋น„์Šค๋ฅผ ์–ด๋–ค ๋น„์œจ๋กœ ์žฌ๋ฐฉ๋ฌธ/์žฌ๊ตฌ๋งคํ•˜๋Š”์ง€ ํ™•์ธํ•˜๊ณ  ์‹ถ์„ ๋•Œ,

์ฆ‰ ์‚ฌ์šฉ์ž์˜ ๋ฆฌํ…์…˜์„ ํŒŒ์•…ํ•˜๊ณ ์ž ํ•  ๋•Œ ์œ ์šฉํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

๋ณดํ†ต ์›”(monthly) ๋˜๋Š” ์ฃผ(weekly) ๋‹จ์œ„๋กœ ๋ฆฌํ…์…˜์„ ํ™•์ธํ•œ๋‹ค.

 

Data ์ค€๋น„

๋ฐ์ดํ„ฐ๋Š” ์บ๊ธ€์˜ ์˜จ๋ผ์ธ ์„ธ์ผ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

 

https://www.kaggle.com/datasets/jihyeseo/online-retail-data-set-from-uci-ml-repo

 

Online Retail Data Set from UCI ML repo

transactions 2010-2011 for a UK-based and registered non-store online retail

www.kaggle.com

 

๋ฐ์ดํ„ฐ์—์„œ ์œ ์ €ID์™€ ๊ตฌ๋งค๋‚ ์งœ๋งŒ ์žˆ์œผ๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€๊ณตํ•ด์„œ SQL์— import ํ•ด์ค€๋‹ค.

ํ™œ์šฉํ•œ ๋ฐ์ดํ„ฐ๋Š” ์ด๋ ‡๊ฒŒ ์ƒ๊ฒผ๋‹ค.

 

 

SQL ์ฝ”๋“œ : ๋ฐฉ๋ฒ• 1

MySQL workbench์—์„œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ–ˆ๋‹ค.

 

-- ์œ ์ €๋ณ„ ์ฒซ ๊ตฌ๋งค์ผ ๊ตฌํ•˜๊ธฐ
WITH first_purchase AS (
    SELECT
        CustomerID AS user_id,
        MIN(InvoiceDate) AS first_purchase_date
    FROM cohort_practice.online_sale_data
    GROUP BY 1
),

-- ์ฒซ ๊ตฌ๋งค์ผ์ด ๋‹ด๊ธด ํ…Œ์ด๋ธ”๊ณผ ์ผ๋ฐ˜ ๊ตฌ๋งค์ผ์„ ์กฐ์ธ, ๋‘ ๋‚ ์งœ์˜ ์›” ์ฐจ์ด ๊ณ„์‚ฐ
cohort_group as (
SELECT 
    ch.CustomerID AS user_id,
    DATE_FORMAT(fp.first_purchase_date, '%Y-%m') AS cohort_month,
    (YEAR(ch.InvoiceDate) * 12 + MONTH(ch.InvoiceDate)) - 
    (YEAR(fp.first_purchase_date) * 12 + MONTH(fp.first_purchase_date)) AS cohort_index -- ์ฒซ ๊ตฌ๋งค์ผ๊ณผ ๊ตฌ๋งค์ผ์˜ ์›” ์ฐจ์ด ๊ณ„์‚ฐ
FROM cohort_practice.online_sale_data ch
LEFT JOIN first_purchase fp
    ON ch.CustomerID = fp.user_id )

-- ์ฒซ ๊ตฌ๋งค์ผ, ์žฌ๊ตฌ๋งค์ผ์„ ๊ธฐ์ค€์œผ๋กœ ์‚ฌ์šฉ์ž ๊ทœ๋ชจ count
select cohort_month, cohort_index, count(distinct user_id) as quantity
from cohort_group
group by cohort_month, cohort_index

 

 

1. first_purchase CTE :

์œ ์ €ID๋ณ„๋กœ ๊ทธ๋ฃนํ™”๋ฅผ ํ•˜์—ฌ ์œ ์ €๋ณ„ ์ตœ์ดˆ ๊ตฌ๋งค์ผ์„ ๊ตฌํ•ด์ค€๋‹ค.

 

2. cohort_group CTE :

์ฒซ ๊ตฌ๋งค์ผ์ด ๋‹ด๊ธด ํ…Œ์ด๋ธ”๊ณผ ์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ์ธํ•ด์ค€๋‹ค.

๊ทธ๋Ÿฌ๋ฉด ํ•œ row ๋‹น ์œ ์ € ID / ๊ทธ ์œ ์ €์˜ ์ฒซ ๊ตฌ๋งค์ผ / ์ผ๋ฐ˜ ๊ตฌ๋งค์ผ(InvoiceDate) ์ด๋ ‡๊ฒŒ ์šฐ๋ฆฌ๊ฐ€ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๊ณณ์— ๋ชจ์„ ์ˆ˜ ์žˆ๋‹ค.

 

์—ฌ๊ธฐ์„œ ์šฐ๋ฆฌ๋Š” ์ผ๋ฐ˜ ๊ตฌ๋งค์ผ(InvoieDate)๊ณผ ์ฒซ ๊ตฌ๋งค์ผ์˜ ์ฐจ์ด๋ฅผ ๊ตฌํ•ด์•ผ ํ•œ๋‹ค. 

๋”ฐ๋ผ์„œ InvoiceDate์™€ first_purchase_date์„ 1์ด ํ•œ ๋‹ฌ ๋‹จ์œ„์ธ ์ˆซ์ž๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๊ณ  ์ฐจ์ด๋ฅผ ๊ตฌํ•ด์ค€๋‹ค.

๊ทธ ์ฐจ์ด๋Š” cohort_index๋กœ ๋‚˜์ค‘์— ์ฒซ ๊ตฌ๋งค์ผ๋กœ๋ถ€ํ„ฐ 1๋‹ฌ, 2๋‹ฌ, 3๋‹ฌ ๋’ค์ธ ์‹œ๊ฐ„์˜ ํ๋ฆ„์„ ๋‚˜ํƒ€๋‚ด์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

 

๋‹ค๋ฅธ ๋ธ”๋กœ๊ทธ์—์„œ๋Š” DATEDIFF์™€ TIMESTAMPDIFF๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ ๊ทธ๋ ‡๊ฒŒ ํ•˜๋ฉด ๋‘ ๋‚ ์งœ ์‚ฌ์ด์˜ ๊ฐ„๊ฒฉ์„ ๊ธฐ์ค€์œผ๋กœ ์ˆซ์ž๋ฅผ ๋ฐ˜ํ™˜ํ•˜๊ฒŒ ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, 2024๋…„ 3์›” 24์ผ์ด ์ฒซ ๊ตฌ๋งค์ผ, 2024๋…„ 4์›” 1์ผ์ด ์žฌ๊ตฌ๋งค์ผ์ธ ๊ฒฝ์šฐ

์›ํ•˜๋Š” ๊ฐ’ : 1 ( ํ•œ๋‹ฌ ์ฐจ์ด ) ์ธ๋ฐ

DATEDIFF/TIMESTAMPDIFF๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ : 0 (๋‘ ๋‚ ์งœ ๊ฐ„๊ฒฉ์ด 11์ผ๋กœ ํ•œ๋‹ฌ ๋ฏธ๋งŒ) ์œผ๋กœ

์›ํ•˜๋Š” ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์—†๋‹ค.

๋”ฐ๋ผ์„œ ์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

 

3. ์ตœ์ข… select๋ฌธ:

๋งˆ์ง€๋ง‰์œผ๋กœ ์ฒซ ๊ตฌ๋งค์ผ (cohort_month)์™€ ์›” ์ฐจ์ด (cohort_index)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™”๋ฅผ ํ•˜์—ฌ ์œ ์ €ID๋ฅผ countํ•ด์ค€๋‹ค.

๊ทธ๋Ÿฌ๋ฉด ์ฒซ ๊ตฌ๋งค์ผ์„ ๊ธฐ์ค€์œผ๋กœ n์›” ๋–จ์–ด์ง„ ๋‚ ์งœ์— ๋ช‡๋ช…์ด ๊ตฌ๋งค๋ฅผ ํ–ˆ๋Š”์ง€๋ฅผ ์•Œ ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

 

 

๊ฒฐ๊ณผ ํ…Œ์ด๋ธ”

์ฟผ๋ฆฌ ์‹คํ–‰ ๊ฒฐ๊ณผ ํ…Œ์ด๋ธ”์€ ์ด๋ ‡๊ฒŒ ์ƒ๊ฒผ๋‹ค.

์ด๊ฒƒ์„ ์šฐ๋ฆฌ๊ฐ€ ์•„๋Š” ์‚ผ๊ฐํ˜• ๋ชจ์–‘ ์ฐจํŠธ๋กœ ๋งŒ๋“ค์–ด์ฃผ๊ธฐ ์œ„ํ•ด SQL์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ

์—‘์…€์ด ๋” ๊ฐ„ํŽธํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์—‘์…€๋กœ ์˜ฎ๊ฒจ์ค€๋‹ค !

 

Cohort ์ฐจํŠธ : ์—‘์…€ ์‚ฌ์šฉ

์—‘์…€์˜ ํ”ผ๋ฒ— ํ…Œ์ด๋ธ”์„ ์‚ฌ์šฉํ•ด์„œ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ์‚ผ๊ฐํ˜• ๋ชจ์–‘ ์ฐจํŠธ๋กœ ๋งŒ๋“ค์–ด์ค€๋‹ค.

๊ทธ ๋‹ค์Œ์— ์กฐ๊ฑด๋ถ€ ์„œ์‹์„ ์‚ฌ์šฉํ•ด์„œ ์Šค์ผ€์ผ๋ณ„๋กœ ์ƒ‰์น ์„ ํ•ด์ค€๋‹ค.

 

๊ฒฐ๊ณผ๋ฅผ ํ•ด์„ํ•ด๋ณด์ž๋ฉด, 

2010๋…„ 12์›”์— ์ฒซ ๊ตฌ๋งค๋ฅผ ํ•œ ์œ ์ €๋Š” 948๋ช…์ด๋‹ค.

๊ทธ๋ฆฌ๊ณ  ๊ทธ ์œ ์ € ์ค‘ 38% (362๋ช…)์ด ๋‹ค์Œ ๋‹ฌ์ธ 2011๋…„ 1์›”์—๋„ ๊ตฌ๋งค ํ–‰์œ„๋ฅผ ํ–ˆ๋‹ค.

33%๊ฐ€ ๊ทธ ๋‹ค์Œ ๋‹ฌ์—๋„ ๊ตฌ๋งค๋ฅผ ํ–ˆ๊ณ  39%, 36% ์ด๋ ‡๊ฒŒ ์›” ๋‹จ์œ„์˜ ์‹œ๊ฐ„์˜ ํ๋ฆ„์— ๋”ฐ๋ผ ์œ ์ €๊ฐ€ ์žฌ๊ตฌ๋งค๋ฅผ ํ–ˆ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 


 

 

 

SQl ์ฝ”๋“œ : ๋ฐฉ๋ฒ• 2

-- ์ฒซ ๊ตฌ๋งค์ผ ๊ตฌํ•˜๊ธฐ
WITH first_order AS (
    SELECT 
        CustomerID as customer_id,
        MIN(InvoiceDate) AS first_order_date
    FROM cohort_practice.online_sale_data
    GROUP BY customer_id
),
-- ์ตœ์ดˆ ๊ตฌ๋งค์ผ๊ณผ ์ผ๋ฐ˜ ๊ตฌ๋งค์ผ ๊ตฌํ•˜๊ธฐ
order_month_table AS (
    SELECT 
        a.CustomerID as customer_id, 
        date_format(first_order_date, '%Y-%m-01') AS first_order_month, 
        date_format(InvoiceDate, '%Y-%m-01') AS order_month
    FROM cohort_practice.online_sale_data a
    LEFT JOIN first_order b
    ON a.CustomerID = b.customer_id
),
-- ์ฒซ ๊ตฌ๋งค์ผ์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™”
-- month0์€ ๊ทธ๋ƒฅ distinct ์œ ์ €id count
-- month1~11์€ case when ์‚ฌ์šฉ
-- month1์€ ๊ตฌ๋งค์ผ = ์ฒซ ๊ตฌ๋งค์ผ + 1m ์ผ ๋•Œ, distinct ์œ ์ €id count
cohort_chart as (
SELECT 
    first_order_month,
    COUNT(DISTINCT customer_id) AS month0,
    COUNT(DISTINCT CASE WHEN order_month = DATE_ADD(first_order_month, interval 1 month) THEN customer_id END) AS month1,
    COUNT(DISTINCT CASE WHEN order_month = DATE_ADD(first_order_month, interval 2 month) THEN customer_id END) AS month2,
    COUNT(DISTINCT CASE WHEN order_month = DATE_ADD(first_order_month, interval 3 month) THEN customer_id END) AS month3,
    COUNT(DISTINCT CASE WHEN order_month = DATE_ADD(first_order_month, interval 4 month) THEN customer_id END) AS month4,
    COUNT(DISTINCT CASE WHEN order_month = DATE_ADD(first_order_month, interval 5 month) THEN customer_id END) AS month5,
    COUNT(DISTINCT CASE WHEN order_month = DATE_ADD(first_order_month, interval 6 month) THEN customer_id END) AS month6,
    COUNT(DISTINCT CASE WHEN order_month = DATE_ADD(first_order_month, interval 7 month) THEN customer_id END) AS month7,
    COUNT(DISTINCT CASE WHEN order_month = DATE_ADD(first_order_month, interval 8 month) THEN customer_id END) AS month8,
    COUNT(DISTINCT CASE WHEN order_month = DATE_ADD(first_order_month, interval 9 month) THEN customer_id END) AS month9,
    COUNT(DISTINCT CASE WHEN order_month = DATE_ADD(first_order_month, interval 10 month) THEN customer_id END) AS month10,
    COUNT(DISTINCT CASE WHEN order_month = DATE_ADD(first_order_month, interval 11 month) THEN customer_id END) AS month11
FROM order_month_table
GROUP BY first_order_month
ORDER BY first_order_month)

select
	first_order_month,
    round(month0 * 100 / month0, 2) as month0,
    round(month1 * 100 / month0, 2) as month1,
    round(month2 * 100 / month0, 2) as month2,
    round(month3 * 100 / month0, 2) as month3,
    round(month4 * 100 / month0, 2) as month4,
    round(month5 * 100 / month0, 2) as month5,
    round(month6 * 100 / month0, 2) as month6,
    round(month7 * 100 / month0, 2) as month7,
    round(month8 * 100 / month0, 2) as month8,
    round(month9 * 100 / month0, 2) as month9,
    round(month10 * 100 / month0, 2) as month10,
    round(month11 * 100 / month0, 2) as month11
from cohort_chart

 

  • step 1 : ์ฒซ๋ฒˆ์งธ ๊ตฌ๋งค์ผ ๊ตฌํ•˜๊ธฐ

์œ ์ €ID๋ฅผ ๊ธฐ์ค€์œผ๋กœ min(๊ฒฐ์ œ์ผ), ์ฆ‰ ์ตœ์†Œ ๊ฒฐ์ œ์ผ์„ ๊ตฌํ•ด์ค€๋‹ค.

  • step 2 : ์ฒซ๋ฒˆ์งธ ๊ตฌ๋งค์ผ๊ณผ ์›๋ณธ ํ…Œ์ด๋ธ” ์กฐ์ธ์‹œํ‚ค๊ธฐ

[์œ ์ € ID, ์ตœ์†Œ ๊ฒฐ์ œ์›”, ์ผ๋ฐ˜ ๊ฒฐ์ œ์›”] ๊ตฌ์กฐ์˜ ํ…Œ์ด๋ธ”์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์กฐ์ธ์„ ํ•ด์ค€๋‹ค.

  • step 3 : ์ฒซ๋ฒˆ์งธ ๊ตฌ๋งค์ผ์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ์›”๋ณ„ ๋ฆฌํ…์…˜ ๊ตฌํ•˜๊ธฐ

์ตœ์ดˆ ๊ฒฐ์ œ์›”์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™”๋ฅผ ํ•˜์—ฌ ์นด์šดํŠธํ•œ๋‹ค.

์ตœ์ดˆ ๊ฒฐ์ œ์›”์˜ ์œ ์ € ์ˆ˜๋ฅผ count ํ•˜๋ฉด ๊ทธ๊ฒŒ month0์ด ๋œ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์ตœ์ดˆ ๊ฒฐ์ œ์›”๋กœ๋ถ€ํ„ฐ + 1๋‹ฌ์ธ ์ผ๋ฐ˜ ๊ฒฐ์ œ์›”์˜ ์œ ์ € ์ˆ˜๋ฅผ count ํ•˜๋ฉด month 1์ด ๋œ๋‹ค.

๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ month11๊นŒ์ง€ ๊ตฌํ•ด์ค€๋‹ค!

 

๊ฒฐ๊ณผ ํ…Œ์ด๋ธ”

 

 

SQL์—์„œ ๋ฐ”๋กœ ์ด๋ ‡๊ฒŒ ๋ฆฌํ…์…˜ ์ฐจํŠธ๋ฅผ ๊ทธ๋ ค์ค„ ์ˆ˜ ์žˆ๋‹ค.

๋ฌผ๋ก  ์‹œ๊ฐํ™” ํˆด์„ ํ™œ์šฉํ•ด์„œ ๊ทธ๋ ค์ฃผ๋Š” ๊ฒŒ ํ›จ์”ฌ ๊ฐ€๋…์„ฑ๋„ ์žˆ๊ณ  ๊ด€๋ฆฌํ•˜๊ธฐ ์‰ฌ์šธ ๊ฒƒ์ด๋‹ค.