Pandas 데이터프레임 연습 2

Pandas 데이터프레임 연습 2#

이 노트는 데이터프레임의 정리와 그룹화에 대한 예제이니 자유롭게 코드를 실행시키면서 연습해 보세요.

import pandas as pd
import numpy as np

먼저 연습을 위한 데이터프레임 df를 만들어 보자.

df = pd.DataFrame( {
    'I' : ["A","A","A","A","A","A","B","B","B","B","B","C","C","C","C","C","C"],
    'J' : [ "math","math","math","lang","lang","lang","math","math","lang","lang","lang","math","math","math","lang","lang","lang"],
    'K' : [22, 21, 31, 24, 25, 36, 31, 23, 34, 21, 29, 34, 41, 32, 26, 23, 43],
    'L' : [79, 94, 64, 67, 75, 45, 65, 34, 54, 32,87, 65, 56, 65, 76, 65, 89]
})
df

	I	J	K	L
0	A	math	22	79
1	A	math	21	94
2	A	math	31	64
3	A	lang	24	67
4	A	lang	25	75
5	A	lang	36	45
6	B	math	31	65
7	B	math	23	34
8	B	lang	34	54
9	B	lang	21	32
10	B	lang	29	87
11	C	math	34	65
12	C	math	41	56
13	C	math	32	65
14	C	lang	26	76
15	C	lang	23	65
16	C	lang	43	89

열이름 바꾸기#

df.rename(columns= { 'I':'school', 'J': 'class', 'K' : 'age', 'L':'score'})

	school	class	age	score
0	A	math	22	79
1	A	math	21	94
2	A	math	31	64
3	A	lang	24	67
4	A	lang	25	75
5	A	lang	36	45
6	B	math	31	65
7	B	math	23	34
8	B	lang	34	54
9	B	lang	21	32
10	B	lang	29	87
11	C	math	34	65
12	C	math	41	56
13	C	math	32	65
14	C	lang	26	76
15	C	lang	23	65
16	C	lang	43	89

df.columns

Index(['I', 'J', 'K', 'L'], dtype='object')

df.rename(columns= { 'I':'school', 'J': 'class', 'K' : 'age', 'L':'score'}, inplace=True)

df.head(4)

	school	class	age	score
0	A	math	22	79
1	A	math	21	94
2	A	math	31	64
3	A	lang	24	67

그룹별 요약#

df1 = df.groupby(by = 'school')

함수 count() 는 그룹으로 지정된 school 의 각 그룹에 속한 행의 개수를 표시하는 함수

df1.count()

	class	age	score
school
A	6	6	6
B	5	5	5
C	6	6	6

df1.sum()

	age	score
school
A	159	424
B	138	272
C	199	416

df1.mean()

	age	score
school
A	26.500000	70.666667
B	27.600000	54.400000
C	33.166667	69.333333

df1.max()

	class	age	score
school
A	math	36	94
B	math	34	87
C	math	43	89

df2 = df.groupby(by = ['school', 'class'])

df2.count()

		age	score
school	class
A	lang	3	3
A	math	3	3
B	lang	3	3
B	math	2	2
C	lang	3	3
C	math	3	3

df2.sum()

		age	score
school	class
A	lang	85	187
A	math	74	237
B	lang	84	173
B	math	54	99
C	lang	92	230
C	math	107	186

df2.mean()

		age	score
school	class
A	lang	28.333333	62.333333
A	math	24.666667	79.000000
B	lang	28.000000	57.666667
B	math	27.000000	49.500000
C	lang	30.666667	76.666667
C	math	35.666667	62.000000

summary_df = df2.mean().reset_index()
summary_df

	school	class	age	score
0	A	lang	28.333333	62.333333
1	A	math	24.666667	79.000000
2	B	lang	28.000000	57.666667
3	B	math	27.000000	49.500000
4	C	lang	30.666667	76.666667
5	C	math	35.666667	62.000000

예제: 매출액 계산#

다음은 3개 전자제품 대리점 A, B, C 의 6개월간(1월-6월) 모니터와 핸드폰의 매출액(단위:백만원)이다.

데이터프레임 sales 의 각 열에 대한 설명은 다음과 같다.

company : 전자제품 대리점
month : 월
monitor: 모니터 매출액 (단위:백만원)
phone : 핸드폰 매출액 (단위:백만원)

sales = pd.DataFrame( {
    'company' : ["A","A","A","A","A","A","B","B","B","B","B","B","C","C","C","C","C","C"],
    'month' : [1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6],
    'monitor' : [22, 21, 31, 24, 25, 36, 31, 23, 34, 21, 29, 34, 41, 32, 26, 23, 43, 34],
    'phone' : [79, 94, 64, 67, 75, 45, 65, 34, 54, 32,87, 65, 56, 65, 76, 65, 89, 45]
})
sales

	company	month	monitor	phone
0	A	1	22	79
1	A	2	21	94
2	A	3	31	64
3	A	4	24	67
4	A	5	25	75
5	A	6	36	45
6	B	1	31	65
7	B	2	23	34
8	B	3	34	54
9	B	4	21	32
10	B	5	29	87
11	B	6	34	65
12	C	1	41	56
13	C	2	32	65
14	C	3	26	76
15	C	4	23	65
16	C	5	43	89
17	C	6	34	45

대리점별 총매출액은?#

두 개의 메소드 groupby(by = 'company') 와 sum() 을 이용하여 대리점별로 열의 합을 구해보자.

sales.groupby(by = 'company').sum()

	month	monitor	phone
company
A	21	159	424
B	21	172	337
C	21	199	396

위와 같이 전체 데이터프레임에 메소드 sum() 을 적용하면 월(month)의 값들를 더해주는 의도치 않은 결과를 얻는다.

이제 필요한 3개의 열만 선택하여 그룹을 설정하자.

sales_gr1 = sales[["company","monitor","phone"]].groupby(by = 'company')

sales_gr1

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fa388053670>

다음으로 그룹화된 데이터프레임에 sum() 을 적용하여 구룹별로 열의 합을 구하자.

sales_gr1.sum()

	monitor	phone
company
A	159	424
B	172	337
C	199	396

이제 sum(axis=1) 메소드로 모니터와 핸드폰의 매출액 합을 더하면 대리점별로 총 매출액을 구할 수 있다.

sales_gr1.sum().sum(axis=1)

company
A    583
B    509
C    595
dtype: int64

하나의 표현식으로 다음과 같이 동일한 결과를 얻을 수 있다.

sales[["company","monitor","phone"]].groupby(by = 'company').sum().sum(axis=1)

company
A    583
B    509
C    595
dtype: int64

월별 총 배출액은?#

sales_gr2 = sales.groupby(by = 'month')

sales_gr2.sum().sum(axis=1)

month
  294
  269
  285
  232
  348
  259
dtype: int64

월별로 대리점들의 매출 점유율(market share)은?#

월 총 매출액에서 각 대리점의 매출액이 차이하는 비율을 구해보자.

모니터와 핸드폰 매출액을 더해서 대리점의 월별 총매츨액 total 을 계산

sales['total'] = sales['monitor'] + sales['phone']

sales

	company	month	monitor	phone	total
0	A	1	22	79	101
1	A	2	21	94	115
2	A	3	31	64	95
3	A	4	24	67	91
4	A	5	25	75	100
5	A	6	36	45	81
6	B	1	31	65	96
7	B	2	23	34	57
8	B	3	34	54	88
9	B	4	21	32	53
10	B	5	29	87	116
11	B	6	34	65	99
12	C	1	41	56	97
13	C	2	32	65	97
14	C	3	26	76	102
15	C	4	23	65	88
16	C	5	43	89	132
17	C	6	34	45	79

에를 들어 1월의 총매출액은 101+96+97 = 294 이므로 1월 각 대리점의 점유율은 다음과 같다.

A 지점 = 101/294
B 지점 = 96/294
C 지점 = 97/294

np.array([101, 96, 97]) / 294

array([0.34353741, 0.32653061, 0.32993197])

월별 총 매출액 month_total 을 가진 새로운 데이터프레임 month_sales 생성

sales_gr2 = sales[['month', 'total']].groupby(by = 'month')
sales_gr2.sum()

	total
month
1	294
2	269
3	285
4	232
5	348
6	259

month_sales = sales_gr2.sum().reset_index()

month_sales

	month	total
0	1	294
1	2	269
2	3	285
3	4	232
4	5	348
5	6	259

데이터프레임 month_sales 의 열이름 total 이 데이터프레임 sales 의 열 이름과 동이하므로 이름을 month_total 로 변경하자.

month_sales.rename(columns={'total':"month_total"}, inplace=True)
month_sales

	month	month_total
0	1	294
1	2	269
2	3	285
3	4	232
4	5	348
5	6	259

매출액 자료 sales 와 월별 총 매출액 을 가진 데이터프레임 month_sales 를 결합하여 새로운 데이터프레임 sales2 생성한다.

이 경우 결합의 식별자(key)는 월(month) 로 한다. 데이터프레임 sales 는 month 가 동일한 행이 여러 개가 있다. 따라서 결합하는 경우 기준이 되는 데이처프레임을 sales 로 지정한다 (how='left')

sales2 = pd.merge(sales, month_sales, on='month', how='left' )
sales2

	company	month	monitor	phone	total	month_total
0	A	1	22	79	101	294
1	A	2	21	94	115	269
2	A	3	31	64	95	285
3	A	4	24	67	91	232
4	A	5	25	75	100	348
5	A	6	36	45	81	259
6	B	1	31	65	96	294
7	B	2	23	34	57	269
8	B	3	34	54	88	285
9	B	4	21	32	53	232
10	B	5	29	87	116	348
11	B	6	34	65	99	259
12	C	1	41	56	97	294
13	C	2	32	65	97	269
14	C	3	26	76	102	285
15	C	4	23	65	88	232
16	C	5	43	89	132	348
17	C	6	34	45	79	259

각 대리점별 월 매출액 total 을 월별 총매출액 month_total 로 나누어 점유율 share 생성

sales2['share'] =  (sales2.total / sales2.month_total)*100

sales2

	company	month	monitor	phone	total	month_total	share
0	A	1	22	79	101	294	34.353741
1	A	2	21	94	115	269	42.750929
2	A	3	31	64	95	285	33.333333
3	A	4	24	67	91	232	39.224138
4	A	5	25	75	100	348	28.735632
5	A	6	36	45	81	259	31.274131
6	B	1	31	65	96	294	32.653061
7	B	2	23	34	57	269	21.189591
8	B	3	34	54	88	285	30.877193
9	B	4	21	32	53	232	22.844828
10	B	5	29	87	116	348	33.333333
11	B	6	34	65	99	259	38.223938
12	C	1	41	56	97	294	32.993197
13	C	2	32	65	97	269	36.059480
14	C	3	26	76	102	285	35.789474
15	C	4	23	65	88	232	37.931034
16	C	5	43	89	132	348	37.931034
17	C	6	34	45	79	259	30.501931

월별로 점유율이 큰 대리점부터 정렬

sales2.sort_values(by = ['month', 'share'], ascending=[True, False])

	company	month	monitor	phone	total	month_total	share
0	A	1	22	79	101	294	34.353741
12	C	1	41	56	97	294	32.993197
6	B	1	31	65	96	294	32.653061
1	A	2	21	94	115	269	42.750929
13	C	2	32	65	97	269	36.059480
7	B	2	23	34	57	269	21.189591
14	C	3	26	76	102	285	35.789474
2	A	3	31	64	95	285	33.333333
8	B	3	34	54	88	285	30.877193
3	A	4	24	67	91	232	39.224138
15	C	4	23	65	88	232	37.931034
9	B	4	21	32	53	232	22.844828
16	C	5	43	89	132	348	37.931034
10	B	5	29	87	116	348	33.333333
4	A	5	25	75	100	348	28.735632
11	B	6	34	65	99	259	38.223938
5	A	6	36	45	81	259	31.274131
17	C	6	34	45	79	259	30.501931

	company	month	monitor	phone
0	A	1	22	79
1	A	2	21	94
2	A	3	31	64
3	A	4	24	67
4	A	5	25	75
5	A	6	36	45
6	B	1	31	65
7	B	2	23	34
8	B	3	34	54
9	B	4	21	32
10	B	5	29	87
11	B	6	34	65
12	C	1	41	56
13	C	2	32	65
14	C	3	26	76
15	C	4	23	65
16	C	5	43	89
17	C	6	34	45

	company	month	monitor	phone	total
0	A	1	22	79	101
1	A	2	21	94	115
2	A	3	31	64	95
3	A	4	24	67	91
4	A	5	25	75	100
5	A	6	36	45	81
6	B	1	31	65	96
7	B	2	23	34	57
8	B	3	34	54	88
9	B	4	21	32	53
10	B	5	29	87	116
11	B	6	34	65	99
12	C	1	41	56	97
13	C	2	32	65	97
14	C	3	26	76	102
15	C	4	23	65	88
16	C	5	43	89	132
17	C	6	34	45	79

	company	month	monitor	phone	total	month_total
0	A	1	22	79	101	294
1	A	2	21	94	115	269
2	A	3	31	64	95	285
3	A	4	24	67	91	232
4	A	5	25	75	100	348
5	A	6	36	45	81	259
6	B	1	31	65	96	294
7	B	2	23	34	57	269
8	B	3	34	54	88	285
9	B	4	21	32	53	232
10	B	5	29	87	116	348
11	B	6	34	65	99	259
12	C	1	41	56	97	294
13	C	2	32	65	97	269
14	C	3	26	76	102	285
15	C	4	23	65	88	232
16	C	5	43	89	132	348
17	C	6	34	45	79	259

	company	month	monitor	phone
0	A	1	22	79
1	A	2	21	94
2	A	3	31	64
3	A	4	24	67
4	A	5	25	75
5	A	6	36	45
6	B	1	31	65
7	B	2	23	34
8	B	3	34	54
9	B	4	21	32
10	B	5	29	87
11	B	6	34	65
12	C	1	41	56
13	C	2	32	65
14	C	3	26	76
15	C	4	23	65
16	C	5	43	89
17	C	6	34	45

	company	month	monitor	phone	total
0	A	1	22	79	101
1	A	2	21	94	115
2	A	3	31	64	95
3	A	4	24	67	91
4	A	5	25	75	100
5	A	6	36	45	81
6	B	1	31	65	96
7	B	2	23	34	57
8	B	3	34	54	88
9	B	4	21	32	53
10	B	5	29	87	116
11	B	6	34	65	99
12	C	1	41	56	97
13	C	2	32	65	97
14	C	3	26	76	102
15	C	4	23	65	88
16	C	5	43	89	132
17	C	6	34	45	79

	company	month	monitor	phone	total	month_total
0	A	1	22	79	101	294
1	A	2	21	94	115	269
2	A	3	31	64	95	285
3	A	4	24	67	91	232
4	A	5	25	75	100	348
5	A	6	36	45	81	259
6	B	1	31	65	96	294
7	B	2	23	34	57	269
8	B	3	34	54	88	285
9	B	4	21	32	53	232
10	B	5	29	87	116	348
11	B	6	34	65	99	259
12	C	1	41	56	97	294
13	C	2	32	65	97	269
14	C	3	26	76	102	285
15	C	4	23	65	88	232
16	C	5	43	89	132	348
17	C	6	34	45	79	259