Pandas 데이터프레임 연습 1

Pandas 데이터프레임 연습 1#

이 노트는 데이터프레임의 연산과 요약에 대한 예제이니 자유롭게 코드를 실행시키면서 연습해 보세요.

import pandas as pd
import numpy as np

다음과 같이 간단한 데이터프레임을 만들어 보자

df = pd.DataFrame( {
    'x1' : ["A","A","B","B"],
    'x2' : [1,2,3,4],
    'x3' : [1,2,1,2],
    'x4' : [4,3,2,1] 
})

df

	x1	x2	x3	x4
0	A	1	1	4
1	A	2	2	3
2	B	3	1	2
3	B	4	2	1

열을 이용한 연산#

다음 3개의 표현식은 두 열 x3 와 x4 에서 같은 행에 있는 데이터를 더하는 동일한 작업을 수행한다.

df["x3"] + df["x4"]

  5
  5
  3
  3
dtype: int64

df.x3 + df.x4

  5
  5
  3
  3
dtype: int64

df[['x3','x4']].sum(axis=1)

  5
  5
  3
  3
dtype: int64

여러 개의 열로 사칙연산을 다음과 같이 적용하면 어떤 결과가 나오는지 생각해 봅시다.

df["x2"] * df["x3"] * df["x4"] 

   4
  12
   6
   8
dtype: int64

df["x2"] * df["x3"]  +  df["x4"] 

  5
  7
  5
  9
dtype: int64

df["x2"]**2 +  df["x3"] *  df["x4"] 

   5
  10
  11
  18
dtype: int64

열의 요약#

데이터프레임에 적용되는 메소드에 axis=0 이라고 지정하면 각 열에 속한 모든 행의 자료들에 대하여 메소드가 적용된다.

df

	x1	x2	x3	x4
0	A	1	1	4
1	A	2	2	3
2	B	3	1	2
3	B	4	2	1

df.sum(axis=0) # 열에 속한 모든 원소의 합 - 열 x1 의 합이 왜 AAAABBB 일까요? 

x1    AABB
x2      10
x3       6
x4      10
dtype: object

"A"+"A"+"B"+"B"

'AABB'

df.mean(axis=0) # 열에 속한 모든 원소의 평균 - 평균은 문자로 구성된 열에는 자동적으로 적용되지 않습니다. 

x2    2.5
x3    1.5
x4    2.5
dtype: float64

df.mean()  # axis= 값을 지정하지 않으면 디폴트(default)는 axis=0

x2    2.5
x3    1.5
x4    2.5
dtype: float64

df.min(axis=0) # 각 열의 최소값 - 문자도 크기를 비교할 있다( "A" < "B" )

x1    A
x2    1
x3    1
x4    1
dtype: object

df.std(axis=0) #  각 열 의 표준편차 - 문자로 구성된 열에는 자동적으로 적용되지 않습니다. 

x2    1.290994
x3    0.577350
x4    1.290994
dtype: float64

df.describe() # 문자로 구성된 열은 통계량이 지동적으로 나오지 않습니다.

	x2	x3	x4
count	4.000000	4.00000	4.000000
mean	2.500000	1.50000	2.500000
std	1.290994	0.57735	1.290994
min	1.000000	1.00000	1.000000
25%	1.750000	1.00000	1.750000
50%	2.500000	1.50000	2.500000
75%	3.250000	2.00000	3.250000
max	4.000000	2.00000	4.000000

하나의 열만 선택하면 데이터프레임이 시리즈(Series)로 변하기 때문에 메소드를 적용하면 결과에 열이름이 나타나지 않는다.

df["x2"]

  1
  2
  3
  4
Name: x2, dtype: int64

df["x2"].mean() # 

2.5

행의 요약#

데이터프레임에 적용되는 메소드에 axis=1 이라고 지정하면 각 행에 속한 모든 열의 자료들에 대하여 메소드가 적용된다.

df

	x1	x2	x3	x4
0	A	1	1	4
1	A	2	2	3
2	B	3	1	2
3	B	4	2	1

df.sum(axis=1) # 각 행의 합 (문자를 포함한 행이 자동적으로 제거된다.)

  6
  7
  6
  7
dtype: int64

df.min(axis=1)  # 최소값을 계산할 때도 문자가 자동적으로 제거된다.

  1
  2
  1
  1
dtype: int64

df.mean(axis=1) # 각 행의 평균

  2.000000
  2.333333
  2.000000
  2.333333
dtype: float64

안전한 프로그래밍#

데이터프레임에 대한 다양한 계산에서 특정한 메소드는 연산이 적용될 수 없는 열과 행을 자동적으로 제거하고 적용된다.

예를 들어 아래에서 mean() 은 자동적으로 문자로 구성된 열을 자동적으로 제외하고 적용된다.

df

	x1	x2	x3	x4
0	A	1	1	4
1	A	2	2	3
2	B	3	1	2
3	B	4	2	1

df.mean(axis=0)

x2    2.5
x3    1.5
x4    2.5
dtype: float64

df.mean(axis=0) 은 다음과 같이 숫자로 나타난 열만 선택하여 평균을 구해주는 표현식과 같은 결과를 준다.

df[["x2", "x3", "x4"]].mean(axis=0)

x2    2.5
x3    1.5
x4    2.5
dtype: float64

반면 어떤 메소드는 오류가 발생한다. 또한 메소드를 숮차적으로 적용하는 경우 예상치 못한 결과가 발생하는 경우가 있다.

따라서 메소드를 이용하는 작업의 의도에 맞게 적용하여는 자료의 형식을 가진 열과 행을 먼저 선택하여 적용하는 것이 좋다.

이러한 안전한 프로그래밍은 오류나 예기치 않는 결과를 방지해 줄 수 있다.

예를 들어서 데이터프레임에서

각 열의 자료에 대한 합을 구해고
그 합들의 최소값을 찾는다고 하자.

df.sum(axis=0)

x1    AABB
x2      10
x3       6
x4      10
dtype: object

표현식 df.sum(axis=0)의 결과가 문자열과 숫자가 같이 나타나므로 메소드 min() 을 적용하면 오류가 나타난다.

df.sum(axis=0).min()  # 문자로 구성된 열에 의해 의도하지 않은 결과 

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [28], in <cell line: 1>()
----> 1 df.sum(axis=0).min()

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py:11468, in _make_stat_function.<locals>.stat_func(self, axis, skipna, level, numeric_only, **kwargs)
  11466 if level is not None:
  11467     return self._agg_by_level(name, axis=axis, level=level, skipna=skipna)
> 11468 return self._reduce(
  11469     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11470 )

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py:4248, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   4244     raise NotImplementedError(
   4245         f"Series.{name} does not implement numeric_only."
   4246     )
   4247 with np.errstate(all="ignore"):
-> 4248     return op(delegate, skipna=skipna, **kwds)

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py:129, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    127         result = alt(values, axis=axis, skipna=skipna, **kwds)
    128 else:
--> 129     result = alt(values, axis=axis, skipna=skipna, **kwds)
    131 return result

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py:873, in _nanminmax.<locals>.reduction(values, axis, skipna, mask)
    871         result = np.nan
    872 else:
--> 873     result = getattr(values, meth)(axis)
    875 result = _wrap_results(result, dtype, fill_value)
    876 return _maybe_null_out(result, axis, mask, values.shape)

File ~/opt/anaconda3/lib/python3.8/site-packages/numpy/core/_methods.py:44, in _amin(a, axis, out, keepdims, initial, where)
     42 def _amin(a, axis=None, out=None, keepdims=False,
     43           initial=_NoValue, where=True):
---> 44     return umr_minimum(a, axis, None, out, keepdims, initial, where)

TypeError: '>=' not supported between instances of 'numpy.ndarray' and 'str'

메소드 select_dtypes(include=['number']) 는 자료의 형식이 number 인 열만 선택해 준다.

자료의 형식 number 는 숫자로 된 모든 형식을 포함하는 **상위 형식(super type)**이다. 즉 int 와 float 등 수를 나타내는 모든 자료의 형식의 포함하는 자료 형식이다. number는 정수 형식 int 와 부동소수점 형식 float 의 부모(parent)라고 할 수 있다.

메소드 select_dtypes(include=['number']) 로 숫자로 된 열만 선택하면 안전하게 최소값을 구할 수 있다.

df.select_dtypes(include=['number'])  # 숫자로 된 열만 선택 

	x2	x3	x4
0	1	1	4
1	2	2	3
2	3	1	2
3	4	2	1

df.select_dtypes(include=['number']).sum(axis=0).min()  # OK!

`inplace=True` 의 이용#

df2 = pd.DataFrame( {
    'x1' : ["A","C","D","B"],
    'x2' : [1,2,3,4]})

df2

	x1	x2
0	A	1
1	C	2
2	D	3
3	B	4

df2.sort_values(by='x1')

	x1	x2
0	A	1
3	B	4
1	C	2
2	D	3

df2 # df2의 자료는 그대로

	x1	x2
0	A	1
1	C	2
2	D	3
3	B	4

df2_copy = df2.sort_values(by='x1')

df2_copy

	x1	x2
0	A	1
3	B	4
1	C	2
2	D	3

df2.sort_values(by='x1', inplace=True)

df2

	x1	x2
0	A	1
3	B	4
1	C	2
2	D	3

inplace=True 를 사용하면 적용된 데이터프레임의 자료가 변경된다. 하지만 아래와 같이 inplace=True 를 적용한 결과를 다른 데이터플임에 저장할 수 없다.

df3 = df2.sort_values(by='x2', inplace=True)

df3

type(df3)

NoneType

df2

	x1	x2
0	A	1
1	C	2
2	D	3
3	B	4