파이썬 Pandas DataFrame

<Python>/[DataFrame] 2021. 12. 3. 16:39

728x90

1. 결측치 확인———————————————————————————————

2. .isnull().sum() #결과는 시리즈형태 # 결측치 개수확인

3. df[df['x1'].isnull()] # 결측값 확인

실습 : https://9566.tistory.com/41

4. 데이터 제거—————————————————————————————

5. .drop(시리즈.index.tolist(), axis=1) # 변수 삭제 #drop안에 list가 들어가야함

6. .drop(columns=['x1', 'x2'], axis =1, inplace=True) # 열 제거 # inplace=True는 df에 바로 적용

7. .drop(columns={'x1', 'x2'})

실습 : https://9566.tistory.com/44

8. 결측치 행 제거

9. .dropna(axis=1) # NaN이 있는 열(변수) 제거

10. .dropna() # NaN 행 제거

11. .dropna(subset=['x1'], inplace =True) # x1의 결측값 행 제거

12. df=df[~df['x1'].isnull()] # 행 제거

실습 : https://9566.tistory.com/42

13. 일부 행 제거

14. df = df.drop(10, axis=0) # 10번째 행 제거

15. df.query('x1 ! = ''''') # 빈문자열 행 제거

16. df[~(df['x1'] == 'NEAR BAY')] # 원하는 행 제거

실습 : https://9566.tistory.com/43

17. 결측치 대체-———————————————-—————————————

18. df.iloc[3, 3] = np.nan # 3행3열을 결측값으로 교체 # index location(인덱스 위치)

19. df.x1 = df.x1.fillna(df.x1.mean()) # NaN을 평균값으로 대체

20. df.fillna({'x1':df['x1'].mode()[0], 'x2':int(df['x2'].mean())}, inplace=True)

21. df['x1'] = df['x1'].fillna(3) # 결측값을 3으로 대체

22. df = df.replace(np.NaN, 0)

23. df.replace(['good','bad'],[0,1])

실습 : https://9566.tistory.com/40

24. 인덱스-—————————————————————————————-——

25. df = df.reset_index(drop=True) # 인덱스 재설정

26. df.index #인덱스(고정) # IndexRange

실습 : https://9566.tistory.com/45

27. 개수-——————————————————————-

28. len(df) #길이확인, 개수 세기 len(df['x1'])

29. df.shape # (행, 열) 개수

30. df.shape[0] # 행개수

31. df.shape[1] # 열개수

32. df['x1'].shape # (행,) 개수

33. df.count() # 널값을 제외한 개수 세기 #df['x1'].count()

실습 : https://9566.tistory.com/46

34. 정보 —————————————————————————————————-

35. .info() # 결측값을 제외한 데이터 개수 확인

# output : 변수명/null 존재여부/dtype

36. .describe() #사분위수 활용 : df['x1'].describe()['75%']

# output : count, mean, std, min, max, 25%, 50%, 75% 수치형 정보

37. .isnull().sum() # 널값 정보

38. .isna().sum() # 널값 정보

39. .dtypes # 데이터 타입

40. .columns() #변수명 #결과는 인덱스형태

실습 : https://9566.tistory.com/47

41. 변환————————————————————————————————————

42. df['x1'] # Series = df.x1

43. df[['x1']] # DataFrame

44. df[['x1', 'x2']] # 열 추출 #DataFrame

45. df.iloc[:, [0,2]] # x1, x3 열 추출 #DataFrame

46. np.array(df['x1']).reshape(-1,1) # Series → ndarray (행,) 형태

47. df['x1'].unique() # Series → ndarray

48. df['x1'].to_dict() # Series → dictionary

49. sorted(list(set(df['x1'])))# Series → list

50. .to_datetime(df['x1'], format='%Y-%m-%d') # 시간으로 변환

실습 : https://9566.tistory.com/48

51. 묶기 ————————————————————————————————

52. .groupby('x1').mean()['y'] # x1의 각 원소마다의 y값 평균

53. .groupby('x1')['x2'].agg(**{'mean_x2':'mean'})

54. .groupby('x1').agg(**{'mean_x2':pd.NamedAgg(column='x2', aggfunc='mean')})

55. df.groupby('latitude').agg(**{'mean_longitude': pd.NamedAgg(column = 'longitude', aggfunc = 'mean') , 'mean_total_rooms' : pd.NamedAgg(column='total_rooms', aggfunc = 'mean')})

실습 : https://9566.tistory.com/49

56. 분류————————————————————————————————

57. .sort_values('mean_price')

58. .sort_values(by=['y'])['y'] # Series형태

59. .sort_values(by='x1', ascending=False).head(10) # 상위 10개

실습 : https://9566.tistory.com/50

60. df.corr() # 상관계수 #dataframe

61. df.corr()['y'].sort_values(ascending=False)

# pd.get_dummies(df, columns = ['x1', 'x2', 'x3'], drop_first = True) # 가변수 1개 제거

실습 : https://9566.tistory.com/51

62. 중복값 ——————————————————————————————

63. df[df.duplicated(keep=False)] #중복값 확인

64. df = df.drop_duplicates() # 중복값 제거

실습 : https://9566.tistory.com/52

65. 함수———————————————————————————————

66. .apply(사용자 제작 함수) # 사용자가 만든 함수를 적용할수있다. ex)np.sum, np.square

67. df.mean(axis=1) # 한행의 모든 값의 평균

68. .aggregate([min, np.median, max])

69. .aggregate({'x1':min, 'x2':sum})

실습 : https://9566.tistory.com/53

70. 변수명 변경 ————————————————————————————

71. df = df.rename(columns={"차명": "model", "리콜사유": "cause"}) # 변수명 변경

실습 : https://9566.tistory.com/54

72. 원하는 데이터 선택 ————————————————————

73. df = df[df['recall_year']==2020] #원하는 행만 선택

74. df.sample(frac = 0.2) # 비복원 랜덤추출

75. df.sample(n=len(df)*0.2, replace=True) # 복원 랜덤추출

76. df[:int(len(df)*0.8)] # 데이터의 80%를 순서대로 추출

77. df[(df['x1'] > 38.00) & (df['x2'] == 25.0)] # and

78. df[(df['x1'] > 38.00) | (df['x2'] == 25.0)] # or

79. `df = df.replace(' ','').replace('',df['Fruit'].mode()[0])` #첫 번째 replace는 공백 문자를 모두 빈 문자열로 만들어주기 위해 사용했으며 두 번째 replace가 빈 문자열을 최빈값으로 대체해줍니다. = df.replace(' ', df['Fruit'].mode()[0])

실습 :

80. 기타————————————————————————————————

81. df.style.bar(subset=['A','B'],align='mid',color=['#a103fc','#03e3fc']) # df안에 bar chart 넣기

82. df.style.set_precision(4) # 소수점 4자리까지

실습 :

728x90

'<Python> > [DataFrame]' 카테고리의 다른 글

파이썬 Pandas DataFrame 열 제거 (0)	2021.12.18
파이썬 Pandas DataFrame 일부 행 제거 (0)	2021.12.18
파이썬 Pandas DataFrame 결측치 행 제거 (0)	2021.12.18
파이썬 Pandas DataFrame 결측치 확인 (0)	2021.12.18
파이썬 Pandas DataFrame 결측치 대체 (0)	2021.12.18

ABOUT ME

9566

'<Python> > [DataFrame]' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'<Python> > [DataFrame]' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바