Outline:
- Introduction to Data Cleaning, Cleansing, & Munging/Wrangling
- Data Merging & Concatenating
- Group & Aggregate
- Pivot & Crosstabs
Video:
Code:
EDA - Data Wrangling
Outline:
- Pendahuluan Data Wrangling
- Hierarchical Indexing
- Combine and Merge
- Reshape & Pivot
- Group By
- Data Aggregation
- Apply
- Pivot & Cross-Tab
Data Wrangling
Data wrangling adalah proses mengubah data mentah menjadi bentuk yang dapat digunakan. Ini juga dapat disebut sebagai data munging.
Data wrangling menggambarkan serangkaian proses yang dirancang untuk mengeksplorasi, mengubah, dan memvalidasi dataset mentah dari bentuk yang kotor dan kompleks menjadi data berkualitas tinggi. Kita dapat menggunakan data yang telah diwrangling untuk menghasilkan wawasan berharga dan membimbing keputusan bisnis.
Data Wrangling bersifat iteratif.
Core data wrangling types
Hierarchical Indexing
- Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis
- Source: [2]
import warnings; warnings.simplefilter('ignore')
import pandas as pd, numpy as np
'Done'
'Done'
data = pd.DataFrame(data=np.random.randn(9),
index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
[1, 2, 3, 1, 3, 1, 2, 2, 3]])
data
0 | ||
---|---|---|
a | 1 | 0.696853 |
2 | -0.294804 | |
3 | -1.017076 | |
b | 1 | -1.181953 |
3 | 0.517265 | |
c | 1 | 0.169213 |
2 | 2.682151 | |
d | 2 | 1.940434 |
3 | -0.331235 |
np.random.randn(9)
array([-0.69273157, -0.43129397, 0.30781477, -0.98809902, 1.04713771, 0.4380333 , 1.01300192, 0.1172724 , -0.23156173])
data.index
MultiIndex([('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 3), ('c', 1), ('c', 2), ('d', 2), ('d', 3)], )
data.loc['a'].loc[1]
0 0.696853 Name: 1, dtype: float64
data.loc['b'].loc[1]
0 -1.181953 Name: 1, dtype: float64
data.loc[['b', 'd', 'a']] # Perhatikan jumlah kurung siku!....
0 | ||
---|---|---|
b | 1 | -1.181953 |
3 | 0.517265 | |
d | 2 | 1.940434 |
3 | -0.331235 | |
a | 1 | 0.696853 |
2 | -0.294804 | |
3 | -1.017076 |
data.loc['b' : 'd'] # Perhatikan jumlah kurung siku!.... Saat slicing berbeda!.
0 | ||
---|---|---|
b | 1 | -1.181953 |
3 | 0.517265 | |
c | 1 | 0.169213 |
2 | 2.682151 | |
d | 2 | 1.940434 |
3 | -0.331235 |
Hati-hati ... pemanggilan index sedikit berbeda di "Series"¶
data2 = pd.Series(np.random.randn(9),
index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
[1, 2, 3, 1, 3, 1, 2, 2, 3]])
data2['b']
1 1.494578 3 -0.780658 dtype: float64
data2.loc[:, 2] # HAti-hati ... ini error jika "data2" berupa Dataframe!...
a -1.008824 c 0.021703 d 0.218325 dtype: float64
Hierarchical Indexing untuk Apa?
Hierarchical indexing plays an important role in reshaping data and group-based operations like forming a pivot table. For example, you could rearrange the data into a DataFrame using its unstack method
data
0 | ||
---|---|---|
a | 1 | 0.696853 |
2 | -0.294804 | |
3 | -1.017076 | |
b | 1 | -1.181953 |
3 | 0.517265 | |
c | 1 | 0.169213 |
2 | 2.682151 | |
d | 2 | 1.940434 |
3 | -0.331235 |
data.unstack()
0 | |||
---|---|---|---|
1 | 2 | 3 | |
a | 0.696853 | -0.294804 | -1.017076 |
b | -1.181953 | NaN | 0.517265 |
c | 0.169213 | 2.682151 | NaN |
d | NaN | 1.940434 | -0.331235 |
data.unstack().stack() # Perhatikan struktur datanya
0 | ||
---|---|---|
a | 1 | 0.696853 |
2 | -0.294804 | |
3 | -1.017076 | |
b | 1 | -1.181953 |
3 | 0.517265 | |
c | 1 | 0.169213 |
2 | 2.682151 | |
d | 2 | 1.940434 |
3 | -0.331235 |
data.unstack().shape, data.shape # Bilamana ini berguna? ... coba difikirkan
((4, 3), (9, 1))
Rename Index (for clarity)¶
data = np.arange(12).reshape((4, 3))
data
array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
frame = pd.DataFrame(data,
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']])
frame
Ohio | Colorado | |||
---|---|---|---|---|
Green | Red | Green | ||
a | 1 | 0 | 1 | 2 |
2 | 3 | 4 | 5 | |
b | 1 | 6 | 7 | 8 |
2 | 9 | 10 | 11 |
# Pahami kedua perintah ini sebagai perubahan property di Object "frame"! ...
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
state | Ohio | Colorado | ||
---|---|---|---|---|
color | Green | Red | Green | |
key1 | key2 | |||
a | 1 | 0 | 1 | 2 |
2 | 3 | 4 | 5 | |
b | 1 | 6 | 7 | 8 |
2 | 9 | 10 | 11 |
# Memilih Groups of Colums
# Sangat memudahkan EDA nantinya ==> Sudah bukan DA cupu lagi.
frame['Ohio'].describe()
color | Green | Red |
---|---|---|
count | 4.000000 | 4.000000 |
mean | 4.500000 | 5.500000 |
std | 3.872983 | 3.872983 |
min | 0.000000 | 1.000000 |
25% | 2.250000 | 3.250000 |
50% | 4.500000 | 5.500000 |
75% | 6.750000 | 7.750000 |
max | 9.000000 | 10.000000 |
Reordering and Sorting Levels
frame.swaplevel('key1', 'key2')
state | Ohio | Colorado | ||
---|---|---|---|---|
color | Green | Red | Green | |
key2 | key1 | |||
1 | a | 0 | 1 | 2 |
2 | a | 3 | 4 | 5 |
1 | b | 6 | 7 | 8 |
2 | b | 9 | 10 | 11 |
# Bisa juga diurutkan
frame.sort_index(level=1) # Ganti "1" dengan "0" untuk memahami operasi ini.
state | Ohio | Colorado | ||
---|---|---|---|---|
color | Green | Red | Green | |
key1 | key2 | |||
a | 1 | 0 | 1 | 2 |
b | 1 | 6 | 7 | 8 |
a | 2 | 3 | 4 | 5 |
b | 2 | 9 | 10 | 11 |
Indexing with a DataFrame’s columns
Pada data di dunia nyata anda akan sering membutuhkan ini¶
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
'c': ['one', 'one', 'one', 'two', 'two',
'two', 'two'],
'd': [0, 1, 2, 0, 1, 2, 3]})
frame
a | b | c | d | |
---|---|---|---|---|
0 | 0 | 7 | one | 0 |
1 | 1 | 6 | one | 1 |
2 | 2 | 5 | one | 2 |
3 | 3 | 4 | two | 0 |
4 | 4 | 3 | two | 1 |
5 | 5 | 2 | two | 2 |
6 | 6 | 1 | two | 3 |
frame2 = frame.set_index(['c', 'd']) # ==> Membuat index berdasarkan nilai dari kolom "c" dan "d"
frame2
a | b | ||
---|---|---|---|
c | d | ||
one | 0 | 0 | 7 |
1 | 1 | 6 | |
2 | 2 | 5 | |
two | 0 | 3 | 4 |
1 | 4 | 3 | |
2 | 5 | 2 | |
3 | 6 | 1 |
frame.set_index(['c', 'd'], drop=False) # Kalau kita ingin c dan d tetap ada (not sure why u want this)
a | b | c | d | ||
---|---|---|---|---|---|
c | d | ||||
one | 0 | 0 | 7 | one | 0 |
1 | 1 | 6 | one | 1 | |
2 | 2 | 5 | one | 2 | |
two | 0 | 3 | 4 | two | 0 |
1 | 4 | 3 | two | 1 | |
2 | 5 | 2 | two | 2 | |
3 | 6 | 1 | two | 3 |
# Reset Index biasanya dibutuhkan saat kita sudah banyak melakukan perubahan di DF-nya
# Atau selesai melakukan data PreProcessing/EDA
frame2.reset_index()
c | d | a | b | |
---|---|---|---|---|
0 | one | 0 | 0 | 7 |
1 | one | 1 | 1 | 6 |
2 | one | 2 | 2 | 5 |
3 | two | 0 | 3 | 4 |
4 | two | 1 | 4 | 3 |
5 | two | 2 | 5 | 2 |
6 | two | 3 | 6 | 1 |
Combining and Merging Datasets
Ada 3 macam:
- pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
- pandas.concat concatenates or “stacks” together objects along an axis.
- The combine_first instance method enables splicing together overlapping data to fill in missing values in one object with values from another.
Database-Style DataFrame Joins (Merge/Join)
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df1
key | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | a | 5 |
6 | b | 6 |
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
'data2': range(3)})
df2
key | data2 | |
---|---|---|
0 | a | 0 |
1 | b | 1 |
2 | d | 2 |
pd.merge(df1, df2, on='key')
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1 |
1 | b | 1 | 1 |
2 | b | 6 | 1 |
3 | a | 2 | 0 |
4 | a | 4 | 0 |
5 | a | 5 | 0 |
Bagaimana jika nama kolom berbeda?¶
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df3
lkey | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | a | 5 |
6 | b | 6 |
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
'data2': range(3)})
df4
rkey | data2 | |
---|---|---|
0 | a | 0 |
1 | b | 1 |
2 | d | 2 |
pd.merge(df3, df4, left_on='lkey', right_on='rkey')[['lkey', 'data1', 'data2']]
lkey | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1 |
1 | b | 1 | 1 |
2 | b | 6 | 1 |
3 | a | 2 | 0 |
4 | a | 4 | 0 |
5 | a | 5 | 0 |
Inner join¶
You may notice that the 'c' and 'd' values and associated data are missing from the result. By default merge does an 'inner' join; the keys in the result are the intersec‐ tion, or the common set found in both tables.
Outer Join¶
Other possible options are 'left', 'right', and 'outer'. The outer join takes the union of the keys, combining the effect of applying both left and right joins:
pd.merge(df1, df2, how='outer') # Perhatikan outputnya.
key | data1 | data2 | |
---|---|---|---|
0 | b | 0.0 | 1.0 |
1 | b | 1.0 | 1.0 |
2 | b | 6.0 | 1.0 |
3 | a | 2.0 | 0.0 |
4 | a | 4.0 | 0.0 |
5 | a | 5.0 | 0.0 |
6 | c | 3.0 | NaN |
7 | d | NaN | 2.0 |
Many-to-many merges¶
- well-defined, though not necessarily intuitive, behavior
- Many-to-many joins form the Cartesian product of the rows.
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
'data1': range(6)})
df1
key | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | b | 5 |
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
'data2': range(5)})
df2
key | data2 | |
---|---|---|
0 | a | 0 |
1 | b | 1 |
2 | a | 2 |
3 | b | 3 |
4 | d | 4 |
pd.merge(df1, df2, on='key', how='left')
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1.0 |
1 | b | 0 | 3.0 |
2 | b | 1 | 1.0 |
3 | b | 1 | 3.0 |
4 | a | 2 | 0.0 |
5 | a | 2 | 2.0 |
6 | c | 3 | NaN |
7 | a | 4 | 0.0 |
8 | a | 4 | 2.0 |
9 | b | 5 | 1.0 |
10 | b | 5 | 3.0 |
Since there were three 'b' rows in the left DataFrame and two in the right one, there are six 'b' rows in the result. The join method only affects the distinct key values appearing in the result:
pd.merge(df1, df2, how='inner')
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1 |
1 | b | 0 | 3 |
2 | b | 1 | 1 |
3 | b | 1 | 3 |
4 | b | 5 | 1 |
5 | b | 5 | 3 |
6 | a | 2 | 0 |
7 | a | 2 | 2 |
8 | a | 4 | 0 |
9 | a | 4 | 2 |
merge with multiple keys
- When you’re joining columns-on-columns, the indexes on the passed DataFrame objects are discarded.
df1 = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
'key2': ['one', 'two', 'one'],
'lval': [1, 2, 3]})
df1
key1 | key2 | lval | |
---|---|---|---|
0 | foo | one | 1 |
1 | foo | two | 2 |
2 | bar | one | 3 |
df2 = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
'key2': ['one', 'one', 'one', 'two'],
'rval': [4, 5, 6, 7]})
df2
key1 | key2 | rval | |
---|---|---|---|
0 | foo | one | 4 |
1 | foo | one | 5 |
2 | bar | one | 6 |
3 | bar | two | 7 |
pd.merge(df1, df2, on=['key1', 'key2'], how='outer')
key1 | key2 | lval | rval | |
---|---|---|---|---|
0 | foo | one | 1.0 | 4.0 |
1 | foo | one | 1.0 | 5.0 |
2 | foo | two | 2.0 | NaN |
3 | bar | one | 3.0 | 6.0 |
4 | bar | two | NaN | 7.0 |
overlapping column names¶
- While you can address the overlap manually (Rename nama kolom), merge has a suffixes option for specifying strings to append to overlapping names in the left and right DataFrame objects
pd.merge(df1, df2, on='key1')
key1 | key2_x | lval | key2_y | rval | |
---|---|---|---|---|---|
0 | foo | one | 1 | one | 4 |
1 | foo | one | 1 | one | 5 |
2 | foo | two | 2 | one | 4 |
3 | foo | two | 2 | one | 5 |
4 | bar | one | 3 | one | 6 |
5 | bar | one | 3 | two | 7 |
pd.merge(df1, df2, on='key1', suffixes=('_satu', '_dua'))
key1 | key2_satu | lval | key2_dua | rval | |
---|---|---|---|---|---|
0 | foo | one | 1 | one | 4 |
1 | foo | one | 1 | one | 5 |
2 | foo | two | 2 | one | 4 |
3 | foo | two | 2 | one | 5 |
4 | bar | one | 3 | one | 6 |
5 | bar | one | 3 | two | 7 |
Merging on Index
df1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
'value': range(6)})
df1
key | value | |
---|---|---|
0 | a | 0 |
1 | b | 1 |
2 | a | 2 |
3 | a | 3 |
4 | b | 4 |
5 | c | 5 |
dd = df1.describe()
dd.unstack()
value count 6.000000 mean 2.500000 std 1.870829 min 0.000000 25% 1.250000 50% 2.500000 75% 3.750000 max 5.000000 dtype: float64
df2 = pd.DataFrame({'group_val': [35, 7]}, index=['a', 'b'])
df2
group_val | |
---|---|
a | 35 |
b | 7 |
pd.merge(df1, df2, left_on='key', right_index=True)
key | value | group_val | |
---|---|---|---|
0 | a | 0 | 35 |
2 | a | 2 | 35 |
3 | a | 3 | 35 |
1 | b | 1 | 7 |
4 | b | 4 | 7 |
Concatenating Along an Axis¶
# Kalau di Numpy
arr = np.arange(12).reshape((3, 4))
arr
array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])
np.concatenate([arr, arr], axis=0)
array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])
# Di Pandas - Series
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
pd.concat([s1, s2, s3])
a 0 b 1 c 2 d 3 e 4 f 5 g 6 dtype: int64
pd.concat([s1, s2, s3], axis=1)
0 | 1 | 2 | |
---|---|---|---|
a | 0.0 | NaN | NaN |
b | 1.0 | NaN | NaN |
c | NaN | 2.0 | NaN |
d | NaN | 3.0 | NaN |
e | NaN | 4.0 | NaN |
f | NaN | NaN | 5.0 |
g | NaN | NaN | 6.0 |
# Concat di DataFrame
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
columns=['one', 'two'])
df1
one | two | |
---|---|---|
a | 0 | 1 |
b | 2 | 3 |
c | 4 | 5 |
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
columns=['three', 'four'])
df2
three | four | |
---|---|---|
a | 5 | 6 |
c | 7 | 8 |
pd.concat([df1, df2], axis=1) # , keys=['level1', 'level2']
#Perhatikan indexnya
one | two | three | four | |
---|---|---|---|---|
a | 0 | 1 | 5.0 | 6.0 |
b | 2 | 3 | NaN | NaN |
c | 4 | 5 | 7.0 | 8.0 |
Data Aggregation and Group Operations
- pandas provides a flexible groupby interface, enabling you to slice, dice, and summarize datasets in a natural way.
- Aggregation of time series data, a special use case of groupby, is referred to as resampling.
1. GroupBy Mechanics
- Hadley Wickham, an author of many popular packages for the R programming lan‐ guage, coined the term split-apply-combine for describing group operations.
import pandas as pd , numpy as np
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : [1,2,3,4,5],
'data2' : [6,7,8,9,10]})
df
key1 | key2 | data1 | data2 | |
---|---|---|---|---|
0 | a | one | 1 | 6 |
1 | a | two | 2 | 7 |
2 | b | one | 3 | 8 |
3 | b | two | 4 | 9 |
4 | a | one | 5 | 10 |
grouped = df.groupby(df['key1']).sum()
grouped
key2 | data1 | data2 | |
---|---|---|---|
key1 | |||
a | onetwoone | 8 | 23 |
b | onetwo | 7 | 17 |
grouped[['data1', 'data2']].mean()
data1 7.5 data2 20.0 dtype: float64
means = df.groupby([df['key1'], df['key2']]).mean()
means
data1 | data2 | ||
---|---|---|---|
key1 | key2 | ||
a | one | 3.0 | 8.0 |
two | 2.0 | 7.0 | |
b | one | 3.0 | 8.0 |
two | 4.0 | 9.0 |
means.loc['a'].loc['one'].describe()
count 2.000000 mean 5.500000 std 3.535534 min 3.000000 25% 4.250000 50% 5.500000 75% 6.750000 max 8.000000 Name: one, dtype: float64
means.unstack()
df
df[['key1', 'data1', 'data2']].groupby('key1').mean()
# You may have noticed in the first case df.groupby('key1').mean() that there is no key2 column in the result.
# Because df['key2'] is not numeric data, it is said to be a nuisance column, which is therefore excluded from the result.
# Walau pada data numerik terkadang kita ingin mengetahui jumlah kejadiannya (frekuensi)
df.groupby(['key1', 'key2']).size()
Iterating Over Groups¶
for name, group in df.groupby('key1'):
print(name)
print(group)
# Atau kalau mau bisa juga disimpan ke variable:
pieces = dict(list(df.groupby('key1')))
pieces['b']
df.groupby(['key1', 'key2'])[['data2']].mean()
data2 | ||
---|---|---|
key1 | key2 | |
a | one | 8.0 |
two | 7.0 | |
b | one | 8.0 |
two | 9.0 |
GroupBy bisa flexible dengan memilih grouping berdasarkan kolom mana saja¶
- Sekalian rename hasil groupBy-nya
people = pd.DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people
a | b | c | d | e | |
---|---|---|---|---|---|
Joe | 0.472783 | 1.859784 | -2.145745 | -0.421652 | 0.473530 |
Steve | -0.140093 | 0.506622 | 1.838508 | 0.478082 | -0.800755 |
Wes | -1.975428 | -0.777821 | -1.030737 | 0.771424 | 0.004535 |
Jim | 0.609483 | -0.383058 | 0.505918 | 0.194575 | -0.566108 |
Travis | 1.371451 | 1.668983 | -0.554708 | -1.534661 | 0.279254 |
d = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f' : 'orange'}
d
{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}
people.groupby(d, axis=1).mean()
blue | red | |
---|---|---|
Joe | -1.283698 | 0.935366 |
Steve | 1.158295 | -0.144742 |
Wes | -0.129657 | -0.916238 |
Jim | 0.350247 | -0.113228 |
Travis | -1.044684 | 1.106562 |
Data Aggregation
- Aggregations refer to any data transformation that produces scalar values from arrays.
- Tidak hanya fungsi yang ditabel diatas
df
key1 | key2 | data1 | data2 | |
---|---|---|---|---|
0 | a | one | 1 | 6 |
1 | a | two | 2 | 7 |
2 | b | one | 3 | 8 |
3 | b | two | 4 | 9 |
4 | a | one | 5 | 10 |
df[['key1','data1','data2']].groupby('key1').quantile(0.9)
data1 | data2 | |
---|---|---|
key1 | ||
a | 4.4 | 9.4 |
b | 3.9 | 8.9 |
def RangeData(arr):
return arr.max() - arr.min()
df[['key1','data1','data2']].groupby('key1').agg(RangeData)
data1 | data2 | |
---|---|---|
key1 | ||
a | 4 | 4 |
b | 1 | 1 |
Apply di Dataframe¶
- Sangat penting juga untuk mengefisienkan Code
def sum_kuadrat(x):
return x.sum()**2+1
data = {
"x": [5, 4, 3],
"y": [2, 1, 7]
}
df = pd.DataFrame(data)
df
x | y | |
---|---|---|
0 | 5 | 2 |
1 | 4 | 1 |
2 | 3 | 7 |
df.apply(sum_kuadrat)
x 145 y 101 dtype: int64
Pivot Tables and Cross-Tabulation
A pivot table is a data summarization tool frequently found in spreadsheet programs and other data analysis software. It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns. Pivot tables in Python with pandas are made possible through the groupby
import seaborn as sns
tips = sns.load_dataset("tips")
tips
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
tips[['day', 'smoker','tip','total_bill']].pivot_table(index=['day', 'smoker'])
tip | total_bill | ||
---|---|---|---|
day | smoker | ||
Thur | Yes | 3.030000 | 19.190588 |
No | 2.673778 | 17.113111 | |
Fri | Yes | 2.714000 | 16.813333 |
No | 2.812500 | 18.420000 | |
Sat | Yes | 2.875476 | 21.276667 |
No | 3.102889 | 19.661778 | |
Sun | Yes | 3.516842 | 24.120000 |
No | 3.167895 | 20.506667 |
CrossTab¶
- A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies.
pd.crosstab(tips.day, tips.smoker)
smoker | Yes | No |
---|---|---|
day | ||
Thur | 17 | 45 |
Fri | 15 | 4 |
Sat | 42 | 45 |
Sun | 19 | 57 |
End of Module
Referensi:¶
Rattenbury, Tye, et al. Principles of data wrangling: Practical techniques for data preparation. " O'Reilly Media, Inc.", 2017.
McKinney, Wes. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O'Reilly Media, Inc.", 2012.
No comments:
Post a Comment
Relevant & Respectful Comments Only.