Outline:

Introduction to Data Cleaning, Cleansing, & Munging/Wrangling
Data Merging & Concatenating
Group & Aggregate
Pivot & Crosstabs

“The goal is to turn data into information and information into insight.” – Carly Fiorina

Outline:

Pendahuluan Data Wrangling
Hierarchical Indexing
Combine and Merge
Reshape & Pivot
Group By
Data Aggregation
Apply
Pivot & Cross-Tab

Data Wrangling

Data wrangling adalah proses mengubah data mentah menjadi bentuk yang dapat digunakan. Ini juga dapat disebut sebagai data munging.
Data wrangling menggambarkan serangkaian proses yang dirancang untuk mengeksplorasi, mengubah, dan memvalidasi dataset mentah dari bentuk yang kotor dan kompleks menjadi data berkualitas tinggi. Kita dapat menggunakan data yang telah diwrangling untuk menghasilkan wawasan berharga dan membimbing keputusan bisnis.
Data Wrangling bersifat iteratif.

Core data wrangling types

Hierarchical Indexing

Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis
Source: [2]

In [1]:

import warnings; warnings.simplefilter('ignore')
import pandas as pd, numpy as np

'Done'

Out[1]:

'Done'

In [2]:

data = pd.DataFrame(data=np.random.randn(9), 
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

Out[2]:

		0
a	1	0.696853
	2	-0.294804
	3	-1.017076
b	1	-1.181953
b	3	0.517265
c	1	0.169213
c	2	2.682151
d	2	1.940434
d	3	-0.331235

In [11]:

np.random.randn(9)

Out[11]:

array([-0.69273157, -0.43129397,  0.30781477, -0.98809902,  1.04713771,
        0.4380333 ,  1.01300192,  0.1172724 , -0.23156173])

In [13]:

data.index

Out[13]:

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

In [21]:

data.loc['a'].loc[1]

Out[21]:

0    0.696853
Name: 1, dtype: float64

In [22]:

data.loc['b'].loc[1]

Out[22]:

0   -1.181953
Name: 1, dtype: float64

In [25]:

data.loc[['b', 'd', 'a']] # Perhatikan jumlah kurung siku!....

Out[25]:

		0
b	1	-1.181953
b	3	0.517265
d	2	1.940434
d	3	-0.331235
a	1	0.696853
	2	-0.294804
	3	-1.017076

In [26]:

data.loc['b' : 'd'] # Perhatikan jumlah kurung siku!.... Saat slicing berbeda!.

Out[26]:

		0
b	1	-1.181953
b	3	0.517265
c	1	0.169213
c	2	2.682151
d	2	1.940434
d	3	-0.331235

Hati-hati ... pemanggilan index sedikit berbeda di "Series"¶

In [27]:

data2 = pd.Series(np.random.randn(9), 
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data2['b']

Out[27]:

1    1.494578
3   -0.780658
dtype: float64

In [28]:

data2.loc[:, 2] # HAti-hati ... ini error jika "data2" berupa Dataframe!...

Out[28]:

a   -1.008824
c    0.021703
d    0.218325
dtype: float64

Hierarchical Indexing untuk Apa?

Hierarchical indexing plays an important role in reshaping data and group-based operations like forming a pivot table. For example, you could rearrange the data into a DataFrame using its unstack method

In [29]:

data

Out[29]:

		0
a	1	0.696853
	2	-0.294804
	3	-1.017076
b	1	-1.181953
b	3	0.517265
c	1	0.169213
c	2	2.682151
d	2	1.940434
d	3	-0.331235

In [30]:

data.unstack()

Out[30]:

	0
	1	2	3
a	0.696853	-0.294804	-1.017076
b	-1.181953	NaN	0.517265
c	0.169213	2.682151	NaN
d	NaN	1.940434	-0.331235

In [31]:

data.unstack().stack() # Perhatikan struktur datanya

Out[31]:

		0
a	1	0.696853
	2	-0.294804
	3	-1.017076
b	1	-1.181953
b	3	0.517265
c	1	0.169213
c	2	2.682151
d	2	1.940434
d	3	-0.331235

In [32]:

data.unstack().shape, data.shape # Bilamana ini berguna? ... coba difikirkan

Out[32]:

((4, 3), (9, 1))

Rename Index (for clarity)¶

In [36]:

data = np.arange(12).reshape((4, 3))
data

Out[36]:

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [37]:

frame = pd.DataFrame(data,
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'],
    ['Green', 'Red', 'Green']])

frame

Out[37]:

		Ohio		Colorado
		Green	Red	Green
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

In [38]:

# Pahami kedua perintah ini sebagai perubahan property di Object "frame"! ... 
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

Out[38]:

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

In [43]:

# Memilih Groups of Colums
# Sangat memudahkan EDA nantinya ==> Sudah bukan DA cupu lagi.
frame['Ohio'].describe()

Out[43]:

color	Green	Red
count	4.000000	4.000000
mean	4.500000	5.500000
std	3.872983	3.872983
min	0.000000	1.000000
25%	2.250000	3.250000
50%	4.500000	5.500000
75%	6.750000	7.750000
max	9.000000	10.000000

Reordering and Sorting Levels

In [40]:

frame.swaplevel('key1', 'key2')

Out[40]:

	state	Ohio		Colorado
	color	Green	Red	Green
key2	key1
1	a	0	1	2
2	a	3	4	5
1	b	6	7	8
2	b	9	10	11

In [41]:

# Bisa juga diurutkan
frame.sort_index(level=1) # Ganti "1" dengan "0" untuk memahami operasi ini.

Out[41]:

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
b	1	6	7	8
a	2	3	4	5
b	2	9	10	11

Indexing with a DataFrame’s columns

Pada data di dunia nyata anda akan sering membutuhkan ini¶

In [33]:

frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
     'c': ['one', 'one', 'one', 'two', 'two',
     'two', 'two'],
     'd': [0, 1, 2, 0, 1, 2, 3]})
frame

Out[33]:

	a	b	c	d
0	0	7	one	0
1	1	6	one	1
2	2	5	one	2
3	3	4	two	0
4	4	3	two	1
5	5	2	two	2
6	6	1	two	3

In [34]:

frame2 = frame.set_index(['c', 'd']) # ==> Membuat index berdasarkan nilai dari kolom "c" dan "d"
frame2

Out[34]:

		a	b
c	d
one	0	0	7
	1	1	6
	2	2	5
two	0	3	4
	1	4	3
	2	5	2
	3	6	1

In [35]:

frame.set_index(['c', 'd'], drop=False) # Kalau kita ingin c dan d tetap ada (not sure why u want this)

Out[35]:

		a	b	c	d
c	d
one	0	0	7	one	0
	1	1	6	one	1
	2	2	5	one	2
two	0	3	4	two	0
	1	4	3	two	1
	2	5	2	two	2
	3	6	1	two	3

In [42]:

# Reset Index biasanya dibutuhkan saat kita sudah banyak melakukan perubahan di DF-nya
# Atau selesai melakukan data PreProcessing/EDA
frame2.reset_index()

Out[42]:

	c	d	a	b
0	one	0	0	7
1	one	1	1	6
2	one	2	2	5
3	two	0	3	4
4	two	1	4	3
5	two	2	5	2
6	two	3	6	1

Combining and Merging Datasets

Ada 3 macam:

pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
pandas.concat concatenates or “stacks” together objects along an axis.
The combine_first instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

Database-Style DataFrame Joins (Merge/Join)

In [44]:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
     'data1': range(7)})
df1

Out[44]:

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	a	5
6	b	6

In [45]:

df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
    'data2': range(3)})
df2

Out[45]:

	key	data2
0	a	0
1	b	1
2	d	2

many-to-one join¶

the data in df1 has multiple rows labeled a and b, whereas df2 has only one row for each value in the key column. Calling merge with these objects we obtain:

Data Mana yang Hilang? Mengapa?¶

In [46]:

pd.merge(df1, df2, on='key')

Out[46]:

	key	data1	data2
0	b	0	1
1	b	1	1
2	b	6	1
3	a	2	0
4	a	4	0
5	a	5	0

Bagaimana jika nama kolom berbeda?¶

In [47]:

df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
    'data1': range(7)})
df3

Out[47]:

	lkey	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	a	5
6	b	6

In [48]:

df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
    'data2': range(3)})

df4

Out[48]:

	rkey	data2
0	a	0
1	b	1
2	d	2

In [50]:

pd.merge(df3, df4, left_on='lkey', right_on='rkey')[['lkey', 'data1', 'data2']]

Out[50]:

	lkey	data1	data2
0	b	0	1
1	b	1	1
2	b	6	1
3	a	2	0
4	a	4	0
5	a	5	0

Inner join¶

You may notice that the 'c' and 'd' values and associated data are missing from the result. By default merge does an 'inner' join; the keys in the result are the intersec‐ tion, or the common set found in both tables.

Outer Join¶

Other possible options are 'left', 'right', and 'outer'. The outer join takes the union of the keys, combining the effect of applying both left and right joins:

In [51]:

pd.merge(df1, df2, how='outer') # Perhatikan outputnya.

Out[51]:

	key	data1	data2
0	b	0.0	1.0
1	b	1.0	1.0
2	b	6.0	1.0
3	a	2.0	0.0
4	a	4.0	0.0
5	a	5.0	0.0
6	c	3.0	NaN
7	d	NaN	2.0

Many-to-many merges¶

well-defined, though not necessarily intuitive, behavior
Many-to-many joins form the Cartesian product of the rows.

In [52]:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
     'data1': range(6)})
df1

Out[52]:

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	b	5

In [53]:

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
    'data2': range(5)})

df2

Out[53]:

	key	data2
0	a	0
1	b	1
2	a	2
3	b	3
4	d	4

In [55]:

pd.merge(df1, df2, on='key', how='left')

Out[55]:

	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

Since there were three 'b' rows in the left DataFrame and two in the right one, there are six 'b' rows in the result. The join method only affects the distinct key values appearing in the result:

In [56]:

pd.merge(df1, df2, how='inner')

Out[56]:

	key	data1	data2
0	b	0	1
1	b	0	3
2	b	1	1
3	b	1	3
4	b	5	1
5	b	5	3
6	a	2	0
7	a	2	2
8	a	4	0
9	a	4	2

merge with multiple keys

When you’re joining columns-on-columns, the indexes on the passed DataFrame objects are discarded.

In [57]:

df1 = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
    'key2': ['one', 'two', 'one'],
    'lval': [1, 2, 3]})
df1

Out[57]:

	key1	key2	lval
0	foo	one	1
1	foo	two	2
2	bar	one	3

In [58]:

df2 = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
     'key2': ['one', 'one', 'one', 'two'],
     'rval': [4, 5, 6, 7]})
df2

Out[58]:

	key1	key2	rval
0	foo	one	4
1	foo	one	5
2	bar	one	6
3	bar	two	7

In [59]:

pd.merge(df1, df2, on=['key1', 'key2'], how='outer')

Out[59]:

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

overlapping column names¶

While you can address the overlap manually (Rename nama kolom), merge has a suffixes option for specifying strings to append to overlapping names in the left and right DataFrame objects

In [60]:

pd.merge(df1, df2, on='key1')

Out[60]:

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

In [62]:

pd.merge(df1, df2, on='key1', suffixes=('_satu', '_dua'))

Out[62]:

	key1	key2_satu	lval	key2_dua	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

Merging on Index

In [66]:

df1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
    'value': range(6)})
df1

Out[66]:

	key	value
0	a	0
1	b	1
2	a	2
3	a	3
4	b	4
5	c	5

In [68]:

dd = df1.describe()
dd.unstack()

Out[68]:

value  count    6.000000
       mean     2.500000
       std      1.870829
       min      0.000000
       25%      1.250000
       50%      2.500000
       75%      3.750000
       max      5.000000
dtype: float64

In [64]:

df2 = pd.DataFrame({'group_val': [35, 7]}, index=['a', 'b'])
df2

Out[64]:

	group_val
a	35
b	7

In [65]:

pd.merge(df1, df2, left_on='key', right_index=True)

Out[65]:

	key	value	group_val
0	a	0	35
2	a	2	35
3	a	3	35
1	b	1	7
4	b	4	7

Concatenating Along an Axis¶

In [69]:

# Kalau di Numpy
arr = np.arange(12).reshape((3, 4))
arr

Out[69]:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [70]:

np.concatenate([arr, arr], axis=0)

Out[70]:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [71]:

# Di Pandas  - Series
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

pd.concat([s1, s2, s3])

Out[71]:

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

In [72]:

pd.concat([s1, s2, s3], axis=1)

Out[72]:

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

In [73]:

# Concat di DataFrame

df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
    columns=['one', 'two'])
df1

Out[73]:

	one	two
a	0	1
b	2	3
c	4	5

In [79]:

df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
    columns=['three', 'four'])
df2

Out[79]:

	three	four
a	5	6
c	7	8

In [81]:

pd.concat([df1, df2], axis=1) # , keys=['level1', 'level2']
#Perhatikan indexnya

Out[81]:

	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

Data Aggregation and Group Operations

pandas provides a flexible groupby interface, enabling you to slice, dice, and summarize datasets in a natural way.
Aggregation of time series data, a special use case of groupby, is referred to as resampling.

1. GroupBy Mechanics

Hadley Wickham, an author of many popular packages for the R programming lan‐ guage, coined the term split-apply-combine for describing group operations.

In [84]:

import pandas as pd , numpy as np

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
    'key2' : ['one', 'two', 'one', 'two', 'one'],
    'data1' : [1,2,3,4,5],
    'data2' : [6,7,8,9,10]})
df

Out[84]:

	key1	key2	data1	data2
0	a	one	1	6
1	a	two	2	7
2	b	one	3	8
3	b	two	4	9
4	a	one	5	10

In [95]:

grouped = df.groupby(df['key1']).sum()
grouped

Out[95]:

	key2	data1	data2
key1
a	onetwoone	8	23
b	onetwo	7	17

In [96]:

grouped[['data1', 'data2']].mean()

Out[96]:

data1     7.5
data2    20.0
dtype: float64

In [97]:

means = df.groupby([df['key1'], df['key2']]).mean()
means

Out[97]:

		data1	data2
key1	key2
a	one	3.0	8.0
a	two	2.0	7.0
b	one	3.0	8.0
b	two	4.0	9.0

In [105]:

means.loc['a'].loc['one'].describe()

Out[105]:

count    2.000000
mean     5.500000
std      3.535534
min      3.000000
25%      4.250000
50%      5.500000
75%      6.750000
max      8.000000
Name: one, dtype: float64

In [ ]:

means.unstack()

In [ ]:

df

In [ ]:

df[['key1', 'data1', 'data2']].groupby('key1').mean()

# You may have noticed in the first case df.groupby('key1').mean() that there is no key2 column in the result. 
# Because df['key2'] is not numeric data, it is said to be a nuisance column, which is therefore excluded from the result.

In [ ]:

# Walau pada data numerik terkadang kita ingin mengetahui jumlah kejadiannya (frekuensi)
df.groupby(['key1', 'key2']).size()

Iterating Over Groups¶

In [ ]:

for name, group in df.groupby('key1'):
    print(name)
    print(group)

In [ ]:

# Atau kalau mau bisa juga disimpan ke variable:
pieces = dict(list(df.groupby('key1')))
pieces['b']

Selecting a Column or Subset of Columns¶

Trivial sebenarnya¶

In [106]:

df.groupby(['key1', 'key2'])[['data2']].mean()

Out[106]:

		data2
key1	key2
a	one	8.0
a	two	7.0
b	one	8.0
b	two	9.0

GroupBy bisa flexible dengan memilih grouping berdasarkan kolom mana saja¶

Sekalian rename hasil groupBy-nya

In [107]:

people = pd.DataFrame(np.random.randn(5, 5),
    columns=['a', 'b', 'c', 'd', 'e'],
    index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people

Out[107]:

	a	b	c	d	e
Joe	0.472783	1.859784	-2.145745	-0.421652	0.473530
Steve	-0.140093	0.506622	1.838508	0.478082	-0.800755
Wes	-1.975428	-0.777821	-1.030737	0.771424	0.004535
Jim	0.609483	-0.383058	0.505918	0.194575	-0.566108
Travis	1.371451	1.668983	-0.554708	-1.534661	0.279254

In [108]:

d = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f' : 'orange'}
d

Out[108]:

{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}

In [ ]:

In [111]:

people.groupby(d, axis=1).mean()

Out[111]:

	blue	red
Joe	-1.283698	0.935366
Steve	1.158295	-0.144742
Wes	-0.129657	-0.916238
Jim	0.350247	-0.113228
Travis	-1.044684	1.106562

Data Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays.
Tidak hanya fungsi yang ditabel diatas

In [112]:

df

Out[112]:

	key1	key2	data1	data2
0	a	one	1	6
1	a	two	2	7
2	b	one	3	8
3	b	two	4	9
4	a	one	5	10

In [113]:

df[['key1','data1','data2']].groupby('key1').quantile(0.9)

Out[113]:

	data1	data2
key1
a	4.4	9.4
b	3.9	8.9

Custom Aggregate Function¶

Perhatikan, ini sangat penting untuk diketahui¶

In [114]:

def RangeData(arr):
    return arr.max() - arr.min()

In [115]:

df[['key1','data1','data2']].groupby('key1').agg(RangeData)

Out[115]:

	data1	data2
key1
a	4	4
b	1	1

Apply di Dataframe¶

Sangat penting juga untuk mengefisienkan Code

In [118]:

def sum_kuadrat(x):
    return x.sum()**2+1

data = {
  "x": [5, 4, 3],
  "y": [2, 1, 7]
}
df = pd.DataFrame(data)
df

Out[118]:

	x	y
0	5	2
1	4	1
2	3	7

In [119]:

df.apply(sum_kuadrat)

Out[119]:

x    145
y    101
dtype: int64

Pivot Tables and Cross-Tabulation

A pivot table is a data summarization tool frequently found in spreadsheet programs and other data analysis software. It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns. Pivot tables in Python with pandas are made possible through the groupby

In [120]:

import seaborn as sns

tips = sns.load_dataset("tips")
tips

Out[120]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

In [121]:

tips[['day', 'smoker','tip','total_bill']].pivot_table(index=['day', 'smoker'])

Out[121]:

		tip	total_bill
day	smoker
Thur	Yes	3.030000	19.190588
Thur	No	2.673778	17.113111
Fri	Yes	2.714000	16.813333
Fri	No	2.812500	18.420000
Sat	Yes	2.875476	21.276667
Sat	No	3.102889	19.661778
Sun	Yes	3.516842	24.120000
Sun	No	3.167895	20.506667

CrossTab¶

A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies.

In [122]:

pd.crosstab(tips.day, tips.smoker)

Out[122]:

smoker	Yes	No
day
Thur	17	45
Fri	15	4
Sat	42	45
Sun	19	57

End of Module

Referensi:¶

Rattenbury, Tye, et al. Principles of data wrangling: Practical techniques for data preparation. " O'Reilly Media, Inc.", 2017.
McKinney, Wes. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O'Reilly Media, Inc.", 2012.

	key	data1	data2
0	b	0.0	1.0
1	b	1.0	1.0
2	b	6.0	1.0
3	a	2.0	0.0
4	a	4.0	0.0
5	a	5.0	0.0
6	c	3.0	NaN
7	d	NaN	2.0

	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_satu	lval	key2_dua	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	key	data1	data2
0	b	0.0	1.0
1	b	1.0	1.0
2	b	6.0	1.0
3	a	2.0	0.0
4	a	4.0	0.0
5	a	5.0	0.0
6	c	3.0	NaN
7	d	NaN	2.0

	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_satu	lval	key2_dua	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

E(V)DA-04: Data Wrangling

Outline:

Video:

Code:

EDA - Data Wrangling

Outline:

Data Wrangling

Core data wrangling types

Hierarchical Indexing

Hati-hati ... pemanggilan index sedikit berbeda di "Series"¶

Hierarchical Indexing untuk Apa?

Rename Index (for clarity)¶

Reordering and Sorting Levels

Indexing with a DataFrame’s columns

Pada data di dunia nyata anda akan sering membutuhkan ini¶

Combining and Merging Datasets

Database-Style DataFrame Joins (Merge/Join)

many-to-one join¶

Data Mana yang Hilang? Mengapa?¶

Bagaimana jika nama kolom berbeda?¶

Inner join¶

Outer Join¶

Many-to-many merges¶

merge with multiple keys

overlapping column names¶

Merging on Index

Concatenating Along an Axis¶

Data Aggregation and Group Operations

1. GroupBy Mechanics

Iterating Over Groups¶

Selecting a Column or Subset of Columns¶

Trivial sebenarnya¶

GroupBy bisa flexible dengan memilih grouping berdasarkan kolom mana saja¶

Data Aggregation

Custom Aggregate Function¶

Perhatikan, ini sangat penting untuk diketahui¶

Apply di Dataframe¶

Pivot Tables and Cross-Tabulation

CrossTab¶

End of Module

Referensi:¶

No comments:

Post a Comment

SEARCH

LATEST

FOLLOW ME

Visitors

Translate~Terjemahkan

Pages

Follow Us

Popular

Archive

Postingan Populer

Latest courses

Comments

About

	key	data1	data2
0	b	0.0	1.0
1	b	1.0	1.0
2	b	6.0	1.0
3	a	2.0	0.0
4	a	4.0	0.0
5	a	5.0	0.0
6	c	3.0	NaN
7	d	NaN	2.0

	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_satu	lval	key2_dua	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0