Algorithms, Data Structures, and Programming (ADSP)
Bagian dari Combined Module: Pendahuluan High Performance (Pemrograman Parallel) for Data Science (HPDS)
Bagian ke-3 dari "dasar Python" untuk Big Data & Data Science. Bagian dari modul HPDS dan ADSP di kurikulum tau-data Indonesia.
Pembahasan:
- Menghindari copy memory di pemrograman Big Data/Data Science
- Bagaimana Menggunakan Args dan Kwargs di fungsi Python yang bisa diaplikasikan di Machine Learning.
- Reference to variable, fungsi lamda, dll
Video Lesson ADSP-03
(Link Video: sementara hanya tersedia bagi mitra tau-data)
https://tau-data.id
ADSP-03: Struktur Data Python Bagian ke-02
(C)Taufik Sutanto
https://tau-data.id/adsp-03
Outline:
- Numpy Array (Matrix)
- Numpy MemMap
- SciPy Sparse Matrix
- Dataframe
Numpy Matrix Discontinued¶
https://numpy.org/doc/stable/reference/generated/numpy.matrix.html¶
import numpy as np
s = [2.0, 2.8, 1.9, 2.5, 2.7, 2.3, 1.8, 1.2, 0.9, 1.0]
C = np.array(s)
print(C)
[2. 2.8 1.9 2.5 2.7 2.3 1.8 1.2 0.9 1. ]
L = [2,3,4,4,3,6,23,6,4,7,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9]*33
print(L)
[2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 2, 3, 4, 4, 3, 6, 23, 6, 4, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9]
print(np.array(L))
[2 3 4 ... 9 9 9]
# ndarray = N-dimensional-Array
# Perhatikan "shape" adalah property bukan function
type(C), C.shape
(numpy.ndarray, (10,))
C
array([2. , 2.8, 1.9, 2.5, 2.7, 2.3, 1.8, 1.2, 0.9, 1. ])
# elemen wise operations
print(C * 2+1)
[5. 6.6 4.8 6. 6.4 5.6 4.6 3.4 2.8 3. ]
try:
print(s * 2+1)
except:
print('Error : tidak bisa dilakukan di List')
# Hence: List BUKAN Array, shg array operations tidak terdefinisi atasnya
Error : tidak bisa dilakukan di List
print(C)
print(C*C)
[2. 2.8 1.9 2.5 2.7 2.3 1.8 1.2 0.9 1. ] [4. 7.84 3.61 6.25 7.29 5.29 3.24 1.44 0.81 1. ]
print(np.dot(C,C)) # Jarak Euclidean di Data Science, misal k-Means
40.77
# Array as Matrix
A = [ [1,2], [3,4] ]
B = np.array(A)
print(B.shape)
B
(2, 2)
array([[1, 2], [3, 4]])
# Bisa Juga dinitialisasi, misa
M1 = np.zeros((2,2), dtype='int') # hati-hati kurungnya
print(M1)
M2 = np.ones((2,2))
M2
[[0 0] [0 0]]
array([[1., 1.], [1., 1.]])
# Akses ke matrix menggunakan indexing biasa
M1[1,0] = 99.7 # .7 hilang karena tipe yg ditetapkan di atas utk M1 adalah "int"
print(M1[:,0]) # Slicingnya sedikit berbeda dengan List, tapi bisa juga dengan cara indexing list [][]
M2[0,:] = [3, 7] # renungkan baris ini perlahan
print(M2, type(M2))
[ 0 99] [[3. 7.] [1. 1.]] <class 'numpy.ndarray'>
M1
array([[ 0, 0], [99, 0]])
M1[1,0], M1[1][0]
(99, 99)
B
array([[1, 2], [3, 4]])
# Hati-hati
B*B
array([[ 1, 4], [ 9, 16]])
np.matmul(B,B) # Matlab version of B*B
array([[ 7, 10], [15, 22]])
B = np.matrix(B)
type(B), B
(numpy.matrix, matrix([[1, 2], [3, 4]]))
B*B
matrix([[ 7, 10], [15, 22]])
# Defaultnya elemen wise operation
B*2
matrix([[2, 4], [6, 8]])
print(B)
B.transpose() # ini versi matlab dari B'
[[1 2] [3 4]]
matrix([[1, 3], [2, 4]])
# Shortcut untuk Transpose B'
B.T
matrix([[1, 3], [2, 4]])
inv = np.linalg.inv # alias
#np.linalg.inv(B) # ini versi Matlab dari inv(B)
inv(B)
matrix([[-2. , 1. ], [ 1.5, -0.5]])
det = np.linalg.det
det(B) # Determinan Matriks B
-2.0000000000000004
eig = np.linalg.eig
eig(B) # Determinan Matriks B
(array([-0.37228132, 5.37228132]), matrix([[-0.82456484, -0.41597356], [ 0.56576746, -0.90937671]]))
C
array([2. , 2.8, 1.9, 2.5, 2.7, 2.3, 1.8, 1.2, 0.9, 1. ])
import matplotlib.pyplot as plt
plt.plot(C)
plt.show()
List VS Array : Best use scenario
# Perbandingan memory usage (bit)
from sys import getsizeof as size
a = np.array([24, 12, 57])
b = np.array([])
c = []
d = [24, 12, 57]
print(size(a),size(b),size(c),size(d))
# Dalam byte https://docs.python.org/3/library/sys.html#sys.getsizeof
# caution untuk idctionary https://docs.python.org/3/library/sys.html#sys.getsizeof
108 96 64 88
# Mari test perbandingan kecepatan numpy vs list
# Di Data Science EFISIENSI sangat penting
N = 10000
A = [i+1 for i in range(N)] # [1,2,3,...,N]
B = [i*2 for i in range(N)]
C = np.array(A)
D = np.array(B)
D[:10]
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
%%timeit
E = [a+b for a,b in zip(A,B)]
823 µs ± 243 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
F = np.add(C,D)
12.9 µs ± 547 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# Histogram
import matplotlib.pyplot as plt
data = np.random.normal(size=10000)
plt.hist(data)
plt.title("Gaussian Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
data = np.random.normal(size=10000)
type(data)
numpy.ndarray
data[:10]
array([ 0.248563 , 0.4350702 , 0.2730992 , 2.0050722 , -1.22075245, -1.17351649, -0.20473633, 0.44299796, 1.15944265, 0.30431478])
X = np.linspace(-2 * np.pi, 2 * np.pi, 50, endpoint=True)
F1 = 3 * np.sin(X)
F2 = np.sin(2*X)
F3 = 0.3 * np.sin(X)
startx, endx = -2 * np.pi - 0.1, 2*np.pi + 0.1
starty, endy = -3.1, 3.1
plt.axis([startx, endx, starty, endy])
plt.plot(X,F1)
plt.plot(X,F2)
plt.plot(X,F3)
plt.plot(X, F1, 'ro')
plt.plot(X, F2, 'bx')
plt.show()
# Comment F1, F2, F3 untuk contoh scatter plot
Menangani Array/Matrix berukuran "Besar" dengan Numpy MemMap¶
- Fungsi memmap() numpy memetakan array dari memory ke harddisk.
- Digunakan untuk mengoperasikan Array/matrix besar yang melebihi kapasitas RAM/memori.
- file memMap di akses dalam bentuk segment/chunks menggunakan indexing/slicing biasa seperti numpy array biasa.
- Usahakan menggunakan HardDisk yang cepat, misal NVME/SSD
- Operasi di Array MemMap = Array biasa. Namun yakinkan saat mengakses Array memMap tidak dilakukan diseluruh elemen sekaligus, tapi hanya segmen/sebagian saja.
nrows, ncols = 10**6, 100
f = np.memmap('memap.dat', dtype=np.float32, mode='w+', shape=(nrows, ncols))
# Lihat di folder dimana ipynb ini berada, terdapat file baru.. lihat ukuran filenya
mode:¶
- ‘r’ Open existing file for reading only.
- ‘r+’ Open existing file for reading and writing.
- ‘w+’ Create or overwrite existing file for reading and writing.
- ‘c’ Copy-on-write: assignments affect data in memory, but changes are not saved to disk. The file on disk is read-only.
# isi dengan nilai Acak Normal
for i in range(ncols):
f[:, i] = np.random.rand(nrows)
# Misal mau mengambil baris ke-3 terakhir
x = f[:, 0]
print(x, np.mean(x))
[0.9059809 0.6013287 0.38846755 ... 0.4889865 0.34110525 0.92023325] 0.49993035
Menghitung besar array di harddisk¶
def check_asize_bytes(shape, dtype):
return np.prod(shape) * np.dtype(dtype).itemsize
print(check_asize_bytes((10**6,100), 'float32'))
400000000
Menyimpan perubahan ke disk MemMap¶
# Save variabel MemMap ke HardDisk dengan cara hapus "del" dari Python
del f
Loading MemMap Array¶
# Sengaja buat variabel baru untuk cek/verifikasi
fBaru = np.memmap('memap.dat', dtype=np.float32, shape=(nrows, ncols))
xx = fBaru[:, 0]
xx==x, np.array_equal(xx, x)
(array([ True, True, True, ..., True, True, True]), True)
Testing Matrix yang super besar, lalu melakukan operasi sederhana¶
- Supaya yakin, yakinkan ukuran RAM komputer anda dan buka "task manager"
- Yakinkan enough diskspace
- Kita akan menghitung "rata-rata" seluruh elemen di matrix setelah generate bil random
# Hati-hati ini akan mengenerate matrix Sangat Besar >35Gb!!!!
import numpy as np
from tqdm import tqdm
nrows, ncols = 10**6, 10**4
f = np.memmap('BigMatrix.dat', dtype=np.float32, mode='w+', shape=(nrows, ncols))
# Lihat di folder dimana ipynb ini berada, terdapat file baru.. lihat ukuran filenya
# Sengaja di "Flush dulu"
del f
# Load Lagi dari HD mensimulasikan kasus nyata
f = np.memmap('BigMatrix.dat', dtype=np.float32, shape=(nrows, ncols))
# Generate Random Data
# Lihat Task Manager dan memory yang digunakan.
# Jika MemMap gagal maka kita akan out-of-memory
for i in tqdm(range(ncols)):
f[:, i] = np.random.rand(nrows)
if i> 1500:
break # biar ndak terlalu lama
15%|███████████▌ | 1501/10000 [03:10<18:00, 7.86it/s]
# Sengaja Flush ke HD lagi dulu
del f
# Lihat penggunaan RAM di Task Manager
# lalu load lagi
f = np.memmap('BigMatrix.dat', dtype=np.float32, shape=(nrows, ncols))
# Baru mencoba hitung rata-rata
sum_ = 0
for i in tqdm(range(ncols)):
sum_ += np.sum(f[:, i])
if i>1500:
break # Biar ndak terlalu lama
sum_/1500
15%|███████████▌ | 1501/10000 [01:05<06:08, 23.09it/s]
500665.11179166666
del f # I need the memory back :)
Matrix Sparse¶
Referensi : https://matteding.github.io/2019/04/25/sparse-matrices/¶
- Matrix Sparse adalah matrix yang di dominasi oleh nilai "0" sebagai elemennya.
- Banyak ditemukan di machine learning untuk data tidak terstruktur (terutama Text).
# Contoh Matrix DENSE numpy
A = np.array([[1, 0, 0, 1, 0, 0], [0, 0, 2, 0, 0, 1], [0, 0, 0, 2, 0, 0]])
print(A)
type(A), A.size
[[1 0 0 1 0 0] [0 0 2 0 0 1] [0 0 0 2 0 0]]
(numpy.ndarray, 18)
# SPARSITY: count zero elements / total elements
np.count_nonzero(A) / float(A.size)
0.2777777777777778
# Kalau ada Nan bagaimana?
A = np.array([[1, 0, 0, 1, 0, np.nan], [0, 0, 2, 0, 0, 1], [0, 0, 0, 2, 0, np.nan]])
A = np.nan_to_num(A, 0)
np.count_nonzero(A) / float(A.size)
0.2777777777777778
Modul SciPy untuk Menangani matrix Sparse¶
Total ada 7 Macam tipe Sparse Matrix:
- csc_matrix: Compressed Sparse Column format
- csr_matrix: Compressed Sparse Row format
- bsr_matrix: Block Sparse Row format
- lil_matrix: List of Lists format
- dok_matrix: Dictionary of Keys format
- coo_matrix: COOrdinate format (aka IJV, triplet format)
- dia_matrix: DIAgonal format https://docs.scipy.org/doc/scipy/reference/sparse.html
- Related Link: https://tau-data.id/fast-cosine/
Coordinate Matrix (COO)¶
- # COO mudah dibuat/construct dan dimengerti
from scipy import sparse
row = [0,3,1,0] # bisa juga array
col = [0,3,1,2]
data = [4,5,7,9]
A = sparse.coo_matrix((data,(row, col)),shape=(4,4))
# Perhatikan: index tidak harus urut, dan kita butuh "Ukuran Matrix"
A, A.data
(<4x4 sparse matrix of type '<class 'numpy.int32'>' with 4 stored elements in COOrdinate format>, array([4, 5, 7, 9]))
A.toarray(), type(A) # not "in place"
(array([[4, 0, 9, 0], [0, 7, 0, 0], [0, 0, 0, 0], [0, 0, 0, 5]]), scipy.sparse.coo.coo_matrix)
Hati-hati ... Jika tidak Sparse jangan gunakan struktur data ini¶
Compressed Sparse Matrix¶
- Digunakan di DS dan ML untuk komputasi/perhitungan
image source: https://matteding.github.io/2019/04/25/sparse-matrices/
- Pasangan index pointer menentukan:
- Posisi baris
- Mulai:Akhir
- NNZ adalah value/nilainya.
# hati-hati di Python
B = [0, 1, 2, 3, 4, 5, 6]
B[6:7]
[6]
indptr = [0, 2, 3, 3, 3, 6, 6, 7]
indices = [0, 2, 2, 2, 3, 4, 3]
data = [8, 2, 5, 7, 1, 2, 9]
csr = sparse.csr_matrix((data, indices, indptr)) # Perhatikan tidak ada SHAPE
csr
<7x5 sparse matrix of type '<class 'numpy.int32'>' with 7 stored elements in Compressed Sparse Row format>
csr.toarray()
array([[8, 0, 2, 0, 0], [0, 0, 5, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 7, 1, 2], [0, 0, 0, 0, 0], [0, 0, 0, 9, 0]])
csr.indices, csr.data, csr.getrow(0).indices
(array([0, 2, 2, 2, 3, 4, 3], dtype=int32), array([8, 2, 5, 7, 1, 2, 9]), array([0, 2]))
Sifat Matrix Sparse di Python¶
Dataframe¶
- Creating Dataframe
- info, dtypes, basic properties & Functions
- Iterating and loc
- groups
- (un)Stack
- Concat
- Search
import pandas as pd
#creating from dictionary
D = {'nama':['ali', 'budi', 'wati'], 'umur':[22, 34, 12]}
df = pd.DataFrame(D)
df
nama | umur | |
---|---|---|
0 | ali | 22 |
1 | budi | 34 |
2 | wati | 12 |
# Other method to create dataframe
D = [{'col_1': 3, 'col_2': 'a'},
{'col_1': 2, 'col_2': 'b'},
{'col_1': 1, 'col_2': 'c'},
{'col_1': 0, 'col_2': 'd'}]
df = pd.DataFrame.from_records(D)
df
col_1 | col_2 | |
---|---|---|
0 | 3 | a |
1 | 2 | b |
2 | 1 | c |
3 | 0 | d |
# We can also import from CSV or Excel
# Lakukan hanya jika menggunakan Google Colab
try:
df = pd.read_csv('data/price.csv')
except: #Using Google Colab
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/price.csv
df = pd.read_csv('data/price.csv')
df
Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | Open | CAT B | 530 | 6649000 |
1 | 2 | 8294.0 | 8186.0 | 12694.0 | 1461.0 | 1752.0 | Not Provided | CAT B | 210 | 3982000 |
2 | 3 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | Not Provided | CAT A | 720 | 5401000 |
3 | 4 | 8301.0 | 11188.0 | 12289.0 | 1451.0 | 1748.0 | Covered | CAT B | 620 | 5373000 |
4 | 5 | 10510.0 | 12629.0 | 13921.0 | 1770.0 | 2111.0 | Not Provided | CAT B | 450 | 4662000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
931 | 932 | 9297.0 | 12537.0 | 14418.0 | 1174.0 | 1429.0 | Covered | CAT C | 1110 | 5434000 |
932 | 933 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
933 | 934 | 9205.0 | 10418.0 | 14496.0 | 1118.0 | 1337.0 | Open | CAT A | 560 | 7227000 |
934 | 935 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
935 | 936 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
936 rows × 10 columns
# Basic properties
# Object ~ string
print(df.size, df.shape, df.columns)
df.dtypes
9360 (936, 10) Index(['Observation', 'Dist_Taxi', 'Dist_Market', 'Dist_Hospital', 'Carpet', 'Builtup', 'Parking', 'City_Category', 'Rainfall', 'House_Price'], dtype='object')
Observation int64 Dist_Taxi float64 Dist_Market float64 Dist_Hospital float64 Carpet float64 Builtup float64 Parking object City_Category object Rainfall int64 House_Price int64 dtype: object
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 936 entries, 0 to 935 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Observation 936 non-null int64 1 Dist_Taxi 923 non-null float64 2 Dist_Market 923 non-null float64 3 Dist_Hospital 935 non-null float64 4 Carpet 928 non-null float64 5 Builtup 921 non-null float64 6 Parking 936 non-null object 7 City_Category 936 non-null object 8 Rainfall 936 non-null int64 9 House_Price 936 non-null int64 dtypes: float64(5), int64(3), object(2) memory usage: 73.2+ KB
df.head()
Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | Open | CAT B | 530 | 6649000 |
1 | 2 | 8294.0 | 8186.0 | 12694.0 | 1461.0 | 1752.0 | Not Provided | CAT B | 210 | 3982000 |
2 | 3 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | Not Provided | CAT A | 720 | 5401000 |
3 | 4 | 8301.0 | 11188.0 | 12289.0 | 1451.0 | 1748.0 | Covered | CAT B | 620 | 5373000 |
4 | 5 | 10510.0 | 12629.0 | 13921.0 | 1770.0 | 2111.0 | Not Provided | CAT B | 450 | 4662000 |
# iterating Dataframe
for i, d in df.iterrows():
print(i, d.Parking, d['Rainfall'])
if i>3:
break
0 Open 530 1 Not Provided 210 2 Not Provided 720 3 Covered 620 4 Not Provided 450
# Accessing and Modifiying the element
df.loc[0, 'Rainfall'] = 999999
df.head()
Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | Open | CAT B | 999999 | 6649000 |
1 | 2 | 8294.0 | 8186.0 | 12694.0 | 1461.0 | 1752.0 | Not Provided | CAT B | 210 | 3982000 |
2 | 3 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | Not Provided | CAT A | 720 | 5401000 |
3 | 4 | 8301.0 | 11188.0 | 12289.0 | 1451.0 | 1748.0 | Covered | CAT B | 620 | 5373000 |
4 | 5 | 10510.0 | 12629.0 | 13921.0 | 1770.0 | 2111.0 | Not Provided | CAT B | 450 | 4662000 |
# Transpose
df.head().transpose()
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
Observation | 1 | 2 | 3 | 4 | 5 |
Dist_Taxi | 9796 | 8294 | 11001 | 8301 | 10510 |
Dist_Market | 5250 | 8186 | 14399 | 11188 | 12629 |
Dist_Hospital | 10703 | 12694 | 16991 | 12289 | 13921 |
Carpet | 1659 | 1461 | 1340 | 1451 | 1770 |
Builtup | 1961 | 1752 | 1609 | 1748 | 2111 |
Parking | Open | Not Provided | Not Provided | Covered | Not Provided |
City_Category | CAT B | CAT B | CAT A | CAT B | CAT B |
Rainfall | 999999 | 210 | 720 | 620 | 450 |
House_Price | 6649000 | 3982000 | 5401000 | 5373000 | 4662000 |
Terkait DataFrame, silahkan akses https://tau-data.id/eda-01/ dan https://tau-data.id/eda-02/ untuk mendapatkan pengetahuan lebih lanjut tentangnya.¶
End of Module
Next Lesson ADSP-04: Data Science Teamwork via Python
Code Lesson ADSP-03
Code dari lesson ini dapat di akses di Link berikut (wajib login ke Google/Gmail): Code ADSP-03
Di link tersebut anda langsung bisa merubah code dan menjalankannya. Keterangan lebih lanjut di video yang disertakan.
Sangat disarankan untuk membuka code dan video "side-by-side" untuk mendapatkan pengalaman belajar yang baik. SIlahkan modifikasi (coba-coba) hal lain, selain yang ditunjukkan di video untuk mendapatkan pengalaman belajar yang lebih mendalam. Tentu saja juga silahkan akses berbagai referensi lain untuk memperkaya pengetahuan lalu diskusikan di forum yang telah disediakan.
No comments:
Post a Comment
Relevant & Respectful Comments Only.