Another Step Towards Machine Learning: Pandas by Hrithik Sharma

Hello, I am back with another article as you know I am learning python for development of Artificial Intelligence & Machine Learning and using to solve pre-existing and new real-world problems by using machine learning. In python programming language, there exists many pre-defined libraries and inbuilt functions for all processes which are done in the creation of Machine Learning algorithms such as data collection, data cleaning, data preprocessing, data visualization, data plotting, and statistical data modeling all processes are used in Machine Learning and model creation for solving the real-world problems related in healthcare, business, and in various other fields.

In today's era, there exists a huge and tremendous amount of data in every sector and by looking at that data and by applying various analysis and analytics techniques on data we can form trends, patterns, and associations so that we can carry out a prediction related to that data and use that data for Machine Learning and can get a computerized prediction by dividing data into test set and training set and we can get a prediction or decision for solving a real-world challenging problem. For importing the datasets in python environment and use that data for Machine Learning i.e. removing null values, applying analysis and analytics on data and data preprocessing there is another library about which I am going to talk about today's article which is "Pandas". I am not talking about the China's national animal Panda 🐼🐼🐼🐼🐼🐼 😂😂😂😂😂 just kidding... but a library in python programming language.
In my previous article, I talked about a library called NumPy. In that article, I told about how to install NumPy in the environment and use NumPy in the python environment for creating an n-dimensional array in the python environment. I also talked about how to import NumPy in the environment and by use of different functions of the NumPy library we can create different types of arrays in the environment such as an array of zeros, ones, identity matrices of different dimensions, creating different types of arrays of lists using the different functions and I also told that NumPy contains the mathematical functions which can be used for applying the analytics as well as advanced analytics on the data such as trigonometric, logarithmic, and exponential functions and many more.

In today's article, I am going to talk about another library which is Pandas. Pandas is generally used for creating the Series, DataFrames and accessing data within the series and dataframes in the python environment and importing the data from excel, database, and various datasets which are mostly in .csv format. For using Pandas in python environment first of all we have to install the library by typing the command "pip install Pandas" in command prompt. In the case of Anaconda IDE, all packages related to Machine Learning & Data Science comes inbuilt inside the IDE. For using Pandas firstly we import Pandas as shown:
>>> import pandas as pd
Let's discuss some functions of Pandas library
Series() function
Series is a one-dimensional array which can hold any type of data such as integer, float, string etc. A series is a column of a excel sheet.
Let us consider a tuple and a list:
>>> cities1=('New York','Las Vegas','Atlanta','Los Angeles','San Diego','san Jose')
>>> cities2=['New York','Las Vegas','Atlanta','Los Angeles','San Diego','San Jose']
Here cities1 is a tuple and cities2 is a list. If we check type of both the objects:
>>> type(cities1)
tuple
>>> type(cities2)
list
Let's call Series() function:
# Series() function on a tuple
>>> pd.Series(cities1)
0 New York
1 Las Vegas
2 Atlanta
3 Los Angeles
4 San Diego
5 San Jose
dtype: object
# Series() function on a list
>>> pd.Series(cities2)
0 New York
1 Las Vegas
2 Atlanta
3 Los Angeles
4 San Diego
5 San Jose
dtype: object
If we check type of objects:
>>> type(pd.Series(cities1))
pandas.core.series.Series
>>> type(pd.Series(cities2))
pandas.core.series.Series
This is how we can change the tuple or a list into a series.
How we can create a Series manually.
>>> s1=pd.Series(data=['Python','Java','C','C++','C#','Ruby','Perl'],
index=[1,2,3,4,5,6,7])
>>> s1
1 Python
2 Java
3 C
4 C++
5 C#
6 Ruby
7 Perl
dtype: object
>>> type(s1)
pandas.core.series.Series
Accessing data in Series:
>>> s1
1 Python
2 Java
3 C
4 C++
5 C#
6 Ruby
7 Perl
dtype: object
>>> s1[1]
'Python'
>>> s1[4]
'C++'
>>> s1[1:5]
2 Java
3 C
4 C++
5 C#
>>> s1[:5]
1 Python
2 Java
3 C
4 C++
5 C#
dtype: object
Let's talk about another function DataFrame()
DataFrame() function
>>> df=pd.DataFrame(data={'ID':[101,102,103,104,105],
'Name':[Hrithik','Jatin','Akhil','Kapil','Shubham'],
'Job':['ML Engineer','Android Developer','Digital Marketer','Software Engineer','Data Analyst'],
'Location':['Cupertino','Los Angeles','New York','San Jose','Detroit'],
'Company':['Apple','Google','HTC','Samsung','HP'],
'Salary':[132000,120000,121000,130000,110000],
'Increment':[500,300,250,350,400]},
index=[1,2,3,4,5])
>>> df
ID Name Job Location Company Salary Increment
1 101 Hrithik ML Engineer Cupertino Apple 132000 500
2 102 Jatin Android Developer Los Angeles Google 120000 300
3 103 Akhil Digital Marketer New York HTC 121000 250
4 104 Kapil Software Engineer San Jose Samsung 130000 350
5 105 Shubham Data Analyst Detroit HP 110000 400
Data Frame is two-dimensional size-mutable, heterogeneous tabular data structure with rows and columns. A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns as created above. In above Data Frame df there are 5 rows and 7 columns. A Data Frame is like a excel spreadsheet having data in rows and columns.
Addition of a new column in a Data Frame:
>>> df['Total'] = df['Salary'] + df['Increment']
>>> df
ID Name Job Location Company Salary Increment Total
1 101 Hrithik ML Engineer Cupertino Apple 132000 500 132500
2 102 Jatin Android Developer Los Angeles Google 120000 300 120300
3 103 Akhil Digital Marketer New York HTC 121000 250 121250
4 104 Kapil Software Engineer San Jose Samsung 130000 350 130350
5 105 Shubham Data Analyst Detroit HP 110000 400 110400
>>> df['Name']
1 Hrithik
2 Jatin
3 Akhil
4 Kapil
5 Shubham
Name:Name, dtype: object
>>> type(df['Name'])
pandas.core.series.Series
Deleting a column in Data Frame:
>>> del(df['Total'])
>>> df
ID Name Job Location Company Salary Increment
1 101 Hrithik ML Engineer Cupertino Apple 132000 500
2 102 Jatin Android Developer Los Angeles Google 120000 300
3 103 Akhil Digital Marketer New York HTC 121000 250
4 104 Kapil Software Engineer San Jose Samsung 130000 350
5 105 Shubham Data Analyst Detroit HP 110000 400
To check all the columns in a Data Frame:
>>> df.columns
Index(['ID', 'Name', 'Job', 'Location', 'Company', 'Salary', 'Increment'], dtype='object')
To check relations between the columns of a Data Frame:
>>> df
ID Name Job Location Company Salary Increment
1 101 Hrithik ML Engineer Cupertino Apple 132000 500
2 102 Jatin Android Developer Los Angeles Google 120000 300
3 103 Akhil Digital Marketer New York HTC 121000 250
4 104 Kapil Software Engineer San Jose Samsung 130000 350
5 105 Shubham Data Analyst Detroit HP 110000 400
# Relation between Name and Location
>>> df[['Name','Location']]
Name Location
1 Hrithik Cupertino
2 Jatin Los Angeles
3 Akhil New York
4 Kapil San Jose
5 Shubham Detroit
# Relation between Name, Salary and Increment
>>> df[['Name','Salary','Increment']]
Name Salary Increment
1 Hrithik 132000 500
2 Jatin 120000 300
3 Akhil 121000 250
4 Kapil 130000 350
5 Shubham 110000 400
We have to write name of Data Frame in which we have to pass a list of columns we want to see. By using this only we can access columns of Data Frame in form of Series which is illustrated above.
Some mathematical functions on Data Frames. For mathematical functions we have to import NumPy in environment and mathematical functions can be applied on integers. Here in this Data Frame the Salary, ID & Increment are the columns on which the mathematical functions can be applied. Let's see:
>>> df
ID Name Job Location Company Salary Increment
1 101 Hrithik ML Engineer Cupertino Apple 132000 500
2 102 Jatin Android Developer Los Angeles Google 120000 300
3 103 Akhil Digital Marketer New York HTC 121000 250
4 104 Kapil Software Engineer San Jose Samsung 130000 350
5 105 Shubham Data Analyst Detroit HP 110000 400
>>> import numpy as np
>>> df['Salary'].sum()
613000
>>> df['Salary'].min()
110000
>>> df['Salary'].max()
132000
>>> df['Salary'].mean()
122600.0
>>> df.describe()
ID Salary Increment
count 5.000000 5.000000 5.00000
mean 103.000000 122600.000000 360.00000
std 1.581139 8820.430828 96.17692
min 101.000000 110000.000000 250.00000
25% 102.000000 120000.000000 300.00000
50% 103.000000 121000.000000 350.00000
75% 104.000000 130000.000000 400.00000
max 105.000000 132000.000000 500.00000
Let's see how to access rows of a Data Frame above we see that how to access the columns of the Data Frames. Let's see:
>>> df
ID Name Job Location Company Salary Increment
1 101 Hrithik ML Engineer Cupertino Apple 132000 500
2 102 Jatin Android Developer Los Angeles Google 120000 300
3 103 Akhil Digital Marketer New York HTC 121000 250
104 Kapil Software Engineer San Jose Samsung 130000 350
4
5 105 Shubham Data Analyst Detroit HP 110000 400
>>> df.iloc[0]
ID 101
Name Hrithik
Job ML Engineer
Location Cupertino
Company Apple
Salary 132000
Increment 500
Name: 1, dtype: object
>>> df.iloc[2]
ID 103
Name Akhil
Job Digital Marketer
Location New York
Company HTC
Salary 121000
Increment 250
Name: 3, dtype: object
>>> type(df.iloc[1])
pandas.core.series.Series
>>> df.iloc[1:5]
ID Name Job Location Company Salary Increment
2 102 Jatin Android Developer Los Angeles Google 120000 300
3 103 Akhil Digital Marketer New York HTC 121000 250
4 104 Kapil Software Engineer San Jose Samsung 130000 350
5 105 Shubham Data Analyst Detroit HP 110000 400
>>> df.loc[0:3]
ID Name Job Location Company Salary Increment
1 101 Hrithik ML Engineer Cupertino Apple 132000 500
2 102 Jatin Android Developer Los Angeles Google 120000 300
3 103 Akhil Digital Marketer New York HTC 121000 250
iloc[] returns a Pandas Series when one row is selected, and a Pandas Data Frame when multiple rows are selected, or the slicing can be done. loc[] method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller data frame.
Let's see how to load datasets in python environment:
>>> z=pd.read_csv('C:/Users/sharm/Downloads/DSAT/files/Employee.csv')
>>> z
This is how the datasets are loaded in python environment and all same functions can be applied on the dataset. The dataset becomes a Data Frame of huge number of rows and columns and all functions can be applied on the Data Frame.
In this article I discussed some important features and functions of Pandas library but there are mant functions in Pandas which are about to explore by me.
This is all about my this article of learning python for Machine Learning and Artificial Intelligence as soon I learn different concepts of python I'll keep posting. So, there are more to go as I am learning.
Bye-Bye, See You in my next article. Until then enjoy Machine Learning and PEACE OUT.