Another Step Towards Machine Learning: Pandas by Hrithik Sharma


Hello, I am back with another article as you know I am learning python for development of Artificial Intelligence & Machine Learning and using to solve pre-existing and new real-world problems by using machine learning. In python programming language, there exists many pre-defined libraries and inbuilt functions for all processes which are done in the creation of Machine Learning algorithms such as data collection, data cleaning, data preprocessing, data visualization, data plotting, and statistical data modeling all processes are used in Machine Learning and model creation for solving the real-world problems related in healthcare, business, and in various other fields.

In today's era, there exists a huge and tremendous amount of data in every sector and by looking at that data and by applying various analysis and analytics techniques on data we can form trends, patterns, and associations so that we can carry out a prediction related to that data and use that data for Machine Learning and can get a computerized prediction by dividing data into test set and training set and we can get a prediction or decision for solving a real-world challenging problem. For importing the datasets in python environment and use that data for Machine Learning i.e. removing null values, applying analysis and analytics on data and data preprocessing there is another library about which I am going to talk about today's article which is "Pandas". I am not talking about the China's national animal Panda 🐼🐼🐼🐼🐼🐼 😂😂😂😂😂 just kidding... but a library in python programming language.

In my previous article, I talked about a library called NumPy. In that article, I told about how to install NumPy in the environment and use NumPy in the python environment for creating an n-dimensional array in the python environment. I also talked about how to import NumPy in the environment and by use of different functions of the NumPy library we can create different types of arrays in the environment such as an array of zeros, ones, identity matrices of different dimensions, creating different types of arrays of lists using the different functions and I also told that NumPy contains the mathematical functions which can be used for applying the analytics as well as advanced analytics on the data such as trigonometric, logarithmic, and exponential functions and many more.

In today's article, I am going to talk about another library which is Pandas. Pandas is generally used for creating the Series, DataFrames and accessing data within the series and dataframes in the python environment and importing the data from excel, database, and various datasets which are mostly in .csv format. For using Pandas in python environment first of all we have to install the library by typing the command "pip install Pandas" in command prompt. In the case of Anaconda IDE, all packages related to Machine Learning & Data Science comes inbuilt inside the IDE. For using Pandas firstly we import Pandas as shown:

>>> import pandas as pd


Let's discuss some functions of Pandas library

Series() function

Series is a one-dimensional array which can hold any type of data such as integer, float, string etc. A series is a column of a excel sheet.

Let us consider a tuple and a list:

>>> cities1=('New York','Las Vegas','Atlanta','Los Angeles','San Diego','san Jose')

>>> cities2=['New York','Las Vegas','Atlanta','Los Angeles','San Diego','San   Jose']


Here cities1 is a tuple and cities2 is a list. If we check type of both the objects:

>>> type(cities1)

    tuple

>>> type(cities2)

    list


Let's call Series() function:

# Series() function on a tuple 

>>> pd.Series(cities1) 


    0       New York
    1      Las Vegas
    2        Atlanta
    3    Los Angeles
    4      San Diego
    5       San Jose
    dtype: object

# Series() function on a list

>>> pd.Series(cities2)


    0       New York
    1      Las Vegas
    2        Atlanta
    3    Los Angeles
    4      San Diego
    5       San Jose
      
    dtype: object


If we check type of objects:

>>> type(pd.Series(cities1))

    pandas.core.series.Series

>>> type(pd.Series(cities2))

    pandas.core.series.Series


This is how we can change the tuple or a list into a series.

How we can create a Series manually.


>>> s1=pd.Series(data=['Python','Java','C','C++','C#','Ruby','Perl'],
                 index=[1,2,3,4,5,6,7])

>>> s1

    1    Python
    2      Java
    3         C
    4       C++
    5        C#
    6      Ruby
    7      Perl
    dtype: object     

>>> type(s1)

    pandas.core.series.Series
                            	

Accessing data in Series:

>>> s1


    1    Python
    2      Java
    3         C
    4       C++
    5        C#
    6      Ruby
    7      Perl
    
    dtype: object 

>>> s1[1]

    'Python'

>>> s1[4]

    'C++'

>>> s1[1:5]

    
    2    Java
    3       C
    4     C++
    5      C#

>>> s1[:5]

    1    Python
    2      Java
    3         C
    4       C++
    5        C#
    dtype: object
    


Let's talk about another function DataFrame()

DataFrame() function

>>> df=pd.DataFrame(data={'ID':[101,102,103,104,105],
                          'Name':[Hrithik','Jatin','Akhil','Kapil','Shubham'],
                          'Job':['ML Engineer','Android Developer','Digital Marketer','Software Engineer','Data Analyst'],
                          'Location':['Cupertino','Los Angeles','New York','San Jose','Detroit'],
                          'Company':['Apple','Google','HTC','Samsung','HP'],
                          'Salary':[132000,120000,121000,130000,110000],
                          'Increment':[500,300,250,350,400]},
                    index=[1,2,3,4,5])

>>> df


    ID     Name                Job     Location  Company  Salary  Increment

1  101  Hrithik        ML Engineer    Cupertino    Apple  132000      500
2  102    Jatin  Android Developer  Los Angeles   Google  120000      300
3  103    Akhil   Digital Marketer     New York      HTC  121000      250
4  104    Kapil  Software Engineer     San Jose  Samsung  130000      350

5  105  Shubham       Data Analyst      Detroit       HP  110000      400


Data Frame is two-dimensional size-mutable, heterogeneous tabular data structure with rows and columns. A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns as created above. In above Data Frame df there are 5 rows and 7 columns. A Data Frame is like a excel spreadsheet having data in rows and columns.

Addition of a new column in a Data Frame:

>>> df['Total'] = df['Salary'] + df['Increment']

>>> df

    
    ID     Name                Job     Location  Company  Salary Increment  Total

1  101  Hrithik        ML Engineer    Cupertino    Apple  132000     500    132500
2  102    Jatin  Android Developer  Los Angeles   Google  120000     300    120300
3  103    Akhil   Digital Marketer     New York      HTC  121000     250    121250
4  104    Kapil  Software Engineer     San Jose  Samsung  130000     350    130350

5  105  Shubham       Data Analyst      Detroit       HP  110000     400    110400

>>> df['Name']


    1    Hrithik
    2      Jatin
    3      Akhil
    4      Kapil
    5    Shubham
   
    Name:Name, dtype: object

>>> type(df['Name'])

    pandas.core.series.Series
    

Deleting a column in Data Frame:

>>> del(df['Total'])
    
>>> df

        ID     Name                Job     Location  Company  Salary  Increment

1  101  Hrithik        ML Engineer    Cupertino    Apple  132000      500
2  102    Jatin  Android Developer  Los Angeles   Google  120000      300
3  103    Akhil   Digital Marketer     New York      HTC  121000      250
4  104    Kapil  Software Engineer     San Jose  Samsung  130000      350

5  105  Shubham       Data Analyst      Detroit       HP  110000      400
    
   

To check all the columns in a Data Frame:

>>> df.columns

    Index(['ID', 'Name', 'Job', 'Location', 'Company', 'Salary', 'Increment'],               dtype='object')


To check relations between the columns of a Data Frame:

>>> df


      ID     Name              Job     Location  Company  Salary  Increment

1  101  Hrithik        ML Engineer    Cupertino    Apple  132000      500
2  102    Jatin  Android Developer  Los Angeles   Google  120000      300
3  103    Akhil   Digital Marketer     New York      HTC  121000      250
4  104    Kapil  Software Engineer     San Jose  Samsung  130000      350
5  105  Shubham       Data Analyst      Detroit       HP  110000      400

# Relation between Name and Location

>>> df[['Name','Location']]

     Name     Location
1  Hrithik    Cupertino
2    Jatin  Los Angeles
3    Akhil     New York
4    Kapil     San Jose
5  Shubham      Detroit

# Relation between Name, Salary and Increment

>>> df[['Name','Salary','Increment']]

      Name  Salary  Increment
1  Hrithik  132000        500
2    Jatin  120000        300
3    Akhil  121000        250
4    Kapil  130000        350
5  Shubham  110000        400 
     

We have to write name of Data Frame in which we have to pass a list of columns we want to see. By using this only we can access columns of Data Frame in form of Series which is illustrated above.

Some mathematical functions on Data Frames. For mathematical functions we have to import NumPy in environment and mathematical functions can be applied on integers. Here in this Data Frame the Salary, ID & Increment are the columns on which the mathematical functions can be applied. Let's see:

>>> df

     
    ID     Name                Job     Location  Company  Salary  Increment

1  101  Hrithik        ML Engineer    Cupertino    Apple  132000      500
2  102    Jatin  Android Developer  Los Angeles   Google  120000      300
3  103    Akhil   Digital Marketer     New York      HTC  121000      250
4  104    Kapil  Software Engineer     San Jose  Samsung  130000      350
5  105  Shubham       Data Analyst      Detroit       HP  110000      400
    
   

>>> import numpy as np

>>> df['Salary'].sum()

    613000

>>> df['Salary'].min()

    110000

>>> df['Salary'].max()

    132000

>>> df['Salary'].mean()
    122600.0

>>> df.describe() 

                 ID         Salary  Increment

    count    5.000000       5.000000    5.00000
    mean   103.000000  122600.000000  360.00000
    std      1.581139    8820.430828   96.17692
    min    101.000000  110000.000000  250.00000
    25%    102.000000  120000.000000  300.00000
    50%    103.000000  121000.000000  350.00000
    75%    104.000000  130000.000000  400.00000
    max    105.000000  132000.000000  500.00000    
   

Let's see how to access rows of a Data Frame above we see that how to access the columns of the Data Frames. Let's see:

>>> df


     ID     Name                Job     Location  Company  Salary  Increment

1  101  Hrithik        ML Engineer    Cupertino    Apple  132000        500
2  102    Jatin  Android Developer  Los Angeles   Google  120000        300
3  103    Akhil   Digital Marketer     New York      HTC  121000        250
   104    Kapil  Software Engineer     San Jose  Samsung  130000        350
4         
5  105  Shubham       Data Analyst      Detroit       HP  110000        400

>>> df.iloc[0]



ID                   101
Name             Hrithik
Job          ML Engineer
Location       Cupertino
Company            Apple
Salary            132000
Increment            500
Name: 1, dtype: object

>>> df.iloc[2]


ID                        103
Name                    Akhil
Job          Digital Marketer
Location             New York
Company                   HTC
Salary                 121000
Increment                 250
Name: 3, dtype: object

>>> type(df.iloc[1])
    
    pandas.core.series.Series

>>> df.iloc[1:5]

    
    ID     Name                Job     Location  Company  Salary  Increment
2  102    Jatin  Android Developer  Los Angeles   Google  120000        300
3  103    Akhil   Digital Marketer     New York      HTC  121000        250
4  104    Kapil  Software Engineer     San Jose  Samsung  130000        350

5  105  Shubham       Data Analyst      Detroit       HP  110000        400

>>> df.loc[0:3]

    ID     Name                Job     Location Company  Salary  Increment
1  101  Hrithik        ML Engineer    Cupertino   Apple  132000        500
2  102    Jatin  Android Developer  Los Angeles  Google  120000        300
3  103    Akhil   Digital Marketer     New York     HTC  121000        250



iloc[] returns a Pandas Series when one row is selected, and a Pandas Data Frame when multiple rows are selected, or the slicing can be done. loc[] method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller data frame.

Let's see how to load datasets in python environment:

>>> z=pd.read_csv('C:/Users/sharm/Downloads/DSAT/files/Employee.csv')

>>> z


This is how the datasets are loaded in python environment and all same functions can be applied on the dataset. The dataset becomes a Data Frame of huge number of rows and columns and all functions can be applied on the Data Frame.

In this article I discussed some important features and functions of Pandas library but there are mant functions in Pandas which are about to explore by me.

This is all about my this article of learning python for Machine Learning and Artificial Intelligence as soon I learn different concepts of python I'll keep posting. So, there are more to go as I am learning.

Bye-Bye, See You in my next article. Until then enjoy Machine Learning and PEACE OUT.

0 views

Contact Us:

A-116, The Corenthum A-65,

Sector 62

Noida 201301

UP, India

Phone: +91 8882050481

Email: info@theikigailab.com

© 2020 TheIkigaiLab  Terms of Use Privacy and Security Statement