Pandas : A Beginner’s Guide

moo
5 min readDec 30, 2020

--

If you’ve been using Python for a decent amount of time you’ve probably heard of Pandas — The Swiss army knife for data science

At first you might wonder, what can i do with Pandas and why Pandas ? Answer is, you can do limitless things with it when it comes to parsing and extracting data from preexistent data you have as well as other functionalities I’ll briefly get into in a bit

Firstly let’s properly define what Pandas is

Pandas is an open source library that allows you to load and process large amounts of data in different formats but most commonly CSV files

Still confused ? Here’s a scenario for you

Imagine being hired in any sales / marketing field and you were told to write a report containing some specific “insights” from a data file you have, like every client from CA,US or most ordered items on an e-commerce website or visualize some data, whatever the task was it involved parsing and “filtering” some data and here is where Pandas comes into picture

What can i do with it ?

Aside from what i have mentioned above, you can do many other things including but not limited to :

  • Visualizing data : Combining Pandas with matplotlib can create amazing plots to visualize your data and creating charts for any kind of app you want with very minimal code
  • Writing to csv files ; Those CSV files are your best friend when it comes to data science and the best thing Pandas do is it allows you to create new CSV files from your own data for further processing or later presentation as charts or displayed on a web app
  • Parsing html easily : Ok i know there are many other great options for parsing html but Pandas can parse <table> tags pretty well and eventually turn it into a dataframe
  • Plotting : With tons of options and kinds of plotting to help understand your data better
  • Writing to an SQL table : A great feature i recently discovered and started using is the ability to write a dataframe into an SQL table and it’s totally up to you to decide what to do with that

Now it’s time we get down to business and code some stuff !

Installing Pandas

It’s as easy as pip install pandas and it’ll install some other dependencies

Firing up our Jupyter Notebook

When it comes to data science, it’s very much preferred to use Jupyter Notebook as it allows you to write both code and markdown, as well as being able to easily export it and it’s more flexible in general to use

Importing Pandas and loading our dataset

For the sake of this tutorial i’m going to be using a dataset from a github repo here and will show you the other ways to load your data

import pandas as pd here we imported the library and will reference it with “pd” as it’s a common practise

df = pd.read_csv(#link or file location) here you can either paste the link to the csv file or the location where you downloaded it

Quick Pandas lingo

Dataframe : It’s basically a table like data structure that contains columns known as series

An example on how to construct a dataframe with Pandas

df = pd.DataFrame({"name":['Bob','Fred','Peter']},"age":[22,24,26])`

Basically to construct a dataframe using pandas you pass in a dictionary where the keys are the column names and the values have to be lists of the values

Exploring your data

Now it’s time you get to know your data in order for you to find out what you’re working with and here are few useful methods

  1. df.head() returns the first five records of the dataframe
  2. df.loc[] and df.iloc[] are extremely useful and essential methods to locate certain columns and rows within the dataset, difference is that df.loc[] is label — based which means you can pass names of the columns to be returned as well as the number of rows, on the other hand, df.iloc[] is index based where you have to pass in the index of the columns you want to return
  3. df.dscribe() returns statistical facts about the dataframe such as count, mean,max and other values
  4. df.value_counts() returns each value and how often did it occur
  5. df.columns` this property contains the column names only and is useful when you want to extra certain pieces of information
  6. df.shape this property returns dimensions of the dataframe
  7. df.plot() is used to plot charts using the data

These are the most useful methods and properties for now and next we’ll cover them in more details

Firstly let’s find out what columns and rows we have and what values they hold

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/covid-geography/mmsa-icu-beds.csv")
df.head()

Next let’s find out how do we locate certain columns

First let’s find out what our columns are using the propertydf.columns

Then we’ll return some of those columns for all the records

df.loc[:,['MMSA','total_percent_at_risk','icu_beds','hospitals']]

`First argument is the range of rows to be returned and in our case since we didn’t specify neither a beginning nor an ending, it means all rows and the second argument we pass in a python list of the desired columns

If we wanted only the first 10 records we would say

df.loc[:10,['MMSA','total_percent_at_risk','icu_beds','hospitals']]

Additionally you can add some logic to return certain rows, for example let’s find out which MMSA has more than 10 hospitals

df.loc[df.hospitals>10]

You can access individual columns using two ways :

  1. Square brackets syntax as in df[icu_beds]
  2. Dot syntax as in df.icu_beds

Plotting with pandas

import pandas as pd
url = "" #URL TO THE DATASET CSV FILE OR FILE PATH
df = pd.read_csv(url)
df.plot(x='MMSA',y='icu_beds',kind='bar')

Parameters :

  • x : values of the x axis which is a column in your dataframe so check the name carefully
  • y = values of the y axis which is a column in your dataframe so check the name carefully
  • kind = the kind of chart to be plotted, supported kinds are :
  1. line
  2. bar
  3. barh
  4. hist
  5. box
  6. kde
  7. density
  8. area
  9. pie
  10. scatter
  11. hexbin

For complete documentation of the method check out this link https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

Conclusion

Pandas is a powerful and this simple article really didn’t do it any justice as the library is insanely huge and needs more and more time to master it completely but eventually this was enough for someone to be interested enough in it and willing to learn more

--

--