If you’ve been using Python for a decent amount of time you’ve probably heard of Pandas — The Swiss army knife for data science
At first you might wonder, what can i do with Pandas and why Pandas ? Answer is, you can do limitless things with it when it comes to parsing and extracting data from preexistent data you have as well as other functionalities I’ll briefly get into in a bit
Firstly let’s properly define what Pandas is
Pandas is an open source library that allows you to load and process large amounts of data in different formats but most commonly CSV files
Still confused ? Here’s a scenario for you
Imagine being hired in any sales / marketing field and you were told to write a report containing some specific “insights” from a data file you have, like every client from CA,US or most ordered items on an e-commerce website or visualize some data, whatever the task was it involved parsing and “filtering” some data and here is where Pandas comes into picture
What can i do with it ?
Aside from what i have mentioned above, you can do many other things including but not limited to :
- Visualizing data : Combining Pandas with matplotlib can create amazing plots to visualize your data and creating charts for any kind of app you want with very minimal code
- Writing to csv files ; Those CSV files are your best friend when it comes to data science and the best thing Pandas do is it allows you to create new CSV files from your own data for further processing or later presentation as charts or displayed on a web app
- Parsing html easily : Ok i know there are many other great options for parsing html but Pandas can parse <table> tags pretty well and eventually turn it into a dataframe
- Plotting : With tons of options and kinds of plotting to help understand your data better
- Writing to an SQL table : A great feature i recently discovered and started using is the ability to write a dataframe into an SQL table and it’s totally up to you to decide what to do with that
Now it’s time we get down to business and code some stuff !
Installing Pandas
It’s as easy as pip install pandas
and it’ll install some other dependencies
Firing up our Jupyter Notebook
When it comes to data science, it’s very much preferred to use Jupyter Notebook as it allows you to write both code and markdown, as well as being able to easily export it and it’s more flexible in general to use
Importing Pandas and loading our dataset
For the sake of this tutorial i’m going to be using a dataset from a github repo here and will show you the other ways to load your data
import pandas as pd
here we imported the library and will reference it with “pd” as it’s a common practise
df = pd.read_csv(#link or file location)
here you can either paste the link to the csv file or the location where you downloaded it
Quick Pandas lingo
Dataframe : It’s basically a table like data structure that contains columns known as series
An example on how to construct a dataframe with Pandas
df = pd.DataFrame({"name":['Bob','Fred','Peter']},"age":[22,24,26])
`
Basically to construct a dataframe using pandas you pass in a dictionary where the keys are the column names and the values have to be lists of the values
Exploring your data
Now it’s time you get to know your data in order for you to find out what you’re working with and here are few useful methods
df.head()
returns the first five records of the dataframedf.loc[]
anddf.iloc[]
are extremely useful and essential methods to locate certain columns and rows within the dataset, difference is thatdf.loc[]
is label — based which means you can pass names of the columns to be returned as well as the number of rows, on the other hand,df.iloc[]
is index based where you have to pass in the index of the columns you want to returndf.dscribe()
returns statistical facts about the dataframe such as count, mean,max and other valuesdf.value_counts()
returns each value and how often did it occurdf.columns
` this property contains the column names only and is useful when you want to extra certain pieces of informationdf.shape
this property returns dimensions of the dataframedf.plot()
is used to plot charts using the data
These are the most useful methods and properties for now and next we’ll cover them in more details
Firstly let’s find out what columns and rows we have and what values they hold
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/covid-geography/mmsa-icu-beds.csv")
df.head()
Next let’s find out how do we locate certain columns
First let’s find out what our columns are using the propertydf.columns
Then we’ll return some of those columns for all the records
df.loc[:,['MMSA','total_percent_at_risk','icu_beds','hospitals']]
`First argument is the range of rows to be returned and in our case since we didn’t specify neither a beginning nor an ending, it means all rows and the second argument we pass in a python list of the desired columns
If we wanted only the first 10 records we would say
df.loc[:10,['MMSA','total_percent_at_risk','icu_beds','hospitals']]
Additionally you can add some logic to return certain rows, for example let’s find out which MMSA has more than 10 hospitals
df.loc[df.hospitals>10]
You can access individual columns using two ways :
- Square brackets syntax as in
df[icu_beds]
- Dot syntax as in
df.icu_beds
Plotting with pandas
import pandas as pd
url = "" #URL TO THE DATASET CSV FILE OR FILE PATH
df = pd.read_csv(url)
df.plot(x='MMSA',y='icu_beds',kind='bar')
Parameters :
- x : values of the x axis which is a column in your dataframe so check the name carefully
- y = values of the y axis which is a column in your dataframe so check the name carefully
- kind = the kind of chart to be plotted, supported kinds are :
- line
- bar
- barh
- hist
- box
- kde
- density
- area
- pie
- scatter
- hexbin
For complete documentation of the method check out this link https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html
Conclusion
Pandas is a powerful and this simple article really didn’t do it any justice as the library is insanely huge and needs more and more time to master it completely but eventually this was enough for someone to be interested enough in it and willing to learn more