Sunday, May 27, 2018

Pandas Tips① ~How to add rows by splitting data using separator ??~

Pandas Tips① ~How to add rows by splitting data using separator ??~

In data munging, or calculate statistics, or whatever else, "dataframe" in Python is really convenient and efficient. I'm really fond of using pandas. Nevertheless, I'm inclined to forget some useful technic. Thereby, I've decided to write it down here. First post is regarding "How to add rows by splitting data using separator??" I hope you will enjoy this :)

For instance, let us assume there is data as bellow.

In [1]:
import pandas as pd
In [2]:
purchase_data = pd.DataFrame([['john','apple|orange'],
                             ['robert','apple|orange|blueberry'],
                             ['david','banana']],columns=['name','fruit'])
purchase_data.head()
Out[2]:
name fruit
0 john apple|orange
1 robert apple|orange|blueberry
2 david banana

However, in some cases, the dataframe equipped with rows by fuit is required. You can create new dataframe as bellow.

In [3]:
column_list = ['name','fruit']
# Prepare empty data frame
purchase_data_new = pd.DataFrame(columns=column_list)

for index,row in purchase_data.iterrows():
    # When creating dataframe with dictionary, order is not guaranteed. 
    # Hence I specify "columns" parameter.
    purchase_data_temp = pd.DataFrame({row.index[0]:row.values[0],
                                      row.index[1]:row.values[1].split('|')},
                                      columns=column_list)
    purchase_data_new = purchase_data_new.append(purchase_data_temp)

purchase_data_new
Out[3]:
name fruit
0 john apple
1 john orange
0 robert apple
1 robert orange
2 robert blueberry
0 david banana

Should you have a question regarding the part where creating dataframe with dictionary, I'll share with you a little :)
Basically, you can create dataframe with dictionary as following. (By the way, you can see that this method doesn't guarranty order of columns)

In [4]:
df_temp = pd.DataFrame({'name':['hiroshi'],'fruit':['kiwi']})
df_temp.head()
Out[4]:
fruit name
0 kiwi hiroshi

When you specify value of dictionary as string not list, it means somewhat of constant value.

In [5]:
df_temp2 = pd.DataFrame({'name':'hiroshi','fruit':['kiwi','apple','banana']})
df_temp2.head()
Out[5]:
fruit name
0 kiwi hiroshi
1 apple hiroshi
2 banana hiroshi

Then, the point is that "split" method returns "list" type. Therefore, in david's turn in above example, there is no error :)

In [6]:
# Unfortunatelly, This gonna be error...
# df_temp3 = pd.DataFrame({'name':'hiroshi','fruit':'kiwi'})
df_temp3 = pd.DataFrame({'name':'hiroshi','fruit':'kiwi'.split('|')})
df_temp3.head()
Out[6]:
fruit name
0 kiwi hiroshi

You can see even though there is no '|' in fruit value, "split" returns "list" type as bellow.

In [7]:
'kiwi'.split('|')
Out[7]:
['kiwi']

If anyone know more efficient know, It's gonna be really grateful to let me know ! :)

No comments:

Post a Comment