Sometimes we need to remove unnecessary data and save the column in the right format in the Pandas data frames.

If you want real time testing, please refer to the posts AWS Glue run locally or Glue Development using Jupyter.

FROM amazon/aws-glue-libs:glue_libs_3.0.0_image_01
WORKDIR /home/glue_user/workspace/jupyter_workspace
ENV DISABLE_SSL=true
RUN pip3 install pyathena
RUN pip3 install seaborn
CMD [ "./start.sh" ]

First import the pandas and seaborn :

import pandas as pd
import seaborn as sns
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Show the tips data frame contents:

tips = sns.load_dataset('tips')
tips.head()
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Information about the data frame:

tips.info()
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB

Only the data types of the columns:

tips.dtypes
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object

Convert columns:

tips['size_str'] = tips['size'].astype(str)
tips.head()
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


   total_bill   tip     sex smoker  day    time  size size_str
0       16.99  1.01  Female     No  Sun  Dinner     2        2
1       10.34  1.66    Male     No  Sun  Dinner     3        3
2       21.01  3.50    Male     No  Sun  Dinner     3        3
3       23.68  3.31    Male     No  Sun  Dinner     2        2
4       24.59  3.61  Female     No  Sun  Dinner     4        4
tips.dtypes
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
size_str        object
dtype: object

Extract the rows based on the index:

tips.loc[[1,2,20]]
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


    total_bill   tip   sex smoker  day    time  size size_str
1        10.34  1.66  Male     No  Sun  Dinner     3        3
2        21.01  3.50  Male     No  Sun  Dinner     3        3
20       17.92  4.08  Male     No  Sat  Dinner     2        2

When total_bill changed to the string:

temp_df = tips.head(5)
temp_df.loc[[1,3], 'total_bill'] = "test"
temp_df
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


  total_bill   tip     sex smoker  day    time  size size_str
0      16.99  1.01  Female     No  Sun  Dinner     2        2
1       test  1.66    Male     No  Sun  Dinner     3        3
2      21.01  3.50    Male     No  Sun  Dinner     3        3
3       test  3.31    Male     No  Sun  Dinner     2        2
4      24.59  3.61  Female     No  Sun  Dinner     4        4
/home/glue_user/.local/lib/python3.7/site-packages/pandas/core/indexing.py:1817: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
temp_df.dtypes
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


total_bill      object
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
size_str        object
dtype: object

As shown in the above, now total_bill is an object type.

It is not possible to convert back temp_df.total_bill back to the numeric because of the test string. Therefore, you have to use the to_numeric(...) method in this case:

pd.to_numeric(temp_df.total_bill, errors="coerce")
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


0    16.99
1      NaN
2    21.01
3      NaN
4    24.59
Name: total_bill, dtype: float64

In addition to the above coerce, ignore also possible, but the string will be remained. As shown above, which cannot be converted to numeric has been converted as Not a Number(NaN).

pd.to_numeric(temp_df.total_bill, errors="ignore", downcast='float')
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


0    16.99
1     test
2    21.01
3     test
4    24.59
Name: total_bill, dtype: object

To check the categorical columns:

tips['smoker'].cat.categories
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


Index(['Yes', 'No'], dtype='object')

To find the codes:

tips['smoker'].cat.codes
FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


0      1
1      1
2      1
3      1
4      1
      ..
239    1
240    0
241    0
242    1
243    1
Length: 244, dtype: int8

As shown in the above results, 0s and 1s are represent the Yes and No.