1. Define Spark Context

sc

<pyspark.context.SparkContext at 0x102cea1d0>

2. Start H2O Context


    from pysparkling import *
    sc
    hc= H2OContext(sc).start()

    Warning: Version mismatch. H2O is version 3.6.0.2, but the python package is version 3.7.0.99999.
H2O cluster uptime: 2 seconds 217 milliseconds
H2O cluster version: 3.6.0.2
H2O cluster name: sparkling-water-nidhimehta
H2O cluster total nodes: 2
H2O cluster total memory: 3.83 GB
H2O cluster total cores: 16
H2O cluster allowed cores: 16
H2O cluster healthy: True
H2O Connection ip: 172.16.2.98
H2O Connection port: 54329

3. Define H2O Context

    hc

    H2OContext: ip=172.16.2.98, port=54329

4. Import H2O Python library

import h2o

5. View all available H2O Python functions

#dir(h2o)

6. Parse Chicago Crime dataset into H2O

column_type = ['Numeric','String','String','Enum','Enum','Enum','Enum','Enum','Enum','Enum','Numeric','Numeric','Numeric','Numeric','Enum','Numeric','Numeric','Numeric','Enum','Numeric','Numeric','Enum']
f_crimes = h2o.import_file(path ="../data/chicagoCrimes10k.csv",col_types =column_type)

print(f_crimes.shape)
f_crimes.summary()


Parse Progress: [##################################################] 100%
(9999, 22)
ID Case Number Date Block IUCR Primary Type Description Location Description Arrest Domestic Beat District Ward Community Area FBI Code X Coordinate Y Coordinate Year Updated On Latitude Longitude Location
type int string string enum enum enum enum enum enum enum int int int int enum int int int enum real real enum
mins 21735.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 111.0 1.0 1.0 1.0 0.0 1100317.0 1814255.0 2015.00.0 41.64507243 -87.906463888 0.0
mean 9931318.73737NaN NaN NaN NaN NaN NaN NaN 0.2928292829280.1523152315231159.6180618111.348988512822.954095409537.4476447645 NaN 1163880.59815 1885916.14984 2015.0NaN 41.8425652247 -87.6741405221 NaN
maxs 9962898.0 NaN NaN 6517.0 212.0 26.0 198.0 90.0 1.0 1.0 2535.0 25.0 50.0 77.0 24.0 1205069.0 1951533.0 2015.032.0 42.022646183 -87.524773286 8603.0
sigma 396787.564221NaN NaN NaN NaN NaN NaN NaN 0.4550835155880.35934414686 695.76029875 6.9454749330113.649566114421.2748762223 NaN 16496.4493681 31274.0163199 0.0 NaN 0.08601865793580.0600357970653NaN
zeros 0 0 0 3 16 11 933 19 7071 8476 0 0 0 0 16 0 0 0 603 0 0 1
missing0 0 0 0 0 0 0 6 0 0 0 162 0 0 0 162 162 0 0 162 162 162
0 9955810.0 HY144797 02/08/2015 11:43:40 PM081XX S COLES AVE 1811 NARCOTICS POSS: CANNABIS 30GMS OR LESSSTREET true false 422.0 4.0 7.0 46.0 18 1198273.0 1851626.0 2015.002/15/2015 12:43:39 PM41.747693646 -87.549035389 (41.747693646, -87.549035389)
1 9955861.0 HY144838 02/08/2015 11:41:42 PM118XX S STATE ST 0486 BATTERY DOMESTIC BATTERY SIMPLE APARTMENT true true 522.0 5.0 34.0 53.0 08B 1178335.0 1826581.0 2015.002/15/2015 12:43:39 PM41.679442289 -87.622850758 (41.679442289, -87.622850758)
2 9955801.0 HY144779 02/08/2015 11:30:22 PM002XX S LARAMIE AVE 2026 NARCOTICS POSS: PCP SIDEWALK true false 1522.0 15.0 29.0 25.0 18 1141717.0 1898581.0 2015.002/15/2015 12:43:39 PM41.87777333 -87.755117993 (41.87777333, -87.755117993)
3 9956197.0 HY144787 02/08/2015 11:30:23 PM006XX E 67TH ST 1811 NARCOTICS POSS: CANNABIS 30GMS OR LESSSTREET true false 321.0 nan 6.0 42.0 18 nan nan 2015.002/15/2015 12:43:39 PMnan nan
4 9955846.0 HY144829 02/08/2015 11:30:58 PM0000X S MAYFIELD AVE0610 BURGLARY FORCIBLE ENTRY APARTMENT false false 1513.0 15.0 29.0 25.0 05 1137239.0 1899372.0 2015.002/15/2015 12:43:39 PM41.880025548 -87.771541324 (41.880025548, -87.771541324)
5 9955835.0 HY144778 02/08/2015 11:30:21 PM010XX W 48TH ST 0486 BATTERY DOMESTIC BATTERY SIMPLE APARTMENT false true 933.0 9.0 3.0 61.0 08B 1169986.0 1873019.0 2015.002/15/2015 12:43:39 PM41.807059405 -87.65206589 (41.807059405, -87.65206589)
6 9955872.0 HY144822 02/08/2015 11:27:24 PM015XX W ARTHUR AVE 1320 CRIMINAL DAMAGETO VEHICLE STREET false false 2432.0 24.0 40.0 1.0 14 1164732.0 1943222.0 2015.002/15/2015 12:43:39 PM41.999814056 -87.669342967 (41.999814056, -87.669342967)
7 21752.0 HY144738 02/08/2015 11:26:12 PM060XX W GRAND AVE 0110 HOMICIDE FIRST DEGREE MURDER STREET true false 2512.0 25.0 37.0 19.0 01A 1135910.0 1914206.0 2015.002/15/2015 12:43:39 PM41.920755683 -87.776067514 (41.920755683, -87.776067514)
8 9955808.0 HY144775 02/08/2015 11:20:33 PM001XX W WACKER DR 0460 BATTERY SIMPLE OTHER false false 122.0 1.0 42.0 32.0 08B 1175384.0 1902088.0 2015.002/15/2015 12:43:39 PM41.886707818 -87.631396356 (41.886707818, -87.631396356)
9 9958275.0 HY146732 02/08/2015 11:15:36 PM001XX W WACKER DR 0460 BATTERY SIMPLE HOTEL/MOTEL false false 122.0 1.0 42.0 32.0 08B 1175384.0 1902088.0 2015.002/15/2015 12:43:39 PM41.886707818 -87.631396356 (41.886707818, -87.631396356)

7. Look at the distribution of the IUCR column

f_crimes["IUCR"].table()
IUCR Count
0110 16
0261 2
0263 2
0265 5
0266 2
0281 41
0291 3
0312 18
0313 20
031A 136

8. Look at the distribution of the Arrest column

f_crimes["Arrest"].table()
Arrest Count
false 7071
true 2928

9. Modify column names to replace blank spaces with underscores

col_names = map(lambda s: s.replace(' ', '_'), f_crimes.col_names)
f_crimes.set_names(col_names)
IDCase_Number Date Block IUCRPrimary_Type Description Location_Description Arrest Domestic Beat District Ward Community_AreaFBI_Code X_Coordinate Y_Coordinate YearUpdated_On Latitude LongitudeLocation
9.95581e+06HY144797 02/08/2015 11:43:40 PM081XX S COLES AVE 1811NARCOTICS POSS: CANNABIS 30GMS OR LESSSTREET true false 422 4 7 4618 1.19827e+06 1.85163e+06 201502/15/2015 12:43:39 PM 41.7477 -87.549 (41.747693646, -87.549035389)
9.95586e+06HY144838 02/08/2015 11:41:42 PM118XX S STATE ST 0486BATTERY DOMESTIC BATTERY SIMPLE APARTMENT true true 522 5 34 5308B 1.17834e+06 1.82658e+06 201502/15/2015 12:43:39 PM 41.6794 -87.6229(41.679442289, -87.622850758)
9.9558e+06 HY144779 02/08/2015 11:30:22 PM002XX S LARAMIE AVE 2026NARCOTICS POSS: PCP SIDEWALK true false 1522 15 29 2518 1.14172e+06 1.89858e+06 201502/15/2015 12:43:39 PM 41.8778 -87.7551(41.87777333, -87.755117993)
9.9562e+06 HY144787 02/08/2015 11:30:23 PM006XX E 67TH ST 1811NARCOTICS POSS: CANNABIS 30GMS OR LESSSTREET true false 321 nan 6 4218 nan nan 201502/15/2015 12:43:39 PM nan nan
9.95585e+06HY144829 02/08/2015 11:30:58 PM0000X S MAYFIELD AVE 0610BURGLARY FORCIBLE ENTRY APARTMENT false false 1513 15 29 2505 1.13724e+06 1.89937e+06 201502/15/2015 12:43:39 PM 41.88 -87.7715(41.880025548, -87.771541324)
9.95584e+06HY144778 02/08/2015 11:30:21 PM010XX W 48TH ST 0486BATTERY DOMESTIC BATTERY SIMPLE APARTMENT false true 933 9 3 6108B 1.16999e+06 1.87302e+06 201502/15/2015 12:43:39 PM 41.8071 -87.6521(41.807059405, -87.65206589)
9.95587e+06HY144822 02/08/2015 11:27:24 PM015XX W ARTHUR AVE 1320CRIMINAL DAMAGETO VEHICLE STREET false false 2432 24 40 114 1.16473e+06 1.94322e+06 201502/15/2015 12:43:39 PM 41.9998 -87.6693(41.999814056, -87.669342967)
21752 HY144738 02/08/2015 11:26:12 PM060XX W GRAND AVE 0110HOMICIDE FIRST DEGREE MURDER STREET true false 2512 25 37 1901A 1.13591e+06 1.91421e+06 201502/15/2015 12:43:39 PM 41.9208 -87.7761(41.920755683, -87.776067514)
9.95581e+06HY144775 02/08/2015 11:20:33 PM001XX W WACKER DR 0460BATTERY SIMPLE OTHER false false 122 1 42 3208B 1.17538e+06 1.90209e+06 201502/15/2015 12:43:39 PM 41.8867 -87.6314(41.886707818, -87.631396356)
9.95828e+06HY146732 02/08/2015 11:15:36 PM001XX W WACKER DR 0460BATTERY SIMPLE HOTEL/MOTEL false false 122 1 42 3208B 1.17538e+06 1.90209e+06 201502/15/2015 12:43:39 PM 41.8867 -87.6314(41.886707818, -87.631396356)

10. Set time zone to UTC for date manipulation

h2o.set_timezone("Etc/UTC")

11. Refine the date column

def refine_date_col(data, col, pattern):
    data[col]         = data[col].as_date(pattern)
    data["Day"]       = data[col].day()
    data["Month"]     = data[col].month()    # Since H2O indexes from 0
    data["Year"]      = data[col].year()
    data["WeekNum"]   = data[col].week()
    data["WeekDay"]   = data[col].dayOfWeek()
    data["HourOfDay"] = data[col].hour()

    # Create weekend and season cols
    data["Weekend"] = (data["WeekDay"] == "Sun" or data["WeekDay"] == "Sat").ifelse(1, 0)[0]
    data["Season"] = data["Month"].cut([0, 2, 5, 7, 10, 12], ["Winter", "Spring", "Summer", "Autumn", "Winter"])

refine_date_col(f_crimes, "Date", "%m/%d/%Y %I:%M:%S %p")
f_crimes = f_crimes.drop("Date")

12. Parse Census data into H2O

f_census = h2o.import_file("../data/chicagoCensus.csv",header=1)

## Update column names in the table
col_names = map(lambda s: s.strip().replace(' ', '_'), f_census.col_names)
f_census.set_names(col_names)
f_census = f_census[1:78,:]
print(f_census.dim)
#f_census.summary()


Parse Progress: [##################################################] 100%
[77, 9]

13. Parse Weather data into H2O

f_weather = h2o.import_file("../data/chicagoAllWeather.csv")
f_weather = f_weather[1:]
print(f_weather.dim)
#f_weather.summary()


Parse Progress: [##################################################] 100%
[5162, 6]

14. Look at all the null entires in the Weather table

f_weather[f_weather["meanTemp"].isna()]
month day year maxTemp meanTemp minTemp
6 19 2008 nan nan nan
9 23 2008 nan nan nan
9 24 2008 nan nan nan
9 25 2008 nan nan nan
9 26 2008 nan nan nan
9 27 2008 nan nan nan
9 28 2008 nan nan nan
9 29 2008 nan nan nan
9 30 2008 nan nan nan
3 4 2009 nan nan nan

15. Look at the help on as_h2o_frame

hc.as_spark_frame?
f_weather

H2OContext: ip=172.16.2.98, port=54329
month day year maxTemp meanTemp minTemp
1 1 2001 23 14 6
1 2 2001 18 12 6
1 3 2001 28 18 8
1 4 2001 30 24 19
1 5 2001 36 30 21
1 6 2001 33 26 19
1 7 2001 34 28 21
1 8 2001 26 20 14
1 9 2001 23 16 10
1 10 2001 34 26 19

16. Copy data frames to Spark from H2O

df_weather = hc.as_spark_frame(f_weather,)
df_census = hc.as_spark_frame(f_census)
df_crimes = hc.as_spark_frame(f_crimes)

17. Look at the weather data as parsed in Spark

(only showing top 2 rows)

df_weather.show(2)

+-----+---+----+-------+--------+-------+
|month|day|year|maxTemp|meanTemp|minTemp|
+-----+---+----+-------+--------+-------+
|    1|  1|2001|     23|      14|      6|
|    1|  2|2001|     18|      12|      6|
+-----+---+----+-------+--------+-------+

18. Join columns from Crime, Census and Weather DataFrames in Spark

## Register DataFrames as tables in SQL context
sqlContext.registerDataFrameAsTable(df_weather, "chicagoWeather")
sqlContext.registerDataFrameAsTable(df_census, "chicagoCensus")
sqlContext.registerDataFrameAsTable(df_crimes, "chicagoCrime")


crimeWithWeather = sqlContext.sql("""SELECT
a.Year, a.Month, a.Day, a.WeekNum, a.HourOfDay, a.Weekend, a.Season, a.WeekDay,
a.IUCR, a.Primary_Type, a.Location_Description, a.Community_Area, a.District,
a.Arrest, a.Domestic, a.Beat, a.Ward, a.FBI_Code,
b.minTemp, b.maxTemp, b.meanTemp,
c.PERCENT_AGED_UNDER_18_OR_OVER_64, c.PER_CAPITA_INCOME, c.HARDSHIP_INDEX,
c.PERCENT_OF_HOUSING_CROWDED, c.PERCENT_HOUSEHOLDS_BELOW_POVERTY,
c.PERCENT_AGED_16__UNEMPLOYED, c.PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA
FROM chicagoCrime a
JOIN chicagoWeather b
ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day
JOIN chicagoCensus c
ON a.Community_Area = c.Community_Area_Number""")

19. Print the crimeWithWeather data table from Spark

crimeWithWeather.show(2)

+----+-----+---+-------+---------+-------+------+-------+----+-----------------+--------------------+--------------+--------+------+--------+----+----+--------+-------+-------+--------+--------------------------------+-----------------+--------------+--------------------------+--------------------------------+---------------------------+--------------------------------------------+
|Year|Month|Day|WeekNum|HourOfDay|Weekend|Season|WeekDay|IUCR|     Primary_Type|Location_Description|Community_Area|District|Arrest|Domestic|Beat|Ward|FBI_Code|minTemp|maxTemp|meanTemp|PERCENT_AGED_UNDER_18_OR_OVER_64|PER_CAPITA_INCOME|HARDSHIP_INDEX|PERCENT_OF_HOUSING_CROWDED|PERCENT_HOUSEHOLDS_BELOW_POVERTY|PERCENT_AGED_16__UNEMPLOYED|PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA|
+----+-----+---+-------+---------+-------+------+-------+----+-----------------+--------------------+--------------+--------+------+--------+----+----+--------+-------+-------+--------+--------------------------------+-----------------+--------------+--------------------------+--------------------------------+---------------------------+--------------------------------------------+
|2015|    1| 23|      4|       22|      0|Winter|    Fri|143A|WEAPONS VIOLATION|               ALLEY|            31|      12|  true|   false|1234|  25|      15|     29|     31|      30|                            32.6|            16444|            76|         9.600000000000001|                            25.8|                       15.8|                                        40.7|
|2015|    1| 23|      4|       19|      0|Winter|    Fri|4625|    OTHER OFFENSE|            SIDEWALK|            31|      10|  true|   false|1034|  25|      26|     29|     31|      30|                            32.6|            16444|            76|         9.600000000000001|                            25.8|                       15.8|                                        40.7|
+----+-----+---+-------+---------+-------+------+-------+----+-----------------+--------------------+--------------+--------+------+--------+----+----+--------+-------+-------+--------+--------------------------------+-----------------+--------------+--------------------------+--------------------------------+---------------------------+--------------------------------------------+
only showing top 2 rows

20. Copy table from Spark to H2O

hc.as_h2o_frame?
crimeWithWeatherHF = hc.as_h2o_frame(crimeWithWeather,framename="crimeWithWeather")

H2OContext: ip=172.16.2.98, port=54329



crimeWithWeatherHF.summary()
Year Month Day WeekNum HourOfDay Weekend Season WeekDay IUCR Primary_Type Location_Description Community_Area District Arrest Domestic Beat Ward FBI_Code minTemp maxTemp meanTemp PERCENT_AGED_UNDER_18_OR_OVER_64 PER_CAPITA_INCOME HARDSHIP_INDEX PERCENT_OF_HOUSING_CROWDED PERCENT_HOUSEHOLDS_BELOW_POVERTY PERCENT_AGED_16__UNEMPLOYED PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA
type int int int int int int string string stringstring string int int string string int int string int int int real int int real real real real
mins 2015.01.0 1.0 4.0 0.0 0.0 NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN 111.0 1.0 NaN -2.0 15.0 7.0 13.5 8201.0 1.0 0.3 3.3 4.7 2.5
mean 2015.01.41944194419 17.68396839685.18081808181 13.63196319630.159115911591NaN NaN NaN NaN NaN 37.4476447645 11.3489885128NaN NaN 1159.6180618122.9540954095NaN 17.699669967 31.719971997224.940894089435.0596759676 25221.3057306 54.4786478648 5.43707370737 24.600750075 16.8288328833 21.096639664
maxs 2015.02.0 31.0 6.0 23.0 1.0 NaN NaN NaN NaN NaN 77.0 25.0 NaN NaN 2535.0 50.0 NaN 29.0 43.0 36.0 51.5 88669.0 98.0 15.8 56.5 35.9 54.8
sigma 0.0 0.49349240678711.18010433580.7389298304096.473217358070.365802434041NaN NaN NaN NaN NaN 21.2748762223 6.94547493301NaN NaN 695.76029875 13.6495661144NaN 8.961181364386.938099134727.463025270627.95653388237 18010.0446225 29.3247456472 3.75289588494 10.1450570661 7.58926327988 11.3868817911
zeros 0 0 0 0 374 8408 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
missing0 0 0 0 0 0 0 0 0 0 6 0 162 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 2015.01.0 24.0 4.0 22.0 0.0 Winter Sat 2820 OTHER OFFENSE APARTMENT 31.0 10.0 false false 1034.0 25.0 26 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
1 2015.01.0 24.0 4.0 21.0 0.0 Winter Sat 1310 CRIMINAL DAMAGE RESTAURANT 31.0 12.0 true false 1233.0 25.0 14 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
2 2015.01.0 24.0 4.0 18.0 0.0 Winter Sat 1750 OFFENSE INVOLVING CHILDRENRESIDENCE 31.0 12.0 false true 1235.0 25.0 20 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
3 2015.01.0 24.0 4.0 18.0 0.0 Winter Sat 0460 BATTERY OTHER 31.0 10.0 false false 1023.0 25.0 08B 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
4 2015.01.0 24.0 4.0 13.0 0.0 Winter Sat 0890 THEFT CURRENCY EXCHANGE 31.0 10.0 false false 1023.0 25.0 06 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
5 2015.01.0 24.0 4.0 9.0 0.0 Winter Sat 0560 ASSAULT OTHER 31.0 12.0 false false 1234.0 25.0 08A 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
6 2015.01.0 24.0 4.0 8.0 0.0 Winter Sat 0486 BATTERY RESIDENCE 31.0 12.0 true true 1235.0 25.0 08B 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
7 2015.01.0 24.0 4.0 1.0 0.0 Winter Sat 0420 BATTERY SIDEWALK 31.0 10.0 false false 1034.0 25.0 04B 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
8 2015.01.0 24.0 4.0 0.0 0.0 Winter Sat 1320 CRIMINAL DAMAGE PARKING LOT/GARAGE(NON.RESID.)31.0 9.0 false false 912.0 11.0 14 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
9 2015.01.0 31.0 5.0 23.0 0.0 Winter Sat 0820 THEFT SIDEWALK 31.0 12.0 false false 1234.0 25.0 06 19.0 36.0 28.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7

21. Assign column types to the CrimeWeatherHF data table in H2O

crimeWithWeatherHF["Season"]= crimeWithWeatherHF["Season"].asfactor()
crimeWithWeatherHF["WeekDay"]= crimeWithWeatherHF["WeekDay"].asfactor()
crimeWithWeatherHF["IUCR"]= crimeWithWeatherHF["IUCR"].asfactor()
crimeWithWeatherHF["Primary_Type"]= crimeWithWeatherHF["Primary_Type"].asfactor()
crimeWithWeatherHF["Location_Description"]= crimeWithWeatherHF["Location_Description"].asfactor()
crimeWithWeatherHF["Arrest"]= crimeWithWeatherHF["Arrest"].asfactor()
crimeWithWeatherHF["Domestic"]= crimeWithWeatherHF["Domestic"].asfactor()
crimeWithWeatherHF["FBI_Code"]= crimeWithWeatherHF["FBI_Code"].asfactor()
crimeWithWeatherHF["Season"]= crimeWithWeatherHF["Season"].asfactor()



crimeWithWeatherHF.summary()
Year Month Day WeekNum HourOfDay Weekend Season WeekDay IUCR Primary_Type Location_Description Community_Area District Arrest Domestic Beat Ward FBI_Code minTemp maxTemp meanTemp PERCENT_AGED_UNDER_18_OR_OVER_64 PER_CAPITA_INCOME HARDSHIP_INDEX PERCENT_OF_HOUSING_CROWDED PERCENT_HOUSEHOLDS_BELOW_POVERTY PERCENT_AGED_16__UNEMPLOYED PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA
type int int int int int int enum enum enum enum enum int int enum enum int int enum int int int real int int real real real real
mins 2015.01.0 1.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 111.0 1.0 0.0 -2.0 15.0 7.0 13.5 8201.0 1.0 0.3 3.3 4.7 2.5
mean 2015.01.41944194419 17.68396839685.18081808181 13.63196319630.1591159115910.0 NaN NaN NaN NaN 37.4476447645 11.34898851280.2928292829280.1523152315231159.6180618122.9540954095NaN 17.699669967 31.719971997224.940894089435.0596759676 25221.3057306 54.4786478648 5.43707370737 24.600750075 16.8288328833 21.096639664
maxs 2015.02.0 31.0 6.0 23.0 1.0 0.0 6.0 212.0 26.0 90.0 77.0 25.0 1.0 1.0 2535.0 50.0 24.0 29.0 43.0 36.0 51.5 88669.0 98.0 15.8 56.5 35.9 54.8
sigma 0.0 0.49349240678711.18010433580.7389298304096.473217358070.3658024340410.0 NaN NaN NaN NaN 21.2748762223 6.945474933010.4550835155880.35934414686 695.76029875 13.6495661144NaN 8.961181364386.938099134727.463025270627.95653388237 18010.0446225 29.3247456472 3.75289588494 10.1450570661 7.58926327988 11.3868817911
zeros 0 0 0 0 374 8408 9999 1942 16 11 19 0 0 7071 8476 0 0 16 0 0 0 0 0 0 0 0 0 0
missing0 0 0 0 0 0 0 0 0 0 6 0 162 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 2015.01.0 24.0 4.0 22.0 0.0 Winter Sat 2820 OTHER OFFENSE APARTMENT 31.0 10.0 false false 1034.0 25.0 26 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
1 2015.01.0 24.0 4.0 21.0 0.0 Winter Sat 1310 CRIMINAL DAMAGE RESTAURANT 31.0 12.0 true false 1233.0 25.0 14 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
2 2015.01.0 24.0 4.0 18.0 0.0 Winter Sat 1750 OFFENSE INVOLVING CHILDRENRESIDENCE 31.0 12.0 false true 1235.0 25.0 20 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
3 2015.01.0 24.0 4.0 18.0 0.0 Winter Sat 0460 BATTERY OTHER 31.0 10.0 false false 1023.0 25.0 08B 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
4 2015.01.0 24.0 4.0 13.0 0.0 Winter Sat 0890 THEFT CURRENCY EXCHANGE 31.0 10.0 false false 1023.0 25.0 06 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
5 2015.01.0 24.0 4.0 9.0 0.0 Winter Sat 0560 ASSAULT OTHER 31.0 12.0 false false 1234.0 25.0 08A 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
6 2015.01.0 24.0 4.0 8.0 0.0 Winter Sat 0486 BATTERY RESIDENCE 31.0 12.0 true true 1235.0 25.0 08B 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
7 2015.01.0 24.0 4.0 1.0 0.0 Winter Sat 0420 BATTERY SIDEWALK 31.0 10.0 false false 1034.0 25.0 04B 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
8 2015.01.0 24.0 4.0 0.0 0.0 Winter Sat 1320 CRIMINAL DAMAGE PARKING LOT/GARAGE(NON.RESID.)31.0 9.0 false false 912.0 11.0 14 29.0 43.0 36.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7
9 2015.01.0 31.0 5.0 23.0 0.0 Winter Sat 0820 THEFT SIDEWALK 31.0 12.0 false false 1234.0 25.0 06 19.0 36.0 28.0 32.6 16444.0 76.0 9.6 25.8 15.8 40.7

22. Split final H2O data table into train test and validation sets

ratios = [0.6,0.2]
frs = crimeWithWeatherHF.split_frame(ratios,seed=12345)
train = frs[0]
train.frame_id = "Train"
valid = frs[2]
valid.frame_id = "Validation"
test = frs[1]
test.frame_id = "Test"

23. Import Model Builders from H2O Python

from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

24. Inspect the availble GBM parameters

H2OGradientBoostingEstimator?

25. Define Predictors

predictors = crimeWithWeatherHF.names[:]
response = "Arrest"
predictors.remove(response)

26. Create a Simple GBM model to Predict Arrests

model_gbm = H2OGradientBoostingEstimator(ntrees         =50,
                                        max_depth      =6,
                                        learn_rate     =0.1, 
                                        #nfolds         =2,
                                        distribution   ="bernoulli")

model_gbm.train(x               =predictors,
               y               ="Arrest",
               training_frame  =train,
               validation_frame=valid
               )

27. Create a Simple Deep Learning model to Predict Arrests

model_dl = H2ODeepLearningEstimator(variable_importances=True,
                                   loss                ="Automatic")

model_dl.train(x                =predictors,
              y                ="Arrest",
              training_frame  =train,
              validation_frame=valid)


gbm Model Build Progress: [##################################################] 100%

deeplearning Model Build Progress: [##################################################] 100%

28. Print confusion matrices for the training and validation datasets

print(model_gbm.confusion_matrix(train = True))
print(model_gbm.confusion_matrix(valid = True))

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.335827722991:

false true Error Rate
false 4125.0 142.0 0.0333 (142.0/4267.0)
true 251.0 1504.0 0.143 (251.0/1755.0)
Total 4376.0 1646.0 0.0653 (393.0/6022.0)

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.432844055866:

false true Error Rate
false 1362.0 61.0 0.0429 (61.0/1423.0)
true 150.0 443.0 0.253 (150.0/593.0)
Total 1512.0 504.0 0.1047 (211.0/2016.0)
print(model_gbm.auc(train=True))
print(model_gbm.auc(valid=True))
model_gbm.plot(metric="AUC")

0.974667176776
0.92596751276

29. Print variable importances

model_gbm.varimp(True)
variable relative_importance scaled_importance percentage
0 IUCR 4280.939453 1.000000e+00 8.234218e-01
1 Location_Description 487.323059 1.138355e-01 9.373466e-02
2 WeekDay 55.790558 1.303232e-02 1.073109e-02
3 HourOfDay 55.419220 1.294557e-02 1.065967e-02
4 PERCENT_AGED_16__UNEMPLOYED 34.422894 8.040967e-03 6.621107e-03
5 Beat 31.468222 7.350775e-03 6.052788e-03
6 PERCENT_HOUSEHOLDS_BELOW_POVERTY 29.103352 6.798356e-03 5.597915e-03
7 PER_CAPITA_INCOME 26.233143 6.127894e-03 5.045841e-03
8 PERCENT_AGED_UNDER_18_OR_OVER_64 24.077402 5.624327e-03 4.631193e-03
9 Day 23.472567 5.483041e-03 4.514855e-03
... ... ... ... ...
15 maxTemp 11.300793 2.639793e-03 2.173663e-03
16 Community_Area 10.252146 2.394835e-03 1.971960e-03
17 HARDSHIP_INDEX 10.116072 2.363049e-03 1.945786e-03
18 Domestic 9.294327 2.171095e-03 1.787727e-03
19 District 8.304654 1.939914e-03 1.597367e-03
20 minTemp 6.243027 1.458331e-03 1.200822e-03
21 WeekNum 4.230102 9.881246e-04 8.136433e-04
22 FBI_Code 2.363182 5.520241e-04 4.545486e-04
23 Month 0.000018 4.187325e-09 3.447935e-09
24 Weekend 0.000000 0.000000e+00 0.000000e+00

25 rows × 4 columns

30. Inspect Deep Learning model output

model_dl

Model Details
=============
H2ODeepLearningEstimator :  Deep Learning
Model Key:  DeepLearning_model_python_1446861372065_4

Status of Neuron Layers: predicting Arrest, 2-class classification, bernoulli distribution, CrossEntropy loss, 118,802 weights/biases, 1.4 MB, 72,478 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_RMS momentum mean_weight weight_RMS mean_bias bias_RMS
1 390 Input 0.0
2 200 Rectifier 0.0 0.0 0.0 0.1 0.3 0.0 -0.0 0.1 -0.0 0.1
3 200 Rectifier 0.0 0.0 0.0 0.1 0.2 0.0 -0.0 0.1 0.8 0.2
4 2 Softmax 0.0 0.0 0.0 0.0 0.0 0.0 0.4 -0.0 0.0

ModelMetricsBinomial: deeplearning

Reported on train data.

MSE: 0.0737426129728
R^2: 0.642891439669
LogLoss: 0.242051500943
AUC: 0.950131166302
Gini: 0.900262332604

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.343997370612:

false true Error Rate
false 4003.0 264.0 0.0619 (264.0/4267.0)
true 358.0 1397.0 0.204 (358.0/1755.0)
Total 4361.0 1661.0 0.1033 (622.0/6022.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.3 0.8 195.0
max f2 0.2 0.9 278.0
max f0point5 0.7 0.9 86.0
max accuracy 0.5 0.9 149.0
max precision 1.0 1.0 0.0
max absolute_MCC 0.3 0.7 195.0
max min_per_class_accuracy 0.2 0.9 247.0
ModelMetricsBinomial: deeplearning
** Reported on validation data. **

MSE: 0.0843305429737
R^2: 0.593831388139
LogLoss: 0.280203809486
AUC: 0.930515181213
Gini: 0.861030362427

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.493462351545:

false true Error Rate
false 1361.0 62.0 0.0436 (62.0/1423.0)
true 158.0 435.0 0.2664 (158.0/593.0)
Total 1519.0 497.0 0.1091 (220.0/2016.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.5 0.8 137.0
max f2 0.1 0.8 303.0
max f0point5 0.7 0.9 82.0
max accuracy 0.7 0.9 91.0
max precision 1.0 1.0 0.0
max absolute_MCC 0.7 0.7 91.0
max min_per_class_accuracy 0.2 0.8 236.0

Scoring History:

timestamp duration training_speed epochs samples training_MSE training_r2 training_logloss training_AUC training_classification_error validation_MSE validation_r2 validation_logloss validation_AUC validation_classification_error
2015-11-06 17:57:05 0.000 sec None 0.0 0.0 nan nan nan nan nan nan nan nan nan nan
2015-11-06 17:57:09 2.899 sec 2594 rows/sec 1.0 6068.0 0.1 0.3 0.6 0.9 0.1 0.1 0.3 0.6 0.9 0.1
2015-11-06 17:57:15 9.096 sec 5465 rows/sec 7.3 43742.0 0.1 0.6 0.3 0.9 0.1 0.1 0.6 0.3 0.9 0.1
2015-11-06 17:57:19 12.425 sec 6571 rows/sec 12.0 72478.0 0.1 0.6 0.2 1.0 0.1 0.1 0.6 0.3 0.9 0.1

Variable Importances:

variable relative_importance scaled_importance percentage
Domestic.false 1.0 1.0 0.0
Primary_Type.NARCOTICS 0.9 0.9 0.0
IUCR.0860 0.8 0.8 0.0
FBI_Code.18 0.8 0.8 0.0
IUCR.4625 0.7 0.7 0.0
--- --- --- ---
Location_Description.missing(NA) 0.0 0.0 0.0
Primary_Type.missing(NA) 0.0 0.0 0.0
FBI_Code.missing(NA) 0.0 0.0 0.0
WeekDay.missing(NA) 0.0 0.0 0.0
Domestic.missing(NA) 0.0 0.0 0.0

31. Predict on the test set using the GBM model

predictions = model_gbm.predict(test)
predictions.show()
predict false true
false 0.946415 0.0535847
false 0.862165 0.137835
false 0.938661 0.0613392
false 0.870186 0.129814
false 0.980488 0.0195118
false 0.972006 0.0279937
false 0.990995 0.00900489
true 0.02106920.978931
false 0.693061 0.306939
false 0.992097 0.00790253

32. Look at test set performance (if it includes true labels)

test_performance = model_gbm.model_performance(test)
test_performance


ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.0893676876445
R^2: 0.57094394422
LogLoss: 0.294019576922
AUC: 0.922152238508
Gini: 0.844304477016

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.365461652105:

false true Error Rate
false 1297.0 84.0 0.0608 (84.0/1381.0)
true 153.0 427.0 0.2638 (153.0/580.0)
Total 1450.0 511.0 0.1209 (237.0/1961.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.4 0.8 158.0
max f2 0.1 0.8 295.0
max f0point5 0.7 0.9 97.0
max accuracy 0.6 0.9 112.0
max precision 1.0 1.0 0.0
max absolute_MCC 0.6 0.7 112.0
max min_per_class_accuracy 0.2 0.8 235.0

33. Create Plots of Crime type vs Arrest Rate and Proportion of reported Crime

# Create table to report Crimetype, Arrest count per crime, total reported count per Crime  
sqlContext.registerDataFrameAsTable(df_crimes, "df_crimes")
allCrimes = sqlContext.sql("""SELECT Primary_Type, count(*) as all_count FROM df_crimes GROUP BY Primary_Type""")
crimesWithArrest = sqlContext.sql("SELECT Primary_Type, count(*) as crime_count FROM chicagoCrime WHERE Arrest = 'true' GROUP BY Primary_Type")

sqlContext.registerDataFrameAsTable(crimesWithArrest, "crimesWithArrest")
sqlContext.registerDataFrameAsTable(allCrimes, "allCrimes")

crime_type = sqlContext.sql("Select a.Primary_Type as Crime_Type, a.crime_count, b.all_count \
FROM crimesWithArrest a \
JOIN allCrimes b \
ON a.Primary_Type = b.Primary_Type ")

crime_type.show(12)

+--------------------+-----------+---------+
|          Crime_Type|crime_count|all_count|
+--------------------+-----------+---------+
|       OTHER OFFENSE|        183|      720|
|   WEAPONS VIOLATION|         96|      118|
|  DECEPTIVE PRACTICE|         25|      445|
|            BURGLARY|         14|      458|
|             BATTERY|        432|     1851|
|             ROBBERY|         17|      357|
| MOTOR VEHICLE THEFT|         17|      414|
|        PROSTITUTION|        106|      106|
|     CRIMINAL DAMAGE|         76|     1003|
|          KIDNAPPING|          1|        7|
|            GAMBLING|          3|        3|
|LIQUOR LAW VIOLATION|         12|       12|
+--------------------+-----------+---------+
only showing top 12 rows

34. Copy Crime_type table from Spark to H2O

crime_typeHF = hc.as_h2o_frame(crime_type,framename="crime_type")

35. Create Additional columns Arrest_rate and Crime_propotion

crime_typeHF["Arrest_rate"] = crime_typeHF["crime_count"]/crime_typeHF["all_count"]
crime_typeHF["Crime_proportion"] = crime_typeHF["all_count"]/crime_typeHF["all_count"].sum()
crime_typeHF["Crime_Type"] = crime_typeHF["Crime_Type"].asfactor()
# h2o.assign(crime_typeHF,crime_type)
crime_typeHF.frame_id = "Crime_type"


crime_typeHF
Crime_Type crime_count all_count Arrest_rate Crime_proportion
OTHER OFFENSE 183 720 0.254167 0.0721226
WEAPONS VIOLATION 96 118 0.813559 0.0118201
DECEPTIVE PRACTICE 25 445 0.0561798 0.0445758
BURGLARY 14 458 0.0305677 0.045878
BATTERY 432 1851 0.233387 0.185415
ROBBERY 17 357 0.047619 0.0357608
MOTOR VEHICLE THEFT 17 414 0.0410628 0.0414705
PROSTITUTION 106 106 1 0.0106181
CRIMINAL DAMAGE 76 1003 0.0757727 0.100471
KIDNAPPING 1 7 0.142857 0.000701192
hc


H2OContext: ip=172.16.2.98, port=54329

36. Plot in Flow

plot (g) -> g(
  g.rect(
    g.position "Crime_Type", "Arrest_rate"
    g.fillColor g.value 'blue'
    g.fillOpacity g.value 0.75
  )
  g.rect(
    g.position "Crime_Type", "Crime_proportion"
    g.fillColor g.value 'red'
    g.fillOpacity g.value 0.65
  )
  g.from inspect "data", getFrame "Crime_type"
)


#hc.stop()