Author: Joseph DiPietro

Countering a Dangerous Problem

Introduction

    Human Trafficking has always been a major problem in the world and it has devastating effects on its victims.   
It is also widely known that human trafficking affects a vast number of children. With new databases we can  
examine risk factors in order to help mitigate this problem, protect more children, and counter the abusers.  
In this project I will be examining the age of the victims, relation of the abuser to the victim, and type of  
abuse used for trafficking. The data that I will use comes from the CTDC and contains information from across  
the world. This data dates back to 2002. I focused on the data from the United States to narrow the scope of  
the project, however, I also utilized other small sets of data from other countries for comparison. The chosen  
cases range from years 2015 to 2018. Using machine learning, we can predict the violent nature of the abusers,  
and their relationship to the victim based on other factors such as age.

Libraries Used

This project will be completed in Python using the pandas, numpy, scikit-learn, seaborn, matplotlib, and folium libraries.

In [284]:
import pandas as pd
import seaborn as sea
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
import folium
from folium.plugins import MarkerCluster
import warnings
warnings.filterwarnings('ignore')

Pre - Processing

The format of the data needs to be altered for cases of missing data.  The way missing data is handled is  
to use -99 as a placeholder. However, that will skew our data when we attempt to do any regression modeling.  
To fix this, we have to alter the actual excel file and replace all -99s with 0s.  This will also give us a  
pseudo binary flag, where any column with 0 indicates an adult while 1 indicates a child. 

The data set can be found here: TraffickingData.

Gathering the Data

  The file is in a CSV, comma-seperated values, file so we can use the built in pandas parser to create a data frame. 
In [266]:
globalFrame = pd.read_csv("trafficking.csv")

Fixing the Data

    This dataframe is much too large to be interpreted so we must get rid of unrelated data. For example,  
I removed the data source column and other repeated entries. In addition, I renamed most of the columns  
to be more fitting to the actual data.  When the names of the columns take less space, it allows us to  
read the dataframe better.  I also changed what a missing entry looks like in the majorityStatus column,  
as "unknown" is more fitting than the value 0.

    In order to create a sum of cases, we must add a new column with entries of 1.  Then, we can group by both  
majorityStatus and the year in order to create a new dataframe.  This dataframe will be indexed by the year  
and age status and will have columns that contain all of the cases and types of control, relation, etc. We  
can use seaborn to plot this dataframe as shown below which will show the vast difference between cases based  
on majority status.
In [267]:
#Dropping unnecessary columns
globalFrame = globalFrame.drop(globalFrame.columns[0],1)
globalFrame = globalFrame.drop(["Datasource","ageBroad","majorityStatus","majorityEntry"],1)
globalFrame = globalFrame.rename(columns = {"yearOfRegistration":"year","majorityStatusAtExploit":"majorityStatus","meansOfControlDebtBondage":"DebtBondage","meansOfControlTakesEarnings":"EarningsStolen","meansOfControlRestrictsFinancialAccess": "WithholdsMoney","meansOfControlThreats":"Threats","meansOfControlPsychologicalAbuse":"PsychologicalAbuse","meansOfControlPhysicalAbuse":"PhysicalAbuse","meansOfControlSexualAbuse":"SexualAbuse","meansOfControlFalsePromises":"FalsePromises","meansOfControlPsychoactiveSubstances":"PsychoactiveSubstances","meansOfControlRestrictsMovement":"RestrictsMovement","meansOfControlRestrictsMedicalCare":"RestrictsMedicalCare","meansOfControlExcessiveWorkingHours":"ExcessiveWorkingHours","meansOfControlUsesChildren":"UsesChildren","meansOfControlThreatOfLawEnforcement":"ThreatOfLawEnforcement","meansOfControlWithholdsNecessities":"WithholdsNecessities","meansOfControlWithholdsDocuments":"WithholdsDocuments","meansOfControlOther":"OtherControl","meansOfControlNotSpecified":"ControlNotSpecified","recruiterRelationIntimatePartner":"IntimatePartner","recruiterRelationFriend":"Friend","recruiterRelationFamily":"Family","recruiterRelationOther":"OtherRelation","recruiterRelationUnknown":"UnknownRelation"})
frame = globalFrame[globalFrame['citizenship'] == "US"]
frame = frame.reset_index()
frame = frame.drop('index',1)
#Changing missing entries in majority status to unknown
frame.loc[frame['majorityStatus'] == '0', 'majorityStatus'] = "unkown"
frame["Cases"] = 1
#Indexing the year and majority status while summing the columns that match each year,majorityStatus tuple
ageFrame = frame.groupby(["majorityStatus","year"]).sum()
ageFrame['Cases'].plot.bar()
ageFrame = ageFrame.reset_index()
ageFrame
Out[267]:
majorityStatus year DebtBondage EarningsStolen WithholdsMoney Threats PsychologicalAbuse PhysicalAbuse SexualAbuse FalsePromises ... typeOfSexPornography typeOfSexRemoteInteractiveServices typeOfSexPrivateSexualServices isAbduction IntimatePartner Friend Family OtherRelation UnknownRelation Cases
0 Adult 2015 5 12 0 13 20 15 4 4 ... 0 0 0 0 19 3 0 3 16 41
1 Adult 2016 12 21 8 45 47 45 29 20 ... 0 0 0 0 41 16 3 21 60 133
2 Adult 2017 10 20 1 63 50 51 21 9 ... 0 0 0 0 43 11 2 15 70 137
3 Adult 2018 5 16 2 37 38 28 11 8 ... 0 0 0 0 31 12 1 12 23 76
4 Minor 2015 13 36 4 56 101 72 42 10 ... 0 0 0 0 63 26 47 34 169 320
5 Minor 2016 21 43 6 111 140 118 76 20 ... 0 0 0 0 90 49 61 52 345 575
6 Minor 2017 20 44 5 109 149 104 56 18 ... 0 0 0 0 69 28 83 34 263 467
7 Minor 2018 6 17 3 67 107 70 65 3 ... 0 0 0 0 37 16 81 17 119 263
8 unkown 2015 13 18 2 39 56 46 11 4 ... 0 0 0 0 32 5 14 16 209 271
9 unkown 2016 13 40 12 80 114 119 29 15 ... 0 0 0 0 84 12 16 12 411 531
10 unkown 2017 27 36 6 122 107 107 37 7 ... 0 0 0 0 66 13 14 9 385 482
11 unkown 2018 11 27 3 81 98 96 41 11 ... 0 0 0 0 73 13 16 11 229 340

12 rows × 52 columns

Graphs

To make a good hypothesis on the predictions, we must first look at the values.  First, we  
must again edit the frame to get rid of data we cannot use such as missing entries or unknowns.  
To do this we simply set ageFrame equal to ageFrame where the column majorityStatus is not unknown.  
Then, again using seaborn, we can see the differences in types of control used on adults and minors.  
By using hue, we can input two different types of data and seaborn will automatically color and label  
the points on the graph.  Seaborn is also made much easier by using the data function which allows  
the user to just input the column names as x and y.  The clf() function is simply to prevent seaborn  
from plotting on the same graph in future calls. Based on the graphs we can infer that minors will be  
the victims for most types of control.
In [335]:
#Clearing the unknowns from our dataframe
ageFrame = ageFrame[ageFrame['majorityStatus'] != "unkown"]
physcialAbuse = sea.scatterplot(x = "year",y = "PhysicalAbuse",hue = "majorityStatus",data = ageFrame)
plt.show()
plt.clf()
sexualAbuse = sea.scatterplot(x = "year",y = "SexualAbuse",hue = "majorityStatus",data = ageFrame)
plt.show()
plt.clf()
pschologicalAbuse = sea.scatterplot(x = "year",y = "PsychologicalAbuse",hue = "majorityStatus",data = ageFrame)
plt.show()
plt.clf()
psychoactiveDrugs = sea.scatterplot(x = "year",y = "PsychoactiveSubstances",hue = "majorityStatus",data = ageFrame)
plt.show()
plt.clf()
<Figure size 432x288 with 0 Axes>

Predictive Modeling

To begin, we first need to edit the original frame so that there is no unknown data skewing our predictions.  
Next, we must create a column that is a binary flag to indicate whether the victim was a minor or not. 1  
indicates a minor, while 0 indicates an adult. This is the column that will be predicting.  We will first  
begin our prediction by utilizing just the year. We will make a new frame that only includes the year and  
majority status flag columns.  Here, we will use the LinearRegression model from sklearn in order to form  
a best fit line over our data.  After we use the fit function with X being our frame with year and y being  
the majority status flag column, we can predict what majority status a victim would have in a year.  As  
shown below, when 2017 is inputted, the output is close to 1 meaning that the victim is most likely a minor.  
However, this isn't an ideal prediction model as only the factor of year is taken into account.  The next  
step is adding in our other variables.
In [351]:
#Clearing the unknowns from the dataframe
predictFrame = frame[frame["majorityStatus"] != "unkown"]
#Making our binary flag for majority status
for index,row in predictFrame.iterrows():
    if(row.majorityStatus == "Minor"):
        predictFrame.loc[index,"BinaryAge"] = 1
    else:
        predictFrame.loc[index,"BinaryAge"] = 0
#Here I make a smaller dataframe of just the year and the binary flag
predictAge = predictFrame[["year","BinaryAge"]]
#Creating our linear regression
model = LinearRegression()
#Creating the regression fit line
model.fit(X = predictAge.drop("BinaryAge",1),y = predictAbuse["BinaryAge"])
print("A victim in 2017 with 1 being a minor is ", model.predict(np.array([[2017]])))
A victim in 2017 with 1 being a minor is  [0.78742787]

Types of Control

When looking at the data, it is obvious that it is important to take into consideration other factors other  
than just age, when predicting the majority status of a victim. To fix this, we have to add an interaction  
term so that the prediction is based off of both the year and the type of control used.  To accomplish this,  
we will use a built in function from sklearn.  First, we will initialize a variable that holds the polynomial  
features and in the features set interaction only to true and include bias to false.  The 1 represents what  
type of model you are predicting i.e linear, quadratic, hyperbolic.  Then using this feature, we can create  
the interaction term using fit transform.  In the fit transform we only include variables that are not being  
predicted.  

Next, we will attempt to predict the majority status of a victim based on the type of control the abuser used.  
This can be done by making a new dataframe that includes all of the types of control columns that we renamed  
earlier.  Again, we want to use the entire data frame, excluding our binary age flag, as the x and use the age  
flag as the y. Now when we do the linear regression fit, our prediction model expects 18 inputs.  To test a  
prediction, simply make an numpy array that has a 0 for each type of control that is not the one utilized,  
and a 1 for the control being predicted.  As shown below we have a 1 in the fifth column so this is a prediction  
of when threats are used for control.  Using the prediction model we can see that based on this type of control  
and the year it happened that the victim is likely a minor.
In [329]:
#Adding all of the columns of types of control to a new frame
predictAbuse = predictFrame[["year","DebtBondage","EarningsStolen","WithholdsMoney","Threats","PhysicalAbuse","SexualAbuse","FalsePromises",
                            "PsychoactiveSubstances","RestrictsMovement","RestrictsMedicalCare","ExcessiveWorkingHours",
                            "UsesChildren","ThreatOfLawEnforcement","WithholdsNecessities","WithholdsDocuments",
                            "OtherControl","ControlNotSpecified","BinaryAge"]]
#Using sklearn's built in polynomialfeatures to create an interaction term
poly = PolynomialFeatures(1,interaction_only=True,include_bias = False)
#X_inter becomes our interaction term
X_inter = poly.fit_transform(X = predictAbuse.drop('BinaryAge',1))
model = LinearRegression()
model.fit(X = X_inter,y = predictAbuse["BinaryAge"])
print("A victim of threats with 1 being a minor in 2017 controlled using threats is",model.predict(np.array([[2017,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]])))
A victim of threats with 1 being a minor in 2017 controlled using threats is [0.77239109]

Relation to Victim

Another path you can look at is the predictability of the majority status of the victim given the recruiters  
relation to them.  This is again accomplished by using the sklearn features given the same parameters. Again,  
X_inter is our interaction term between the year and the recruiter relation.  Now we are set up for another  
linear regression. 

To do this we make a new frame of the year, recruitment relation columns that we renamed earlier, and the age  
flag.  Once again, X is our frame minus the flag and y is the flag itself that we are trying to predict. After  
the fit the prediction model expects 5 elements which are handled the same way as before.  We can see through  
the predictions below that if the recruiter is a family member they are almost always a minor and if it is an  
intimate partner they are likely to be a minor.
In [333]:
#Making a new frame with the recruiter relation columns
predictRecruit = predictFrame[["year","IntimatePartner","Family","OtherRelation","UnknownRelation","BinaryAge"]]
#Creating the interaction term
poly = PolynomialFeatures(1,interaction_only=True,include_bias = False)
X_inter = poly.fit_transform(X = predictRecruit.drop('BinaryAge',1))
#Fitting and predicting based on our interaction term
model = LinearRegression()
model.fit(X = X_inter,y = predictAbuse["BinaryAge"])
print("A victim of a family member with one being a minor in 2017 is",model.predict(np.array([[2017,0,1,0,0]])))
print("A victim of an intimate partner with one being a minor in 2015 is",model.predict(np.array([[2015,1,0,0,0]])))
A victim of a family member with one being a minor in 2017 is [0.96826561]
A victim of an intimate partner with one being a minor in 2015 is [0.71835606]

A Global Approach to Safety

In order to display the vast amounts of data in the file in a readible format we must use a map.  This  
map will show the entire world, will highlight countries that we have in our dataframe, and label them  
with total cases.  In order to accomplish this, we use folium and a folium plugin called MarkerCluster().  
First, we must initialize the map which we will simply call m, as a new folium map.  Then, a MarkerCluster()  
must be added to the map which is called cluster here. The next step is to iterrate over the data and check  
the citizenship of victim column to see which country to add a marker at.  To add a marker, we use the simple  
Marker() function. This function takes a location in the form [lat,long] and can display a popup which is at  
that point and it is set to the country we are adding the marker to.  As the markers are added to the cluster,  
it will automatically absorb any nearby markers and display a total number.  Not only does this make the data  
easier to read, but it will also help you deal with large datasets that can crash your computer.  The map is  
also interactive so users can move around and zoom in to specific countries to see their total reported cases.  
In [248]:
#Initializes the map
m = folium.Map()
#Adds a MarkerCluster to the map which we add Markers to
cluster = MarkerCluster().add_to(m)
for index,row in globalFrame.iterrows():
    if(row.citizenship == "CO"):
        folium.map.Marker(location = ['4.5709','-74.2973'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "MD"):
        folium.map.Marker(location = ['47.4116','28.3699'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "RO"):
        folium.map.Marker(location = ['45.9432','24.9668'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "UA"):
        folium.map.Marker(location = ['48.3794','31.1656'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "BY"):
        folium.map.Marker(location = ['53.7098','27.9534'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "HT"):
        folium.map.Marker(location = ['18.9712','-72.2852'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "UZ"):
        folium.map.Marker(location = ['41.3775','64.5853'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "LK"):
        folium.map.Marker(location = ['7.8731','80.7718'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "LK"):
        folium.map.Marker(location = ['7.8731','80.7718'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "MM"):
        folium.map.Marker(location = ['21.9162','95.9560'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "UG"):
        folium.map.Marker(location = ['1.3733','32.2903'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "ID"):
        folium.map.Marker(location = ['-0.7893','113.9213'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "ID"):
        folium.map.Marker(location = ['-0.7893','113.9213'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "KG"):
        folium.map.Marker(location = ['42.882004','74.582748'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "AF"):
        folium.map.Marker(location = ['33.9391','67.7100'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "ER"):
        folium.map.Marker(location = ['15.1794','39.7823'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "NG"):
        folium.map.Marker(location = ['17.6078','8.0817'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "NP"):
        folium.map.Marker(location = ['28.3949','84.1240'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "PH"):
        folium.map.Marker(location = ['12.8797','121.7740'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "KH"):
        folium.map.Marker(location = ['12.5657','104.9910'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "BD"):
        folium.map.Marker(location = ['23.6850','90.3563'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "US"):
        folium.map.Marker(location = ['37.0902','-95.7129'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "TH"):
        folium.map.Marker(location = ['15.8700','100.9925'],popup = row.citizenship).add_to(cluster)
    if(row.citizenship == "VN"):
        folium.map.Marker(location = ['14.0583','108.2772'],popup = row.citizenship).add_to(cluster)
m
Out[248]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Country Comparison

We can also use our machine learning to compare the safety of countries.  This would be useful information to  
look at before traveling or simply to see if one particular country has a larger problem.  To complete this we  
have to create a new frame that has all of the data from the Philippines.  We can then concatenate the Philippines  
and United States frames to combine both frames.  Similarly, to before we must make the citizenship column into  
some numerical in order to do our regression model.  This will be a frequent problem when attempting to use a  
linear regression prediction model, but usually setting one or more frames to a binary flag is the solution. In  
this case we will use 1 to represent the Philippines and 0 for the United States.  Since we are attempting to  
predict the country of citizenship, the citizenship column will be our y with the rest of the frame being the x.
In [326]:
#Grabbing the data of just the Phillippines
PH = globalFrame[globalFrame['citizenship'] == 'PH']
#Joining the US and PH datafrm
USPH = pd.concat([PH,frame.drop('Cases',1)])
#Dropping unknown data
USPH = USPH[USPH['majorityStatus'] != "unknown"]
#Creating a binary flag for citizenship with 1 being Phillippines
for index,row in USPH.iterrows():
    if(row.citizenship == "PH"):
        USPH.loc[index,"CitizenFlag"] = 1
    else:
        USPH.loc[index,"CitizenFlag"] = 0
USPH = USPH[['year','citizenship','CitizenFlag']]
model2 = LinearRegression()
model2.fit(X = USPH.drop(['citizenship','CitizenFlag'],1),y = USPH['CitizenFlag'])
print("A victim of human trafficking in 2016 with 1 being Philippines is ", model2.predict(np.array([[2017]])))
A victim of human trafficking in 2016 with 1 being Philippines is  [0.48067162]

Conclusion

From these graphs and prediction models we can conclude that, on average, the victims of human trafficking  
are minors.  We can also use this to extrapolate risk factors that make an individual more susceptible to  
human trafficking.  For example, we now know that children need to be protected from threats at a higher  
rate than adults do based on the analyzed data.  The database also allows for us to have a global idea of  
which countries have a larger problem with human trafficking and need more attention. With this information,  
we can potentially address this issue.  This is extremely important data as we are dealing with human lives  
and attempting to prevent a lifetime of damage.  I hope that this tutorial helped you understand how to  
manipulate data, graph it, utilize it in predictive modeling, and display it in a map. I hope that one day  
you may apply it to another set of data that can also improve lives.  If you want to look further into this  
topic the [Counter Trafficking Data Center](https://www.ctdatacollaborative.org/) has other information on  
both human trafficking and ways in which data science can be used to track and prevent it.