Human Trafficking has always been a major problem in the world and it has devastating effects on its victims.
It is also widely known that human trafficking affects a vast number of children. With new databases we can
examine risk factors in order to help mitigate this problem, protect more children, and counter the abusers.
In this project I will be examining the age of the victims, relation of the abuser to the victim, and type of
abuse used for trafficking. The data that I will use comes from the CTDC and contains information from across
the world. This data dates back to 2002. I focused on the data from the United States to narrow the scope of
the project, however, I also utilized other small sets of data from other countries for comparison. The chosen
cases range from years 2015 to 2018. Using machine learning, we can predict the violent nature of the abusers,
and their relationship to the victim based on other factors such as age.
This project will be completed in Python using the pandas, numpy, scikit-learn, seaborn, matplotlib, and folium libraries.
import pandas as pd
import seaborn as sea
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
import folium
from folium.plugins import MarkerCluster
import warnings
warnings.filterwarnings('ignore')
The format of the data needs to be altered for cases of missing data. The way missing data is handled is
to use -99 as a placeholder. However, that will skew our data when we attempt to do any regression modeling.
To fix this, we have to alter the actual excel file and replace all -99s with 0s. This will also give us a
pseudo binary flag, where any column with 0 indicates an adult while 1 indicates a child.
The data set can be found here: TraffickingData.
The file is in a CSV, comma-seperated values, file so we can use the built in pandas parser to create a data frame.
globalFrame = pd.read_csv("trafficking.csv")
This dataframe is much too large to be interpreted so we must get rid of unrelated data. For example,
I removed the data source column and other repeated entries. In addition, I renamed most of the columns
to be more fitting to the actual data. When the names of the columns take less space, it allows us to
read the dataframe better. I also changed what a missing entry looks like in the majorityStatus column,
as "unknown" is more fitting than the value 0.
In order to create a sum of cases, we must add a new column with entries of 1. Then, we can group by both
majorityStatus and the year in order to create a new dataframe. This dataframe will be indexed by the year
and age status and will have columns that contain all of the cases and types of control, relation, etc. We
can use seaborn to plot this dataframe as shown below which will show the vast difference between cases based
on majority status.
#Dropping unnecessary columns
globalFrame = globalFrame.drop(globalFrame.columns[0],1)
globalFrame = globalFrame.drop(["Datasource","ageBroad","majorityStatus","majorityEntry"],1)
globalFrame = globalFrame.rename(columns = {"yearOfRegistration":"year","majorityStatusAtExploit":"majorityStatus","meansOfControlDebtBondage":"DebtBondage","meansOfControlTakesEarnings":"EarningsStolen","meansOfControlRestrictsFinancialAccess": "WithholdsMoney","meansOfControlThreats":"Threats","meansOfControlPsychologicalAbuse":"PsychologicalAbuse","meansOfControlPhysicalAbuse":"PhysicalAbuse","meansOfControlSexualAbuse":"SexualAbuse","meansOfControlFalsePromises":"FalsePromises","meansOfControlPsychoactiveSubstances":"PsychoactiveSubstances","meansOfControlRestrictsMovement":"RestrictsMovement","meansOfControlRestrictsMedicalCare":"RestrictsMedicalCare","meansOfControlExcessiveWorkingHours":"ExcessiveWorkingHours","meansOfControlUsesChildren":"UsesChildren","meansOfControlThreatOfLawEnforcement":"ThreatOfLawEnforcement","meansOfControlWithholdsNecessities":"WithholdsNecessities","meansOfControlWithholdsDocuments":"WithholdsDocuments","meansOfControlOther":"OtherControl","meansOfControlNotSpecified":"ControlNotSpecified","recruiterRelationIntimatePartner":"IntimatePartner","recruiterRelationFriend":"Friend","recruiterRelationFamily":"Family","recruiterRelationOther":"OtherRelation","recruiterRelationUnknown":"UnknownRelation"})
frame = globalFrame[globalFrame['citizenship'] == "US"]
frame = frame.reset_index()
frame = frame.drop('index',1)
#Changing missing entries in majority status to unknown
frame.loc[frame['majorityStatus'] == '0', 'majorityStatus'] = "unkown"
frame["Cases"] = 1
#Indexing the year and majority status while summing the columns that match each year,majorityStatus tuple
ageFrame = frame.groupby(["majorityStatus","year"]).sum()
ageFrame['Cases'].plot.bar()
ageFrame = ageFrame.reset_index()
ageFrame
To make a good hypothesis on the predictions, we must first look at the values. First, we
must again edit the frame to get rid of data we cannot use such as missing entries or unknowns.
To do this we simply set ageFrame equal to ageFrame where the column majorityStatus is not unknown.
Then, again using seaborn, we can see the differences in types of control used on adults and minors.
By using hue, we can input two different types of data and seaborn will automatically color and label
the points on the graph. Seaborn is also made much easier by using the data function which allows
the user to just input the column names as x and y. The clf() function is simply to prevent seaborn
from plotting on the same graph in future calls. Based on the graphs we can infer that minors will be
the victims for most types of control.
#Clearing the unknowns from our dataframe
ageFrame = ageFrame[ageFrame['majorityStatus'] != "unkown"]
physcialAbuse = sea.scatterplot(x = "year",y = "PhysicalAbuse",hue = "majorityStatus",data = ageFrame)
plt.show()
plt.clf()
sexualAbuse = sea.scatterplot(x = "year",y = "SexualAbuse",hue = "majorityStatus",data = ageFrame)
plt.show()
plt.clf()
pschologicalAbuse = sea.scatterplot(x = "year",y = "PsychologicalAbuse",hue = "majorityStatus",data = ageFrame)
plt.show()
plt.clf()
psychoactiveDrugs = sea.scatterplot(x = "year",y = "PsychoactiveSubstances",hue = "majorityStatus",data = ageFrame)
plt.show()
plt.clf()
To begin, we first need to edit the original frame so that there is no unknown data skewing our predictions.
Next, we must create a column that is a binary flag to indicate whether the victim was a minor or not. 1
indicates a minor, while 0 indicates an adult. This is the column that will be predicting. We will first
begin our prediction by utilizing just the year. We will make a new frame that only includes the year and
majority status flag columns. Here, we will use the LinearRegression model from sklearn in order to form
a best fit line over our data. After we use the fit function with X being our frame with year and y being
the majority status flag column, we can predict what majority status a victim would have in a year. As
shown below, when 2017 is inputted, the output is close to 1 meaning that the victim is most likely a minor.
However, this isn't an ideal prediction model as only the factor of year is taken into account. The next
step is adding in our other variables.
#Clearing the unknowns from the dataframe
predictFrame = frame[frame["majorityStatus"] != "unkown"]
#Making our binary flag for majority status
for index,row in predictFrame.iterrows():
if(row.majorityStatus == "Minor"):
predictFrame.loc[index,"BinaryAge"] = 1
else:
predictFrame.loc[index,"BinaryAge"] = 0
#Here I make a smaller dataframe of just the year and the binary flag
predictAge = predictFrame[["year","BinaryAge"]]
#Creating our linear regression
model = LinearRegression()
#Creating the regression fit line
model.fit(X = predictAge.drop("BinaryAge",1),y = predictAbuse["BinaryAge"])
print("A victim in 2017 with 1 being a minor is ", model.predict(np.array([[2017]])))
When looking at the data, it is obvious that it is important to take into consideration other factors other
than just age, when predicting the majority status of a victim. To fix this, we have to add an interaction
term so that the prediction is based off of both the year and the type of control used. To accomplish this,
we will use a built in function from sklearn. First, we will initialize a variable that holds the polynomial
features and in the features set interaction only to true and include bias to false. The 1 represents what
type of model you are predicting i.e linear, quadratic, hyperbolic. Then using this feature, we can create
the interaction term using fit transform. In the fit transform we only include variables that are not being
predicted.
Next, we will attempt to predict the majority status of a victim based on the type of control the abuser used.
This can be done by making a new dataframe that includes all of the types of control columns that we renamed
earlier. Again, we want to use the entire data frame, excluding our binary age flag, as the x and use the age
flag as the y. Now when we do the linear regression fit, our prediction model expects 18 inputs. To test a
prediction, simply make an numpy array that has a 0 for each type of control that is not the one utilized,
and a 1 for the control being predicted. As shown below we have a 1 in the fifth column so this is a prediction
of when threats are used for control. Using the prediction model we can see that based on this type of control
and the year it happened that the victim is likely a minor.
#Adding all of the columns of types of control to a new frame
predictAbuse = predictFrame[["year","DebtBondage","EarningsStolen","WithholdsMoney","Threats","PhysicalAbuse","SexualAbuse","FalsePromises",
"PsychoactiveSubstances","RestrictsMovement","RestrictsMedicalCare","ExcessiveWorkingHours",
"UsesChildren","ThreatOfLawEnforcement","WithholdsNecessities","WithholdsDocuments",
"OtherControl","ControlNotSpecified","BinaryAge"]]
#Using sklearn's built in polynomialfeatures to create an interaction term
poly = PolynomialFeatures(1,interaction_only=True,include_bias = False)
#X_inter becomes our interaction term
X_inter = poly.fit_transform(X = predictAbuse.drop('BinaryAge',1))
model = LinearRegression()
model.fit(X = X_inter,y = predictAbuse["BinaryAge"])
print("A victim of threats with 1 being a minor in 2017 controlled using threats is",model.predict(np.array([[2017,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]])))
Another path you can look at is the predictability of the majority status of the victim given the recruiters
relation to them. This is again accomplished by using the sklearn features given the same parameters. Again,
X_inter is our interaction term between the year and the recruiter relation. Now we are set up for another
linear regression.
To do this we make a new frame of the year, recruitment relation columns that we renamed earlier, and the age
flag. Once again, X is our frame minus the flag and y is the flag itself that we are trying to predict. After
the fit the prediction model expects 5 elements which are handled the same way as before. We can see through
the predictions below that if the recruiter is a family member they are almost always a minor and if it is an
intimate partner they are likely to be a minor.
#Making a new frame with the recruiter relation columns
predictRecruit = predictFrame[["year","IntimatePartner","Family","OtherRelation","UnknownRelation","BinaryAge"]]
#Creating the interaction term
poly = PolynomialFeatures(1,interaction_only=True,include_bias = False)
X_inter = poly.fit_transform(X = predictRecruit.drop('BinaryAge',1))
#Fitting and predicting based on our interaction term
model = LinearRegression()
model.fit(X = X_inter,y = predictAbuse["BinaryAge"])
print("A victim of a family member with one being a minor in 2017 is",model.predict(np.array([[2017,0,1,0,0]])))
print("A victim of an intimate partner with one being a minor in 2015 is",model.predict(np.array([[2015,1,0,0,0]])))
In order to display the vast amounts of data in the file in a readible format we must use a map. This
map will show the entire world, will highlight countries that we have in our dataframe, and label them
with total cases. In order to accomplish this, we use folium and a folium plugin called MarkerCluster().
First, we must initialize the map which we will simply call m, as a new folium map. Then, a MarkerCluster()
must be added to the map which is called cluster here. The next step is to iterrate over the data and check
the citizenship of victim column to see which country to add a marker at. To add a marker, we use the simple
Marker() function. This function takes a location in the form [lat,long] and can display a popup which is at
that point and it is set to the country we are adding the marker to. As the markers are added to the cluster,
it will automatically absorb any nearby markers and display a total number. Not only does this make the data
easier to read, but it will also help you deal with large datasets that can crash your computer. The map is
also interactive so users can move around and zoom in to specific countries to see their total reported cases.
#Initializes the map
m = folium.Map()
#Adds a MarkerCluster to the map which we add Markers to
cluster = MarkerCluster().add_to(m)
for index,row in globalFrame.iterrows():
if(row.citizenship == "CO"):
folium.map.Marker(location = ['4.5709','-74.2973'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "MD"):
folium.map.Marker(location = ['47.4116','28.3699'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "RO"):
folium.map.Marker(location = ['45.9432','24.9668'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "UA"):
folium.map.Marker(location = ['48.3794','31.1656'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "BY"):
folium.map.Marker(location = ['53.7098','27.9534'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "HT"):
folium.map.Marker(location = ['18.9712','-72.2852'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "UZ"):
folium.map.Marker(location = ['41.3775','64.5853'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "LK"):
folium.map.Marker(location = ['7.8731','80.7718'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "LK"):
folium.map.Marker(location = ['7.8731','80.7718'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "MM"):
folium.map.Marker(location = ['21.9162','95.9560'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "UG"):
folium.map.Marker(location = ['1.3733','32.2903'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "ID"):
folium.map.Marker(location = ['-0.7893','113.9213'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "ID"):
folium.map.Marker(location = ['-0.7893','113.9213'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "KG"):
folium.map.Marker(location = ['42.882004','74.582748'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "AF"):
folium.map.Marker(location = ['33.9391','67.7100'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "ER"):
folium.map.Marker(location = ['15.1794','39.7823'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "NG"):
folium.map.Marker(location = ['17.6078','8.0817'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "NP"):
folium.map.Marker(location = ['28.3949','84.1240'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "PH"):
folium.map.Marker(location = ['12.8797','121.7740'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "KH"):
folium.map.Marker(location = ['12.5657','104.9910'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "BD"):
folium.map.Marker(location = ['23.6850','90.3563'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "US"):
folium.map.Marker(location = ['37.0902','-95.7129'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "TH"):
folium.map.Marker(location = ['15.8700','100.9925'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "VN"):
folium.map.Marker(location = ['14.0583','108.2772'],popup = row.citizenship).add_to(cluster)
m