#Reverse Line Movement
#By: Mark Hanz

#Introduction:
#I decided to do my final tutorial testing the efficacy of a betting strategy called reverse line movement. I really wanted to do this project for very personal reasons. I have some friends who like to legally bet recreationally and safely. My friends all use this specific strategy, and I want to know if there is some statistical valitidy to following it. In addition, my friends pay a company that I do not think I should name, to provide them with the correct picks following this strategy. Over my college career i have spent a lot my time researching how to be a good bettor, and I have learned about people who take advantage of college kids like me. I have learned that winning money betting is not a passive thing, it is a full time job for a lot of people. Additionally, many people have gotten millions of dollars selling picks, that after analysis of their data, were found to be doing nothing better than guessing (this analysis I just refered to is not the project I am working on right now). These people understand that there sports fans want an easy way to make money and know they can charge a lot for it. By the time customers realize that they are not making money in the long run, many have lost a lot of their budget for betting, especially if they are paying for losing picks. Since many of their picks are mostly random, the worst case is when people start winning money right at the beginning of when they start using a service, just based on sheer luck, it becomes way harder for them not to pay for a service in the future, and can affect this persons financial situation in a very real way. I just kept bringing up to point to my friends that there are no financial analysts who charge to give good stock picks, if they are good at what they do, they would be investing in the stocks themselves. In the same way, if these companies giving picks were so good, why would they sell them instead of just betting themselves? 
# 
#   This worst case scenerio I was reffering too, is exactly what happened to my friends. They started to win a lot of money and could not understand how a bad betting strategy could be doing so well. There was no way I could convince them to stop paying for that service, no matter how skeptical I was for all the reasons I brought before. Even though I was skeptical, I went into this analysis with an open mind, knowing I could be wrong. My goal was to necessarily convince them that this strategy is bad, but that it could be bad. I wanted them to understand that small sample sizes mean nothing in the long run, and that the only way to know if something like this will work, is by doing this type of work. My mission will have been complete if they understand to trust to data, and not the opinions of people trying to sell them things online. To show the power of data, to prove or disprove a betting strategy. 

#     Now that I have gone throught the motivation for this project, to really expain the project, I need to go into what reverse line movement is and how it works. Intuatively it makes a lot of sense, that what can make something like this so dangerous. It is based on the fact that there are two types of bettors. There are many recreatinal bettors who bet small amounts to make games interesting, lets call these people the public. There are also what we call sharp bettors, people who view betting as a full time job, and run very complicated statistical models that help give them an edge. The goal of reverse line movement is to try to find out what sides these sharp bettors are betting. 

#   To do this, we have to first know a little on how the game lines are made by the legal US books. They try to have no exposure to a game, and their goal is to have 50% of the total money for each game, on both sides so that they cannot lose money. To do this, after the inicial line is put out, whenever a big bet is made on side A, the line gets more favorable for side B, inticing that future people bet for team B and even out the books exposure. So if a lot of money is going to team A, team B will move to a better line. Since the amount of bets placed for a certain team A, is highly corralated to the amount of money on team A, if for example 80 percent of bets go to team A, the most likely scenerio is that the line will move towards team B, as I just explained. There is a scenrio where only 20 percent of bets will go on team A, and the line will get more favorible for team A, or the line is moving opposite to the direction it is expected to move, hence the name 'reverse line movement'. This can only happen for one reason, because the books do not change the line for percentage of bets, but rather for percentage of money. Books do not care how many people are betting a certain way, if the money for each side is the same. Meaning if a team has only 20% betting on their side, but the line moves in there direction, that means vegas wants more money on the other side. The only reason they would want that is if that 20 percent of bettors bet more total money than the other 80 percent. This is very important so I will expain once more, if 20 percent of people bet on team A, but the line still moved in that direction, that means those 20 percent of bettors had more total money bet than the other 80 percent, then the assumption is that this 20 percent is full of "sharp" bettors who will make more money than the average person. Any game that follows this way of line movement is a good bet, accoriding to the this website that sells the picks.

import pandas as pd

#the hardest part of this project was by far the data cleaning portion, since the data was very messy. The data I was able to get on this was just a textfile with every single time a line moved for each game, meaning there were hundreds of data points for each game. Here is a quick example of part of one game: 
# game: Carolina vs Denver
# -3 -125
# 67%
# +3 +105
# 33%
# 08/31 5:35 pm
# -3 -120
# 68%
# +3 +100
# 32%
# 08/29 3:18 pm
# -3 -125
# 67%
# +3 +105
# 33%
# 08/29 3:02 pm
# -3 -120
# 67%
# +3 +100
# 33%
# 08/29 1:53 pm
# -3 -115
# 67%
# +3 -105
# 33%
# 08/29 10:35 am
# -3 -110
# 67%
# +3 -110
# 33%
# 08/22 2:00 pm
# -3 -115
# 68%
# +3 -105
# 32%
# 08/22 1:38 pm
# -3 -110
# 68%
# +3 -110
# 32%
# I need to parse each line for the correct information that I needed and ignore a lot of the information on most of the games. I had to figure out which of these lines for each game correspond to the money line (what I am doing this analysis on) To do that, I had to make a lot of lists to classify everything on the file, and then keep only what I needed. The main data points I needed were: the line the game started at, the line the game ended at, the percentage of bets each way. Once I had all the lists, I can get just the points I need for every game and create a data frame of that. I started to run into a bunch of other issues cleaning that data, and that is where I was forced to spend the most hours on this project. Some example of issues were, When the game has a line of '0', it is written as 'PK' which is the only non integer for the money line in the raw data file. Sometimes percentages of bets were only given as '-', and I had to throw out those games as unusable. But once I figures out those issues, I was able to move on to the next part, figuring out which games classify as 'reverse line movement', and see how they performed. I added all the games with the correct classification to my final data frame, so all the games I was looking at were deemed to be 'good bets'. Lastly, to see how they would perform, I had to figure out what it meant for the strategy to be successful. The percentage of these bets that hit tell us nothing since, depending on the odds, you can hit most of your bets and still lose money. I made another column on the data frame, for how much money would have been won or lost with a standard unit size. That way I could see if all the reverse line game actually would have made money with a standard bet size on each bet.

lines = []  # all the lines
games = []
teamA_spread = [] # the spread of the first team
teamA_spread_odds = []  # the odds associated with betting on team A spread
teamA_spread_percent = []  # the percent of ppl betting on team A spread
teamB_spread = []  # the spread of the second team
teamB_spread_odds = []  # the odds associated with betting on team B spread
teamB_spread_percent = []  # the percent of people betting on team B spread
teamA_moneyline = []  # moneyline odds for teamA
teamA_moneyline_percent = [] # percent of people betting on team A moneyline
teamB_moneyline = []  # moneyline odds for teamB
teamB_moneyline_percent = []  # percent of people betting on team B moneyline
time_spread = []  # the time associated with change of spread
time_moneyline = []  # the time associated with change of moneyline
total_ou = []  # total for game
total_under_odds = []
total_over_odds = []
total_over_percentage = []  # percentage of people betting the over on the total
total_under_percentage = []  # percentage of people betting the under on the total
time_total = []  # the time the bets occur
game_ids_moneyline = []
game_ids_total = []
game_ids_spread = []
game_moneyline = []
game_total = []
game_spread = []
game_id = 1
units_won = []


def process_five_line(five_line, year):
    for i in range(5):
        if five_line[i][0] == "-":
            return
    if five_line[0][0] == "PK":  # if spread is a pickem it will log it and corresponding odds
        teamA_spread.append(five_line[0][0])
        teamA_spread_odds.append(five_line[0][1])
        teamA_spread_percent.append(five_line[1])
        teamB_spread.append(five_line[2][0])
        teamB_spread_odds.append(five_line[2][1])
        teamB_spread_percent.append(five_line[3][0])
        dates = ""
        for i in range(len(five_line[4])):
            dates += five_line[4][i]
            if i == 0:
                if float((five_line[4][i][0:2])) < 3:
                    dates += "/" + str(year + 1) + " "
                else:
                    dates += "/" + str(year) + " "
            else:
                dates += " "

        time_spread.append(dates)
        game_spread.append(current_game)
        return
    if five_line[0][0][0] == "+" or five_line[0][0][0] == "-":
        if len(five_line[0]) == 1:
            teamA_moneyline.append(five_line[0][0])
            teamA_moneyline_percent.append(five_line[1][0])
            teamB_moneyline.append(five_line[2][0])
            teamB_moneyline_percent.append(five_line[3][0])
            dates = ""
            for i in range(len(five_line[4])):
                dates += five_line[4][i]
                if i == 0:
                    if float((five_line[4][i][0:2])) < 3:
                        dates += "/" + str(year + 1) + " "
                    else:
                        dates += "/" + str(year) + " "
                else:
                    dates += " "
            time_moneyline.append(dates)
            game_moneyline.append(current_game)
        if len(five_line[0]) > 1:
            teamA_spread.append(five_line[0][0])
            teamA_spread_odds.append(five_line[0][1])
            teamA_spread_percent.append(five_line[1])
            teamB_spread.append(five_line[2][0])
            teamB_spread_odds.append(five_line[2][1])
            teamB_spread_percent.append(five_line[3][0])
            dates = ""
            for i in range(len(five_line[4])):
                dates += five_line[4][i]
                if i == 0:
                    if float((five_line[4][i][0:2])) < 3:
                        dates += "/" + str(year + 1) + " "
                    else:
                        dates += "/" + str(year) + " "
                else:
                    dates += " "
            time_spread.append(dates)
            game_spread.append(current_game)
    else:
        total_ou.append(five_line[0][0])
        total_over_odds.append(five_line[0][1])
        total_over_percentage.append(five_line[1][0])
        total_under_odds.append(five_line[2][1])
        total_under_percentage.append(five_line[3])
        dates = ""
        for i in range(len(five_line[4])):
            dates += five_line[4][i]
            if i == 0:
                if float((five_line[4][i][0:2])) < 3:
                    dates += "/" + str(year + 1) + " "
                else:
                    dates += "/" + str(year) + " "
            else:
                dates += " "
        time_total.append(dates)
        game_total.append(current_game)


# open file
for file_num in range(16, 19):
    file_str = "results-20" + str(file_num) + ".csv"
    file = open(file_str, 'r')

    lines = []
    for row in file:  # loop through each row in file and put every row into a master list
        line = row.split()
        lines.append(line)
    print(len(lines))
    five_line = []
    # iterate
    for line in lines:

        # handle the game name
        if 'game:' in line:
            current_game = ""
            for i in range(len(line)):
                current_game += line[i]
                if i != (len(line) - 1):
                    current_game += " "
            continue

            # handle five lines
        if len(five_line) < 5:
            five_line.append(line)

        elif len(five_line) == 5:
            process_five_line(five_line, file_num)
            games.append(current_game)
            five_line = []
            five_line.append(line)


dic = {}
for game in games:
    if game in dic:
        continue
    else:
        dic[game] = game_id
        game_id += 1

for game in game_spread:
    game_ids_spread.append(dic[game])
for game in game_moneyline:
    game_ids_moneyline.append(dic[game])
for game in game_total:
    game_ids_total.append(dic[game])



ML_games_list = []
ML_start_line_teamA = []
ML_start_line_teamB = []
ML_ending_line_teamA = []
ML_ending_line_teamB = []
ML_tot_percent = []
ML_date = []

ML_compiled = list(zip(game_moneyline,teamA_moneyline,teamA_moneyline_percent, teamB_moneyline, teamB_moneyline_percent,time_moneyline))
spread_compiled = list(zip(game_spread, teamA_spread,teamA_spread_odds,teamA_spread_percent, teamB_spread, teamA_spread_odds, teamB_spread_percent, time_spread))
total_compiled = list(zip(game_total, total_ou, total_over_odds, total_over_percentage,total_under_odds, total_under_percentage, time_total))


def get_starting_line(ml_compiled):
    for i in range(len(ml_compiled) - 1):
        if ml_compiled[i + 1][0] != ml_compiled[i][0]:
            ML_games_list.append(ml_compiled[i][0])
            ML_start_line_teamA.append(ml_compiled[i][1])
            ML_start_line_teamB.append(ml_compiled[i][3])
    ML_games_list.append(ML_compiled[len(ml_compiled)-1][0])
    ML_start_line_teamA.append(ML_compiled[len(ml_compiled) - 1][1])
    ML_start_line_teamB.append(ML_compiled[len(ml_compiled)-1][3])



def get_ending_line_and_percent(ml_compiled):
    ML_ending_line_teamA.append(ml_compiled[0][1])
    ML_ending_line_teamB.append(ml_compiled[0][3])
    ML_tot_percent.append(ml_compiled[0][2])
    ML_date.append(ml_compiled[0][5])
    for j in range(1, len(ml_compiled)):
        if ml_compiled[j][0] != ml_compiled[j-1][0]:
            ML_ending_line_teamA.append(ml_compiled[j][1])
            ML_ending_line_teamB.append(ml_compiled[j][3])
            ML_tot_percent.append(ml_compiled[j][2])
            ML_date.append(ml_compiled[j][5])



get_starting_line(ML_compiled)
get_ending_line_and_percent(ML_compiled)

ML_reverse_line_team = []

ML_ready_for_analysis = list(zip(ML_games_list, ML_start_line_teamA, ML_ending_line_teamA, ML_tot_percent, ML_start_line_teamB, ML_ending_line_teamB,  ML_date))

print(ML_ready_for_analysis)

ML_reverse_game = []
ML_reverse_start_line_teamA = []
ML_reverse_start_line_teamB = []
ML_reverse_ending_line_teamA = []
ML_reverse_ending_line_teamB = []
ML_reverse_tot_percent = []
ML_reverse_date = []
ML_reverse_line_team = []


def identify_reverse_movement(ml_analysis):
    for i in range(len(ml_analysis)):
        if len(ml_analysis[i][3]) == 2 or len(ml_analysis[i][3]) > 3:
            continue
        if float(ml_analysis[i][1]) < float(ml_analysis[i][2]) and float(ml_analysis[i][3][0:2]) > 50:
            ML_reverse_game.append(ml_analysis[i][0])
            ML_reverse_start_line_teamA.append(ml_analysis[i][1])
            ML_reverse_start_line_teamB.append(ml_analysis[i][4])

            ML_reverse_ending_line_teamA.append(ml_analysis[i][2])
            ML_reverse_ending_line_teamB.append(ml_analysis[i][5])

            ML_reverse_tot_percent.append(ml_analysis[i][3])
            ML_reverse_date.append(ml_analysis[i][6])
            ML_reverse_line_team.append('team B')
    

            # add necessARY
        if float(ml_analysis[i][1]) > float(ml_analysis[i][2]) and float(ml_analysis[i][3][0:2]) < 50:
            ML_reverse_game.append(ml_analysis[i][0])
            ML_reverse_start_line_teamA.append(ml_analysis[i][1])
            ML_reverse_start_line_teamB.append(ml_analysis[i][4])

            ML_reverse_ending_line_teamA.append(ml_analysis[i][2])
            ML_reverse_ending_line_teamB.append(ml_analysis[i][5])

            ML_reverse_tot_percent.append(ml_analysis[i][3])
            ML_reverse_date.append(ml_analysis[i][6])
            ML_reverse_line_team.append('team A')



identify_reverse_movement(ML_ready_for_analysis)


ML_reverse_movement = list(zip(ML_reverse_game, ML_reverse_start_line_teamA, ML_reverse_ending_line_teamA, ML_reverse_tot_percent, ML_reverse_start_line_teamB, ML_reverse_ending_line_teamB, ML_reverse_date, ML_reverse_line_team, units_won))

44648
48939

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-1-1eaeaeb084b6> in <module>
    138 
    139         elif len(five_line) == 5:
--> 140             process_five_line(five_line, file_num)
    141             games.append(current_game)
    142             five_line = []

<ipython-input-1-1eaeaeb084b6> in process_five_line(five_line, year)
     92     else:
     93         total_ou.append(five_line[0][0])
---> 94         total_over_odds.append(five_line[0][1])
     95         total_over_percentage.append(five_line[1][0])
     96         total_under_odds.append(five_line[2][1])

IndexError: list index out of range

columns = ["ML_reverse_game", "ML_reverse_start_line_teamA", "ML_reverse_ending_line_teamA", "ML_reverse_tot_percent", "ML_reverse_start_line_teamB", "ML_reverse_ending_line_teamB", "ML_reverse_date", "ML_reverse_line_team"]
df = pd.DataFrame(ML_reverse_movement, columns = columns)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-86fc3e0d2f44> in <module>
      1 columns = ["ML_reverse_game", "ML_reverse_start_line_teamA", "ML_reverse_ending_line_teamA", "ML_reverse_tot_percent", "ML_reverse_start_line_teamB", "ML_reverse_ending_line_teamB", "ML_reverse_date", "ML_reverse_line_team"]
----> 2 df = pd.DataFrame(ML_reverse_movement, columns = columns)

NameError: name 'ML_reverse_movement' is not defined

x = np.asarray(list(df['ML_reverse_line_team']))
y = np.asarray(list(df['units_won']))

x = x.reshape(len(x), 1)

y = y.reshape(len(y), 1)

regr = linear_model.LinearRegression()

regr.fit(x, y)

plt.scatter(x,y)
plt.plot(x, regr.predict(x), color='blue', linewidth=3)

#after running a linear regeression on the plot of reverse line team and units won, I found that the correlation coeficient is a very low number of 0.11. Running a hypothesis test, it is enough to conclude with statistical certainty that the reverse line team does not correlate with postive units won in any meaningful way. This is enough to show that the reverse line movement strategy likely won result in a good long term strategy for betting.