If you follow me on Twitter or have ever played around with the Data Monster, chances are you’ve come across the Prospect Model before. Back at the end of 2019, I began working on a model to help me better understand and project the hundreds of minor league baseball players. I found myself joining a 30 team dynasty league where I drafted a terrible team. As a result, I found myself scraping the wire trying to find any player who may have the upside to eventually help my team win. Like many other things in my life, this led me to math and to building a model. However, I wanted to make my model a little bit different than all of the others, I wanted it to focus specifically on future fantasy value.

Finding A Sample

The first step of any model-building process is identifying the number you want to predict and then a sample surrounding it. For my purposes, I wanted to focus strictly on future fantasy value. For me, that meant SGP or standings gain points. For those who do not know what I am referring to, SGP is a measure that takes into account the inherent value of each statistical category, as in how many Runs Scored are equal to 1 HR or 1 SB? This then allows you to build one single number of each player’s value over a given season.

To determine my SGP coefficients, I used the 2018 Rotowire OC leagues as a base. These are 12 team leagues run by NFBC and due to the number of leagues that exist, it allowed me to get a solid baseline and a representative sample to determine value. Then I went and downloaded the career data of all players in the 20 or so seasons (with more than 600 PAs) from Fangraphs and scaled it to 600 PAs. I scaled it because I wanted to take every player into effect and didn’t want to overweight players with long careers, I wanted to also account for players who had the fantasy value, but might not have gotten the MLB PAs to make it work.

There were a few small changes that needed to be made to some numbers. Mainly steals. I developed a way to account for the league-wide trend away from SBs so that I was not overvaluing early 2000’s seasons with big steals totals that we would never see today.

What Leads to Success

Now that we have built a representative sample, I wanted to see how each of those players performed at various levels in the minors. I went to FanGraphs again and downloaded all minor league seasons going back to 2007. This gave me a series of metrics both performance-based and underlying to help me compare to current players.

Then for each level, I ran correlations on all of the metrics and a given player’s future fantasy value and determined what actually signals future success or failure. At Single-A certain skills matter more than they do at Triple-A and each level weights different metrics accordingly. Additionally, each level also includes age as a factor as age is a massive factor to consider when seeing how a given player is performing relative to his level.

Model Generation

The next step in the process was actually building and generating the model. For this purpose, I used a concept called Mahalanobis distance. The actual math behind the concept does not matter, but it is a distance calculation that allows me to weigh the different parameters based on their relative value. What does this mean? Let’s say a particular player has a perfect match to Mike Trout in every metric except for K%. If the correlation step indicates that K% at that particular level is a massive indicator of future success, despite the other five metrics being a perfect match to Trout, his overall distance or similarity will be further away than it would be if we used a straight distance formula.

Using this distance formula, I am able to generate a list of the 100 most similar minor league seasons, and weighting those results for the closeness I am able to generate a singular value of a given player at a given level. I can run these for every single player with at least 50 PAs at a given minor league level to determine an overall value. If a given player played at multiple levels, we can then weigh these individual results based on PAs at each level.

Results

So let’s talk a little bit about the output of the model and what that shows in 2021 so far. The model results can be interpreted in three ways, which you can see in the screenshot from the Data Monster below. Here are the top 10 prospects (who have not yet reached the majors) thus far by Elite Rate.

Value – This is the pure weighted result of the model. This tends to favor prospects who are closer to the majors as they have a lower rate of never making it to the majors

Adjusted Value – This value takes the given level into account. It is essentially a measure of how much better a given prospect is compared to the average prospect at their level. So getting a Value of 5 in Double-A is better than a value of 5 at Triple-A

Elite Rate – This is the percentage of a given prospect’s weighted outcomes that fall under “Elite”. This means the percentage of weighted outcomes for a given prospect that are greater than 11, which is approximately a 90th percentile MLB outcome

This chart above is the crux of the model output and within the Data Monster, you can sort by these columns and search for a particular player. Additionally, there is a filter to show only the prospects who have yet to debut or all players who had at least 50 PAs at a given level.

Range Of Outcomes/Prospect Compare

However, in my opinion, the most valuable aspect of the model and all of the artifacts it produces is the comparison chart. For example, let’s compare two of the Rays’ best prospects: Wander Franco and Vidal Brujan at the Triple-A level.

Here we can see the weighted distribution of the range of outcomes for these two future superstars. As you can see around the middle sections of the chart, Brujan seems to have a greater chance of middle-range production than Wander, but as you move towards the right-hand side, you start to see Wander outperforming Brujan in certain spots. To better highlight these items there is a filter within the Data Monster that allows you to only see the non-zero outcomes for a given player.

Conclusion

Some quick final thoughts before wrapping this piece up completely. This takes into account strictly 2021 data. I will be working to include previous seasons in my analysis but for now, it looks only at a singular season. Additionally, this is a pure stat-line scouting endeavor. The model does not know anything about these players beyond their actual stats. Therefore it will miss on prospects who show flashes or have great potential and may overvalue players who are more advanced than their counterparts, but lack projection. Additionally, this works strictly for hitters. A pitching model is in my plans, but I have not had the time to really dive into one. I am always working to make minor improvements and have a few other features, like a list of the player comps, adding in OBP instead of pure average, and other model-based tweaks. Due to the nature of the model, I’ll be updating the Data Monster with this every Friday morning as opposed to daily. Like always please do not hesitate to reach out on Twitter with any questions you may have.