Skip to main content

Ontario Education & OSAP

·6186 words·30 mins
Author
Ghassan Shahzad
69 - This article is part of a series.

Ontario Education & OSAP
#

This notebook aims to quantify returns to post-secondary education (covered under OSAP), analyse OSAP’s latest policies, and see if the one matches the other - if getting tertiary education is worth it given OSAP financing arrangements, especially loan burdens.

For this notebook, I will be using StatCan’s Census 2021 PUMF Individual data. Since we don’t want to analyse structures (i.e. families), we don’t need hierarchical data.

To quantify and analyse the returns to post-secondary education, we need a few things. First, we need a metric to compare against - the median income of someone who won’t get post-secondary education. A counterfactual would be ideal - the sort of person who gets post-secondary education is not guaranteed to earn the same in, say, the trades as a tradesman. They may earn more, or they may earn less. Instead of comparing Person A who gets post-secondary education to Person B who doesn’t, a detailed analysis would compare Person A who gets post-secondary education to Person A who doesn’t. But constructing such a model seems complex, so we will simply compare median earnings.

Once we’ve constructed that model, we need to quantify the returns to post-secondary education. Median wages vary by degree, field, and experience, even gender, so we will need to control or analyse these variables separately.

Finally, we’ll need to do some accounting calculations: given OSAP’s financing burden, does an investment in education (given the above models) make sense? In technical terms, I’m going to calculate the NPV (value of investment) of the earnings premium (of education) net (minus) of loan repayment costs, and maybe compared across grant/loan ratios.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import seaborn as sns

Pre-processing
#

Census 2021’s PUMF data is massive. I will filter the file prior to working with it to minimise space-used. It is also complex, primarily using codes for characteristics and values. I will handle this all early.

The characteristics I’m filtering for:

  1. PPSORT - Unique ID
  2. AGEGRP - AGE
  3. CIP2021 - Fields of Study
  4. EmpIn - Income: Employment income
  5. Gender - Gender
  6. HDGREE - Education: Highest certificate, diploma or degree
  7. LFACT - Labour: Labour force status - Detailed
  8. PR - Province
# Filter for variables and load csv
cols = ['PPSORT', 'AGEGRP', 'CIP2021', 'EmpIn', 'Gender', 'SSGRAD', 'LFACT', 'PR']
census_2021 = pd.read_csv('census_2021_ontario.csv', usecols=cols)
# Filter for Ontario only
census_2021 = census_2021[census_2021['PR'] == 35]
# Remove NA
census_2021 = census_2021[census_2021['EmpIn'] != 99999999]
census_2021 = census_2021[census_2021['EmpIn'] != 88888888]
census_2021 = census_2021[~census_2021['SSGRAD'].isin([88, 99])]
census_2021 = census_2021[~census_2021['AGEGRP'].isin([88])]
census_2021 = census_2021[~census_2021['CIP2021'].isin([88, 99])]
census_2021 = census_2021[~census_2021['LFACT'].isin([88, 99])]
# Filtering for working age
census_2021 = census_2021[census_2021['AGEGRP'].isin(range(8,17))]
# Filter for positive income
census_2021 = census_2021[census_2021['EmpIn'] >= 0]

census_2021.head()

PPSORTAGEGRPCIP2021EmpInGenderLFACTPRSSGRAD
021181200013356
18164610001133511
31012132500021354
41213513000011358
82110563000113511
SSGRAD = {
    1:  'No certificate',
    2:  'Trades (no HS)',
    3:  'College/CEGEP (no HS)',
    4:  'HS',
    5:  'Trades',
    6:  'College/CEGEP',
    7:  'University below bachelor',
    8:  'Bachelor',
    9:  'University above bachelor',
    10: 'Medicine/Dentistry/Vet/Optometry',
    11: 'Masters',
    12: 'Doctorate',
    88: None,  # Not available
    99: None   # Not applicable
}


GENDER = {
    1: "Woman",
    2: "Man"
}

AGEGRP = {
    4: "10-11",
    19: "75-79",
    16: "60-64",
    10: "30-34",
    11: "35-39",
    2: "5-6",
    17: "65-69",
    3: "7-9",
    1: "0-4",
    20: "80-84",
    12: "40-44",
    8: "20-24",
    6: "15-17",
    7: "18-19",
    18: "70-74",
    13: "45-49",
    88: None,
    14: "50-54",
    9: "25-29",
    21: "85+",
    5: "12-14",
    15: "55-59",
}

CIP2021 = {
    11: "Personal/protective/transport services",
    8: "Architecture/engineering/trades",
    99: None,
    3: "Humanities",
    10: "Health",
    7: "Math/CS/info",
    88: None,
    4: "Social sciences & law",
    9: "Agriculture/natural resources/conservation",
    5: "Business/management/public admin",
    1: "Education",
    13: "No postsecondary degree",
    12: "Other",
    2: "Visual/performing arts & comm",
    6: "Physical/life sciences & tech",
}

LFACT = {
    1:  'Employed',
    2:  'Employed - Absent',
    3:  'Unemployed',
    4:  'Unemployed',
    5:  'Unemployed',
    6:  'Unemployed',
    7:  'Unemployed',
    8:  'Unemployed',
    9:  'Unemployed',
    10: 'Unemployed',
    11: 'Not in labour force',
    12: 'Not in labour force',
    13: 'Not in labour force',
    14: 'Not in labour force',
    88: None,
    99: None
}

Constructing the Alternate
#

First, we will find the median income of the average non-post secondary educated person. This includes those with a high school education, tradesmen, career college diplomas, and so on.

non_ps_df = census_2021[census_2021['SSGRAD'].isin([1, 2, 4, 5])]
non_ps_df['EmpIn'].median()
np.float64(33000.0)

The median non-post secondary educated person in Ontario earns $33,000 per year. This figure accounts for the unemployed, and not for those with negative income.

non_ps_df[non_ps_df['LFACT'].isin([1, 2])]['EmpIn'].median()
np.float64(42000.0)

Filtering for the employed only, that number increases to $42000 per year.

Returns to Education
#

The Mincer earnings function, given wages (their log, to be specific), years of experience, and years of education (alongside other independent variables), models an individuals earnings as a function of these variables, allowing us to calculate the value of, for instance, an additional year of schooling. It’s identical to what I did in the last post, with two distinctions - first, we’re tracking multiple independent variables (schooling, exp, gender); second, we’re using the log of wages. This is because we want the percentage increase of our coefficients, not their absolute increase.

There are some limitations, specific to my data and general to the model. I only have age groups, so I will have to use midpoints - this affects years of experience. Furthermore, my formula for years of experience is simply: age - (years of schooling + 6). This does not account for unemployment and other such things.

Years of education are guessed for some groups with non-post-secondary education - it is impossible to get an exact year for, say, someone who has neither an HS diploma or degree.

Code
#

mincer_df = census_2021.copy()

# Map degree to years of schooling
edu_years = {
    1:  10,  # No certificate — assume some high school
    2:  11,  # Trades no HS — HS dropout + trades
    3:  13,  # College no HS — assume completed college
    4:  12,  # HS diploma only
    5:  13,  # Trades + HS — 12 + ~1 year trades
    6:  14,  # College/CEGEP — 12 + 2
    7:  14,  # University below bachelor — 12 + 2
    8:  16,  # Bachelor — 12 + 4
    9:  17,  # University above bachelor — 12 + 5
    10: 18,  # Medicine/Dentistry/Vet — 12 + 4 + 2 specialty minimum
    11: 18,  # Masters — 12 + 4 + 2
    12: 21,  # Doctorate — 12 + 4 + 5
}
age_midpoint = {
    8:  22,   # 20-24
    9:  27,   # 25-29
    10: 32,   # 30-34
    11: 37,   # 35-39
    12: 42,   # 40-44
    13: 47,   # 45-49
    14: 52,   # 50-54
    15: 57,   # 55-59
    16: 62,   # 60-64
}

mincer_df['log_wage'] = np.log(mincer_df['EmpIn'])
mincer_df['age_mid'] = mincer_df['AGEGRP'].map(age_midpoint)
mincer_df['edu_years'] = mincer_df['SSGRAD'].map(edu_years)
mincer_df['exp_years'] = mincer_df['age_mid'] - (mincer_df['edu_years'] + 6)
mincer_df['exp_years'] = mincer_df['exp_years'].clip(lower=0)

mincer_df

PPSORTAGEGRPCIP2021EmpInGenderLFACTPRSSGRADlog_wageage_midedu_yearsexp_years
0211812000133569.392662371417
1816461000113351111.018629621838
3101213250002135410.126631421224
4121351300001135811.775290471625
8211056300011351111.05089032188
.......................................
37884298085111818000011351212.100712372110
3788439808521213490002135110.799576421026
37884498085612101100001135611.608236421422
3788469808621613330001135410.404263621244
3788489808661051300002135611.775290321412

182864 rows × 12 columns

model = sm.OLS.from_formula('log_wage ~ edu_years + exp_years + I(exp_years**2) + C(Gender)', data=mincer_df)
results = model.fit()

results.summary()
OLS Regression Results
Dep. Variable:log_wageR-squared:0.102
Model:OLSAdj. R-squared:0.102
Method:Least SquaresF-statistic:5217.
Date:Mon, 13 Apr 2026Prob (F-statistic):0.00
Time:17:07:21Log-Likelihood:-3.3964e+05
No. Observations:182864AIC:6.793e+05
Df Residuals:182859BIC:6.793e+05
Df Model:4
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
Intercept7.00400.027264.1760.0006.9527.056
C(Gender)[T.2]0.38840.00753.2910.0000.3740.403
edu_years0.14970.00290.6630.0000.1460.153
exp_years0.10440.00194.2000.0000.1020.107
I(exp_years ** 2)-0.00202.43e-05-80.5470.000-0.002-0.002
Omnibus:142020.638Durbin-Watson:1.997
Prob(Omnibus):0.000Jarque-Bera (JB):3451730.600
Skew:-3.609Prob(JB):0.00
Kurtosis:23.023Cond. No.6.39e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.39e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Since we are using ln(wage) (we have a log-level model), we need to adjust the coefficients to get proper slopes.

params = results.params
adjusted_params = (np.exp(params) - 1) * 100
adjusted_params
Intercept            110006.516586
C(Gender)[T.2]           47.455093
edu_years                16.143483
exp_years                10.999062
I(exp_years ** 2)        -0.195553
dtype: float64

Interpretation
#

We have three independent variables we’re tracking: years of schooling, experience, and gender. Our R-squared is equal to 0.102, which means that 10% of the variation in employment income is explained by these factors. All these variables are statistically significant (P>|t| < 0.05).

Our baseline is a male with 0 years of education and experience. An additional year of schooling (for such a person) gives us a 16.1% increase in wages. An additional year of experience gives us 11% more wages, but with a diminishing return of 0.2pp each year.

Our data also shows that men earn 47.46% more than women.

Importantly, we can now model incomes based on a hypothetical person’s education, experience, and gender.

baseline = results.predict({'edu_years': 12, 'exp_years': 2, 'Gender': 2})
# HS degree, 2 years of experience, male
np.exp(baseline[0])
np.float64(11957.7402224044)

The Costs of Education
#

The purpose of this article is to see how OSAP policy changes affect an individual’s choice of education. Starting from the year 2026-27, OSAP has tweaked its grant:loans ratio to 25:75, at a maximum. The exact ratio may be 0:100, depending on an individual’s financial background. We will be testing variations up to the maximum.

One question is how much money our model should expect from OSAP. Again, this depends on an individual’s background. Some individuals can expect a near 100% cover, while others will receive nothing. Moreover, the ratio and level of financing are tightly linked: an individual who receives a 100% cover will likely also get the maximum 25:75 ratio because both factors are determined by financial background.

Generally, the costs of post-secondary education are many. Our model covers experience - every year spent in post-secondary is one less year of experience compared to the person who skipped it. Some costs are deceptive - we could include rent and other living expenses, but these are independent of post-secondary education.

Simple NPV of Education
#

To cut through the fog, we will start with a simple calculation. We will ignore OSAP and other factors, and simply see if, ceteris paribus, education is a worthwhile investment using the Net Present Value formula. The NPV formula is expressed as such:

NPV = sum(net_cash_flow / (1 + discount_rate)^time) - costs

Our costs will be the costs of education (tuition, fees) and the foregone wages (opportunity cost). Ontariocolleges.ca provides a number: 6100 (tuition) + 800 (fees) + 1300 (books and supplies for a total $8200 per academic year. I will assume a bachelors degree. Note that this is an average, exact costs will depend on institution. It also assumes Ontario, despite OSAP covering out-of-province study in certain cases. Our discount rate will be the rate of interest. Our net cash flow will be earnings post-degree. Time will be a working lifetime. Gender will be male (for our example).

This can be quite confusing. It serves to express this in tabular form once to get the general gist of how it works. First, we need to get our cash flows down.

post_sec = []
post_sec[0:3] = [-8200] * 4 # adding college costs
# getting the rest of the cash flow of our post-sec model
for exp in range(22,66):
    post_sec.append(np.exp(results.predict({'edu_years': 16, 'exp_years': exp-22, 'Gender': 2})[0]))
hs = [] # same but for our hs model
for exp in range(18,66):
    hs.append(np.exp(results.predict({'edu_years': 12, 'exp_years': exp-18, 'Gender': 2})[0]))
data = [post_sec, hs]
cash_flow = pd.DataFrame(data=list(zip(*data)), columns=['post_sec', 'hs'], index=range(0, 48))
cash_flow

post_sechs
0-8200.0000009781.623220
1-8200.00000010836.277815
2-8200.00000011957.740222
3-8200.00000013143.707352
417798.78294314390.849470
519717.84768215694.763037
621758.47688317049.940925
723916.47980818449.762380
826185.79762519886.504769
928558.41759121351.378687
1031024.31885622834.587510
1133571.45420324325.411815
1236185.77141225812.318479
1338851.27715927283.093505
1441550.14536328724.996933
1544262.87080930124.937429
1646968.46765931469.663498
1749644.71114832745.967587
1852268.41945833940.898824
1954815.77141235041.979698
2057262.65439736037.421651
2159585.03575836916.334435
2261759.34990337668.924029
2363762.89256938286.674120
2465574.21313138762.506430
2567173.49553139090.915696
2668542.91843139268.075679
2769666.98542139291.913367
2870532.81676839162.149351
2971130.39501338880.303273
3071452.75787938449.664208
3171496.13328137875.226776
3271260.01280137163.594721
3370747.16160536322.854566
3469963.56454135362.422716
3568918.30989034292.870055
3667623.41392133125.728596
3766093.59098731873.285126
3864345.97531530548.366986
3962399.80184029164.125185
4060276.05437727733.819916
4157997.09014326270.613299
4255586.24995324787.373747
4353067.46357123296.495863
4450464.85943721809.739138
4547802.38753520338.088067
4645103.46341618891.635562
4742390.64047717479.490823
# crunching the numbers
cash_flow['net_ps'] = cash_flow['post_sec'] / (1 + 0.03) ** cash_flow.index
cash_flow['net_hs']= cash_flow['hs'] / (1 + 0.03) ** cash_flow.index
male_npv = cash_flow['net_ps'].sum() - cash_flow['net_hs'].sum()
male_npv
np.float64(356221.5858905276)

The data seems to tell us that college education is a great investment. A NPV of $356,222, specifically.

There’s a few complications here, though. Of course, financing the initial investment is difficult, which is why we have to bring OSAP into our calculations. This doesn’t present the whole story.

The main specific issue is unemployment - my numbers account for unemployment, so the unemployed are dragging the wages down (hence the somewhat odd, below minimum wage, numbers for initial years). Since we’re doing this for both HS and PS models, it cancels out somewhat. Employment chances are pretty important for decisions such as these, so accounting for unemployment seems sensible, but it does make our figures somewhat less precise.

Moreover, we’re assuming a Bachelor’s degree, and not accounting for field of study. For our non-post-sec model, we’re only assuming a high-school degree when trades are also an option. These variables will be covered later on.

For clarity’s sake, I’ll recrunch the numbers for a female model too.

post_sec_fem = []
post_sec_fem[0:3] = [-8200] * 4
for exp in range(22,66):
    post_sec_fem.append(np.exp(results.predict({'edu_years': 16, 'exp_years': exp-22, 'Gender': 1})[0]))
hs_fem = []
for exp in range(18,66):
    hs_fem.append(np.exp(results.predict({'edu_years': 12, 'exp_years': exp-18, 'Gender': 1})[0]))
data = [post_sec_fem, hs_fem]
cash_flow_fem = pd.DataFrame(data=list(zip(*data)), columns=['post_sec', 'hs'], index=range(0, 48))
cash_flow_fem

post_sechs
0-8200.0000006633.628607
1-8200.0000007348.866430
2-8200.0000008109.411478
3-8200.0000008913.701861
412070.6464619759.479442
513372.10353710643.757863
614756.00229711562.802341
716219.50069712512.122862
817758.48980513486.482148
919367.53559714479.919465
1021039.84220815485.791012
1122767.23954416496.827169
1224540.19777817505.206490
1326347.87067518502.645778
1428178.16907419480.505138
1530017.86507520429.906389
1631852.72665821341.862730
1733667.68159722207.417160
1835447.00861823017.786755
1937174.55285223764.509606
2038833.96178624439.591039
2140408.93713425035.645585
2241883.49735725546.031208
2343242.24504625964.972378
2444470.63297526287.668801
2545555.22243926510.386958
2646483.92749426630.532000
2747246.23889926646.698067
2847833.42197326558.695669
2948238.68315026367.555389
3048457.30080426075.507786
3148486.71681325685.940073
3248326.58637925203.330730
3347978.78476324633.163814
3447447.37072423981.825273
3546738.50768623256.483991
3645860.34475222464.960662
3744822.86079321615.587849
3843637.67577320717.064702
3942317.83429619778.309875
4040877.56701118808.316068
4139332.03596717816.009469
4237697.07026716810.117083
4335988.89842115799.044589
4434223.88367414790.766952
4532418.26824813792.733556
4630587.93192812811.789136
4728748.16981011854.111302
# crunching the numbers
cash_flow_fem['net_ps'] = cash_flow_fem['post_sec'] / (1 + 0.03) ** cash_flow_fem.index
cash_flow_fem['net_hs']= cash_flow_fem['hs'] / (1 + 0.03) ** cash_flow_fem.index
fem_npv = cash_flow_fem['net_ps'].sum() - cash_flow_fem['net_hs'].sum()
fem_npv
np.float64(231476.0626947455)

For women, the NPV is $231,476. This is an interesting result - common-sense tells us that women benefit more from college than men because the sort of jobs that don’t require college degrees are less-suited for them (for whatever reason). This value seems to contradict that.

However, this is an absolute number. It makes sense for the absolute number to be lower, because of the gender pay gap we quantified earlier in this notebook. Once we account for that, by modeling the relative ROI, what do the results look like?

hs_male_pv = cash_flow['net_hs'].sum()
hs_fem_pv = cash_flow_fem['net_hs'].sum()

male_roi = (male_npv / hs_male_pv) * 100
fem_roi = (fem_npv / hs_fem_pv) * 100

print(f"Male ROI: {male_roi:.1f}%")
print(f"Female ROI: {fem_roi:.1f}%")
Male ROI: 52.2%
Female ROI: 50.0%

The results seem to be similar. This does not contradict the intuition, however. Our model fails to capture some considerations. For starters, our coefficient for edu_years already accounts for both female and male returns together. We’d have to calculate two seperate slopes, and then our result would be cleaner. The exact idea behind women and edcation is quite complex, and goes beyond just a simple wage model’s capabilities too. That’s for another day, though.

Accounting for OSAP
#

Regardless, now that we have a base model, and have determined that education makes sense if you can afford initial costs, we can move onto accounting for OSAP. We will assume a 100% OSAP coverage, with variations in grants:loans ratio. Why? Well, if you have 0% OSAP coverage you can either not afford college or will pay using a RESP or some other means: for the former, this model is useless in any case, and in the latter case the answer is yes - our ’naive’ model says you should go to college.

The case of varying proportions of OSAP and misc coverage is more complex (i.e. 50% OSAP, 50% other sources). However, there is no need to cover that. This is because, if 100% OSAP (meaning a large loan burden) is still a positive investment, then a lower proportion of OSAP funds should generally be the same as well.

OSAP, including loans, does cover non-tuition/fees/books expenses. I made the decision not to include these since these exist for HS too, and I won’t include them in this model either.

OSAP loans need to be repayed 6 months after graduating (for simplicities’ sake, I’ll put that as starting in the fifth year). The time horizon is 9.5 years. 70% of the debt is 0-interest federal, while 30% is provincial and has an interest rate of: prime-rate + 1%. The current prime-rate is 4.45%, but this varies. OSAP has a handy calculator that contains more details.

Once we are done with that, we will finally control for fields of study and exact degree variation.

def osap_loan(grant_ratio=0.25, total_osap=8200*4, interest_rate=0.06):
    """Given grant_ratio, loan, and interest rate, calculates and returns annuity"""
    loan = total_osap * (1 - grant_ratio)
    loan_fed = loan * 0.75
    loan_prov = loan * 0.25

    annual_payment = loan_prov * (interest_rate * (1 + interest_rate)**10) / ((1 + interest_rate)**10 - 1)
    annual_payment += loan_fed / 10

    return annual_payment

osap_loan()
2680.58794305536
def osap_npv(ratio, main_edu=16, sec_edu=12, main_model=results, sec_model=results, bFos=False, fos=None):
    """Given a OLS model, education for main and counterfactual model, grant ratio, returns NPV for each gender"""
    main_male = [0 for x in range(18,66)]
    sec_male = [0 for x in range(18,66)]
    main_female = [0 for x in range(18,66)]
    sec_female = [0 for x in range(18,66)]
    sec_range = sec_edu - 12

    for exp in range(22,66):
        if bFos:
            main_male[exp - 18] = np.exp(main_model.predict({'edu_years': main_edu, 'exp_years': exp-22, 'Gender': 2, 'FOS': fos})[0])
            main_female[exp - 18] = np.exp(main_model.predict({'edu_years': main_edu, 'exp_years': exp-22, 'Gender': 1, 'FOS': fos})[0])
        else:
            main_male[exp - 18] = np.exp(main_model.predict({'edu_years': main_edu, 'exp_years': exp-22, 'Gender': 2})[0])
            main_female[exp - 18] = np.exp(main_model.predict({'edu_years': main_edu, 'exp_years': exp-22, 'Gender': 1})[0])

    for exp in range(18 + sec_range,66):
        if bFos:
            sec_male[exp - 18] = np.exp(sec_model.predict({'edu_years': sec_edu, 'exp_years': exp-(18 + sec_range), 'Gender': 2, 'FOS': fos})[0])
            sec_female[exp - 18] = np.exp(sec_model.predict({'edu_years': sec_edu, 'exp_years': exp-(18 + sec_range), 'Gender': 1, 'FOS': fos})[0])
        else:
            sec_male[exp - 18] = np.exp(sec_model.predict({'edu_years': sec_edu, 'exp_years': exp-(18 + sec_range), 'Gender': 2})[0])
            sec_female[exp - 18] = np.exp(sec_model.predict({'edu_years': sec_edu, 'exp_years': exp-(18 + sec_range), 'Gender': 1})[0])

    annual_payment = osap_loan(grant_ratio=ratio/100)
        
    main_male[4:14] = [x - annual_payment for x in main_male[4:14]]
    main_female[4:14] = [x - annual_payment for x in main_female[4:14]]

    data = [main_male, sec_male, main_female, sec_female]
    cash_flow_osap = pd.DataFrame(data=list(zip(*data)), columns=['main_male', 'sec_male', 'main_female', 'sec_female'], index=range(0, 48))
    cash_flow_osap['net_main_male'] = cash_flow_osap['main_male'] / (1 + 0.03) ** cash_flow_osap.index
    cash_flow_osap['net_sec_male']= cash_flow_osap['sec_male'] / (1 + 0.03) ** cash_flow_osap.index
    cash_flow_osap['net_main_female'] = cash_flow_osap['main_female'] / (1 + 0.03) ** cash_flow_osap.index
    cash_flow_osap['net_sec_female']= cash_flow_osap['sec_female'] / (1 + 0.03) ** cash_flow_osap.index
    return[cash_flow_osap['net_main_male'].sum() - cash_flow_osap['net_sec_male'].sum(), cash_flow_osap['net_main_female'].sum() - cash_flow_osap['net_sec_female'].sum() ]
ratios = range(0,26) # from 0% grants to 25% grants, the maximum
osap_npvs = []
for ratio in ratios:
    osap_npvs.append(osap_npv(ratio=ratio))

pd.DataFrame(osap_npvs, columns=['Male', 'Female'])

MaleFemale
0359715.410269234969.887073
1359994.418157235248.894961
2360273.426044235527.902848
3360552.433931235806.910735
4360831.441819236085.918623
5361110.449706236364.926510
6361389.457593236643.934397
7361668.465480236922.942285
8361947.473368237201.950172
9362226.481255237480.958059
10362505.489142237759.965947
11362784.497030238038.973834
12363063.504917238317.981721
13363342.512804238596.989609
14363621.520692238875.997496
15363900.528579239155.005383
16364179.536466239434.013271
17364458.544354239713.021158
18364737.552241239992.029045
19365016.560128240271.036932
20365295.568016240550.044820
21365574.575903240829.052707
22365853.583790241108.060594
23366132.591677241387.068482
24366411.599565241666.076369
25366690.607452241945.084256

The results are about as clear as they get - post-secondary education, no matter the grant ratio, is a good investment relative to going through life with a high school diploma. The main reasons for the ’no matter’ part are simple: OSAP loans are 70% federal, which is zero-interest. Zero-interest loans are no different from the one-time expenses we considered in the naive model. Of course, since up to 25% of your allowance is grants - free money - that helps out too.

Of course, most critics of college do not recommend just a high school diploma. They often recommend the skilled trades.

ratios = range(0,26) # from 0% grants to 25% grants, the maximum
osap_npvs = []
for ratio in ratios:
    osap_npvs.append(osap_npv(sec_edu=13, ratio=ratio))

pd.DataFrame(osap_npvs, columns=['Male', 'Female'])

MaleFemale
0277557.536966179252.670575
1277836.544853179531.678463
2278115.552741179810.686350
3278394.560628180089.694237
4278673.568515180368.702125
5278952.576403180647.710012
6279231.584290180926.717899
7279510.592177181205.725787
8279789.600065181484.733674
9280068.607952181763.741561
10280347.615839182042.749449
11280626.623727182321.757336
12280905.631614182600.765223
13281184.639501182879.773110
14281463.647389183158.780998
15281742.655276183437.788885
16282021.663163183716.796772
17282300.671050183995.804660
18282579.678938184274.812547
19282858.686825184553.820434
20283137.694712184832.828322
21283416.702600185111.836209
22283695.710487185390.844096
23283974.718374185669.851984
24284253.726262185948.859871
25284532.734149186227.867758

This dents our numbers by ~$55k for women at the pessimistic end, and ~$80k for men, but we’re still solidly in the green.

General Statistics about Education
#

A sort of addendum. Here I visualise some interesting data before we go into the final section.

mapped_df = mincer_df.copy()
mapped_df['FOS'] = mapped_df['CIP2021'].map(CIP2021)
mapped_df['Degree'] = mapped_df['SSGRAD'].map(SSGRAD)
mapped_df['gen'] = mapped_df['Gender'].map(GENDER)
mapped_df['gen'] = mapped_df['Gender'].map(GENDER)
mapped_df

PPSORTAGEGRPCIP2021EmpInGenderLFACTPRSSGRADlog_wageage_midedu_yearsexp_yearsFOSDegreegen
0211812000133569.392662371417Architecture/engineering/tradesCollege/CEGEPWoman
1816461000113351111.018629621838Social sciences & lawMastersWoman
3101213250002135410.126631421224No postsecondary degreeHSMan
4121351300001135811.775290471625Business/management/public adminBachelorWoman
8211056300011351111.05089032188Business/management/public adminMastersWoman
................................................
37884298085111818000011351212.100712372110Architecture/engineering/tradesDoctorateWoman
3788439808521213490002135110.799576421026No postsecondary degreeNo certificateMan
37884498085612101100001135611.608236421422HealthCollege/CEGEPWoman
3788469808621613330001135410.404263621244No postsecondary degreeHSWoman
3788489808661051300002135611.775290321412Business/management/public adminCollege/CEGEPMan

182864 rows × 15 columns

sns.set_theme(style="whitegrid", font="Monospace", font_scale=1, rc={'figure.figsize': (14, 8)})
median_deg = mapped_df.groupby('Degree')['EmpIn'].median().sort_values()
g = sns.barplot(median_deg, palette="Blues_d", legend=False)
g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right')
g.set_xlabel('Degree')
g.set_ylabel('Median Employment Income (CAD$)')
g.set_title('Median Income by Degree', y=1.02)
plt.show()
png

We see a pretty linear increase as levels of education increase, but not all degree jumps are equal. Getting a bachelor (vs HS) increases your earnings by roughly 80%, where a masters merely gives you ~$20k on top of that, but for a 2 year investment. It may be justified if you’re getting a doctorate, at the cost of an additional 3 years of study but with nearly double the median earnings of a bachelor - it will not take long to earn the money back. The difference in trades and a bachelor expressed like this is minor, worse when you consider the lower time expenditure and debt burden; however, we’ve already calculated the NPV to find that a bachelor still makes sense.

median_fos = mapped_df.groupby('FOS')['EmpIn'].median().sort_values()
g = sns.barplot(median_fos, palette='Blues_d')
g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right')
g.set_xlabel('Field of Study')
g.set_ylabel('Median Employment Income (CAD$)')
g.set_title('Median Income by Field of Study', y=1.02)
plt.show()
png

There is a wide variation in median earnings by field of study as well - STEM (and, curiously, ’education’) pays the most and the arts and humanities pay the least. Education can be explained by the fact that teachers are paid handsomely in Ontario ‘with salaries ranging from $65,000 to $110,000 per year’. As a category, social sciences & law may make sense academically but does complicate our analysis: law degrees and sociology majors will likely get paid quite differently.

gen_median_deg = mapped_df.groupby(['gen', 'Degree'])['EmpIn'].median().sort_values()

fig, axes = plt.subplots(1, 2, figsize=(14, 8), sharey=False)

genders = ['Woman', 'Man']

for ax, gender in zip(axes, genders):
    data = gen_median_deg[gender]
    sns.barplot(x=data.index, y=data.values, ax=ax, palette='Blues_d', hue=data.index)
    ax.set_title(gender)
    ax.set_xlabel('')
    ax.set_ylabel('Median Employment Income (CAD$)' if gender == 'Woman' else '')
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')

fig.suptitle('Median Income by Degree and Gender', y=1.02)
plt.tight_layout()
plt.show()
png
gap_deg = gen_median_deg.unstack(level=0)
gap_deg['Gap'] = ((gap_deg['Woman'] / gap_deg['Man']) * 100)
g = sns.barplot(gap_deg['Gap'].sort_values(ascending=False), palette='Blues_d')
g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right')
g.set_xlabel('Field of Study')
g.set_ylabel('Gender Gap (Female Earnings as % of Men)')
g.set_title('Gender Gap by Degree', y=1.02)
plt.show()
png

This data is unsurprising. The more educated women are, the more they surmount the gender pay gap - the situation, as common-sense would tell you, is worst in the trades. What’s surprising is that women without any certification (in terms of gender gap relative to male counterparts, not absolute income) do better in this regard. Of course, ‘No certificate’ is an odd category, likely a small number of immigrants or refugees who don’t have any Canadian certifications, and you shouldn’t read too much into it. The reasons for the pay gap declining by degree are manifold, and beyond the scope of this post.

gen_median_fos = mapped_df.groupby(['gen', 'FOS'])['EmpIn'].median()
fig, axes = plt.subplots(1, 2, figsize=(20, 8), sharey=False)

genders = ['Woman', 'Man']

for ax, gender in zip(axes, genders):
    data = gen_median_fos[gender].sort_values()
    sns.barplot(x=data.index, y=data.values, ax=ax, palette='Blues_d', hue=data.index)
    ax.set_title(gender)
    ax.set_xlabel('')
    ax.set_ylabel('Median Employment Income (CAD$)' if gender == 'Woman' else '')
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')

fig.suptitle('Median Income by Field of Study and Gender', y=1.02)
plt.tight_layout()
plt.show()
png
gap = gen_median_fos.unstack(level=0)
gap['Gap'] = ((gap['Woman'] / gap['Man']) * 100)
g = sns.barplot(gap['Gap'].sort_values(ascending=False), palette='Blues_d')
g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right')
g.set_xlabel('Field of Study')
g.set_ylabel('Gender Pay Gap (Female Earnings as % of Men)')
g.set_title('Gender Gap by Field of Study', y=1.02)
plt.show()
png

Women in the performing arts and ‘comm’, and STEM suffer the lowest gender pay gaps. The worst pay gap is in ‘personal/protective/transport services’, the non-degree category, and social sciences & law. The latter may be explained primarily through the law subcategory. Lawyer jobs are extremely demanding (‘greedy’) and women tend to disprefer such jobs; female representation in that category, therefore, will likely be in the lower-paying degrees in the social sciences, therefore skewing the category’s gap.

Field of Study Comparisons
#

Back to the plot. Rationally speaking, there’s really no argument against even maximal changes to the loan and grant ratio for OSAP. Of course, (future) students are relatively worse off and that is a negative, regardless of whether you think it’s a sufficient criticism (for that, you’d need to analyse a whole lot more than just ROI, including Ontario’s fiscal space). But is that really all there is to it? Let’s break down the analysis into more granular terms to get a clearer picture before our final conclusion. We will calculate NPV by the following categories:

  1. Field of Study
  2. Gender

We could also do so by degree, but degree finances get more and more complicated as you go up the chain - doctorates being worst in this regard. Such an analysis would require a separate notebook.

I will do two separate analyses: one will compare a bachelors to a hs diploma, the other will compare a bachelors to a trades degree, by field of study.

Both analyses will use a pessimistic scenario for osap ratios: 100% loans.

# Filters for post-secondary degrees
ps_df = mapped_df.copy()

# This model doesn't include edu_years to prevent multicollinearity
model_fos = sm.OLS.from_formula('log_wage ~ edu_years + C(FOS)*exp_years + C(FOS)*I(exp_years**2) + C(Gender)', data=ps_df)
results_fos = model_fos.fit()

results_fos.summary()
OLS Regression Results
Dep. Variable:log_wageR-squared:0.113
Model:OLSAdj. R-squared:0.112
Method:Least SquaresF-statistic:626.4
Date:Mon, 13 Apr 2026Prob (F-statistic):0.00
Time:17:07:43Log-Likelihood:-3.3861e+05
No. Observations:182864AIC:6.773e+05
Df Residuals:182826BIC:6.777e+05
Df Model:37
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
Intercept7.65060.10572.8890.0007.4457.856
C(FOS)[T.Architecture/engineering/trades]0.02650.1010.2610.794-0.1720.225
C(FOS)[T.Business/management/public admin]0.00670.1000.0670.947-0.1900.203
C(FOS)[T.Education]-0.22880.122-1.8800.060-0.4670.010
C(FOS)[T.Health]0.03840.1020.3760.707-0.1620.239
C(FOS)[T.Humanities]-0.35190.112-3.1300.002-0.572-0.132
C(FOS)[T.Math/CS/info]-0.09710.110-0.8850.376-0.3120.118
C(FOS)[T.No postsecondary degree]-0.65180.099-6.5710.000-0.846-0.457
C(FOS)[T.Personal/protective/transport services]-0.12550.112-1.1180.263-0.3450.094
C(FOS)[T.Physical/life sciences & tech]-0.53440.107-4.9840.000-0.745-0.324
C(FOS)[T.Social sciences & law]-0.16460.101-1.6230.105-0.3630.034
C(FOS)[T.Visual/performing arts & comm]-0.51590.114-4.5280.000-0.739-0.293
C(Gender)[T.2]0.36530.00846.6020.0000.3500.381
edu_years0.12160.00345.4120.0000.1160.127
exp_years0.10690.01110.0790.0000.0860.128
C(FOS)[T.Architecture/engineering/trades]:exp_years0.00140.0110.1260.900-0.0200.023
C(FOS)[T.Business/management/public admin]:exp_years-0.00740.011-0.6720.501-0.0290.014
C(FOS)[T.Education]:exp_years0.02340.0131.7950.073-0.0020.049
C(FOS)[T.Health]:exp_years-0.02200.011-1.9550.051-0.0445.06e-05
C(FOS)[T.Humanities]:exp_years-0.00270.012-0.2160.829-0.0270.022
C(FOS)[T.Math/CS/info]:exp_years0.01230.0121.0230.306-0.0110.036
C(FOS)[T.No postsecondary degree]:exp_years0.01300.0111.2000.230-0.0080.034
C(FOS)[T.Personal/protective/transport services]:exp_years-0.01070.012-0.8800.379-0.0350.013
C(FOS)[T.Physical/life sciences & tech]:exp_years0.04290.0123.5260.0000.0190.067
C(FOS)[T.Social sciences & law]:exp_years0.00430.0110.3800.704-0.0180.026
C(FOS)[T.Visual/performing arts & comm]:exp_years0.00360.0130.2820.778-0.0210.028
I(exp_years ** 2)-0.00240.000-10.0980.000-0.003-0.002
C(FOS)[T.Architecture/engineering/trades]:I(exp_years ** 2)0.00020.0000.6850.494-0.0000.001
C(FOS)[T.Business/management/public admin]:I(exp_years ** 2)0.00040.0001.6510.099-7.67e-050.001
C(FOS)[T.Education]:I(exp_years ** 2)-0.00040.000-1.4080.159-0.0010.000
C(FOS)[T.Health]:I(exp_years ** 2)0.00080.0003.0330.0020.0000.001
C(FOS)[T.Humanities]:I(exp_years ** 2)0.00030.0001.0200.308-0.0000.001
C(FOS)[T.Math/CS/info]:I(exp_years ** 2)-9.931e-050.000-0.3610.718-0.0010.000
C(FOS)[T.No postsecondary degree]:I(exp_years ** 2)0.00040.0001.4650.143-0.0000.001
C(FOS)[T.Personal/protective/transport services]:I(exp_years ** 2)0.00040.0001.3300.183-0.0000.001
C(FOS)[T.Physical/life sciences & tech]:I(exp_years ** 2)-0.00080.000-2.7060.007-0.001-0.000
C(FOS)[T.Social sciences & law]:I(exp_years ** 2)0.00010.0000.4980.619-0.0000.001
C(FOS)[T.Visual/performing arts & comm]:I(exp_years ** 2)0.00020.0000.6250.532-0.0000.001
Omnibus:142563.407Durbin-Watson:1.997
Prob(Omnibus):0.000Jarque-Bera (JB):3512451.082
Skew:-3.625Prob(JB):0.00
Kurtosis:23.210Cond. No.9.34e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.34e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
hs_npvs = {}
for fos in ps_df['FOS'].dropna().unique():
    if fos == 'No postsecondary degree':
        continue
    npvs = osap_npv(ratio=0, main_model=results_fos, sec_model=results, 
                    bFos=True, fos=fos)
    hs_npvs[fos] = npvs

hs_npvs_df = pd.DataFrame.from_dict(hs_npvs, orient='index', columns=['Male', 'Female'])
hs_npvs_df

MaleFemale
Architecture/engineering/trades511710.016950357405.052259
Social sciences & law333757.556643233903.671114
Business/management/public admin427342.998907298853.216373
Math/CS/info458971.503149320803.820208
Health349597.080201244896.511745
Education378718.398483265107.095080
Humanities109472.61944678246.932983
Personal/protective/transport services203982.747130143838.222258
Physical/life sciences & tech278627.998659195643.022505
Visual/performing arts & comm35355.60303126808.734951
Agriculture/natural resources/conservation358928.782511251372.837624
fig, axes = plt.subplots(1, 2, figsize=(20, 8), sharey=False)

genders = ['Male', 'Female']

for ax, gender in zip(axes, genders):
    data = hs_npvs_df[gender].sort_values()
    sns.barplot(x=data.index, y=data.values, ax=ax, palette='Blues_d', hue=data.index)
    ax.set_title(gender)
    ax.set_xlabel('')
    ax.set_ylabel('Net Present Value (CAD$)' if gender == 'Male' else '')
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')

fig.suptitle('Net Present Value of Bachelors by Field of Study and Gender', y=1.02)
plt.tight_layout()
plt.show()
png

We see an extremely vide variation in NPV as well. The counterfactual here is a high school degree. Overall, all fields of study are still a net positive, but the arts are close to having a negative ROI.

trade_npvs = {}
for fos in ps_df['FOS'].dropna().unique():
    if fos == 'No postsecondary degree':
        continue
    npvs = osap_npv(ratio=0, main_model=results_fos, sec_model=results, bFos=True, fos=fos, sec_edu=13)
    trade_npvs[fos] = npvs

trade_npvs_df = pd.DataFrame.from_dict(trade_npvs, orient='index', columns=['Male', 'Female'])
trade_npvs_df

MaleFemale
Architecture/engineering/trades429552.143647301687.835761
Social sciences & law251599.683340178186.454616
Business/management/public admin345185.125603243135.999875
Math/CS/info376813.629846265086.603709
Health267439.206897189179.295247
Education296560.525180209389.878582
Humanities27314.74614322529.716485
Personal/protective/transport services121824.87382788121.005760
Physical/life sciences & tech196470.125356139925.806007
Visual/performing arts & comm-46802.270272-28908.481547
Agriculture/natural resources/conservation276770.909208195655.621126
fig, axes = plt.subplots(1, 2, figsize=(20, 8), sharey=False)

genders = ['Male', 'Female']

for ax, gender in zip(axes, genders):
    data = trade_npvs_df[gender].sort_values()
    sns.barplot(x=data.index, y=data.values, ax=ax, palette='Blues_d', hue=data.index)
    ax.set_title(gender)
    ax.set_xlabel('')
    ax.set_ylabel('Net Present Value (CAD$)' if gender == 'Male' else '')
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')

fig.suptitle('Net Present Value of Bachelors by Field of Study and Gender', y=1.02)
plt.tight_layout()
plt.show()
png

We finally get our first instance of education with a negative ROI - compared to becoming a skilled tradesman, getting an arts degree is a bad investment. Humanities really toes it close to zero, but is positive with a NPV of $~20k for both genders. Beyond that however, most fields still offer a substantial (>$100k) positive ROI for both genders.

Conclusion
#

In sum, education offers a positive ROI regardless of gender, grant/loan ratios, and field of study (with the sole exception of visual/performing arts vs trades) relative to a high school diploma and trades. There is wide variance between fields of study and degree type, and a substantial gender gap that also varies by FoS/Degree.

On average, men earn ~47% more than women controlling for education and experience. This is a ’naive’ gap - the entire difference is not made up solely of discrimination, since our model does not account for many variables. Generally, the gender gap decreases as women climb the education ladder and it is lowest in the Arts and STEM fields (this is just the gap, and does not account for absolute income). There is a substantial pay gap in the trades, as well. Put together, this tells us that education is likely a more important investment for women relative to men.

OSAP’s recent changes make no difference to the NPV of education. Through a sensitivity analysis comparing grant/loan ratios, I found that even in the worst case scenario (100% loans), the NPV of education (compared to a counterfactual trades degree) is $270k for men and $180k for women. The OSAP ratio change actually makes very little difference at all - this is largely because 70% of federal loans are interest-free. The number does vary by field of study - visual/performing arts degrees are negative (<$10k), and humanities degrees offer an insubstantial ROI (~$20k for both genders), but most other fields offer >$100k returns. This tells us that recent or anticipated OSAP changes should not affect your college plans, except where your liquidity is questionable (i.e. OSAP does not cover 100% of college costs, and you can not make up the difference).

Limitations
#

Many of these limitations are for simplicity’s sake - to prevent this notebook from becoming too large. It bears listing them, however. There are also specific limitations listed in their relevant sections.

  • My models factor unemployment in through its effect (drag) on income, not as a separate variable. A more complex model would handle this factor separately and properly. This could be done by modelling employment probability separately and combining that with income predictions.
  • Experience calculation is also affected similarly - few people would have 45 years of experience at the end of their working lifetime, because that assumes continuous employment (i.e. no unemployment).
  • My models do not control for that many variables - hence the low R^2 of ~0.1. These include specific factors relevant to income, like part-time/full-time work or hours worked per week, and general factors like race or immigration status.
  • My models do not account for the weights given by PUMF. That means my model is not ideal for population-level results - this is fine for measuring relationships, but knocks my precision.
  • Immigrants might deserve a separate notebook, esp. for those with foreign education.
  • I use an interest rate of 6% throughout my calculations. An extension of this notebook would be to perform a sensitivity analysis.
  • There are some simplifications when calculating OSAP loan annuities. There is a 6-month grace period, and the actual time horizon is 9.5 years (where I use 10). Since my cash flow models annual income, it is difficult to accomodate this - hence the simplification. There are also tax considerations vis-a-vis both income and student loan tax credits that I have ignored.
69 - This article is part of a series.