This post covers some stuff already covered by others, but more briefly. Two studies of interest:
 Riemann, R., Angleitner, A., & Strelau, J. (1997). Genetic and Environmental Influences on Personality: A Study of Twins Reared Together Using the Self and Peer Report NEOFFI Scales. Journal of Personality, 65(3), 449–475. https://doi.org/10.1111/j.14676494.1997.tb00324.x
 Connelly, B. S., & Ones, D. S. (2010). An other perspective on personality: Metaanalytic integration of observers’ accuracy and predictive validity. Psychological Bulletin, 136(6), 1092–1122. https://doi.org/10.1037/a0021212
Study 1 – the heritability of personality
Heritabilities for personality traits — usually OCEAN — are commonly given as around 50%. A typical citation for this is Bouchard’s 2004 review which produced this table:
The values range from 42% to 57%, the mean of which is exactly 50%. (There’s a big metaanalysis from 2015 finding a mean of 40% but divergent results from adoption (22%) vs. MZTDZT (47%) designs.) These results come from standard twin studies: MZT DZT comparisons. This design underestimates heritability when there is measurement error in the variables. Despite this, researchers routinely ignore measurement error and I have no idea why. As usual, Jensen got it right early on, such as in his 1969 review, by adjusting for measurement error in his review of the IQ findings, so why don’t they follow his example?
Selfrating measures of personality suffer from not just regular, random measurement error, but also have systematic measurement error (bias): people are not able to rate their own personality as well as other people who know them can. They introduce selfrating method variance into the data, and this variance is not so heritable. There is a twin study that used otherratings of personality and when they used them or combined them with selfratings, the heritabilities went up:
So with selfreport they found H 4256%, mean = 51%. Otherreport: 5781, mean = 66%, combined: 6679, mean = 71%. (I used the AE models’ results when possible.) In fact, these analyses did not correct for regular measurement error either, so the heritabilities are higher still according to these data, likely into the 80%s area. This is the same territory as cognitive ability.
Main caveat: unreplicated study based on n = 964 cases. That sounds like a lot, but it is not for twin studies. Estimates of H rely on four measurements, so sampling error adds up quickly. (One has to estimate the intraclass correlations for MZs and DZs which are based on case pairs. Then one has to estimate the difference between these correlations.)
Jayman pointed me towards a replication of this finding in another and larger sample.

Riemann, R., & Kandler, C. (2010). Construct validation using multitraitmultimethodtwin data: The case of a general factor of personality. European Journal of Personality, 24(3), 258–277. https://doi.org/10.1002/per.760
They fit a number of models to their data with higher higher order factors, big two and/or GFP. Unfortunately, they only report the behavioral genetics model parameters from the best fitting model which turned out to be a 5 + 2 model with cross loadings. The heritabilities from this were: E 86%, O 92%, ES (N) 59%, A 85%, C 81%, Plasticity 50% and Stability 40%. If we use just the OCEAN traits as we did before, the mean heritability is 81%, with ES being the obvious outlier some 20%points below the others. Heritability of the big two were similar to the normal estimates for OCEAN for whatever reason. It’s not clear what the heritabilities of OCEAN traits would be if one used just the 5 factor model.
Study 2 – validity of self vs. otherreported personality
If we accept the higher heritability of otherrated personality and that the cause of that is measurement error and bias, then we would also expect the (predictive) validity of otherrated personality to be stronger. At least, unless we think selfrating bias has as strong validity as the personality traits themselves. As it happens, there is a large metaanalysis on this topic concluding exactly that. They present their results in 3 large tables, but I’ve rearranged them in a smaller table for convenience:
Trait  Rater  Outcome: stranger impressions  Outcome: academic achievement  Outcome: job performance 
Emotional stability  Other  0.41  0.46  0.37 
Emotional stability  Self  0.20  0.25  0.12 
Extraversion  Other  0.46  0.52  0.18 
Extraversion  Self  0.37  0.09  0.12 
Openness  Other  0.58  0.29  0.45 
Openness  Self  0.42  0.09  0.05 
Agreeableness  Other  0.34  0.02  0.31 
Agreeableness  Self  0.26  0.06  0.13 
Conscientiousness  Other  0.42  0.69  0.55 
Conscientiousness  Self  0.27  0.22  0.23 
[For academic achievement, I used the selfreport value with the largest n. This had the effect of maximizing the correlations for selfreport.]
These correlations are corrected for measurement error in both variables, so they should be quite comparable with regards to true correlations. The otherreport correlations are systematically larger. It is easy to see if one plots them.
If we average validities within outcomes and calculate the over/self ratios, these are 2.8, 2.9 and 1.5, mean = 2.4. Otherreport is much more valid.