Wednesday, September 19, 2012

An OLS experience. Analysis takes time to learn.

Ahhhhhh.  At least one can learn. Learning is the salt of the earth. Failing to make an effort to learn puts one stuck in the mud.    I may not know all there is to learn about OLS regression analysis, but I've tried my heart out trying.   The last post gave you a hint of the process, here is the final result with my report and data for viewing.   Our data had 29 variables, I ended up with 11.  A full run of all variables gave me a multiple r squared - -53.61/ an adjusted r squared of -111.60, with an AIC of 629. These are initially horrible numbers.   However with a little fines and some nashing of teeth, grunting and starting from mid-scratch- the final run with 11 variables, ended up with a multiple r squared of 0.53 and an adjusted r squared of 0.39, with the AIC at 225.  These numbers may sound jumbily to someone else but to a newbie on this process it sweet music.

Our goal was to come up with viable population census data which may propose probable locations (being census tracts not actual addresses) where meth labs could turn up.    My final analysis could have gone a little further, but time ran out (it usually does for these labs).   The analyzing lawyer part of me wanted to explain everything, but remembering to "stick to the facts" only what we had without supposing was a task for me in itself.


I've only included my results, analysis chart and final map for review here.  












RESULTS:
The overall objective was to use 29 census population variables to predict probable locations for meth labs in Kanawha and Putnam Counties, West Virginia.  After over 40 OLS runs had been conducted with 29 initial variables covering 54 observations, eleven variables and 49 observations survived the elimination process.  OLS run b21 has been selected as the most representative to use for the results analysis although more could have been conducted  time limitations overruled. 

Seven out of eleven variables had negative coefficients with the maximum being the average family size at -8.029133 and the minimum of -0.007652 being the percentage of females with children (pcnt_fchld).  The remaining negative coefficients ranged from -0.074732 to -0.597828.  

When reviewing the probabilities, several groupings evolved. The 50-64 age stood out as a significant statistical value at 0.02351. Three categories made up the mid-low range between 0.071 to 0.082. The mid-range fell from  0.142 to 0.413. While the high group were at 0.964 and 0.792 respectively.    Two categories showed significant in the robust probabilities being pcnt50_64 at 0.0091 and pcnt18-21 at 0.0016.  Only two fields pcnt18_21 and pcnt50_64 were indicative by "*" as being statistically significant.  

Although there is no chart available to show the variance inflation factor (VIF) for the eleven categories it ranged from 1.35 to 3.45 with the mean at 2.35 well below the coveted >7.5 redundancy threshold.  However, the overall diagnostic for the OLS showed a Jarque-Bera statistic reflecting significant bias, even though the AIC (Akaike Information Criterion) continued to drop from the initial 629 to 225 for overall model performance. 

DISCUSSION:

Although this small evaluation of population characteristics for Kanawha and Putnam had two suggestive statistically significant categories, it is not conclusive nor complete.  It is loosely based upon a comparative study "Methamphetamine Laboratories: The Geography of Drug Production" (Weishelt, R.A., Wells, L.E, 2010), which used the CDC Social Ecological Model as its basis for analysis. It had 63 base variables casting a greater lifestyle –social profile than the subject analysis.  Our small analysis primarily using population data omitting many influential variables such as poverty levels, unemployment, civic/religious membership and employment categories.  The subject study had significant bias, indicating additional variables may need to be added and others with high probabilities such as pcnt_mchld and pcnt30_39 may need to be removed.  

Both studies revealed there are areas which have no meth labs at all.  Our small study contained 4 census zones without labs, being 0.07 percent of the total.  These zones were eventually removed from the subject study calculations which affected bias.  Future investigation would be helpful to reveal the population and social make up of these "zero meth lab" zones as well.  
   
In Summary, 93 percent of our study area had meth lab influences.  Although the population percentages for the age groups 50-64 and 18-21 were indicated as significantly influential there remains additional information to be collected.  What can be said is that the presence of meth labs either from its environmental "trash" or social influence affects a greater population overall than one may realize. Additional data and studies will need to be conducted to truly grasp its reach upon our overall culture.

No comments:

Post a Comment