9th october 2023

As a data scientist, I am keenly interested in using statistical analysis to understand complex social issues. For our Project 2, I explored the Washington Post’s database on fatal police shootings, aiming to unravel the patterns and trends in this contentious area.

 

The database, meticulously compiled by the Post from public records, news sources, and their own reporting, includes over 6,000 fatal police shooting incidents since 2015, detailed with various attributes like victim demographics, whether the victim was armed, and other contextual factors. My analysis uncovered stark racial disparities. Black Americans, who make up less than 13% of the U.S. population, constitute over 25% of the fatalities in this dataset. In contrast, the number of White American fatalities aligns more closely with their demographic proportion. This disparity becomes even more evident when focusing on unarmed victims: Black Americans, just 6% of the population, represented about 35% of unarmed individuals fatally shot. This indicates a disproportionately high risk for non-violent Black civilians in police encounters. My time series analysis also indicated that the annual rate of fatal shootings has remained fairly consistent nationwide, with around 1,000 cases each year. A racial breakdown reveals a slight increase in fatalities among White Americans over this period, while the number of Black fatalities, though decreasing, still remains disproportionately high.

 

In conclusion, this analysis offers quantifiable evidence of racial disparities in fatal police shootings, highlighting the need for comprehensive reforms to address these issues. It emphasizes that recognizing and understanding the data is a crucial step towards making progress in this area.

6th october 2023, friday

Today’s session was dedicated to an extensive Exploratory Data Analysis (EDA) to detect patterns, relationships, and possible multicollinearity among variables. I utilized visualizations to delve into the connections between obesity, inactivity, and diabetes prevalence. Subsequently, I began constructing a linear regression model, with obesity and inactivity as the predictors and diabetes prevalence as the response variable. This involved verifying linear regression assumptions such as linearity, independence, equal variance (homoscedasticity), and the normal distribution of residuals. I then evaluated our initial model’s performance, focusing on metrics like R-squared and adjusted R-squared values, and inspecting residual plots for potential improvements and enhanced predictive power.

 

Further, I investigated the interactive effects of obesity and inactivity on diabetes by incorporating interaction terms into our model. This approach aimed to capture their combined influence on diabetes rates, adding depth to our understanding. To ensure the model’s reliability across different datasets, I applied cross-validation methods. Moreover, I utilized various validation metrics to thoroughly assess the model’s effectiveness.

4th october wed

Today, I learned about the statistical technique known as Bootstrapping. This method is utilized for estimating the sampling distribution of a statistic, notably without relying on a predetermined underlying distribution. Bootstrapping involves repeatedly resampling the data with replacement and computing the statistic for each new sample. This technique is applicable to various statistics, including sample medians, variances, and correlation coefficients, making it a valuable tool for statistical analysis across diverse scenarios.

 

For instance, consider a scenario where we aim to determine whether there’s a significant difference in average height between men and women. We would start by collecting height samples from both groups. Using bootstrapping, we estimate the sampling distribution of the mean difference by drawing repeated samples from both the men’s and women’s height data, recalculating the mean difference each time. The resulting distribution of these mean differences approximates the sampling distribution for the difference in means.

 

From this distribution, we can compute a p-value for our hypothesis test. This p-value represents the fraction of bootstrap samples where the mean difference is as extreme as or more extreme than what was observed in our original sample. If this p-value is less than 0.05, we reject the null hypothesis, indicating a significant difference in average heights between men and women.

 

Bootstrapping stands out as a flexible and robust statistical tool, especially useful for researchers who prefer not to assume a specific underlying distribution for their data analysis.

2-10-2023

The report was compiled, incorporating various elements and findings. It was meticulously reviewed.
Furthermore, multiple regression was employed to gauge how a blend of several independent variables,
like Year and Overall SVI, can impact or forecast the value of a dependent variable, in this case, the percentage of diagnosed diabetes.

Sir explained about the punchline report and thesis.

29-09-2023

Mean Squared Error Overview

Mean Squared Error (MSE) stands as a prominent metric in statistics and machine learning for gauging the precision of a predictive model. It measures the average squared discrepancy between forecasted and actual values. A model with a reduced MSE is better aligned with the data, establishing its importance in appraising regression model quality.

Illustration

To demonstrate, let’s consider using MSE to assess the precision of a regression model predicting house prices based on their size. With a dataset of five entries, each consisting of a real value and its predicted counterpart, we determine the squared error for each by squaring the difference between the estimated and actual figures. The final MSE is derived by averaging these squared differences. An MSE of 440,000 in this context quantifies the model’s alignment with real house prices; a smaller value suggests a more accurate model. This underscores the significance of MSE in honing regression models for precise forecasts.

Wrap-up

Referring to the 2018 CDC diabetes dataset, MSE emerges as a crucial metric to measure regression model efficacy. Post data refinement and model choice, the model is educated using a dataset segment and then forecasts on a distinct testing dataset. The squared gap between the predicted and actual figures for each test dataset entry gives the MSE.

MSE = Prediction Deviation

A diminished MSE suggests the model’s forecasts are nearer to the actual figures, denoting enhanced predictive precision for diabetes outcomes as per the CDC dataset.

 

27-09-2023

K-fold Cross-Validation Method

K-fold cross-validation is a popular approach in machine learning and statistics to evaluate the effectiveness and adaptability of a forecast model.
It becomes especially useful when dealing with sparse data to maximize its utility and prevent over-optimization. The K-fold cross-validation process encompasses:

mathematica :

Segmenting the Data
Training the Model and Assessing It
Computing Evaluation Metrics
Analyzing Cross-Validation Outcomes

25-09-2023

After watching the video on resampling, I have learned about Cross-validation, Validation Set and Bootstrapping.
Resampling involves taking samples from an existing set of observations and creating a new data set.
Bootstrapping, on the other hand, is a statistical process that involves resampling a single set of observations to create multiple simulated samples.
In the Validation Set approach, the data is divided into a training set and a testing set.
The training set is the one that can be used to estimate the error rate when a model is created using the training set.

The principal component analysis (PCA) was employed to narrow down the features of %Obese and %Inactive to two primary components,
which account for the greatest variability of the data. A scatter plot illustrates the distribution of the data points in this new 2-dimensional space,
allowing for the detection of any patterns or clusters between the two components.

Polynomial regression is an extension of linear regression where higher-degree terms (squared, cubed, etc.) of the predictor variables are included in the model to fit non-linear trends in the data. Let’s delve into how to plot a polynomial regression model:
1. Understanding Polynomial Regression:

In a simple linear regression, we try to fit a straight line to the data. For example:
Y=β0+β1X+ϵY=β0​+β1​X+ϵ
However, in polynomial regression, the equation could look something like:
Y=β0+β1X+β2X2+…+βnXn+ϵY=β0​+β1​X+β2​X2+…+βn​Xn+ϵ
Where nn is the degree of the polynomial.
2. Fitting the Model:

To plot a polynomial regression model, you first need to:

Choose the degree of the polynomial based on your data. This involves a balance: a higher degree might fit the training data better but can lead to overfitting.
Use statistical software or libraries (e.g., scikit-learn in Python) to fit the polynomial regression model to your data.

3. Plotting:

Once the model is fitted, you can plot it. The procedure generally involves:

Plotting the actual data points, usually as scatter points.
Generating a range of predictor values (often a fine grid across the range of your data).
Using the polynomial regression model to predict the response for each of these predictor values.
Plotting the predicted values, usually as a smooth curve.

4. Visual Interpretation:

When viewing the plot, you’ll see the data points and the curve representing the polynomial regression. The curve should capture the underlying trend of the data points. Depending on the degree of the polynomial, this curve can be a simple curve (e.g., quadratic) or more complex, wavy shapes.
5. Potential Pitfalls:

While polynomial regression can capture complex non-linear trends, it also has potential pitfalls:

Overfitting: Higher-degree polynomials can fit the training data very closely, capturing noise and making poor predictions on new, unseen data.
Interpretability: As the degree of the polynomial increases, the model can become harder to interpret.

6. Visual Enhancements:

For a clearer visual representation:

Ensure the polynomial curve is smooth.
Use color or different markers to distinguish between actual data points and the polynomial curve.
If plotting multiple polynomial models (e.g., of different degrees), use different colors or line styles for each.

In summary, plotting a polynomial regression model involves fitting a curve to data points, allowing for the visualization of non-linear relationships. Proper care should be taken to choose an appropriate polynomial degree and to avoid overfitting.

22-09-2023

My efforts were centered on discerning the elements that influence obesity and inactivity rates, and I undertook several pivotal actions in this journey.

I initiated my study by gathering extensive information on obesity and its related aspects like diet, economic scenarios, surroundings, and physical activity. Through coding, I generated revealing histograms, computed fundamental statistical parameters like average, middle value, most frequent value, and spread, and crafted bar charts that associate counties or states with obesity percentages and their respective determinants. These visual representations enhance our comprehension of the factors affecting obesity in various areas.

Furthermore, I broadened the scope of my analysis to encompass chronological data from the years 2006, 2010, 2014, and 2018. Using this information, I designed detailed time series charts that illustrate the evolution and tendencies in obesity and inactivity levels over the years. This time-based evaluation is crucial for pinpointing enduring shifts and aids us in drawing knowledgeable insights regarding these health concerns.

 

20-09-2023

After learning about the t-test in class, I discovered that I could apply it in my model to determine the means of diabetes and inactivity with diabetes and obesity in various counties.