November 2023 – mth522

November 29, 2023December 12, 2023

29th November 2023

The data was first standardized using z-score scaling, a process that normalizes the dataset and ensures each feature contributes equally to the analysis. Following this, Principal Component Analysis (PCA) was conducted to extract principal components from the standardized data. This step is crucial in transforming the data into a set of linearly uncorrelated variables, known as principal components.

Each of these principal components was then evaluated for its explained variance ratio, which indicates the proportion of the dataset’s total variance that is captured by each component. This information is essential in understanding the significance of each principal component in representing the dataset.

Furthermore, a visualization was created to display the cumulative explained variance as a function of the number of principal components used. This graphical representation is invaluable for determining the optimal number of principal components required for dimensionality reduction. It helps in deciding how many principal components should be retained to capture the majority of the variance in the data while reducing the dimensionality, thus striking a balance between data simplification and information retention.

November 27, 2023December 12, 2023

27th November 2023

A binary variable was developed to denote instances where the service time exceeds 30 minutes. This variable serves as the target for the predictive model, which aims to ascertain if the service time will surpass this 30-minute threshold in the test dataset. The model’s predictive capability is then quantitatively assessed by measuring its accuracy, which reflects the proportion of total predictions that were correct.

In addition to accuracy, a confusion matrix was generated. This matrix is a critical tool in evaluating the performance of the model in binary classification tasks. It presents a detailed breakdown of the model’s predictions, showcasing not only the correct predictions (true positives and true negatives) but also the errors it made (false positives and false negatives). This comprehensive analysis allows for a deeper understanding of the model’s strengths and weaknesses, particularly in differentiating between instances with service times above and below the 30-minute mark. By combining the accuracy metric with the insights from the confusion matrix, a more nuanced evaluation of the model’s effectiveness in binary classification is achieved.

November 26, 2023December 8, 2023

First Project Re-Submission

MTH522_proj1updated

November 24, 2023December 12, 2023

24th November 2023

In my project, I focused on analyzing a Research dataset. This dataset was initially split into two distinct subsets: a training set and a testing set. This division is a standard practice in machine learning, allowing for the development of models on one subset (training) and evaluating their performance on another (testing).

The next step involved calculating the average service time for different categories of studies within the dataset. This calculation is crucial as it provides insights into the typical duration associated with each study type, forming a basis for further analysis.

Subsequently, I prepared the features (independent variables) and the target variable (dependent variable) for developing a linear regression model. Linear regression is a statistical method used for predicting a continuous target variable based on one or more features.

The model was then applied to the test set to predict service times. Predictions are essential for assessing the model’s ability to generalize to new, unseen data, which is a critical aspect of machine learning models.

For visualization, I used matplotlib, a popular Python library, to plot the regression line. This line represents the model’s predictions across the range of study types, illustrating the relationship between the type of study and the service time as interpreted by the model.

To evaluate the model’s accuracy, I employed the Root Mean Squared Error (RMSE) metric. RMSE is a standard measure in regression analysis that quantifies the difference between the observed actual outcomes and the outcomes predicted by the model. A lower RMSE value indicates better model performance.

The culmination of this process is a comprehensive figure. This visual representation not only depicts the predicted average service time for each study type as determined by the linear regression model but also provides an intuitive understanding of the model’s predictive accuracy and its fit to the actual data.

November 22, 2023December 12, 2023

22 NOVEMBER 2023

I will employ line graphs and other graphical tools to contrast the growth trajectories, utilizing time series analysis to observe the progression of total earnings across different departments over a period. This involves assessing the fluctuations in earnings and spotting any departments with exceptionally high or low growth compared to their counterparts. To do this, I’ll use statistical methods, like computing the coefficient of variation, to measure these variations.

In the statistical modeling, regression analysis will be a key tool for gaining insights into the main factors influencing overtime pay. This technique will allow me to explore how variables such as length of service, departmental affiliation, and job classification influence overtime pay. Using multiple linear regression, I’ll estimate the relationship between various independent variables (like job type and experience) and the dependent variable (overtime pay).

Additionally, clustering methods, especially the k-means algorithm, will be instrumental in examining potential connections between variables like job category, years of experience, and overtime pay. By analyzing factors such as the average base salary, the ratio of overtime to base pay, and their temporal changes, these techniques will help identify departments with similar compensation trends.

This approach enables policymakers to discern prevalent compensation patterns by categorizing departments together. Such insights are valuable for informed decision-making about standardizing pay scales and salaries across the local government.

November 20, 2023December 13, 2023

20 November 2023

Time series analysis plays a vital role in interpreting data over time, encompassing aspects such as trend identification, spotting seasonal patterns, and noticing cyclical variations over extended periods. Techniques like moving averages and exponential smoothing are employed to emphasize underlying trends. Data decomposition is another essential tool, separating the data into trend, seasonal, and residual elements for better understanding.

Achieving stationarity, wherein the data’s statistical characteristics do not change over time, often necessitates methods like differencing or applying transformations. Tools such as autocorrelation and partial autocorrelation functions are used to discover how observations at different time intervals are interrelated.

In the realm of forecasting, ARIMA models are fundamental, integrating aspects of autoregression, differencing, and moving averages. Exponential smoothing techniques are vital for precise predictions, while more sophisticated models like Prophet and Long Short-Term Memory (LSTM) networks further refine forecasting accuracy.

Time series analysis is extensively applied in areas like financial market predictions, demand planning for inventory control, and energy usage optimization. In essence, time series analysis offers a detailed approach for extracting insights, making well-informed decisions, and projecting future trends across a range of time-sensitive data sets.

November 17, 2023December 13, 2023

Nov 17 2023

The ARIMA (AutoRegressive Integrated Moving Average) model is a powerful method for forecasting time series data, encompassing three principal elements. The AutoRegressive (AR) part captures the links between an observation and a specified number of its previous values, denoted by ‘p’. A larger value of ‘p’ means the model accounts for more complex, long-term dependencies. The Integrated (I) aspect involves differencing the data to ensure stationarity, a critical step in time series analysis. The differencing order is indicated by ‘d’, showing the number of times differencing is performed. The Moving Average (MA) portion looks at the correlation between an observation and the residual errors from a previous moving average model, with ‘q’ denoting the number of lagged residuals involved. This model is typically described as ARIMA(p, d, q).

ARIMA models are extensively used across various fields, including finance and environmental studies, for analyzing time-dependent datasets. The process of using an ARIMA model involves initial data exploration, checking for stationarity, selecting appropriate parameters, training the model, and then proceeding to validation, testing, and forecasting. These models are essential tools for analysts and data scientists, providing a structured approach to conducting robust time series forecasting and analysis.

November 15, 2023December 13, 2023

15th NOVEMBER 2023

Today, I learned about Time series, a sequence of data points arranged in chronological order, recorded at consistent, equally spaced intervals. Time series data is widely used in various disciplines, including environmental studies, biology, finance, and economics. The primary aim when working with time series is to understand the patterns, trends, and behaviors that emerge in the data over time. Time series analysis involves tasks such as modeling, interpreting, and predicting future values by leveraging historical data trends. In forecasting the lifecycle of a project, one anticipates future trends or outcomes based on past data. This lifecycle typically includes stages such as data collection, performing exploratory data analysis (EDA), model selection, training, validation and testing, deployment, and ongoing monitoring and maintenance. This systematic process is crucial for maintaining accurate and current forecasts, requiring periodic updates and refinements.

Baseline models serve as simple initial benchmarks or reference points against which more complex models can be compared. They provide a fundamental level of prediction, which is useful for evaluating the performance of more sophisticated modeling techniques.

November 13, 2023November 15, 2023

13th november 2023

Time series data is a collection of measurements taken at successive time intervals, playing a pivotal role in areas like finance, economics, and meteorology. It is distinguished by its trends, seasonal changes, and cyclical behaviors. The analysis of time series data is key to comprehending historical activities and uncovering underlying trends.
Forecasting, an essential component of time series analysis, uses past data to project future trends. Commonly employed methods include ARIMA and Exponential Smoothing. These techniques use previous patterns and trends to predict future occurrences. This is incredibly significant in fields like stock market analysis, economic forecasting, and weather prediction, where precise forecasts can significantly improve decision-making and planning. The main challenge in forecasting is selecting the appropriate model and accurately interpreting the data in light of its dependency on time.

November 12, 2023November 15, 2023

Project 2 , 12 th November 2023

MTH_Project2