mth522

Time series data must be interpreted using statistical methods such as the Autocorrelation Function (ACF), which measure the series’ self-correlations over various time lags. A positive autocorrelation coefficient indicates a similarity between the past and present patterns, whereas a negative value indicates the opposite relationship. By identifying long-lasting trends, the ACF makes forecasting more precise.

It is widely used in fields where forecasting future actions is essential, such as environmental science and economics, but it is dependent on understanding past contingencies. For instance, understanding market autocorrelations is necessary for precise stock price forecasting, but long-term modelling of meteorological autocorrelations is necessary for trustworthy weather forecasting. Through the identification of significant sequential patterns, the ACF enables researchers in various domains to better accurately predict occurrences. It is a crucial numerical instrument for interpreting structures in time series analysis.

December 8, 2023December 8, 2023

Second Project Re-Submission

mthproj2_updated

December 6, 2023December 12, 2023

6th December 2023

The qualities and patterns found in the data strongly influence the appropriateness of time series and LSTM models. I’ve learned from experience that choosing the best model and fitting it to the dataset are essential to forecasting success. The understanding I obtain from using these models continues to be crucial in forming my viewpoint on sequential data as I progress in my data analysis career.

Beyond traditional statistical techniques, time series forecasting is a crucial analytical technique to uncover trends and patterns concealed inside time-based data. By utilising previous data, interpreting temporal linkages, and forecasting future results, it enables well-informed decision-making.

Time series analysis is essential to data science and forecasting because it provides a window into how events evolve over time. It makes it possible to analyse historical data in order to find

December 4, 2023December 12, 2023

4 th December 2023

My work on long-short-term memory (LSTM) networks has provided valuable insights. A characteristic ability of LSTMs is to deal with extended dependencies in sequential data, which is a common challenge. Their integrated memory element and three dedicated gates—forget, input, and output—allow LSTMs to retain or discard data. This allows them to collect important information through long series, which turns out to be very useful for natural language processing and my time series projects. In addition, the study of time series models has been rewarding. Time series analysis assumes that data points collected over time are related and order is important. I mainly focused on two types of time series models: univariate and multivariate. Univariate models such as ARIMA and exponential smoothing highlight trends and seasonality in individual variables, while multivariate models such as Vector Autoregression (VAR) and Structural Time Series provide a bigger picture by looking at several interrelated variables. I will try to express the key points in my own words keeping the meaning you conveyed about LSTM networks and time series analysis.

December 1, 2023December 12, 2023

1st DEC 2023

In my research, I delved into validating the findings from our regression analysis. Utilizing statistical tests, I examined whether the observed relationships in our study were significant. This meticulous process enhances the credibility of our findings and lays a solid foundation for interpreting the implications of our work.

I formulated hypotheses (informed predictions) and rigorously tested them to determine the magnitude and direction of the correlations in our data. Such a stringent approach not only bolsters the validity of our outcomes but also aids in making informed decisions grounded in empirical evidence. The inclusion of hypothesis testing in my research ensures that my conclusions are not mere coincidences, but are underpinned by robust statistical backing.

November 29, 2023December 12, 2023

29th November 2023

The data was first standardized using z-score scaling, a process that normalizes the dataset and ensures each feature contributes equally to the analysis. Following this, Principal Component Analysis (PCA) was conducted to extract principal components from the standardized data. This step is crucial in transforming the data into a set of linearly uncorrelated variables, known as principal components.

Each of these principal components was then evaluated for its explained variance ratio, which indicates the proportion of the dataset’s total variance that is captured by each component. This information is essential in understanding the significance of each principal component in representing the dataset.

Furthermore, a visualization was created to display the cumulative explained variance as a function of the number of principal components used. This graphical representation is invaluable for determining the optimal number of principal components required for dimensionality reduction. It helps in deciding how many principal components should be retained to capture the majority of the variance in the data while reducing the dimensionality, thus striking a balance between data simplification and information retention.

November 27, 2023December 12, 2023

27th November 2023

A binary variable was developed to denote instances where the service time exceeds 30 minutes. This variable serves as the target for the predictive model, which aims to ascertain if the service time will surpass this 30-minute threshold in the test dataset. The model’s predictive capability is then quantitatively assessed by measuring its accuracy, which reflects the proportion of total predictions that were correct.

In addition to accuracy, a confusion matrix was generated. This matrix is a critical tool in evaluating the performance of the model in binary classification tasks. It presents a detailed breakdown of the model’s predictions, showcasing not only the correct predictions (true positives and true negatives) but also the errors it made (false positives and false negatives). This comprehensive analysis allows for a deeper understanding of the model’s strengths and weaknesses, particularly in differentiating between instances with service times above and below the 30-minute mark. By combining the accuracy metric with the insights from the confusion matrix, a more nuanced evaluation of the model’s effectiveness in binary classification is achieved.

November 26, 2023December 8, 2023

First Project Re-Submission

MTH522_proj1updated

November 24, 2023December 12, 2023

24th November 2023

In my project, I focused on analyzing a Research dataset. This dataset was initially split into two distinct subsets: a training set and a testing set. This division is a standard practice in machine learning, allowing for the development of models on one subset (training) and evaluating their performance on another (testing).

The next step involved calculating the average service time for different categories of studies within the dataset. This calculation is crucial as it provides insights into the typical duration associated with each study type, forming a basis for further analysis.

Subsequently, I prepared the features (independent variables) and the target variable (dependent variable) for developing a linear regression model. Linear regression is a statistical method used for predicting a continuous target variable based on one or more features.

The model was then applied to the test set to predict service times. Predictions are essential for assessing the model’s ability to generalize to new, unseen data, which is a critical aspect of machine learning models.

For visualization, I used matplotlib, a popular Python library, to plot the regression line. This line represents the model’s predictions across the range of study types, illustrating the relationship between the type of study and the service time as interpreted by the model.

To evaluate the model’s accuracy, I employed the Root Mean Squared Error (RMSE) metric. RMSE is a standard measure in regression analysis that quantifies the difference between the observed actual outcomes and the outcomes predicted by the model. A lower RMSE value indicates better model performance.

The culmination of this process is a comprehensive figure. This visual representation not only depicts the predicted average service time for each study type as determined by the linear regression model but also provides an intuitive understanding of the model’s predictive accuracy and its fit to the actual data.