10th november 2023

In their lecture, the professor outlined Decision Trees, a machine learning algorithm used for classification and regression tasks, notable for its tree-like structure where nodes represent decisions based on input features, and branches show possible outcomes, leading to a final prediction at the leaves. This method is appreciated for its simplicity and versatility with different data types, with applications ranging from medical diagnosis, using patient data to predict diseases, to finance, for assessing creditworthiness based on personal financial data. Key metrics in decision trees include the Gini Index, which measures dataset impurity and aims for lower values for more accurate predictions, and Information Gain, which evaluates a feature’s effectiveness in reducing uncertainty, guiding the algorithm to prioritize features that best classify the dataset.

8th november,2023

Today’s class delved into the concept of decision trees, which are used to graphically represent decision-making processes. These trees are formed by continually dividing datasets based on specific features to refine decisions, utilizing criteria like information gain, Gini impurity, or entropy for feature selection and Gini impurity or mean squared error for splitting. The process repeats until a certain condition is met. The lecture also addressed the limitations of decision trees, particularly when data significantly deviates from the average. Our recent project highlighted these shortcomings, demonstrating the necessity of aligning data characteristics with the most fitting analysis method, and suggesting that alternative approaches might sometimes be preferable

6th november 2023, monday

Today’s lecture focused on the Chi-Square test, a robust statistical tool used for examining relationships between categorical variables. It’s especially useful for evaluating if two categorical variables are independent or associated. This involves comparing actual data in a contingency table with expected data assuming independence. There are several types of Chi-Square tests, each with a specific function. The Chi-Square Test for Independence is used to determine if there’s a significant link between two variables, helping to identify dependencies. The Chi-Square Goodness-of-Fit Test checks if observed data matches a particular distribution, like normal or uniform, which is useful for evaluating model fit. Finally, the Chi-Square Test for Homogeneity investigates whether the distribution of a categorical variable is consistent across different groups or populations. These varied applications provide a thorough understanding of the Chi-Square test’s utility in analyzing and interpreting categorical data across different statistical scenarios.

3rd november 2023,friday

To process data and extract insightful information, a number of steps are involved in data processing. Among these steps are:

Collection: Compiling information from different sources.

Preparation: Data refinement and conversion into an appropriate format.

Data entry into a processing system is known as input.

Processing is the process of applying various operations to the data, like aggregation, transformation, sorting, and classification.

Output: Creating a variety of outcomes, including tables, graphs, and documents.

Storage: Preserving the information for later use.

1st November 2023,wed

Powerful examination vigorously depends on the treatment of anomalies and missing information in informational collections. Anomalies, or perceptions fundamentally not the same as the greater part, can be recognized through dissipate plots or determined utilizing z-scores. Dealing with these abnormalities incorporates systems like evacuation, which dangers losing significant data. Drawing certain lines places exceptions at predefined limits, binning arranges constant information, and change strategies lessen inconstancy.

 

Invalid information can be sorted into three kinds: Missing Not Indiscriminately (MNAR), Missing Aimlessly (Blemish), and Missing Totally Aimlessly (MCAR). Devices like graphical utilities and pandas help with distinguishing these. Erasing whole records with invalid qualities is one methodology, yet it might prompt loss of huge information. Attributing missing qualities utilizing factual strategies like the mean, middle, or mode is powerful for MCAR information. Pairwise cancellation uses accessible information for investigation without attributing. Iterative ascription creates numerous evaluations for each missing worth, model-based attribution gauges values utilizing prescient models, and forward/in reverse filling utilizes adjoining data of interest in time series information for filling holes.