Sample space event refers to specific outcomes or occurrences that can happen within the sample space of an experiment or trial. Example:
Probability in data analytics is a powerful tool for dealing with uncertainty, modeling variability in data, and making informed decisions based on available information. It forms the basis for statistical reasoning, hypothesis testing, and various machine learning techniques. Example:
• Mean: The sum of all values in a dataset divided by the number of values.It provides a measure of the central location and is sensitive to extreme values.
• Median: If there's an even number of observations, it's the average of the two middle values.
• Mode: The mode is the value that occurs most frequently in a dataset.
• Range: The range is the difference between the maximum and minimum values in a dataset.
• Variance: Variance measures how far each data point in a dataset is from the mean. It is the average of the squared differences from the mean. It quantifies the overall variability in the dataset.
• Standard Deviation: The standard deviation is the square root of the variance. It provides a measure of how spread out the values are around the mean. It is a widely used indicator of the amount of variation or dispersion in a dataset.
• Quartile: Quartiles divide a dataset into four equal parts. The first quartile (Q1) is the value below which 25% of the data falls, the second quartile (Q2) is the median, and the third quartile (Q3) is the value below which 75% of the data falls. Quartiles are useful for understanding the distribution and identifying potential outliers.
Correlation analysis, often computed using the corr() function, is a statistical method used to measure the strength and direction of a linear relationship between two variables.
Anomalies, in the context of data analysis, refer to values or patterns that deviate significantly from the expected or typical behavior of the data.
Outliers are data points that stand out from the rest of the dataset because they are significantly higher or lower than most of the other values.
Hypothesis testing is a statistical method used to make inferences or draw conclusions about unobserved data.
It's a univariate test typically applied to data like before-and-after blood pressure measurements for individuals.
- H₀: means the difference between the two samples is 0
- H₁: means the difference between the two sample is not 0
Simple Linear Regression is a powerful tool for understanding and predicting how one variable influences another.
A line chart is a type of data visualization that displays data points over a continuous interval or time span and connects these points with straight lines. It is particularly useful for showing trends, patterns, and changes in data over time.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
Seaborn is a popular data visualization library built on top of Matplotlib in Python. Seaborn simplifies the process of creating complex visualizations, such as statistical plots, heatmaps, pair plots, and more, with less code.Seaborn complements Matplotlib and is often used for data analysis and exploration due to its simplified and informative plotting capabilities.
- When we have categorical data then we can use bar graph to visualize
- Bar charts are also usually chosen to highlight increases or decreases over a period of time.
• Bar charts are used to view the frequency of categorical data. • Bar charts in Seaborn use the barplot() function.
Seaborn library provides a barplot function that can automatically compute averages.
• The histogram visualization is also bar-shaped. • However, a histogram is a type of graph that explains frequencies based on two numerical data. Load dataset from seaborn
bin—sometimes called a class interval—is a way of sorting data in a histogram.
• Histograms are commonly used to view the frequency of continuous numerical data. • Bar charts are commonly used to view the frequency of categorical data.
Adjusting bin size using NumPy allows you to control the width of the intervals in a histogram, providing a more detailed or coarse view of your data. Smaller bin sizes capture finer patterns, while larger ones smooth out the data's distribution.