Vectorisation is a technique used to make machine learning algorithms, like linear regression, more efficient by performing operations on entire vectors or matrices at once, instead of using loops to process one element at a time.

In the context of multiple linear regression, vectorisation allows you to compute the dot product of the feature vector and parameter vector in a single step, which simplifies the implementation and speeds up calculations.

By leveraging vectorised operations, you can take advantage of optimised linear algebra libraries, leading to faster computation and more scalable models.

```
```The linear regression model with single feature:

f_{w,b}(**x**) =
wx +
b

The linear regression model with multiple features:

f_{w,b}(**→x**) =
w_{1}x_{1} +
w_{2}x_{2} +
... + w_{n}x_{n} +
b

**→w** =
[w_{1} w_{2} w_{3} ... w_{n}]
*parameters of the model*

b is a number

**→x** =
[x_{1} x_{2} x_{3} ... x_{n}]
*vector*

f_{w,b}(**→x**) =
**→w** ·
**→x** +
b =
w_{1}x_{1} +
w_{2}x_{2} +
w_{3}x_{3} + ... +
w_{n}x_{n} +
b

*dot product* **→w** · **→x**

*multiple linear regression* w_{1}x_{1} +
w_{2}x_{2} +
w_{3}x_{3} + ... +
w_{n}x_{n}

The following Python code defines the weights, bias, and feature vector:

```
w = np.array([1.0, 2.5, -3.3])
b = 4
x = np.array([10, 20, 30])
```

In this example, **w** is the weight vector, **b** is the bias term, and **x** is the feature vector. The values in **w** and **x** represent the parameters and features for a model with 3 features (n = 3).

When performing calculations without vectorisation, each weight is multiplied by the corresponding feature value, and then the bias is added:

f_{→w,b}(**→x**) =
w_{1}x_{1} +
w_{2}x_{2} +
w_{3}x_{3} +
**b**

```
f = w[0] * x[0] +
w[1] * x[1] +
w[2] * x[2] + b
```

Without vectorisation, each calculation is performed sequentially, which can be inefficient for large datasets.

If the number of features is large (e.g., n = 100,000), calculating the output manually becomes more complex and slower:

f_{w,b}(→x) =
∑_{j=1}^{n}
w_{j}x_{j} +
b

```
f = 0
for j in range(0, n):
f = f + w[j] * x[j]
f = f + b
```

This method is computationally expensive when the number of features, n, is very large.

Using vectorisation, the same calculation can be done in a single step:

Withoth vectorisation the model and cost function look like this:

f_{w,b}(**x**) =
w_{1}x_{1} +
... +
w_{n}x_{n} +
b

J(w_{1}, ..., w_{n}, b)

With vectorisation the model and cost function look like this:

f_{w,b}(**x**) =
w · x +
b

J(**w**. b)

Vectorisation allows us to use efficient linear algebra operations, like the dot product, to compute the result much faster. For instance, using `np.dot(w, x) + b`

The main difference in gradient descent for multiple features is that **the parameters and features are vectors, but the update rule still follows the same principle**: you adjust each parameter by subtracting the product of the learning rate and the derivative of the cost function with respect to that parameter.

In single-feature linear regression, gradient descent updates the parameter *w* and the bias *b* using their respective derivatives with respect to the cost function. When you move to **multiple-feature linear regression**, the key difference is that both the parameters *w* and the features *x* become **vectors** rather than single values.

- The update rule for each individual parameter
*w*(where_{i}*i*refers to the specific feature) is very similar to the single-feature case. The error term (the difference between the prediction and the target value) remains the same. - For multiple features,
**you update each parameter**(for*w*_{i}*i = 1, 2, …, n*) based on the derivative of the cost function with respect to that parameter, just like for a single feature. - The process repeats for all the parameters
*w*, and then the bias_{1}, w_{2}, …, w_{n}*b*is updated, just as it is in the single-feature case.

When you have different features that take on very different ranges of values, it can cause gradient descent to run slowly. Rescaling the features so they all take on comparable ranges of values can significantly speed up gradient descent. Scaling features helps ensure that all features contribute equally to the learning process, preventing features with larger scales from dominating and thereby improving the efficiency and convergence of gradient descent.

Mean normalisation :
x_{1} =
x_{1} - μ_{1}
max - min

Z score normalisation :
x_{1} =
x_{1} - μ_{1}
σ_{1}

aim for about -1 ≤ | x_{j} |
≤ 1 | for each feature x_{j} |

-3 ≤ | x_{j} |
≤ 3 | acceptable ranges |

-0.3 ≤ | x_{j} |
≤ 0.3 |

0 ≤ | x_{1} |
≤ 3 | okay, no rescaling |

-2 ≤ | x_{2} |
≤ 0.5 | okay, no rescaling |

-100 ≤ | x_{3} |
≤ 100 | too large → rescale |

-0.001 ≤ | x_{4} |
≤ 0.001 | too small → rescale |

98.6 ≤ | x_{5} |
≤ 105 | too large → rescale |

Feature engineering is the process of creating new features by transforming or combining existing ones, based on your knowledge or intuition about the problem. The goal is to make it easier for the learning algorithm to make accurate predictions by providing more relevant or insightful features, which can lead to a better-performing model than simply using the original features.

For example, feature engineering can involve creating a new feature, such as calculating the area of land (*x _{3} = x_{1} × x_{2}*) from the width (

Polynomial regression lets us fit non-linear relationships. Instead of modeling the target variable *y* solely as a linear function of *x*, polynomial regression includes *x*^{2}, *x*^{3} etc, or √*x*, enabling the model to capture curves in the data.

Selecting the appropriate degree for the polynomial is crucial. Higher-degree polynomials can model more complex relationships but may introduce overfitting, where the model captures noise instead of the underlying pattern. It's essential to choose a degree that balances model complexity with the ability to generalize to new data.

For a dataset used to predict housing prices based on the size of the house in square feet. A simple linear regression might not fit the data well if the relationship between size and price is non-linear. By incorporating polynomial terms, we can model this non-linear relationship more effectively.

For instance, adding a quadratic term (*x*^{2}) allows the model to fit a parabolic curve to the data. However, a quadratic model might eventually predict that prices decrease with increasing size, which may not make sense in this context since larger houses typically cost more. Alternatively, including a cubic term (*x*^{3}) can model more complex curves that continue to increase with size, potentially providing a better fit to the data.

Using functions like the square root of *x* (√*x*) is another way to model relationships where the rate of increase in price slows down as size increases. This function becomes less steep as *x* increases, capturing scenarios where additional square footage adds less value beyond a certain point.

When using polynomial features, feature scaling becomes increasingly important. Polynomial terms can have vastly different scales:

- If
*x*ranges from 1 to 1,000 square feet, then:*x*^{2}ranges from 1 to 1,000,000.*x*^{3}ranges from 1 to 1,000,000,000.

Without scaling, features with larger values can disproportionately influence the model, leading to slow convergence or instability in gradient descent optimization. Applying feature scaling ensures that all features contribute equally to the learning process, improving the efficiency and convergence of the algorithm.