The methods of statistical learning have been evolved with the power of programming languages. People start to employ these learning models more conveniently now with the help of built-in algorithms and interfaces (Tabladeu, Power BI, Cloudera etc.). However, the emerging bottleneck may be data wrangling/cleaning phase of data science.
Forbes: Data scientists found that they spend most of their time massaging rather than mining or modeling data.
In fact, there is also data engineering and data scientist differences. Not all data engineers need to be good at data science but all data scientist should be good at data engineering.
At the end of the day every data handlers end up with SQL… Here you can find three different implementation of SQL libraries in Python. They are Sqlite3, MySQL and PostreSQL. Also you can find the implementation of Sqlalchemy in this post.
Lets start with sqlite… Below you can find the basic implementation of creating a sqlite3 database in python.
Notes: Sqlite is serverless. Work with local files. No installition is needed. Very quick. All transation in sqlite is ACID ( Atomicity, Consisntecy, Isolation, Durability)
First of all, we are all in aware of the power of data science come from the that of statistics. Therefore, take a close look some basic of data examination for a guide line.
Let’s we have a churn data set as below,
Before diving in depth for estimation of output, everyone should seek for some basic statistical challanges such as missing data, outliers, correlations. I dont event mention about assumtions for given machine learning model.
One can see the outliers from the box plot visualisation of each variables.
More the data we have, more the understanding we get.