Data Science: All you need to know before starting a project
“Data, data, data!" he yelled impatiently. "I can’t make bricks without clay!”
—Arthur Conan Doyle
In this article, Zoox Smart Data is pleased to bring you a little more about the universe of data and how putting it into perspective has led to what we experience today as digital transformation.
"Big Data" is the buzzword of the moment. Most people who encounter it realize that it is a powerful force in a digital transformation process. However, few know the variety of tools available that can help businesses large and small take advantage of this revolution.
But after all, what is big data, what is data science, how to start a data science project?
In this article, we try to bring, in a very elucidative way, what it is, where to start, and the perspectives of this segment for the market.
Enjoy your reading.
What is Data Science?
Before we explain what Data Science is, it is necessary to explain, in a didactic way, what big data is and why it is important.
Traditionally, big data is the term for data with enormous volume, velocity, and variety. Traditional database technologies cannot handle it - hence the need for data engineering innovations.
Data science, on the other hand, is the art of harvesting this big data to predict future behavior, to discover patterns that will help us prioritize or derive productive information, or to extract meaning from this untapped data. Data Science involves principles, processes, and techniques for understanding phenomena through (automated) data analysis.
When we talk about Data-Driven Decision Making (DDDM) we refer to the practice of basing decisions on data analysis, rather than on intuition alone.
Economist Erik Brynjolfsson and his colleagues at MIT and Penn's Wharton School conducted a study of how DDDM affects firm performance (Brynjolfsson, Hitt & Kim, 2011).
A DDDM measure was developed that ranks companies on how well they use data to make decisions. They show that statistically, the more data-driven a company is, the more productive it is - even controlling for a wide range of possible confounding factors. And the differences are not small. One standard deviation more on the DDDM scale is associated with a 4%-6% increase in productivity.
DDDM is also correlated with higher return on assets, return on equity, asset utilization, and market value - the relationship appears to be causal.
One of the key assets of Data Science is data, and the ability to extract useful knowledge from it, as it is considered an important strategic asset.
By having a full view of data as an asset, we are allowed to think explicitly about the extent to which we should invest in it.
Building a high-level data science team is a non-trivial task, but it can make a huge difference to decision making.
What do I need to know to get started in the world of Data Science?
It is challenging to imagine the endless possibilities that data gives us.
Before proceeding, we should briefly review the engineering side of data science. Discussions about data science commonly mention not only analytical skills and techniques for understanding data, but also the most commonly used tools. Data scientist definitions (and ads for the jobs) specify not only areas of expertise, but also specific programming languages and tools.
It is common to see job ads that mention data mining techniques (e.g., random forests, support vector machines), specific application areas ( recommendation systems, ad placement optimization), along with popular software tools for big data processing (Hadoop, MongoDB). There is often little distinction between the science and the technology for dealing with large data sets.
Therefore, the first step is to have knowledge of the languages, or to specialize in a single language.
Next, we should point out that data science, like computer science, is a new field. The specific concerns of data science are quite new and the general principles are just beginning to emerge.
Understanding the process and steps helps structure our data analysis thinking and makes it more systematic and therefore less prone to errors and omissions.
There is compelling evidence that data-driven decision making and big data technologies substantially improve business performance.
Data science supports data-driven decision making - and sometimes automatically drives such decision making - and relies on technologies for storing and engineering "Big Data," but its principles are separate. The principles of data science, too, differ from and are complementary to other important technologies, such as statistical hypothesis testing and database queries (which have their own books and lessons).
What do I need to know to start a Data Science project?
As you already know, Data Science is the computer science that extracts meaningful information from raw data and efficiently communicates it in practical activities.
Data engineering, on the other hand, is a domain of engineering dedicated to creating and maintaining systems that overcome data processing bottlenecks and data handling problems for applications that consume, process, and store data with great volume, variety, and velocity.
Data scientists and business analysts working in business intelligence are like cousins: both use data to work on the same business goal, but their approach, technology, and function differ to measurable degrees.
The purpose of business intelligence is to turn raw data into business information that leaders and managers can use to make decisions derived from the data. Business analysts use business intelligence tools to support business management decision making. If you want to create decision support dashboards, visualizations, or reports from complete medium-sized sets of structured business data, you can use business intelligence tools and methods.
There are some critically important skills to explore for those who wish to venture into this medium, and it is from the following topics that you will get a glimpse of the importance of each for the field.
For those who are entering the Data Science field, one of the big questions is what language to use.
Some experts argue that the performance of the language itself does not matter, but rather the resourcefulness with which it is used. Others say that there are some key languages that need to be used in order for the work to flow smoothly.
There is no right or wrong. The fact is that each language has its advantages and disadvantages. The important thing is to know them minimally and understand which one is the most appropriate for your business model and segment.
Currently the most used languages are
Data mining is the process of relating data and finding anomalies, patterns, and correlations in the vast infinity that is big data.
Through the use of some techniques, it is possible to obtain valuable insights that contribute in increasing revenue, reducing costs, reducing risks, and helps in the creation of hyper-personalized and hyper-segmented experiences for customers.
The main techniques used in data mining are:
Variety of DataThe high velocity and immense volume of data that is incorporated into big data on a daily basis causes Data Variety. Therefore, in Data Science and data engineering, you generally work with three varieties of data.
Structured Data: This is data stored, processed and manipulated in rows and columns, in table format, and is found in relational databases (RDBMS).
Unstructured data: Does not fit into any structured database format. This type of data is usually generated from human activity.
Semi-structured data: This type of data does not fit into any structured database, but can be structured by tags useful for creating segmentation, order, and hierarchy.
Now that you know the knowledge required for the development of a Data Science project, let's understand the step-by-step for its implementation.
Step by step to implement a project
Understand the problem and possible solutions
Understanding the problem you seek to solve is a basic principle of a quality project. This definition helps delimit and direct the focus of your data science team.
With a well-defined problem one can outline possible solutions and seek reliable data sources that will help structure a quality analysis, guiding assertive decision making.
Each problem must be defined, understood, and analyzed in isolation.
Data collection and processing
This is one of the most laborious and delicate steps. Text, tables, audio, photos, videos, documents or comments, the data can come from a variety of sources, formats and structures.
At this stage it is important that the data owner and the data science team work together.
The data owner is responsible for providing his critical view of the data based on his knowledge and experience, guiding which data is reliable and can be used. The data science team, on the other hand, takes this information and applies big data, sampling, and data mining techniques to get the proper treatment to send it out for analysis.
It is important to remember that the quality of the data depends on its origin and the treatments applied.
Data analysis and interpretation
With reliable data in hand, the analysis and interpretation stage begins.
In this part of the project, machine learning or statistical algorithms are used to detect patterns, trends, and generate predictive models that will help gain valuable insights to guide decision making.
Decision Making and Implementation
This is the final stage of the project, and it can be put into practice in two ways. The first is by making decisions based on the learnings generated throughout the project, and the second is by implementing models that act on the final data.
The goal of this step is to ensure that the project is useful and that the insights generated contribute to the solution of the problem defined.
All in all...
Data Science is a completely new and constantly evolving area. Without a doubt, it is a field that can contribute to the growth of several areas of your company.
These were the basic knowledge and steps that, from our years of experience with projects in the area, we can share with you.
However, always remember: there is no one right recipe. Each project is unique and depends on your objectives, area of activity, and segment.