High Level view on Data Science
What is Data Science?
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. Data science is the same concept as data mining and big data: "use the most powerful hardware, the most powerful programming systems, and the most efficient algorithms to solve problems".
Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data.[4] It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge. In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.
Some Scope of Data Science
- Automatic Self driving Car
- Delayed or Cancelled flights information
- Route Planning
- Predictive Analytics
- Promotional Offers
- Deciding which class of planes to purchase for better performance
- FedEx use Data Science models for operational efficiency
- Better Decision Making, Predictive Analysis, Pattern Discovery
To buy new furniture for office
- Which website to use
- Check Rating of website
- Check for Discount
- Check if furniture is appropriate or not
Taking Cab
Netflix
- What kind of shows people are interested in
- Easy to make Decision for appropriate advertisement
Politics
- Elections
- Influence the voters
- Personalized messages
Process / Steps in Data Science
- Asking the right questions and exploring the data
- Modeling the data using various algorithms (basically for Machine Learning)
- Finally communicating and visualizing the results
Data Science vs Business Intelligence
Criterion
|
Business Intelligence
|
Data Science
|
Data Source
|
Structured data e.g. Data Warehouse
|
Unstructured data e.g. Web logs
|
Method
|
Analytical
|
Scientific
|
Skills
|
Statistics, Visualization
|
Statistics, Visualization, Machine Learning
|
Focus
|
Past and Present Data
|
Present Data and Future Predictions
|
Prerequisites for Data Science
- Curiosity
- Only when you ask questions, you will have a better understanding of the business problem
- Common Sense
- To identify new ways to solve a business problem and to detect priority problems
- Communication Skills
- Communicate their findings to business teams to act upon the insights
- Machine Learning
- Machine learning is the backbone of Data Science. It is one of the many ways that Data Science uses to find solution to a problem.
- Mathematics Modelling
- Mathematical Models can be extremely helpful to make fast calculations and predictions from what you know of you data
- Statistics
- It is core foundational to Data Science, to extract knowledge and obtain better results from the data
- Programming
- You should know at least one programming language, preferably Python or R for data modelling
- Databases
- The discipline of querying databases teaches you to ask better questions as a Data Scientist
Tools / Skills used in Data Science
Data Analysis
Skills: R, Python, Statistics
Tools: SAS, Jupyter, R studio, MATLAB, Excel, RapidMiner
Data Warehousing
Skills: ETL, SQL, Hadoop, Apache Spark
Tools: Informatica / Talend, AWS Redshift
Data Visualization
Skills: R, Python libraries
Tools: Jupyter, Tableau, Cognos, RAW
Machine Learning
Skills: Algebra, ML Algorithms, Statistics
Tools: Spark MLib, Mahout, Azure ML studio
What does a Data Scientist do?
- Data Scientist is given a problem
- Gather the raw data to solve the problem
- Data is processed and analyzed and prepared into a format in which it can be used and fed into analytics system, be it ML algorithms or statistical model
- Get meaningful data as output
- Communicate insights to others
Must know Machine Learning Algorithms
- Regression (continuous data)
- Clustering (unsupervised learning technique)
- Decision Tree (classification)
- Support Vector Machine
- Naïve Baiyes
Life cycle of Data Science Project
Concept Study
- Understanding the problem statement, thorough study of the business model
- Involves
- Understanding the business problem,
- Asking questions,
- Getting a good understanding of business model,
- Meet up with all stakeholders,
- Understanding what kind of data is available.
Data Preparation
- Also known as Data Gathering, Data Munging, Data Manipulation
- Formatting and structuring of data in an appropriate way
- Fulfill the gaps like missing value, null value, improper datatype, etc. in data
Elements of
Data Preparation
|
|
|
Model Planning
- Could be statistical models, ML models
- Involves Exploratory Data Analysis (EDA) to understand the relation between variables and to see what the data can tell us or data is appropriate or not
- Key variables are selected
- Training data and Test Data is created
- Tools Used
Exploratory Data Analysis
Model Building
- Using various analytical tools and techniques, data is transformed with the goal of discovering useful information to build the right model
- Using test data set, the built model is validated for the best accuracy
- Decide which Algorithm is to be used
- Tools Used
- Python packages like pandas, matplotlib, numpy
Communication
- Key findings are identified and conveyed to the stakeholders and team
- Explain findings and all insights
- Get recommendations and feedbacks
Operationalize
- Put findings into operation
- Final reports are prepared,
- Code and document technically and deliver
Industries with high demand of Data Scientist
- Gaming
- Healthcare
- Used specially for diagnosis and predicting disease
- Finance
- Bank, Insurance Companies
- Marketing
- Finding appropriate market
- Technology