Data science is the study of where information comes from, what it represents and how it can be turned into a aluable resource in the creation of business and IT strategies.Mining large amounts of structured and unstructured data to identify patterns can help an organization rein in costs, increase efficiencies, recognize new market opportunities and increase the organization's competitive advantage. Data science incorporates tools from multi disciplines to gather a data set, process and derive insights from the ata set, extract meaningful data from the set, and interpret it for decision-making purposes. The disciplinary areas that make up the data science field include mining, statistics, machine learning, analytics, and some programming.
Data mining is a process by which companies extract useful information from raw data (data may be in any form i.e. structured, unstructured or semi structured). By using one or more software, from huge sets of data, patterns are discovered that help to learn about customers and develop effective marketing strategies. This term was most widely used in the late 90's and early 00's when a business consolidated all of its data into an Enterprise Data Warehouse. All of that data was brought together to discover previously unknown trends, anomalies and correlations. Data analysis is a process to inspect, clean and transform data to extract the useful information that is required using analytical and logical reasoning. There are many methods to analyse data. Analysis is really a heuristic activity, where scanning through all the data the analyst gains some insight. It is about applying a mechanical or algorithmic process to derive the insights for example running through various data sets looking for meaningful correlations between them. These methods include data mining, text analytics, business intelligence etc. Data science is the study of where information comes from, what it represents and how it can be turned into a valuable resource in the creation of business and IT strategies. Data science is an umbrella term that encompasses data analytics, data mining, machine learning, and several other related disciplines.
Some of the highlighting skills that a data scientist should possess, are as described below. Statistical Skills: A basic statistical skill set is required to be a data scientist, e.g. the ability to summarize data, create statistical graphs, perform basic calculations etc, is necessary. Statistics is required to know the basic characteristics of data Computer Skills: Data is complex, and with the concept of big data, computers skills such as knowledge of software such as Python, R, SAS, Hadoop or at least few of these is necessary to have in order to become a data scientist. Problem Solving Skills: This is an essential generally for all jobs, but for data-analysis it is important because data can be analyzed in a lot of different ways, and in order to solve the problem at hand or predict future problems and their solutions based on the data, it is important for a person to adapt a holistic approach in identifying problems and solving them on the basis of data. Therefore, problem solving skills, i.e. defining the problem accurately, suggesting solutions to the problem, and providing factual evidence in the form of data to support the solutions is necessary. These skills can be acquired with the knowledge of Data Mining, Machine learning, Text analytics, Deep learning and many more of such approaches. Target Industry Knowledge: It is not only important to know how you can explain your data differently to different people in your company, but it is also important to have the knowledge of your client’s industry, in order to analyze and present data effectively, and actually enable your problem-solving skills in that industry. Communication Skills: In a company a project manager might view data differently from a CEO, where as project manager might just be focused on data analysis of a certain project, the CEO will be looking at how the data of this project could affect other projects of the company. Therefore, for data-analysis, a person should have strong analytical, communication and presentation skills to present the data accurately to different facets of a same organization, and even to external partners and clients.
Step 1: Learning the basics for python- Python is an easy to start language. So as a novice first you need to understand all the basics for the language. Step 2: Basic Statistics & Mathematics- Would highly recommend learning statistics with a heavy focus on coding up examples, preferably in Python or R. Step 3: Python for Data Analysis- Once you are done with Step 1 & Step 2 then it’s time to get hands on experience with some real data analysis programming , Learn to install Anaconda, Jupyter notebook, Python packages like Numpy, Pandas etc. Step 4: Machine Learning- It is classified into following two categories: (i) Supervised learning (Regression, classification, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, Install Python Scikit Learn Library for practicing Machine Learning in Jupyter Notebook Step 5: Learn more related skills like NLP, Deep Learning, Big data technologies, Data visualization, etc. Use Python based libraries like Nltk, Keras, Tensorflow to learn implementation. Step 6: Practice – Try to get exposure to data through hands on projects, assignments, internships. Do as many data analysis competitions, Data Hackathons or related competitions which gives exposure to data and real world problems as you can. This is only a rough pathway- you can change the sequence as per your need.
According to HakerRank Developer Skills Survey 2018, by 2020, all alone in the USA the jobs openings for data professionals will increase by 364,000 openings to 2,720,000 according to IBM. It is just insight from opening for jobs. Future Scope of Data Science is high and it is going to stay here for a while. Apart from that, Data Scientist tops the list of ‘Best jobs in the USA’ in an annual survey conducted by Glassdoor, an online portal for job hunting, for consecutive 3 years. 3 out of 5 highest paying professionals are related to Data Science! Hence if the only salary is your concern, Data Science is the right path for you. In India, salaries vary from 0-3 lakhs to 1 crore plus, all based on your skills and experience.
The goal of the statistical analysis is to summarize the data. Statistical methods make tight assumptions about the problem and data distributions. Generalization of conclusions is pursued using statistical tests on the training dataset. It promotes data reduction as much as possible before modeling (sampling, less inputs), that is often easy to work with small data sets. The goal of data science is to learn from data of all kinds. Data science techniques do not make any rigid pre-assumptions about the problem and data distributions in general. Generalization of conclusions is pursued empirically through training, validation and test dataset. Redundancy in features (variables) is okay, and often helpful. It is preferable to use algorithms designed to handle large number of features. It does not promote data reduction prior to learning. It promotes a culture of abundance: “the more data, the better it is”. Data science techniques are capable of solving complex problems.
In a nutshell, Python is better for data manipulation and repeated tasks, while R is good for ad hoc analysis and exploring datasets. R has a steep learning curve, and people without programming experience may find it overwhelming. Python is generally considered easier to pick up. The IEEE Spectrum ranking is a metrics that quantify the popularity of a programming language. In 2017, Python made it at the first place compared to a third rank a year before. R is in 6 th place. Features of Python like easy to learn, strong support for analytics through packages and adaptability make it most used language for analysis in the domain of data science.
Recommender system is a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.
If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example for unsupervised learning.
Deep Learning is a model of machine learning which has shown incredible promise in the recent years. This is because of the fact that Deep Learning shows great analogy with the functioning of the human brain. The superiority of the human brain is an evident fact, and it is considered to be the most versatile and efficient self-learning model that has ever been created.