International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 2, March – April 2014 ISSN 2278-6856
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 2, March – April 2014 ISSN 2278-6856
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 2, March – April 2014 ISSN 2278-6856
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 95
Abstract Data Analysis helps in providing the critical link between good decision making and success. Data analysis is basically used for Prediction and Identification and for the rules of evidence for guiding the analysis by Falsifiability, Validity and Parsimony. The existing data analysis tools are a collection of data analysis methodologies that require experts as users to choose the correct methodology. However, many business users want to apply data analysis to business data to understand the trends, to make predictions and to improve their business decisions. It is is important to know which data analysis technique should be applied when and to which kind of data. This paper performs a comparative study of data analysis techniques and highlights the advantages, disadvantages and application of each technique. It also describes some of the key mistakes that that should be taken care during the application of data analysis techniques.
1. INTRODUCTION Analysis of data is a process of inspecting, cleaning, transforming, and modelling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. Data mining is a particular data analysis technique that focuses on modelling and knowledge discovery for predictive rather than purely descriptive purposes. Business intelligence covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on application of statistical or structural models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All are varieties of data analysis. Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination. The available data analysis tools are mostly a collection of data analysis methods that require experts as users. The users need domain knowledge and also need to know which data analysis methods have to be applied to a given problem and which technique meets the requirements for the solution. The expert should also know how the data has to be prepared for the chosen technique and finally, how the technique needs to be configured. Business users require a much more user- or problem oriented approach to data analysis. Rather than knowing analysis methods, they are experts in the data domain and they know what they want to achieve with data analysis. If they only knew how. They might know, for example, that they want to classify insurance claims as fraudulent or non-fraudulent, given historic information of the customer and the current case. They might want to understand, how the analysis method actually classifies customers (e.g. with a rule set), they might require a certain classification accuracy and that the algorithm is so simple that it can be implemented as an SQL query. Ideally, such users would simply like to feed all these high-level requirements and the data into a tool that would then automatically find the best algorithm in terms of requirements, configure it, run it and create a software module that can be plugged into the business application. [1] In this paper, we focus on a way to select the most appropriate data analysis algorithm given a problem definition, a set of requirements and a data file.
2. NEED OF DATA ANALYSIS MODELS As companies adopt analytics as the new science of winning, organizations will need to focus both on the creation and consumption of insights to enable better decisions. There is need of data analysis due the following reasons: a. The business problem is not clear: In a rush to jump on the analytics bandwagon, business practitioners often forget that the business problem needs to be well- defined for the analytics solution to be relevant to the problem at hand. b. Appropriate stakeholder(s) are not involved: If a firm is using analytics to design a promotion campaign for a certain product, the demand planning teams need to know whats changing to get the product on the shelves. Like any project team, the right stakeholders need to be involved at the right time. This is especially true when multiple functional groups are involved in a specific business problem. c. Mystery math: With the explosion in data and the availability of technologies that bring applied math to A Comparative Study of Data Analysis Techniques
Prateek Bihani 1 and S. T. Patil 2
1&2 Pune University, Vishwakarma Institute of Technology, Bibewadi, Pune 411037, India International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 96
the analytics workbench, analytics practitioners begin to regard the technical analysis as an end in itself. Mathematical techniques are tools necessary to solve the business problem at hand. d. The right expectations are not set: Sophisticated mathematical techniques are often expected to act as magic wands, solving any and every problem at hand. More often than not, this creates unreasonable expectations. As the key sponsor of a failed forecasting project famously said, Why should there be any error in the forecast if you have used sophisticated mathematical techniques? This was clearly a case of a mismatch in expectations it was never communicated to the executive that no mathematical technique, however sophisticated, could accurately predict the future. e. Lack of continuity: As basic a management principle as it may sound, the best analytics ideas tend to lose advantage and diminish in value, due to a variety of reasons ranging from internal organization changes to getting lost in the shuffle of organizational initiatives. f. Losing relevance: Analytics needs to be extremely agile to keep up with changing business priorities. Quite often, the quest for the perfect mathematical technique delays the solution to an extent that it is rendered irrelevant. For example, a launch pricing analysis is irrelevant after the product launch has already happened. g. Bridging the Chasm: It is becoming increasingly apparent that investing in the creation of analytics alone does not guarantee effective consumption of analytics by businesses. To truly leverage analytics as a competitive differentiator, companies will need to ensure that the consumption cycle is tightly integrated with the creation of analytics. The creation of insights requires a holistic perspective of Descriptive Analytics, Inquisitive Analytics, Predictive Analytics and Prescriptive Analytics: a. Descriptive analytics answers the questions What happened in the business? It is looking at data and information to describe the current business situation in a way that trends, patterns and exceptions become apparent b. Inquisitive analytics answers the question Why is something happening in the business? It is the study of data to validate/reject business hypotheses c. Predictive analytics answers the question What is likely to happen in the future. It is data modeling to determine future possibilities d. Prescriptive analytics is the combination of the above to provide answers to the so what? and the now what? questions. For example, what should I do to retain my key customers? How do I improve my supply chain to enhance service levels while reducing my costs?
The type of analysis problem restricts the list of applicable data analysis techniques to that problem. By the term analysis problem, we mean, whether it is a classification problem, function approximation like time series prediction, a clustering problem, if it is about finding dependencies or associations etc. The second category of requirements is concerned with preferences regarding the solution. These comprise properties like accuracy and simplicity of the solution, if the method is adaptable to new, whether it offers an explanation facility like rule- based systems or functional models like linear regression and how simple the explanation should be. Finally, the data might constrain the applicability of methods. The number of data records, for example, might be too small for some statistical methods, or generally, some methods might cope better with certain types of data than others. Depending on the type of user, the level at which the requirements of the problem are defined will vary considerably. Some users may understand the difference between function approximation and classification. Thus, there is a need of hierarchical approaches where requirements are iteratively mapped onto lower level requirements until the lowest level is reached.
Figure 1 Types of Data Analysis
To choose the analysis methods, the requirements have to be mapped onto properties of the methods and the various stages or steps of data analysis have to be followed.
3. STAGES OF DATA ANALYSIS Typically, there are five stages of data analysis viz. Narrative, Coding, Interpretation, Confirmation, and Presentation. a. Narrative Review research questions; Write some history; Describe a social process; Create summaries of interviews; Describe functions / structures of group; Write up critical events chronologically; Make list of important facts. [2] Connect to your own experience. [3] Read written descriptions. [4] Relate participants story to your own experience; locate self in the story as related to participant(s); Look at how participants speak about self and their world. [5] Making metaphors; Note reflections on collected data. [6] b. Coding Create vignettes; create a conceptual framework. [2] Identify data patterns; Extend analysis International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 97
by asking questions derived from the data. [3] Develop meaning from the statements; organize meanings into clusters of themes. [4] Break down text transcripts into overlapping themes and Sub-themes; Organize data in different ways to tap into different dimensions of data sets. [5] Note patterns and themes; Cluster; Partition variables; Subsume particulars into the general; Factor; Note relations between variables; Find intervening variables; Follow up surprises; Develop codes and apply to textual data; Identify patterns, themes, relationships between themes; Conduct an investigation of common/different aspects; Categorize and sort data; Order and reorder data by chronology, importance, frequency [6] c. Interpretation Develop a metaphor; Look at theoretical framework; review relevant theories; Engage in speculation; Look for relevance to program/policy; Evaluate the project. [2] Relate to theory; Refocus on the basis of your tradition/discipline; Evaluate against a standard or against participants interpretation; Position results in a broader analytic framework; Make inferences using inductive reasoning; Flesh out analytical framework; [3] Extract significant statements related to the phenomenon under study [4] Shift focus from individual cases to groups. [5] Look for plausibility; Build a logical chain of evidence; Make conceptual/theoretical coherence; Weight evidence; Check meaning of outliers; Use extreme cases; Make if- then tests; Develop interpretation of findings; Contrast data to determine what fits your assumptions or others findings; Develop hunches; Re-state question to fit data. [6] d. Confirmation Contrast insider views with outsider views. [2] Critique the research process; Report systematic fieldwork procedures; Propose a redesign of the study; Stop when you come to the end, asking what needs to be done next?; Compare to a known case; Analyze the interpretive process. [3] Use member checks to validate the written description; Use numbers to document, verify, and test interpretations. [4] Look at ones assumptions. [5] Triangulate; Count; Make contrasts and comparisons; Check for representativeness; Check for researcher effects; Look for negative evidence; Replicate a finding; Check out rival explanations; Get feedback from participants; Verify interpretations by member checks, peer review, triangulation; Constantly compare earlier data with later data using different bases for comparison. [6] e. Presentation Consider the audience; Draw visual display; Write in narrative form, borrowing form from participants. [2] Emphasize important data; Take suggestions from editors/committee/colleagues; Display findings graphically; explore alternative formats for presentation. [3] Use data analysis results to write an exhaustive description. [4] Write up results in case study form. [5] Use visual displays. [6]
Figure 2 Levels of Data Analysis
To solve or apply a data analysis problem, one should follow each stage or level. It is important to understand what is required as an output from the data analysis. Based on the desired output, one can decide upto which stage or level one has to reach.
4. TECHNIQUES OF DATA ANALYSIS A. Conjoint Analysis / Choice Modeling: Definition - Allows consumers preferences for a product or service to be broken down into tradeoffs among its individual attributes for the context in which overall judgments are made. Conjoint analysis, a popular multi- attribute preference assessment technique used in market research, is a well suited tool to evaluate a multitude of gamut mapping algorithms simultaneously. [7] The objective of conjoint analysis is to determine what combination of a limited number of attributes is most preferred by consumers. Conjoint analysis is a multi- attribute compositional model. Application - Optimizing product configurations; Studying price elasticities for demand; simulating market response to new or modified offerings; Diagnosing competitive strengths and weaknesses. Pros - Of all survey research techniques, this most closely replicates the real-world purchase process. It is flexible as it can run what if scenarios including scenarios not explicitly tested. Great for new product development and pricing. Conjoint analysis helps determine the optimal features of a product or service. Cons - Models preference share rather than market share. There are limits to the number of features that can be included in a study.
B. Factor Analysis: Definition - Identifies a set of underlying dimensions (Factors) within a set of variables, revealing unobserved structure in the data. Factor analysis is applied as a data reduction or structure detection method (the term factor analysis was first introduced by Thurstone [8]). Factor Analysis is an exploratory multivariate statistical method. It is used to summarize the information contained in a International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 98
large set of variables in terms of a smaller set of composite variables, called FACTORS.
Figure 3 Percentage of Variation explained by the factors
Application - Reducing the number of variables for analysis; Identifying conceptual or benefit dimensions underlying expressed product perceptions and preferences. Pros - Simplifies large or complex sets of variables/attributes. Can be used to understand how the customer thinks. Commonly used on subjective measures such as attitudes, beliefs, and product attribute ratings. Cons - Subjective interpretation of the results is a component. Is often a companion to other analyses such as segmentation, rather than an end itself.
C. Discriminate Analysis: Definition - Examines how two or more groups (generally respondents) differ from one another on the basis of a number of predictor variables. Discriminant Analysis works by combining variables in such a way that the differences between predefined groups are maximized. Linear discriminant analysis (LDA) and biased discriminant analysis (BDA) are two effective techniques for dimension reduction, which pay attention to different roles of the positive and negative samples in finding discriminating subspace. [9]
Figure 4 Discriminant Analysis Plot
Application - Understanding and modeling differences between / among groups (e.g., buyers vs. non-buyers of different brands); Predicting market behavior based on demographic and psychographic variables. Pros - Can be thought of as regression for categorical dependent variables. Can include variables of differing scales. Prediction is a powerful tool for finding segments in databases for sales and direct marketing efforts. Cons - Without careful implementation, models will not perform as well on new data as they do on initial data. LDA has limited efficiency in classifying sample data from subclasses with different distributions, and BDA does not account for the underlying distribution of negative samples. [9]
D. Cluster Analysis: Definition - Cluster analysis is an exploratory data analysis tool for solving classification problems. Its object is to sort cases (people, things, events, etc.) into groups, or clusters, so that the degree of association is strong between members of the same cluster and weak between members of different clusters. A cluster is a group of relatively homogenous cases or observations. Each cluster thus describes, in terms of the data collected, the class to which its members belong; and this description may be abstracted through use from the particular to the general class or type. Uses any of several techniques (viz. Nearest Neighbors, K-Means etc.) to classify people, objects, or variables into more homogeneous groups. Application - Identifying / describing market segments; developing typological findings and describing target markets. Pros - Allows a deeper understanding of the market. Can greatly aid messaging and new product development by targeting homogeneous groups. Cons - Subjective interpretation of the results is a component. The technique is mathematical and therefore has no underlying model against which to test statistical hypotheses. K-means is a fast cluster analysis method, in which accuracy depends on the use of initialization algorithms that are usually serial and slow. [10]
E. Structural Equation Modeling (SEM): Definition - Also called causal modeling, it hypothesizes causal relationships among variables and tests the causal models with a linear equation system. It allows the inclusion of latent variables (which are intangible concepts such as intelligence, loyalty, or satisfaction, and are difficult to measure). The basic idea of SEM differs from the usual statistical approach of modeling individual observations, since SEM considers the covariance structure of the data [11]. In SEM, the parameters are estimated by minimizing the difference between the observed covariance and those implied by a structural or path model. The Structural Equation Model consists of a set of linear structural equations containing observed variables and parameters defining causal relationships among the variables. Variables in the equation system can be endogenous (i.e., dependent from the other variables in the model) or exogenous (independent from the model itself). The structural equation model specifies the causal relationship among the variables, describes the causal International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 99
effects and assigns the explained and the unexplained variance. [12] Application - Customer satisfaction and loyalty studies; Driver analysis. Pros - Can model latent variables. Can utilize data of differing scales. Path diagrams make the results easier to understand and communicate to management. Cons - Models can be complex. Tends to need large sample sizes.
F. Regression Analysis: Definition - Studies the dependence of a single, interval scale variable (such as market share) on one (simple) or more (multiple) variables. Regression analysis is to nd a crisp relationship between the dependent and independent variables and to estimate a variance of measurement error. [13] Logistic regression is an important technique for analyzing and predicting data with categorical attributes. Logistic regression is an important statistical method for modeling and predicting categorical data. [14]
Figure 5 Linear Regression Graphs indicating the type of relationships
Application - Forecasting sales, market share, profitability; Modeling buying patterns and impact of market programs; Estimating elasticity and response functions. Pros - Tremendous predictive modeling tool. A tried and true methodology. Diagnostics can be used to evaluate the success of the model. Cons - Susceptible to outliers and highly correlated data. Slow convergence speed and premature are two key problems existing in the regression analysis techniques. [15] When we conduct logistic regression analysis in real- world data mining applications, we often encounter the difficulty of not having the complete set of data in advance. [14]
5. LIMITATIONS IN DATA ANALYSIS Sophistication in statistics compensates for lack of data and/or business understanding. Increased understanding and acceptance of sophisticated statistical techniques in business has resulted in enhanced availability of packaged solutions. These solutions have twin advantages in increasing the usage of statistical tools by business users and reducing the lead time required to gain results/insights. However, the convenience has led to an added temptation of supplementing lack of data/business understanding with sophisticated statistics. This has resulted in overreliance on algorithmic approaches to analytics problem solving. In this approach, business understanding is used to validate the outcome of analytics -- not necessarily the analytics process. A common symptom of this problem is the prevalence of esoteric modeling and data mining techniques without enough inquiry in to their appropriateness and applicability for the problem at hand. Unfortunately, it is model accuracy that often becomes the final arbiter. This results in a classic trap of choosing the technique that gives maximum accuracy over the one that makes most business sense. This can be best avoided by striking a balance between algorithmic and heuristic approaches, which is essentially a balance between a highly accurate model and a model that makes business sense. Tilting to either extreme is dangerous. Extracting meaning out of randomness. Any data that you encounter has non-zero meaningful pattern to noise ratio and the art is in being able to isolate and explain the meaningful pattern and being able to tolerate the unexplained noise as error. But this is easier said than done. Suppose you are relying on a sales-forecasting model to help validate the quarterly targets you received from finance. Wouldnt you want your model to be as accurate as possible so that you can set realistic goals? This is a reasonable expectation, but overemphasis on model accuracy might land you in uncharted waters. Ask any statistician or an expert modeler and you will hear that statistics provides enough tools and implements to make the data say whatever you want to hear. A common outcome of this problem is that you can get a model that is able to explain minute variations in the data that it modeled but fails miserably on any new data. This is call called the Problem of Overfitting in statistics. This happens because noise is a random phenomenon that is beyond the control of your business or even known external factors (that is why it is noise!), and the part of your model that tries to explain this noise fails when it is trying to look at new data. The way to control for this problem is to always compare in-sample validation (model accuracy on the data over which it is built) with out-sample validation (model accuracy on the data which model has not seen). In data mining parlance, these are called training and test data sets, respectively. For the model to be stable, meaning explaining only systematic patterns, the in-sample error should be reasonably close to the out-sample error. What is reasonable depends on the particular context of the problem, but typically if difference between the two errors is greater than 10 percent than you have reason to worry. Correlation versus causation modeling will help uncover causal relationships. This problem can be illustrated with two simple examples. Suppose you are a meteorologist who has a poor track record of predicting International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 100
rain. Your model has the usual predictors such as weather, temperature, extent of cloud formation, wind speed, etc., but still is not reliable. It is clear that you are missing some significant predictor. One day, your analyst comes to you with a breakthrough: a variable that is a very significant predictor in the model and is able to validate historical data very accurately. Unfortunately the variable is weekly sales of umbrellas. This example shows that modeling does not establish direction of causality; in this case we all know that rains cause sales of umbrellas and not vice versa, but there is no way for the model to know this. Another example: Suppose you are a regional sales head of a leading home appliance manufacturer and you want to figure out what drives sales of air conditioners in a region. Your insights team comes back to you with an excellent driver model that is able to explain the sales of air conditioners fairly accurately. The only problem is that the most significant variable in the model is the sale of aerated drinks. This example shows that modeling does not correct for the presence of confounding factors. We know that sale of aerated drinks is not causing the sale of air conditioners and vice versa; ambient temperature is the common factor that impacts both. Because the impact is directionally the same, they are highly correlated and the driver model will naturally show this variable as a significant predictor. In this case, sale of aerated drinks is a confounding variable that should not be present in the model. The recommended way to overcome this problem is to start with a hypothesis matrix. A hypothesis matrix lays down hypotheses connecting every predictor with the predicted and also records the direction of impact. For example, in the above problem, a hypothesis connecting price with sales will read, As price increases, the sales of air conditioners drop. No predictor should go in to the model unless there is a well thought-out business hypothesis. Extrapolating the models way beyond the permissible limits. Well-designed statistical models can answer a lot of business questions, but one has to realize that there is no perfect model that is free from all constraints. Judicious use of statistical models can aid business decision-making, but not being aware of a models limits can be counterproductive. A statistical model is based on underlying data and is subject to the limitations of the data captured. For instance, a model to predict sales cannot take into account the impact of an earthquake on sales if the historical data has never captured earthquakes. Hence, this model will not be able to predict sales accurately in the event of an earthquake. These Black Swan events are often the cause of considerable distress; the subprime crisis being the most recent example. Another example is marketing mix models that are used to assess the impact of marketing on sales. These models are often used as tools for scenario planning where the business user aims to estimate the sales based on different spend scenarios. It is important to realize that any model is only accurate in the range of data it has seen and, therefore, if the scenario is drastically different from history, there is a very high chance of error. For example, if the marketing division decides to increase spend by 5x, the same model might not be as accurate as it has been developed based on the historical spend. To understand intuitively, any model is just an interpreter that interprets data into a language we can understand. If the data does not speak about earthquakes or high marketing spends, the model will not be able to interpret it accurately. Imputing missing values with mean or median is the best way of treating missing values. Any real life data being used for statistical analysis is likely to have quality issues and missing values in variables is one of the most recurring issues. Therefore, it becomes imperative for an analyst/statistician to impute missing values to avoid loss of data and retain maximum information. Often we encounter scenarios where there are 5-10 percent missing values in variables and we are inclined to impute them with the mean or median value. While it does the job in certain cases, extreme caution needs to be taken before imputing missing values as it might have significant consequences on model behavior and interpretation of parameters estimates. It is important to realize that missing values can tell a story and help us better understand the business dynamics in many cases. Hence, it is necessary to look deeper whenever a variable has missing values before coming up with an imputation. For example, while conducting an analysis on premiums for a large health insurer, it was observed that 5 to 6 percent of values were missing. Further analysis revealed that the missing values were only for one state in the U.S. for a certain time period. Research revealed that the company was temporarily banned from operating in that state due to a legal issue. It is, therefore, recommended to look for the cause of the missing values before jumping into imputation. This is by no means an exhaustive list but is certainly a representative one of the types of errors encountered in application of statistics to business. Some of these mistakes stem from incomplete understanding of statistics, some from the incomplete understanding of underlying business and the rest from the inability to marry the two together. With the advent of data analytics and decision sciences, our decisions are being increasingly impacted by these errors, which can result in major implications for our business and therefore the need for the business executives to appreciate, sense and avoid these common pitfalls.
6. CONCLUSION In each discipline, companies that are investing in analytics often find that the businesses do not consume the outputs for a variety of reasons. And as companies realize that one of the many factors that separate failure from success is their ability to effectively use analytics to make better decisions, it becomes necessary for the key stakeholders to ensure the right set of investments are made on the process, technology and people dimensions to bridge the gap between the creation and consumption International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 101
of analytics. The sooner businesses can get this done, the better their chances are of leveraging the potential competitive advantage offered by analytics. There are various techniques that can be used in achieving the desired results through data analysis. However, it is important to choose the correct data analysis technique for achieving the desired results. Through this paper we have tried to provide a comparative study of few of the data analysis techniques available. We have also highlighted the common limitations that are made in the name of data or statistical analysis. This paper would enable the readers for the selection of an appropriate data analysis method for their problem.
References [1] Martin Spott and Detlef Nauck; On Choosing an Appropriate Data Analysis Algorithm The 2005 IEEE International Conference on Fuzzy Systems. [2] LeCompte, M.D., & Schensul, J. J. (1999). Analyzing and interpreting ethnographic data Walnut Creek, CA: AltaMira Press [3] Wolcott, H. F. (1998). Transforming qualitative data: Description, analysis, interpretation. Thousand Oaks, CA: Sage. [4] Beck, C.T. (2003). Initiation into qualitative data analysis. Journal of Nursing Education, 42 (5), 231. [5] Doucet, A., & Mauthner, N. (1998). Voice, reflexivity, and relationships in qualitative data analysis: Background paper for workshop on Voice in Qualitative Data Analysis. Retrieved August 15, 2001, [6] Miles, M. B., & Huberman, A.M. (1994). Qualitative data analysis: An expanded sourcebook. (2nd ed.). London: Sage. [7] Peter Zolliker, Zofia Baranczuk, Iris Sprow, and Joachim Giesen, Conjoint Analysis for Evaluating Parameterized Gamut Mapping Algorithms IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 3, MARCH 2010 [8] Z. Yi, M. Ye, J. C. Lv, and K. K. Tan, Convergence analysis of deterministic discrete time system of Ojas PCA learning algorithm, IEEE Trans. Neural Netw., vol. 16, no. 6, pp. 13181328, Nov. 2005. [9] Yijuan Lu, Member and Qi Tian, Discriminant Subspace Analysis: An Adaptive Approach for Image Classification IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009 [10] Esteves, R.M., Hacker, T., Chunming Rong, Cluster analysis for the cloud: Parallel competitive fitness and parallel K-means++for large dataset analysis Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference. [11] K. A. Bollen, Structural Equations with Latent Variables. New York: Wiley, 1989. [12] Laura Astolfi, Febo Cincotti, Claudio Babiloni, Filippo Carducci, Alessandra Basilisco, Paolo M. Rossini, Serenella Salinari, Donatella , Mattia, Sergio Cerutti, D. Ben Dayan, Lei Ding, Ying Ni, Bin He and Fabio Babiloni, Estimation of the Cortical Connectivity by High-Resolution EEG and Structural Equation Modeling: Simulations and Application to Finger Tapping Data IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 52, NO. 5, MAY 2005. [13] Hideo Tanaka, Member and Haekwan Lee, Interval Regression Analysis by Quadratic Programming Approach, IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 6, NO. 4, NOVEMBER 1998. [14] Ruibin Xi, Nan Lin, and Yixin Chen; Compression and Aggregation for Logistic Regression Analysis in Data Cubes; IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 4, APRIL 2009. [15] Xiaorong Cheng, Lin Sun, Ping Liu; Application of regression analysis based on genetic particle swarm algorithm in financial analysis; Computer Design and Applications (ICCDA), 2010 International Conference.