Data Analytics

This page mostly summarizes information from TechTarget.com, but it also using material from Wikipedia and other online resources.

In an attempt to be as scientific as possible, thus making the data more useful and the reports more credible, phrasing may be used that some people will find disconcerting. If the statistical terminology used on this site has you confused, then this page may help you to understand what the terms mean and why they are important.

The analysis, herein, will focus on the quantitative, that is, data which deals with numbers and can be measured. At this time, there is no purely qualitative analysis on data, that which deals with descriptions and can not be physically measured (such as opinions).

Data analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information. It is used to allow organizations to make better business decisions and in the sciences to verify or disprove existing models or theories. Data analytics focuses on inference, the process of deriving a conclusion based solely on what is already known by the researcher.

The science is generally divided into exploratory data analysis (EDA), where new features in the data are discovered, and confirmatory data analysis (CDA), where existing hypotheses are proven true or false. Qualitative data analysis (QDA) is used in the social sciences to draw conclusions from non-numerical data like words, photographs or video.

(More from source)

Terminology

Data is information that has been translated into a form that is more convenient to move or process. It is acceptable for the term "data" to be used as either a singular or a plural subject.

Data mining is sorting through data to identify patterns and establish relationships. Parameters include:

  • Association - looking for patterns where one event is connected to another event.
  • Sequence or path analysis - looking for patterns where one event leads to another later event.
  • Classification - looking for new patterns. (May result in a change in the way the data is organized.)
  • Clustering - finding and visually documenting groups of facts not previously known.
  • Forecasting - discovering patterns in data that can lead to reasonable predictions about the future. (This area of data mining is known as predictive analytics.)

Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points in order to identify patterns and trends in the larger data set being examined. It allows work with a smaller, more manageable amount of data. Though, samples are best drawn from data sets that are as large and close to complete as possible. [Sampling is not an issue for us.]

Data set is a named collection of data (file) that contains individual data units organized (formatted) in a specific way and accessed by a specific access method that is based on the data set organization (such as: sequential, relative sequential, indexed sequential, and partitioned).

Null hypothesis a term used in inferential statistics which usually refers to a general statement or default position that there is no relationship between two measured phenomena (or no difference among groups). Rejecting or disproving the null hypothesis and thus concluding that there is a relationship between two phenomena (e.g. that a potential treatment has a measurable effect is a central task in the modern practice of science, and gives a precise criterion for rejecting a hypothesis. The null hypothesis is generally assumed to be true until evidence indicates otherwise.

Statistical analysis is a component of data analytics which involves collecting and scrutinizing large amounts of data to discover underlying patterns and trends. Statistical analysis can be broken down into five discrete steps, as follows:

  • Describe the nature of the data to be analyzed.
  • Explore the relation of the data to the underlying population.
  • Create a model to summarize understanding of how the data relates to the underlying population.
  • Prove (or disprove) the validity of the model.
  • Employ predictive analytics to run scenarios that will help guide future actions.

Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.