CRISP-DM: the standard model for analytics progress

Often, the success of any reform or development project is based on how well the project can proceed as a process following a certain order and logic. CRISP (Cross-Industry Standard Process for Data Mining, CRISP-DM consortium 1996) is a flexible analytics production process that has been in use since the 1990s. The model demonstrates a simple approach to bring analytics to production in a business-oriented and systematic way. Because of its comprehensiveness and flexibility, CRISP’s standardized process is still the best practical model in analytics. The stages of the process do not necessarily follow each other slavishly, but often are iterating entities. However, the model gives a good, target-oriented order to the analytics production process.

 

According to the founding team member of the CRISP process, Colin Shearer, the CRISP model was created to promote the use of data mining. The main goal was to create a clear and easy-to-understand process model that describes the work phases of the data mining project, emphasizing that data mining is not simply a technical task but also relates directly to business objectives and processes.

In CRISP’s Business Understanding phase, one or more questions (use cases) are defined, which the process is aiming to find an answer to. Defining the objectives is important relative to utilizing the end results. The Business Understanding phase gives direction to the next phases and an idea of the extent and costs of the actions. In this phase, the project needs a preliminary budget.

In the Data Understanding phase, the objectives are validated: it involves considering what data is at one’s disposal and clarifying whether it is possible to execute the use cases with the existing data. Data understanding is acquired with surveys, reports, and data visualizations. This requires sufficient rights to read and study the data. The central goals of this phase are the understanding of data formats and data contents as well as data quality clearance.

The objective of the Data Preparation phase is to transform data into a format suitable for modeling using analytics tools. With regard to data validity, this is the most essential and time-consuming phase. Preparation usually takes 50 to 70% of all the project’s time and resources. Carrying out previous phases thoroughly can minimize the time needed. This phase normally includes linking different data unities, taking and comparing samples, summarizing information, adding new variables, erasing empty values, and testing. In summary, this means data consolidation, data cleansing, data transformation, and data screening.

In the Modeling phase, the prepared data is analyzed in order to gain results. Here, statistical or mathematical methods are used to solve the business problem. Modeling is often done iteratively. Typically, data is run using default values and adjusting several models and even going back to the data preparation phase to process data to suit the chosen method.

The Evaluation phase is about observing whether the result answers the needs and expectations defined in the beginning. The validity of the mathematical models is verified in the modeling phase, so the aim of this phaseis to ensure that the results give a solution to the defined problem and fulfill the objectives and the level of accuracy aimed for. Because it is possible to try several competing models in the Modeling phase, at least the coefficient of determination’s confidence interval and prediction error should be measured in the evaluation phase. In addition, it is good to ensure that all questions are answered and everything that is learned from the process is listed. It is recommended that the evaluation process is carried out with production data instead of a narrow sample of historical data because the benefit of this will be realized when moving to production.

In Deployment, the results of the process that are proven to work are utilized. The results can be customer segmentation that works as a base for targeted marketing, purchase probability predictions, or a life cycle prediction of a part that is used in optimizing the maintenance process.

CRISP-DM is often used because of its clear processes, extensive phases, and flexibility. There are other similar models but some rival models require that the data mining sources and objectives are defined in advance. In practice, the work begins from scratch, which requires both the defining of objectives and extensive review of data sources.

Problem-Solution: an analytics organizing model

The CRISP model is an excellent example of analytic work but, in order to succeed, organizing, methods and approaches are vital. You have to ensure that the organization has the required know-how both to process data and information and to define business questions. Execution needs alternative methods to gather know-how and data.

One way to organize an analytics project is the Problem-Solution model by Houston Analytics Oy. For a project-based organization this is a functional way to organize and work together. Work is done in two teams, both of which include people from the service provider and the customer sides.

 The Problem Team’s task is to define use cases, i.e. the problems to be solved. The Solution Team finds out the best way to solve these problems. The members of both teams are often, but not always, the same persons. The model reflects the CRISP DM process but divides the phases between the teams in the following way: the Problem Team is responsible for the Business Understanding phase and the Solution Team for the Data Understanding, Data Preparation, and Data Modeling phases. The Evaluation and Release phases are carried out together. The division of responsibilities and phases is described more clearly in Figure 11.

The work of those from elsewhere than analytics-focused organizations can be simplified by dividing the process according to time without the iteration viewpoint. One functional model of progress is the Houston Analytics way of dividing the actions into three phases:

1) In the Launch Pad phase, team members clarify the business need (Business Understanding) and dismantle it into parts that are understandable from the point of view of analytics understanding. The Problem Team defines the decision-making points and the questions that the project seeks to answer. Respectively, the Solution Team goes through the data sources and data contents that are required and usable for the solution. The mapping of possibilities can be done in workshops, where a bigger group participates in creating understanding. Thus, the decision-making points and their questions as well as the interest groups must be recognized.

2) From the data that is collected in the Rocket Launch phase, the team forms a synchronized data unity as the basis for analytic models and creates the first analytical models and their processes. For evaluation, the team makes a first version in which the gathered data and its models are linked to the practical planning process. The result is a process model based on analytics and optimizing. In accordance with the Problem-Solution model, the Rocket Launch phase focuses on data preparation and modeling.

3) In the Mission Control phase, the created and tested analytics process is implemented in the customer’s everyday decision-making. In practice, Mission Control takes care of bringing the first product or solution into production, and after that the operations can continue as a maintenance service, for example.

The aim of a project-time operations model is to define the customer’s so- called decision-making windows and bring significant facts with easily utilized analytics into the specified windows. Models that are created with analytics tools boost the customer’s processes as long as the customer’s data unities are not substantially changed.