Abstract

Statistical analysis and data mining addresses the broad area of data analysis, including data mining algorithms, statistical approaches and practical application. Statistical methods and algorithms can be combined to form the basic idea such as factor analysis, function based data analysis etc. and thorough experimental evaluations show that the results are highly effective and efficient. By the analysis on traditional methods for rules of flash metal consume design, the idea of regression mining for flash metal consume design is proposed. Then according to related definitions, we construct a linear regression model. Its parameter estimation method and significant test methods of linear correlation are discussed in detail. We also analyze the method of applying mathematical statistics to formulation process algorithm. The flash calculation and mathematical description are studied. Through the research on regression analysis software design, the formulation of the forging flash size design guidelines and forging flash metal consuming design criteria. The progressive regression analysis software is used to construct above criteria and the results are analyzed. The algorithm is improved to be effective during our experiments and shows better self-assemble ability and self-adaptive ability.

Introduction

Data mining can be viewed as an extension of statistical analysis techniques used for exploratory analysis and incorporating new techniques [1]. Regression analysis is an important statistical method for the analysis of socio economic data. It also helps us in recognition and categorization of relationships between various factors. ARIMA (auto–regressive integrated moving average), long-memory time-series modeling, and auto-regression are popular methods for such analysis [2]. example the population attribute has a mean value of because most communities are small). A limitation was that the LEMAS survey was of the police departments with at least 100 officers, plus a random sample of smaller departments. For our purposes, communities not found in both census and crime datasets were omitted. Many communities are missing LEMAS data.Communities and Crime Dataset

The data combines socio-economic data[3] from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. Many variables are included so that algorithms that select or learn weights for attributes could be tested. However, clearly unrelated attributes were not included; attributes were picked if there was any plausible connection to crime (N=122), plus the attribute to be predicted (Per Capita Violent Crimes). The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units. Many of these omitted communities were from the Midwestern USA. Data is described below based on original values. All numeric data was normalized into the decimal range 0.00-1.00 using an Unsupervised, equal-interval binning method. Attributes retain their distribution and skew (hence for example the population attribute has a mean value of 0.06 because most communities are small). A limitation was that the LEMAS survey was of the police departments with at least 100 officers, plus a random sample of smaller departments. For our purposes, communities not found in both census and crime datasets were omitted. Many communities are missing LEMAS data.

Regression

Regression analysis is used when researchers want to predict a continuous dependent variable (DV) from a number of independent variables(IV). The purpose of regression analysis [4, 5] is to come up with an equation of a line that fits through that cluster of points with the minimal amount of deviations from the line. The deviation of the points from the line is called ‘error.’

Pre-processing

The pre-processing includes data Cleaning, data Transformation and data Selection. In data cleaning the missing values are eliminated by using the minimum, maximum or average of the attributes. After eliminating the missing values, to improve the Regression factor, the data transformation is used by applying the Z-score Normalization.

Regression

Regression analysis can imply a broader range of techniques that ordinarily appreciated. Statisticians commonly define regression so that the goal is to understand “as far as possible with the available data how the conditional distribution of some response y varies across sub populations determined by the possible values of the predictor or predictors”. If the DV have two form, then logistic regression [6,7] can be used. Regression is used for building advanced data mining model, the applications range from assessing experimental data, through statistical and econometric. It is mainly used for estimating a relationship between many attributes. While being effective and relatively simple method, regression can be applied only for data that are internally dependable. The IV‟s used in regression can be either continuous or dichotomous. The variables which are having two levels those variables can be used in regression analysis if not they must be converted in two levels. Usually, regression analysis is used with naturally occurring variables, even though we can use regression with experimentally manipulated variables. One important thing to consider is regression analysis is that underlying relationships among the variables cannot be determined. While the terminology is such that we say that X ‘predicts’ Y, we cannot say that X ’causes’ Y. Regression analysis[8] also has an assumption of linearity. Linearity means that there is a straight line relationship between the IVs and the DV. This assumption is important because regression analysis only tests for a linear relationship between the IVs and the DV. Any nonlinear relationship between the IV and DV is ignored. You can test for linearity between an IV and the DV by looking at a bivariate scatter plot (i.e., a graph with the IV on one axis analytics, and collaboration and deployment (batch and automated scoring services) 1968 after being developed by Norman H. Nie, Dale H. Bent, and C. Hadlai Hull. SPSS [14] is among the most and the DV on the other). If the two variables are linearly related, the scatter plot will be oval.

Linear Regression

Linear regression is a statistical [9] procedure for predicting the value of a DV from an IV when the relationship between the variables can be described with a linear model [10,11]. Simple linear regression is when researchers want to predict values of one variable, given values of another variable. The relationship is typically expressed in terms of a mathematical equation such as.

Implementation

The proposed technique is implemented using SPSS and STATISTICA, it is a statistics and analytics software package developed by Stat Soft. Statistica provides data analysis, data management, statistics, data mining, and data visualization procedures. Statistica web product categories include Enterprise (for use across a site or organization), Web-Based (for use with a server and browser), Concurrent Network Desktop, and Single-User Desktop.

Statistica originally derives from a set of software packages and add-ons that were initially developed during the mid 1980’s by Stat Soft. Following the 1986 release of CSS (Complete Statistical System) and the 1988 release of MacSS (Macintosh Statistical System), the first DOS version of Statistica (trademarked in capitals as Statistica) was released in 1991.

Statistica 5.0 was released in 1995. It operates on both the new 32-bit Windows 95/NT and the older version of Windows (3.1). It featured many new statistics [12] and graphics procedures, a word- processor-style output editor (combining tables and graphs), and a built-in development environment that enabled the user to easily design new procedures (e.g., via the included Statistica Basic language) and integrate them with the Statistica system. To analyze the regression analysis, SPSS Statistics [13] is a software package used for statistical analysis. It is now officially named ‘IBM SPSS Statistics’, Data mining (IBM SPSS Modeller), Text SPSS Statistics (originally, Statistical Package for the Social Sciences, later modified to read Statistical Product and Service Solutions). It is used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations and others. The original SPSS manual has been described as one of ‘sociology’s most influential books’. In addition to statistical analysis, data management and data documentation (a metadata dictionary is stored in the data file) are features of the base software. SPSS was released in its second version in 1972 and its company name is INDUS Nomi.

Rapid Miner [15] is an open source-learning environment for data mining and machine learning. This environment can be used to extract meaning from a dataset. There are hundreds of machine learning operators to choose from, helpful pre and post processing operators, descriptive graphic visualizations, and many other features. It is available as a stand-alone application for data analysis and as a data-mining engine for the integration into own products.

Research Methodology

Experimental Analysis

“Model summary” table, which provides information about the regression line’s ability to account for the total variation in the dependent variable demonstrates that the observed y-values are highly dispersed around the regression line. ANOVA table presents the F-Statistics; the most commonly used significance threshold is. 05, which means that the variable or model would be significant at the 95% level.

This paper presents a better approach for Regression. We have applied statistical approach to linear regression which leads to more accurate regression results. For future work we planned to extend our research in following directions, to find more efficient results of regression and to work on alternative statistical tools which can lead to accurate results. This paper elaborates basic definition of regression analysis, providing definition of linear regression, discussing its parameter estimation method and significant test method of linear correlation in detail. It also studies non-linear regression equation, multi-linear regression equation model and parameter estimation method, and mathematical statistics during forming processing algorithm application. With data mining features, it can provide stepwise regression analysis to establish forging design standard samples and performs description on software design of stepwise regression analysis, the design standard forming of forging-die flash size and design standard forming of forging-die metal consumption. Finally, the above standards are acquired by establishing step-by-step regression analysis software. The results are analyzed to acquire relevant conclusions.

Statistical analysis and data mining addresses the broad area of data analysis, including data mining algorithms, statistical approaches and practical application. Statistical methods and algorithms can be combined to form the basic idea such as factor analysis, function based data analysis etc. and thorough experimental evaluations show that the results are highly effective and efficient. By the analysis on traditional methods for rules of flash metal consume design, the idea of regression mining for flash metal consume design is proposed. Then according to related definitions, we construct a linear regression model. Its parameter estimation method and significant test methods of linear correlation are discussed in detail. We also analyze the method of applying mathematical statistics to formulation process algorithm. The flash calculation and mathematical description are studied. Through the research on regression analysis software design, the formulation of the forging flash size design guidelines and forging flash metal consuming design criteria. The progressive regression analysis software is used to construct above criteria and the results are analyzed. The algorithm is improved to be effective during our experiments and shows better self-assemble ability and self-adaptive ability.

Introduction

Data mining can be viewed as an extension of statistical analysis techniques used for exploratory analysis and incorporating new techniques [1]. Regression analysis is an important statistical method for the analysis of socio economic data. It also helps us in recognition and categorization of relationships between various factors. ARIMA (auto–regressive integrated moving average), long-memory time-series modeling, and auto-regression are popular methods for such analysis [2]. example the population attribute has a mean value of because most communities are small). A limitation was that the LEMAS survey was of the police departments with at least 100 officers, plus a random sample of smaller departments. For our purposes, communities not found in both census and crime datasets were omitted. Many communities are missing LEMAS data.Communities and Crime Dataset

The data combines socio-economic data[3] from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. Many variables are included so that algorithms that select or learn weights for attributes could be tested. However, clearly unrelated attributes were not included; attributes were picked if there was any plausible connection to crime (N=122), plus the attribute to be predicted (Per Capita Violent Crimes). The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units. Many of these omitted communities were from the Midwestern USA. Data is described below based on original values. All numeric data was normalized into the decimal range 0.00-1.00 using an Unsupervised, equal-interval binning method. Attributes retain their distribution and skew (hence for example the population attribute has a mean value of 0.06 because most communities are small). A limitation was that the LEMAS survey was of the police departments with at least 100 officers, plus a random sample of smaller departments. For our purposes, communities not found in both census and crime datasets were omitted. Many communities are missing LEMAS data.

Regression

Regression analysis is used when researchers want to predict a continuous dependent variable (DV) from a number of independent variables(IV). The purpose of regression analysis [4, 5] is to come up with an equation of a line that fits through that cluster of points with the minimal amount of deviations from the line. The deviation of the points from the line is called ‘error.’

Pre-processing

The pre-processing includes data Cleaning, data Transformation and data Selection. In data cleaning the missing values are eliminated by using the minimum, maximum or average of the attributes. After eliminating the missing values, to improve the Regression factor, the data transformation is used by applying the Z-score Normalization.

Regression

Regression analysis can imply a broader range of techniques that ordinarily appreciated. Statisticians commonly define regression so that the goal is to understand “as far as possible with the available data how the conditional distribution of some response y varies across sub populations determined by the possible values of the predictor or predictors”. If the DV have two form, then logistic regression [6,7] can be used. Regression is used for building advanced data mining model, the applications range from assessing experimental data, through statistical and econometric. It is mainly used for estimating a relationship between many attributes. While being effective and relatively simple method, regression can be applied only for data that are internally dependable. The IV‟s used in regression can be either continuous or dichotomous. The variables which are having two levels those variables can be used in regression analysis if not they must be converted in two levels. Usually, regression analysis is used with naturally occurring variables, even though we can use regression with experimentally manipulated variables. One important thing to consider is regression analysis is that underlying relationships among the variables cannot be determined. While the terminology is such that we say that X ‘predicts’ Y, we cannot say that X ’causes’ Y. Regression analysis[8] also has an assumption of linearity. Linearity means that there is a straight line relationship between the IVs and the DV. This assumption is important because regression analysis only tests for a linear relationship between the IVs and the DV. Any nonlinear relationship between the IV and DV is ignored. You can test for linearity between an IV and the DV by looking at a bivariate scatter plot (i.e., a graph with the IV on one axis analytics, and collaboration and deployment (batch and automated scoring services) 1968 after being developed by Norman H. Nie, Dale H. Bent, and C. Hadlai Hull. SPSS [14] is among the most and the DV on the other). If the two variables are linearly related, the scatter plot will be oval.

Linear Regression

Linear regression is a statistical [9] procedure for predicting the value of a DV from an IV when the relationship between the variables can be described with a linear model [10,11]. Simple linear regression is when researchers want to predict values of one variable, given values of another variable. The relationship is typically expressed in terms of a mathematical equation such as.

Implementation

The proposed technique is implemented using SPSS and STATISTICA, it is a statistics and analytics software package developed by Stat Soft. Statistica provides data analysis, data management, statistics, data mining, and data visualization procedures. Statistica web product categories include Enterprise (for use across a site or organization), Web-Based (for use with a server and browser), Concurrent Network Desktop, and Single-User Desktop.

Statistica originally derives from a set of software packages and add-ons that were initially developed during the mid 1980’s by Stat Soft. Following the 1986 release of CSS (Complete Statistical System) and the 1988 release of MacSS (Macintosh Statistical System), the first DOS version of Statistica (trademarked in capitals as Statistica) was released in 1991.

Statistica 5.0 was released in 1995. It operates on both the new 32-bit Windows 95/NT and the older version of Windows (3.1). It featured many new statistics [12] and graphics procedures, a word- processor-style output editor (combining tables and graphs), and a built-in development environment that enabled the user to easily design new procedures (e.g., via the included Statistica Basic language) and integrate them with the Statistica system. To analyze the regression analysis, SPSS Statistics [13] is a software package used for statistical analysis. It is now officially named ‘IBM SPSS Statistics’, Data mining (IBM SPSS Modeller), Text SPSS Statistics (originally, Statistical Package for the Social Sciences, later modified to read Statistical Product and Service Solutions). It is used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations and others. The original SPSS manual has been described as one of ‘sociology’s most influential books’. In addition to statistical analysis, data management and data documentation (a metadata dictionary is stored in the data file) are features of the base software. SPSS was released in its second version in 1972 and its company name is INDUS Nomi.

Rapid Miner [15] is an open source-learning environment for data mining and machine learning. This environment can be used to extract meaning from a dataset. There are hundreds of machine learning operators to choose from, helpful pre and post processing operators, descriptive graphic visualizations, and many other features. It is available as a stand-alone application for data analysis and as a data-mining engine for the integration into own products.

Research Methodology

Experimental Analysis

“Model summary” table, which provides information about the regression line’s ability to account for the total variation in the dependent variable demonstrates that the observed y-values are highly dispersed around the regression line. ANOVA table presents the F-Statistics; the most commonly used significance threshold is. 05, which means that the variable or model would be significant at the 95% level.

This paper presents a better approach for Regression. We have applied statistical approach to linear regression which leads to more accurate regression results. For future work we planned to extend our research in following directions, to find more efficient results of regression and to work on alternative statistical tools which can lead to accurate results. This paper elaborates basic definition of regression analysis, providing definition of linear regression, discussing its parameter estimation method and significant test method of linear correlation in detail. It also studies non-linear regression equation, multi-linear regression equation model and parameter estimation method, and mathematical statistics during forming processing algorithm application. With data mining features, it can provide stepwise regression analysis to establish forging design standard samples and performs description on software design of stepwise regression analysis, the design standard forming of forging-die flash size and design standard forming of forging-die metal consumption. Finally, the above standards are acquired by establishing step-by-step regression analysis software. The results are analyzed to acquire relevant conclusions.