# Log transformations: How to handle negative data values?

The log transformation is one of the most useful transformations in data analysis. It is used as a transformation to normality and as a variance stabilizing transformation. A log transformation is often used as part of exploratory data analysis in order to visualize (and later model) data that ranges over several orders of magnitude. Common examples include data on income, revenue, populations of cities, sizes of things, weights of things, and so forth.

In many cases, the variable of interest is positive and the log transformation is immediately applicable. However, some quantities (for example, profit) might contain a few negative values. How do you handle negative values if you want to log-transform the data?

### Solution 1: Translate, then Transform

A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1. For the latter choice, you can show that a = b – min(Y), where b is either a small number or is 1.

In the SAS/IML language, this transformation is easily programmed in a single statement. The following example uses b=1 and calls the LOG10 function, but you can call LOG, the natural logarithm function, if you prefer.

```proc iml; Y = {-3,1,2,.,5,10,100}; /** negative datum **/ LY = log10(Y + 1 - min(Y)); /** translate, then transform **/```

### Solution 2: Use Missing Values

A criticism of the previous method is that some practicing statisticians don't like to add an arbitrary constant to the data. They argue that a better way to handle negative values is to use missing values for the logarithm of a nonpositive number.

This is the point at which some programmers decide to resort to loops and IF statements. For example, some programmers write the following inefficient SAS/IML code:

```n = nrow(Y); LogY = j(n,1); /** allocate result vector **/ do i = 1 to n; /** loop is inefficient **/ if Y > 0 then LogY[i] = log(Y); else LogY[i] = .; end;```

The preceding approach is fine for the DATA step, but the DO loop is completely unnecessary in PROC IML. It is more efficient to use the LOC function to assign LogY, as shown in the following statements.

```/** more efficient statements **/ LogY = j(nrow(Y),1,.); /** allocate missing **/ idx = loc(Y > 0); /** find indices where Y > 0 **/ if ncol(idx) > 0 then LogY[idx] = log10(Y[idx]);   print Y LY LogY;```

The preceding statements initially define LogY to be a vector of missing values. The LOC function finds the indices of Y for which Y is positive. If at least one such index is found, those positive values are transformed and overwrite the missing values. A missing value remains in LogY for any element for which Y is negative.

You can see why some practitioners prefer the second method over the first: the logarithms of the data are unchanged by the second method, which makes it easy to mentally convert the transformed data back to the original scale (see the transformed values for 1, 10, and 100). The translation method makes the mental conversion harder.

You can use the previous technique for other functions that have restricted domains. For example, the same technique applies to the SQRT function and to inverse trigonometric functions such as ARSIN and ARCOS.

1. Rick Wicklin
Posted June 2, 2011 at 2:02 pm | Permalink

Did you know that SAS now has a LOG1PX function that "returns the log of 1 plus the argument"? It's true!
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003121132.htm

2. Saumya
Posted September 1, 2011 at 6:47 am | Permalink

Dear Rick

My data set includes stock return of around 1000 companies. In most cases sometimes the return data shows a -34.5 to -108 figures. How to make log transformation in this case. How much should be the constant value in this kind of data. Please help.

• Posted September 1, 2011 at 7:50 am | Permalink

It depends somewhat on what you're trying to do, but you might want to express the returns as a percentage, measured from the start of the time period (1 yr, 5 yrs, or whatever). Then the Negative returns are bounded by -100 percent, and you can safely compute log(101 + return).

• Qunna
Posted September 24, 2014 at 1:04 pm | Permalink

Dear Rick,

My instructor is very reluctant to model on percentages. He said averaging on percentage did not make sense since we had different denominator. Do you normally modeling on percentages? Thanks.

• Posted September 24, 2014 at 1:46 pm | Permalink

I do not. However, I don't think there is an inherent reason to avoid proportions and percentages. It is true that proportions are different from continuous unbounded data. However, power transformations are still useful and the analogue of the log transformation for proportions is the logit transformation: logit(y) = log(y/(1-y)). Atkinson's (1985) book on "Plots, Transformations, and Regression" has a whole chapter devoted to transformations for percentages and proportions.

• Razafimahery
Posted March 20, 2015 at 5:16 am | Permalink

Thanks Rick,
when the negative values are bounded by -100 percent,can you explain why using log(101 + x) and not log(100+x)?

• Posted March 20, 2015 at 7:53 am | Permalink

Because log(0) is undefined.

3. Posted October 11, 2012 at 7:54 am | Permalink

Dear Rick

I have a data set for which the dependent variable is both positive and negative. Would you say an alternative is to take absolute values, then take logs, before multiplying the original values with -1. For me this seems reasonable, but I am not sure if I can interpret my coefficients in terms of percentage changes any more?

All best
Gjermund

• Posted October 11, 2012 at 8:19 am | Permalink

Well, I don't know your application, but I don't think I would recommend that approach because the LOG function has a singularity at zero. The LOG transformation is best for mapping changes that are between 0 and infinity. For example, if you buy a stock at a certain price, the quantity Price/Purchace_Price is always positive. It can be log-transformed. It sounds like your variable might be "relative change" such as (Price-Purchase_Price)/Purchase_Price, which can be positive or negative. I wouldn't use a log transform for the second quantity.

• Rodwell Tundu
Posted September 28, 2014 at 12:44 pm | Permalink

Good Day,
I am commenting on this particular reply because you told someone face a problem similar to mine to refer to the solution you provided to this problem. Well I am using Eviews 6,in my study I have encountered negative GDP and FDI growth rates. Please advise on how I could apply your transformations to my data. Thank you

• Posted September 28, 2014 at 1:14 pm | Permalink

Read the article and do what it says. You will need to learn how to transform data in EViews.

• Rodwell Tundu
Posted September 28, 2014 at 1:54 pm | Permalink

Thank you for the prompt response. I did exactly what as you instructed in solution one of the article, the software is now asking me to define"min". How do I do that?

• Posted September 28, 2014 at 2:12 pm | Permalink

"Min" will be the smallest value for the variable that you are transforming. So if the smallest (=most negative) GDP growth is -1.2, you would use -1.2 as "Min" for GDP.

• Posted June 25, 2014 at 6:01 am | Permalink

Upon rereading your comment, perhaps you were attempting to form a transformation like this: y --> sign(y)*log(|y| + 1)

4. abera daba
Posted October 15, 2012 at 4:40 am | Permalink

Dear Rick
I have data set of both positive and negative value. I have changed the large number with minus sign among the treatments to zero by adding equal positive number and also to all treatments, then I have analyzed by SAS. But I am not sure, please would you help me?

All best
Abera

5. Sara
Posted October 15, 2012 at 7:49 am | Permalink

Dear Rick

I have to apply regression to Return on equity ratios, return on asset ratios, GDP growth, Inflation % and Real interest rate.

Problem lies where I want to take natural log of data of all variables. All data is in % form but have positive or negative values. e.g. ROA =0.328% and ROE = -7.92%. I can easily apply natural log to 0.328% value and get -1.11474 but how can I apply natural log to the negative value? if you could please shoe me how and give me a figure to see.

is there any way if I just write 0 for this entry and perform my regression? It may give wrong results.

• Posted October 16, 2012 at 6:05 am | Permalink

Use the proportion y = Price / Purchase_price.
This is always in the interval (0, infinity) for non-bankrupt stocks, and y=1 means that the price has not changed since it was purchased. If this quantity spans several orders of magnitude, you can apply the log(y) transform.

6. Clarisa Anne
Posted October 20, 2012 at 8:03 pm | Permalink

Sir, I am using Eviews 7 and I have values in my data set for presidential approval ratings which are negative. I need to use log of the ratings but Eviews cannot compute it. How do I get rid of the negative values? Thank you!

• Posted October 20, 2012 at 8:49 pm | Permalink

You don't need to use the log of the rating, just use the ratings as given. Logarithms are used when data many orders of magnitudes, which doesn't apply for approval ratings. If you insist on transforming the data, use y = (r+100)/200, which maps the rating (r) from [-100,100] to [0,1].

7. Rena
Posted November 15, 2012 at 10:58 am | Permalink

Hello Rick!

I have a question regarding the interpretation of log transformed data where the constant was added to avoid some negative values. Should we still interpret the results in the way that 1% change in independent value leads to ß % (which is a coefficient found after regression) change in the dependent one? (both dependent and independent variables were log transformed)
Thanks!

• Posted November 15, 2012 at 11:08 am | Permalink

No, the interpretation is that a unit change in the LOG of the indep var leads to change of beta in the LOG of the dependent variable. This is a much more difficult interpretation because the "unit change in log(x)" now depends on x. At small values of x, log(x) changes quickly; for large values of x, log(x) changes slowly.

• Tom
Posted November 4, 2014 at 11:15 am | Permalink

Hello Rick,

My questions relates to this post. I have some data that ranges from 0.0 to 1,960. I added 1.0 to all values and then ran a natural log transformation to make all transformed values >=0. Next, I computed the arithmetic mean (and 95% CIs) of the logged transformed values (for several levels of categorical variables). I exponentiated these arithmetic means and CIs to get the geometric mean and its CIs. My question is: do I need to subtract 1.0 from the geometric mean and each CI to properly put it back in the original scale? Or is some other adjustment necessary? I became concerned that this approach was not adequate when I added other arbitrary values (0.001, 2, etc) prior to log transformation and (after exponentiation) was not able to adjust the numbers to the same values by subtracting out the constant added during the linear transformation.

Excellent blog. Thanks for your time.

Tom

• Posted November 4, 2014 at 12:58 pm | Permalink

Yes, you would need to invert the transformation, which would include adding the constant.
When you say "compute mean and CIs," I assume that you are using the standard formula xbar +/- t*s/sqrt(n), where t is a quantile for the t distribution. This formula assumes normality, so whether the CIs are good depends on whether the transformed data is approximately normally distributed for each level of the categorical variables.

• Tom
Posted November 4, 2014 at 1:19 pm | Permalink

Thanks for the prompt reply. I am not sure I follow. Could you clarify using my code (pasted below)? I am using SAS to output means of the logged values (here's my macro code):

&varname = categorical variable

title "&Varname: Logged Value Mean";
proc means data = master;
class &varname;
output out=&varname._m mean=m_&varname lclm=lcl_&varname uclm=ucl_&varname;
run;

In the next step I exponentiate and print the values. Can you advise me on how to adjust this step?

data &varname._m2;
set &varname._m;

*Subtract 1.0 from each;

C_geo_&varname = exp(m_&varname) - 1.0;
C_geo_lcl_&varname = exp(lcl_&varname) - 1.0;
C_geo_ucl_&varname = exp(ucl_&varname) - 1.0;

• Posted November 4, 2014 at 1:31 pm | Permalink

The code is correctly computing the pre-image of the normal CIs of the transformed data.

8. Tolu
Posted January 22, 2013 at 5:52 am | Permalink

Hello Rick,
I am working on human capital investment and economic growth, and my dependent variable is Real GDP, while my independent variables are labor, capital, government expenditure on health and education. my proxies are labor force population, gross fixed capital formation, government expenditure on health and education, life expectancy rate and adult literacy rate. However, I need to know which ones to log and whether to use natural or common base 10 logarithm, and why I should use one instead of the other. Thank you very much.

• Posted January 22, 2013 at 8:20 am | Permalink

A simple rule of thumb is to log-transform variables that range over several orders of magnitude. For example, if one country has a population of one million and another has a population of a billion, that is three orders of magnitude, so a regression model that includes the log(population) is worth considering. For your variables, I would choose base 10 because the results will be more interpretable. If you see that log10(X) is close to 3, you can use mental arithmetic to figure out that X is close to 1,000.

9. Jacob Rodriguez
Posted February 20, 2013 at 2:06 am | Permalink

Hi, Rick. Using log(Y+k) to deal with zero and negative values of the outcome variable seems to be problematic, if I care about the interpretation of beta_1 in E[log(Y+k)] = beta_0 + beta_1 X. I've seen some data analysts exponentiate the right side of the equation and then they subtract k to complete the backtransformation. But this isn't right, as E[log(Y+a)] = log GM(Y+a), where GM is the geometric mean. So my question is: for E[log(Y+k)] = beta_0 + beta_1 X, what is the interpretation of beta_1? If k=0, then [exp(beta)-1] has the neat interpretation of percentage change in GM(Y) for a unit increase in X. But if k is not 0, do we have a similar interpretation?

- Jacob

• Posted February 20, 2013 at 5:43 am | Permalink

You've hit on a key issue: how do you interpret statistics that result from (any) transformation of a variable? As you point out, some transformations have simpler interpretations than others. There have been many books and papers written on this topic, and I recommend the ones by AC Atkinson. His book _Plots, Transformations, and Regression_ describes transformations for a wide variety of situations.

10. Neha
Posted March 13, 2013 at 4:41 am | Permalink

Sir,

I have a data set of food expenditures with the consumed quantities. Since there is no data about per unit prices, I got it as expenditure/quantity. Then I got the natural logarithm of prices using stata. But most of the values came as negative. I'm afraid of my results & I want to know can natural log values of prices be negative. What does it mean by the negative sign? Please help. Thanks in advance.

• Posted June 25, 2014 at 6:06 am | Permalink

Yes. The log can be negative. In your case it means that the ratio is less than 1.

11. Moni
Posted April 13, 2013 at 9:10 am | Permalink

Hello Rick, thanks for the useful blog.

My model (OLS regression) consists of depend. variable being the industry return and then the 3 indep. var. are total market return/oil price return/natural gas return. Each of these returns I want to log

I need to transform the negative numbers to use the log and do it the firs way suggested..
I apply =log(1+*value of return*) My question is, Should I apply the +1 for all 2608 observations I have?? Or only for the negative ones.

I am very grateful for an answer here.

Regards, Moni

• Posted April 13, 2013 at 4:23 pm | Permalink

The transformation is applied to the entire variable, so you should apply it to all 2608 observations.

12. kayla
Posted May 4, 2013 at 12:51 pm | Permalink

Hi Rick

I have savings data set with both negatives and positives. How do I log transform it in eviews especially the negatives?

• Posted May 4, 2013 at 7:11 pm | Permalink

See my previous responses, especially to "Gjermund Grimsby."

13. jj
Posted May 9, 2013 at 8:04 pm | Permalink

hello rick,
i have few independent variables, which are earnings per share, book value and fair value. The problem is, i got negative data for earnings per share(EPS). So, should i just transform the EPS to log (1+ EPS) or i need to do the same to book value and fair value?
Tq

• Posted May 10, 2013 at 6:23 am | Permalink

You do not need to transform each variable in the same way. It seems to me that EPS can be less than 1, so that 1+EPS can still be negative, so be sure to look at the most negative value of EPS before you decide on a transformation.

Posted May 11, 2013 at 9:50 am | Permalink

Hi Rick,
Is that necessary for all variables to be normal distribution if we want to run multiple regression? I did transform some of my variables but the result is still not normal. So, what should I do then? Your suggestion is really appreciated.

• Posted May 12, 2013 at 6:31 am | Permalink

No, regression does not require that the explanatory variables be normally distributed. If you do an internet search for "assumptions of linear regression" you will find many articles. If you want to do inference on the least square estimates (the regression coefficients), you assume normally distributed ERRORS (residuals). That is, the Y variable is linearly related to the X variables plus some unknown error term that is normally distributed.

Posted May 12, 2013 at 1:20 pm | Permalink

hi Rick,
I have a problem with normality test again. In order to make sure that I can use parametric test, I need to make sure that my residual distribution is normal. However, when I refer to the value of skewness and kurtosis of the residual, it is -0.017 and -0.438 respectively, where i think this is considered as normal. Unfortunately, when i do kolmogorov-smirnov, the significant value is 0.021, which indicates the residual is not normal. The sample of my study is 290. Could i just ignore the kolmogorov-smirnov test and assume the residual is normal as the data is large?

• Posted May 12, 2013 at 6:36 pm | Permalink

In practice, many people just "eyeball" the residuals to check that they are approximately normal. If the residuals are approximately normal, the inference on the regression coefficient will still be good. The quantile-quantile plot in PROC UNIVARIATE is probably more valuable than the K-S test for assessing (approximate) normality.

16. saizal
Posted July 8, 2013 at 1:43 pm | Permalink

Hello rick. i'm trying to log the size of firm before i run a GMM regression. Size of firm is defined as:
size = (common stock/book value) x stock market price
But the problem is, there are many negative value there. How do i log the data. should i just treat the negative value = 0? When i log the whole data using microsoft excel, the negative values are treated as 'missing'. So can i just run the GMM regression though there are missing values? is there any best alternative to handle the situation. Thanks.

• Posted July 8, 2013 at 1:55 pm | Permalink

First, if you run the regression with missing values, you are excluding all of that data when you construct the regression model. I wouldn't do that.

It seems like the problem is the definition of "size". I think most people expect "size" to be a positive quantity, such as "market capitalization" or something similar. If you can, change the way that you measure size.

Posted July 17, 2013 at 10:22 am | Permalink

Hi I am working on GDP forecasting.The amount is very high .So to make it stationary,I transform data into log difference .I am using eviews 7.After forecating data I dont know how to convert these values into origional values.These values have become very small.Could you help me please in this regard thanks

• Posted July 17, 2013 at 10:34 am | Permalink

If you are predicting log(GDP), then exponentiate the predicted values to get back to the original scale.

Posted July 17, 2013 at 11:52 am | Permalink

Sir I am predicting d(log(gdp))
i using ARIMA MODELING.So i transform the data by using first difference logrithm.Now i got the forecasting results.But after transformation data is changed So i want to bring the data back to its origional form.I used follow transformation

d(log(gdp))=log(gdp)-log(gdp)(-1)
only exponentiate does it work?

18. xiaoqi
Posted August 3, 2013 at 5:38 pm | Permalink

dear sir, I am doing inward FDI as a dependent variable, it has positive numbers and negative numbers, the data start from 75,and the number become larger as a time series, and the lowest number is -15348, if I plus 15348 to make all the number positive, then the first data will become very large, as GDP is a explanatory variable, if plus such a big number will the regression affected？I use eviews 7. how can I do?

lnFDI is measuring the % change of FDI in the regression, so can I just use FDI minus last period FDI, and calculate the change rate, use the growth rate of FDI instead, but without log?

• Posted August 3, 2013 at 10:10 pm | Permalink

Adding a constant to the response only changes the regression by changing the intercept. If you then apply a log transformation, it becomes harder to interpret the regression coefficients in terms of intuitive quantities such as %change. I think in your case you should plot the data. Is inFDI linearly related to GDP? If so, don't apply the transform. If not, tapply the log(Y+c) transform. Is the transformed response linearly related to the explanatory variables? SAS and other statistical software provide graphical diagnostic plots that you can use to assess the fit of the model.

19. Steven
Posted August 12, 2013 at 6:41 pm | Permalink

Hi Rick,

I have monthly growth data, which is sometimes negative. There is a desire for the growth to be measured on a per day basis (so, growth per day). I had been using the method you described Y*=ln(y+a). But, several on the team are not comfortable with that. An alternative suggested was taking the log of the values prior to differencing them. If this is the dependent variable of a log-log model, would the coefficients with this transformation be interpreted the same way as the Y*=ln(y+a) would be interpreted? What can I do with the per day aspect of this?

I really appreciate this thread, and all the useful feedback you are personally supplying.

• Posted August 13, 2013 at 7:06 am | Permalink

In general, I think it is wise for analysts to be skeptical of advice found on the internet! To answer your question, if you take logs first and then difference them, you are forming the log of a ratio, since log(y_{i+1}) - log(y_i) = log(y_{i+1} /y_i). This would mean that you would be examining the "proportion of change" from one year to the next. Assuming that none of your data are zero, this is a reasonable thing to do. It centers the "no change" situation at 1 instead of 0, and it also eliminates negative numbers (assuming your data are positive).

Try graphing the proportion of change without the log: z_i = y_{i+1} /y_i. Perhaps you can do your regression on that proportion. If so, that's what I'd try.

• Steven
Posted August 20, 2013 at 3:34 pm | Permalink

Thanks for your feedback! I really like the log-log because the coefficients are easy to interpret. How would you interpret coefficients in this proportion change model?

• Posted August 20, 2013 at 3:42 pm | Permalink

Same as usual: the change in the proportion when an explanatory variable changes by one unit. But you're right that this model is not used as often as the log-log model.

20. Posted August 19, 2013 at 7:09 am | Permalink

I have a large absolute value as a dependent variable and some equally large independent variables.The other remaining independent variables are in rates.i want to do a regression and wants to introduce logs,how do i go on it?

• Posted August 19, 2013 at 8:59 am | Permalink

Let's say that your independent variables are X1, X2,..., Xp and your dependent variable is Y. To use logs, define LX1=log(X1),..., LXp=log(Xp), and LY=log(Y). Then use regression to model LY as a function of LX1, LX2,..., LXp.

21. Odi
Posted October 10, 2013 at 9:47 am | Permalink

Hi Rick, this blog is really great! I don't have negative numbers, but I do have values below one, which when transformed, turn into negative values. I understand why this happens but I am not sure if this affects my analysis of the data and in which way.

• Posted October 10, 2013 at 9:54 am | Permalink

In general the answer is that it is okay to get negative values. For example, if you are analyzing the GNP of nations in units of Trillions of dollars, small and third-world nations will have a negative log(GNP), whereas major industrial nations will not.

22. Lola
Posted October 28, 2013 at 2:39 pm | Permalink

Hi, I am working with measurements of the conductivity of water and discharge of water. I want to produce a non linear regression to demonstrate the relationship between to two. When I did the log transformation of both these variables the discharge has all come back negative. How do I fix this? Do I add a constant when I am working out both logarthims ?

• Posted October 28, 2013 at 3:11 pm | Permalink

A value that is between 0 and 1 will be transformed to a log-value that is negative. There is nothing to "fix."

23. Nikie
Posted March 10, 2014 at 6:21 am | Permalink

Dear Rick,

Thank you for your effort on this page, it is very helpful. I am forecasting inflation in Eviews 7, and some of my relative variables are negative and not normally distributed. The smallest negative number is -1,5%. From what I understood of the above comments is that I should take log(x+1,5) of this series to convert the negative numbers into positive? (and thus to check for normality again)
Is that correct?
Thank you!

• Posted March 10, 2014 at 9:38 am | Permalink

Yes, that sounds right. Two comments:
1) log(0) is not defined, so add a number GREATER than 1.5 to make sure that when x=-1.5, you don't get log(0). The actual number doesn't matter much: 1.6 would work, as would 1.51 or even 2.
2) I assume from your note that x is measured in terms of percent, so that min(x)=-1.5. If min(x)=-1.5%= -0.015, then you can divide my numbers by 100.

24. jenny
Posted April 19, 2014 at 12:41 am | Permalink

Hi,Sir...
I am doing my research proposal. I use interest rate, inflation, deflator for my independent variables..The data got negative value -X>0 and -X<0. I use this technique (lx1=@recode(x1>0,log(1+x1),-log(1-x1)) for log the data (http://forums.eviews.com/viewtopic.php?f=3&t=1212)
But at the end year, the data for value is NA...Is it any wrong ?? Sir, could you suggest the better technique for me to log the data??
I appreciate it. Thank you!

• Posted April 19, 2014 at 6:47 am | Permalink

The only way that I can see that you would get NA is if the original data had an NA. Your transformation (which I like, and which can be simply written as sgn(x)*log(1+abs(x))) is defined for all real numbers, so the NA are coming from the data, not from the transformation.

It sounds like you are doing this transformation on the explanatory variables. For regressions and other analyses, there is nothing wrong with having negative values in explanatory variables. Furthermore, log-transformations are most useful for variables that stretch over many orders of magnitudes (such as population of cities), and none of the variables that you mention have large ranges. If it were me, I'd try using the original variables instead of transforming them.

25. kenny
Posted April 30, 2014 at 3:46 pm | Permalink

Hi - I'm curious on your thoughts about choice of "a" for log(Y+a). The two minimum values in my data (negative, of course) don't appear to be outliers when viewed on the original scale. Depending on the choice of "a", they can be made to look like severe outliers (if Y+a is very near zero or one) or not to appear like outliers at all (as Y+a increases). This has obvious implications when analyzing via regression. I know you mention 0.001 or 1 as minimums above, but are there any good resources for determining a proper "a"?

Thanks for your time and effort.

• Posted May 1, 2014 at 8:49 am | Permalink

You ask an interesting question. When applying a nonlinear transformation, you are going to change the distribution of the response. Usually you are starting with a response distribution that is skewed and you are trying to transform it into a distribution that is closer to normal. That is why the log transformation is one of the so-called "transformations to normality" or "normalizing transformation." As you point out, the choice of 'a' affects the distribution of the transformed variables. So which value of 'a' to choose?

I'm guessing that you should strive to choose a value that makes your transformed response most nearly normal. If a = min(y) + 0.0001, then the response will be strongly negatively skewed relative to its original skewness. If a = min(y) + 1, then the response will be moderately negatively skewed relative to its original skewness. If a = min(y) + 1000, then the skewness hardly changes at all (assuming the range of Y is small). Unfortunately I do not have a reference for this idea, but it reminds me of the "Box-Cox transformation," which optimizes a parameter in a family of power transformations.

26. shweta patel
Posted May 25, 2014 at 3:36 pm | Permalink

I have data in the form of area in cm square..so what kind of transformation values I used to minimize or stabilize varibilty n data.

27. Samson - ECM
Posted June 20, 2014 at 8:11 am | Permalink

Hi Rick,
I am constructing an Error Correction Model using Panel Data in STATA. I was trying to obtain the natural logs of my dependent variable (ratio of Capital flight/Real GDP) and independent variables (Interest rate differential, Financial Openness- [Chin-ito index], real exchange rate) for 4 countries. I am however, constrained since all my variables contain negative numbers. I was keen on transforming since they all have different magnitudes. How would you advice I proceed in such an instance?

• Posted June 20, 2014 at 10:21 am | Permalink

If "4 countries" means "4 observations," then your regression isn't going to be very good. But to answer your question, if your goal is to reduce the order of magnitudes in a variable, you can use the log-modulus transformation: y --> sign(y)*log(|y| + 1), which is a continuous transformation that preserves signs.

28. Nils
Posted July 17, 2014 at 9:34 am | Permalink

Hi Rick,

I have absolutely no question at all, just wanted to say that I'm absolutely amazed by your responses here to a bunch of very badly phrased questions (even from people who aren't using SAS at all!). You're doing the work of dozens of undergrad thesis advisors - amazing!

• Posted July 17, 2014 at 9:48 am | Permalink

Thanks, you made my day. Yes, some of the questions contain fewer details than I would prefer, and some of my responses are no more than educated guesses. I readily admit that I am not The World's Foremost Expert on Transformations, so I hope none of these folks are relying exclusively on my judgement. Remember, never trust advice found on the Internet! :-)

29. patrick bengana
Posted August 12, 2014 at 4:24 am | Permalink

dear Rick.
Am using eviews to test for normality of inflation values but even when I log or add a constant it does not become normally distributed. The values of inflation that am using are quarterly changes in inflation so some of the values are negative. Please help advise me on how to make the variable normally distributed. thanks

• Posted August 12, 2014 at 2:26 pm | Permalink

Not all data are normally distributed. Not all data can be made normal by taking a logarithm. But that is okay because normality is not a requirement to run a linear regression. After you run a linear regression you should check the RESIDUALS of the regression. If these RESIDUALS are normally distributed, then that is evidence that your regression model captured the relationship between your response and your explanatory variables.

30. Mary
Posted September 17, 2014 at 3:56 am | Permalink

Hi Rick,
I have marginal cost variables (obtained from panel data analysis) that should be taken their log transformation in order to put it in the equation. But some of the data are negative. As I take the log transformations, the negative one's solution is undefined. How can I overcome this? Could you please help? Thank you.

• Posted September 17, 2014 at 5:35 am | Permalink

31. Svilena
Posted October 9, 2014 at 9:20 am | Permalink

Hello Rick,
I run a fixed effects model where the dependent variable is Gini index measuring income inequality and the independent variables are: FDI stock (% of GDP), inflation rate (% change of consumer price index), secondary school enrollment ratio, government expenditure (% of GDP), services value added (% of GDP), GDP per capita in PPP (constant 2011 international \$). Except for GDP per capita and inflation rate which range up to 30 000 and 1000 respectively, all the other variables range up to 100. Do you think it is appropriate to take the natural log only of GDP per capita as is the practice in similar studies and work with the original values of all the other variables? Or is it better to take natural logs of all the variables in the model? Some studies do it the first way, some – the second and I cannot understand the logic behind this choice. (Moreover, in my case taking logs of all variables doesn’t help making them normally distributed).

• Posted October 9, 2014 at 9:25 am | Permalink

It is fine to take the log of a subset of the variables.

32. Felix
Posted December 25, 2014 at 3:27 am | Permalink

Hi Rick
I have a sample of 30 FTSE100 companies' share prices and salaries for the respective CEOs, I have calculated averages for the share prices and the salaries so that I can run a regression, how do I transform these averages into logs in Eviews8

• Posted December 26, 2014 at 7:35 am | Permalink

Sorry, but I do not use Eviews. I use SAS software.

33. Aemro
Posted February 26, 2015 at 1:31 pm | Permalink

what is the best method of transformation if the all Box-Cox transformation not satisfy for continuous response model and other alternatives to use?

1. [...] which the first expression is true. For example, in a previous post, I described several ways to handle negative values in evaluating a logarithmic data transformation. You might assume that the following statements prevent the LOG function from evaluating negative [...]

2. [...] defines a helper function, SafeLog, that returns the natural log of positive quantities and returns missing values for non-positive quantitie... [...]

3. [...] run-time library to include special user-defined functions. In a previous blog post I discussed two different ways to apply a log transformation when your data might contain missing values and neg.... I'll use the log transformation example to show how to define and call user-defined functions in [...]

4. [...] to obtain a missing value for the square root of a negative number. As I showed in my article on how to handle negative values in a log function, you can use the SAS/IML CHOOSE function to return missing values: y2 = sqrt( choose(x>=0, x, [...]

5. […] you can handle zero counts in any mathematically consistent way. I have previously written about how to use a log transformation on data that contain zero or negative values. The idea is simple: instead of the standard log transformation, use the modified transformation x […]

6. […] my four years of blogging, the post that has generated the most comments is "How to handle negative values in log transformations." Many people have written to describe data that contain negative values and to ask for advice about […]

7. By Beware the naked LOC - The DO Loop on August 7, 2014 at 10:31 am

[…] finds elements of a vector or matrix that satisfy some condition. For example, if you are going to apply a logarithmic transform to data, you can use the LOC function to find all of the positive […]

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, statistical graphics, statistical simulation, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.