Log transformations: How to handle negative data values?

The log transformation is one of the most useful transformations in data analysis. It is used as a transformation to normality and as a variance stabilizing transformation. A log transformation is often used as part of exploratory data analysis in order to visualize (and later model) data that ranges over several orders of magnitude. Common examples include data on income, revenue, populations of cities, sizes of things, weights of things, and so forth.

In many cases, the variable of interest is positive and the log transformation is immediately applicable. However, some quantities (for example, profit) might contain a few negative values. How do you handle negative values if you want to log-transform the data?

Solution 1: Translate, then Transform

A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1. For the latter choice, you can show that a = b – min(Y), where b is either a small number or is 1.

In the SAS/IML language, this transformation is easily programmed in a single statement. The following example uses b=1 and calls the LOG10 function, but you can call LOG, the natural logarithm function, if you prefer.

proc iml;
Y = {-3,1,2,.,5,10,100}; /** negative datum **/
LY = log10(Y + 1 - min(Y)); /** translate, then transform **/

Solution 2: Use Missing Values

A criticism of the previous method is that some practicing statisticians don't like to add an arbitrary constant to the data. They argue that a better way to handle negative values is to use missing values for the logarithm of a nonpositive number.

This is the point at which some programmers decide to resort to loops and IF statements. For example, some programmers write the following inefficient SAS/IML code:

n = nrow(Y);
LogY = j(n,1); /** allocate result vector **/
do i = 1 to n; /** loop is inefficient **/
   if Y > 0 then LogY[i] = log(Y);
   else LogY[i] = .;

The preceding approach is fine for the DATA step, but the DO loop is completely unnecessary in PROC IML. It is more efficient to use the LOC function to assign LogY, as shown in the following statements.

/** more efficient statements **/
LogY = j(nrow(Y),1,.); /** allocate missing **/
idx = loc(Y > 0); /** find indices where Y > 0 **/
if ncol(idx) > 0 then 
   LogY[idx] = log10(Y[idx]);
print Y LY LogY;

The preceding statements initially define LogY to be a vector of missing values. The LOC function finds the indices of Y for which Y is positive. If at least one such index is found, those positive values are transformed and overwrite the missing values. A missing value remains in LogY for any element for which Y is negative.

You can see why some practitioners prefer the second method over the first: the logarithms of the data are unchanged by the second method, which makes it easy to mentally convert the transformed data back to the original scale (see the transformed values for 1, 10, and 100). The translation method makes the mental conversion harder.

You can use the previous technique for other functions that have restricted domains. For example, the same technique applies to the SQRT function and to inverse trigonometric functions such as ARSIN and ARCOS.

tags: Data Analysis, Efficiency, Statistical Programming


  1. Rick Wicklin
    Posted June 2, 2011 at 2:02 pm | Permalink

    Did you know that SAS now has a LOG1PX function that "returns the log of 1 plus the argument"? It's true!

  2. Saumya
    Posted September 1, 2011 at 6:47 am | Permalink

    Dear Rick

    My data set includes stock return of around 1000 companies. In most cases sometimes the return data shows a -34.5 to -108 figures. How to make log transformation in this case. How much should be the constant value in this kind of data. Please help.

    • Posted September 1, 2011 at 7:50 am | Permalink

      It depends somewhat on what you're trying to do, but you might want to express the returns as a percentage, measured from the start of the time period (1 yr, 5 yrs, or whatever). Then the Negative returns are bounded by -100 percent, and you can safely compute log(101 + return).

      • Qunna
        Posted September 24, 2014 at 1:04 pm | Permalink

        Dear Rick,

        My instructor is very reluctant to model on percentages. He said averaging on percentage did not make sense since we had different denominator. Do you normally modeling on percentages? Thanks.

        • Posted September 24, 2014 at 1:46 pm | Permalink

          I do not. However, I don't think there is an inherent reason to avoid proportions and percentages. It is true that proportions are different from continuous unbounded data. However, power transformations are still useful and the analogue of the log transformation for proportions is the logit transformation: logit(y) = log(y/(1-y)). Atkinson's (1985) book on "Plots, Transformations, and Regression" has a whole chapter devoted to transformations for percentages and proportions.

      • Razafimahery
        Posted March 20, 2015 at 5:16 am | Permalink

        Thanks Rick,
        when the negative values are bounded by -100 percent,can you explain why using log(101 + x) and not log(100+x)?

  3. Posted October 11, 2012 at 7:54 am | Permalink

    Dear Rick

    I have a data set for which the dependent variable is both positive and negative. Would you say an alternative is to take absolute values, then take logs, before multiplying the original values with -1. For me this seems reasonable, but I am not sure if I can interpret my coefficients in terms of percentage changes any more?

    All best

    • Posted October 11, 2012 at 8:19 am | Permalink

      Well, I don't know your application, but I don't think I would recommend that approach because the LOG function has a singularity at zero. The LOG transformation is best for mapping changes that are between 0 and infinity. For example, if you buy a stock at a certain price, the quantity Price/Purchace_Price is always positive. It can be log-transformed. It sounds like your variable might be "relative change" such as (Price-Purchase_Price)/Purchase_Price, which can be positive or negative. I wouldn't use a log transform for the second quantity.

      • Rodwell Tundu
        Posted September 28, 2014 at 12:44 pm | Permalink

        Good Day,
        I am commenting on this particular reply because you told someone face a problem similar to mine to refer to the solution you provided to this problem. Well I am using Eviews 6,in my study I have encountered negative GDP and FDI growth rates. Please advise on how I could apply your transformations to my data. Thank you

        • Posted September 28, 2014 at 1:14 pm | Permalink

          Read the article and do what it says. You will need to learn how to transform data in EViews.

          • Rodwell Tundu
            Posted September 28, 2014 at 1:54 pm | Permalink

            Thank you for the prompt response. I did exactly what as you instructed in solution one of the article, the software is now asking me to define"min". How do I do that?

          • Posted September 28, 2014 at 2:12 pm | Permalink

            "Min" will be the smallest value for the variable that you are transforming. So if the smallest (=most negative) GDP growth is -1.2, you would use -1.2 as "Min" for GDP.

    • Posted June 25, 2014 at 6:01 am | Permalink

      Upon rereading your comment, perhaps you were attempting to form a transformation like this: y --> sign(y)*log(|y| + 1)

  4. abera daba
    Posted October 15, 2012 at 4:40 am | Permalink

    Dear Rick
    I have data set of both positive and negative value. I have changed the large number with minus sign among the treatments to zero by adding equal positive number and also to all treatments, then I have analyzed by SAS. But I am not sure, please would you help me?

    All best

  5. Sara
    Posted October 15, 2012 at 7:49 am | Permalink

    Dear Rick

    I have to apply regression to Return on equity ratios, return on asset ratios, GDP growth, Inflation % and Real interest rate.

    Problem lies where I want to take natural log of data of all variables. All data is in % form but have positive or negative values. e.g. ROA =0.328% and ROE = -7.92%. I can easily apply natural log to 0.328% value and get -1.11474 but how can I apply natural log to the negative value? if you could please shoe me how and give me a figure to see.

    is there any way if I just write 0 for this entry and perform my regression? It may give wrong results.

    • Posted October 16, 2012 at 6:05 am | Permalink

      Use the proportion y = Price / Purchase_price.
      This is always in the interval (0, infinity) for non-bankrupt stocks, and y=1 means that the price has not changed since it was purchased. If this quantity spans several orders of magnitude, you can apply the log(y) transform.

  6. Clarisa Anne
    Posted October 20, 2012 at 8:03 pm | Permalink

    Sir, I am using Eviews 7 and I have values in my data set for presidential approval ratings which are negative. I need to use log of the ratings but Eviews cannot compute it. How do I get rid of the negative values? Thank you!

    • Posted October 20, 2012 at 8:49 pm | Permalink

      You don't need to use the log of the rating, just use the ratings as given. Logarithms are used when data many orders of magnitudes, which doesn't apply for approval ratings. If you insist on transforming the data, use y = (r+100)/200, which maps the rating (r) from [-100,100] to [0,1].

  7. Rena
    Posted November 15, 2012 at 10:58 am | Permalink

    Hello Rick!

    I have a question regarding the interpretation of log transformed data where the constant was added to avoid some negative values. Should we still interpret the results in the way that 1% change in independent value leads to ß % (which is a coefficient found after regression) change in the dependent one? (both dependent and independent variables were log transformed)

    • Posted November 15, 2012 at 11:08 am | Permalink

      No, the interpretation is that a unit change in the LOG of the indep var leads to change of beta in the LOG of the dependent variable. This is a much more difficult interpretation because the "unit change in log(x)" now depends on x. At small values of x, log(x) changes quickly; for large values of x, log(x) changes slowly.

      • Tom
        Posted November 4, 2014 at 11:15 am | Permalink

        Hello Rick,

        My questions relates to this post. I have some data that ranges from 0.0 to 1,960. I added 1.0 to all values and then ran a natural log transformation to make all transformed values >=0. Next, I computed the arithmetic mean (and 95% CIs) of the logged transformed values (for several levels of categorical variables). I exponentiated these arithmetic means and CIs to get the geometric mean and its CIs. My question is: do I need to subtract 1.0 from the geometric mean and each CI to properly put it back in the original scale? Or is some other adjustment necessary? I became concerned that this approach was not adequate when I added other arbitrary values (0.001, 2, etc) prior to log transformation and (after exponentiation) was not able to adjust the numbers to the same values by subtracting out the constant added during the linear transformation.

        Excellent blog. Thanks for your time.


        • Posted November 4, 2014 at 12:58 pm | Permalink

          Yes, you would need to invert the transformation, which would include adding the constant.
          When you say "compute mean and CIs," I assume that you are using the standard formula xbar +/- t*s/sqrt(n), where t is a quantile for the t distribution. This formula assumes normality, so whether the CIs are good depends on whether the transformed data is approximately normally distributed for each level of the categorical variables.

          • Tom
            Posted November 4, 2014 at 1:19 pm | Permalink

            Thanks for the prompt reply. I am not sure I follow. Could you clarify using my code (pasted below)? I am using SAS to output means of the logged values (here's my macro code):

            &varname = categorical variable
            log_adj_wipval_b = logged dependent variable

            title "&Varname: Logged Value Mean";
            proc means data = master;
            class &varname;
            var log_adj_wipval_b ;
            output out=&varname._m mean=m_&varname lclm=lcl_&varname uclm=ucl_&varname;

            In the next step I exponentiate and print the values. Can you advise me on how to adjust this step?

            data &varname._m2;
            set &varname._m;

            *Subtract 1.0 from each;

            C_geo_&varname = exp(m_&varname) - 1.0;
            C_geo_lcl_&varname = exp(lcl_&varname) - 1.0;
            C_geo_ucl_&varname = exp(ucl_&varname) - 1.0;

          • Posted November 4, 2014 at 1:31 pm | Permalink

            The code is correctly computing the pre-image of the normal CIs of the transformed data.

  8. Tolu
    Posted January 22, 2013 at 5:52 am | Permalink

    Hello Rick,
    I am working on human capital investment and economic growth, and my dependent variable is Real GDP, while my independent variables are labor, capital, government expenditure on health and education. my proxies are labor force population, gross fixed capital formation, government expenditure on health and education, life expectancy rate and adult literacy rate. However, I need to know which ones to log and whether to use natural or common base 10 logarithm, and why I should use one instead of the other. Thank you very much.

    • Posted January 22, 2013 at 8:20 am | Permalink

      A simple rule of thumb is to log-transform variables that range over several orders of magnitude. For example, if one country has a population of one million and another has a population of a billion, that is three orders of magnitude, so a regression model that includes the log(population) is worth considering. For your variables, I would choose base 10 because the results will be more interpretable. If you see that log10(X) is close to 3, you can use mental arithmetic to figure out that X is close to 1,000.

  9. Jacob Rodriguez
    Posted February 20, 2013 at 2:06 am | Permalink

    Hi, Rick. Using log(Y+k) to deal with zero and negative values of the outcome variable seems to be problematic, if I care about the interpretation of beta_1 in E[log(Y+k)] = beta_0 + beta_1 X. I've seen some data analysts exponentiate the right side of the equation and then they subtract k to complete the backtransformation. But this isn't right, as E[log(Y+a)] = log GM(Y+a), where GM is the geometric mean. So my question is: for E[log(Y+k)] = beta_0 + beta_1 X, what is the interpretation of beta_1? If k=0, then [exp(beta)-1] has the neat interpretation of percentage change in GM(Y) for a unit increase in X. But if k is not 0, do we have a similar interpretation?

    - Jacob

    • Posted February 20, 2013 at 5:43 am | Permalink

      You've hit on a key issue: how do you interpret statistics that result from (any) transformation of a variable? As you point out, some transformations have simpler interpretations than others. There have been many books and papers written on this topic, and I recommend the ones by AC Atkinson. His book _Plots, Transformations, and Regression_ describes transformations for a wide variety of situations.

  10. Neha
    Posted March 13, 2013 at 4:41 am | Permalink


    I have a data set of food expenditures with the consumed quantities. Since there is no data about per unit prices, I got it as expenditure/quantity. Then I got the natural logarithm of prices using stata. But most of the values came as negative. I'm afraid of my results & I want to know can natural log values of prices be negative. What does it mean by the negative sign? Please help. Thanks in advance.

    • Posted June 25, 2014 at 6:06 am | Permalink

      Yes. The log can be negative. In your case it means that the ratio is less than 1.

  11. Moni
    Posted April 13, 2013 at 9:10 am | Permalink

    Hello Rick, thanks for the useful blog.

    My model (OLS regression) consists of depend. variable being the industry return and then the 3 indep. var. are total market return/oil price return/natural gas return. Each of these returns I want to log

    I need to transform the negative numbers to use the log and do it the firs way suggested..
    I apply =log(1+*value of return*) My question is, Should I apply the +1 for all 2608 observations I have?? Or only for the negative ones.

    I am very grateful for an answer here.

    Regards, Moni

    • Posted April 13, 2013 at 4:23 pm | Permalink

      The transformation is applied to the entire variable, so you should apply it to all 2608 observations.

  12. kayla
    Posted May 4, 2013 at 12:51 pm | Permalink

    Hi Rick

    I have savings data set with both negatives and positives. How do I log transform it in eviews especially the negatives?

  13. jj
    Posted May 9, 2013 at 8:04 pm | Permalink

    hello rick,
    i have few independent variables, which are earnings per share, book value and fair value. The problem is, i got negative data for earnings per share(EPS). So, should i just transform the EPS to log (1+ EPS) or i need to do the same to book value and fair value?

    • Posted May 10, 2013 at 6:23 am | Permalink

      You do not need to transform each variable in the same way. It seems to me that EPS can be less than 1, so that 1+EPS can still be negative, so be sure to look at the most negative value of EPS before you decide on a transformation.

  14. nad
    Posted May 11, 2013 at 9:50 am | Permalink

    Hi Rick,
    Is that necessary for all variables to be normal distribution if we want to run multiple regression? I did transform some of my variables but the result is still not normal. So, what should I do then? Your suggestion is really appreciated.

    • Posted May 12, 2013 at 6:31 am | Permalink

      No, regression does not require that the explanatory variables be normally distributed. If you do an internet search for "assumptions of linear regression" you will find many articles. If you want to do inference on the least square estimates (the regression coefficients), you assume normally distributed ERRORS (residuals). That is, the Y variable is linearly related to the X variables plus some unknown error term that is normally distributed.

  15. nad
    Posted May 12, 2013 at 1:20 pm | Permalink

    hi Rick,
    I have a problem with normality test again. In order to make sure that I can use parametric test, I need to make sure that my residual distribution is normal. However, when I refer to the value of skewness and kurtosis of the residual, it is -0.017 and -0.438 respectively, where i think this is considered as normal. Unfortunately, when i do kolmogorov-smirnov, the significant value is 0.021, which indicates the residual is not normal. The sample of my study is 290. Could i just ignore the kolmogorov-smirnov test and assume the residual is normal as the data is large?

    • Posted May 12, 2013 at 6:36 pm | Permalink

      In practice, many people just "eyeball" the residuals to check that they are approximately normal. If the residuals are approximately normal, the inference on the regression coefficient will still be good. The quantile-quantile plot in PROC UNIVARIATE is probably more valuable than the K-S test for assessing (approximate) normality.

  16. saizal
    Posted July 8, 2013 at 1:43 pm | Permalink

    Hello rick. i'm trying to log the size of firm before i run a GMM regression. Size of firm is defined as:
    size = (common stock/book value) x stock market price
    But the problem is, there are many negative value there. How do i log the data. should i just treat the negative value = 0? When i log the whole data using microsoft excel, the negative values are treated as 'missing'. So can i just run the GMM regression though there are missing values? is there any best alternative to handle the situation. Thanks.

    • Posted July 8, 2013 at 1:55 pm | Permalink

      First, if you run the regression with missing values, you are excluding all of that data when you construct the regression model. I wouldn't do that.

      It seems like the problem is the definition of "size". I think most people expect "size" to be a positive quantity, such as "market capitalization" or something similar. If you can, change the way that you measure size.

  17. arshad
    Posted July 17, 2013 at 10:22 am | Permalink

    Hi I am working on GDP forecasting.The amount is very high .So to make it stationary,I transform data into log difference .I am using eviews 7.After forecating data I dont know how to convert these values into origional values.These values have become very small.Could you help me please in this regard thanks

    • Posted July 17, 2013 at 10:34 am | Permalink

      If you are predicting log(GDP), then exponentiate the predicted values to get back to the original scale.

      • arshad
        Posted July 17, 2013 at 11:52 am | Permalink

        Sir I am predicting d(log(gdp))
        i using ARIMA MODELING.So i transform the data by using first difference logrithm.Now i got the forecasting results.But after transformation data is changed So i want to bring the data back to its origional form.I used follow transformation

        only exponentiate does it work?

  18. xiaoqi
    Posted August 3, 2013 at 5:38 pm | Permalink

    dear sir, I am doing inward FDI as a dependent variable, it has positive numbers and negative numbers, the data start from 75,and the number become larger as a time series, and the lowest number is -15348, if I plus 15348 to make all the number positive, then the first data will become very large, as GDP is a explanatory variable, if plus such a big number will the regression affected?I use eviews 7. how can I do?

    lnFDI is measuring the % change of FDI in the regression, so can I just use FDI minus last period FDI, and calculate the change rate, use the growth rate of FDI instead, but without log?

    • Posted August 3, 2013 at 10:10 pm | Permalink

      Adding a constant to the response only changes the regression by changing the intercept. If you then apply a log transformation, it becomes harder to interpret the regression coefficients in terms of intuitive quantities such as %change. I think in your case you should plot the data. Is inFDI linearly related to GDP? If so, don't apply the transform. If not, tapply the log(Y+c) transform. Is the transformed response linearly related to the explanatory variables? SAS and other statistical software provide graphical diagnostic plots that you can use to assess the fit of the model.

  19. Steven
    Posted August 12, 2013 at 6:41 pm | Permalink

    Hi Rick,

    I have monthly growth data, which is sometimes negative. There is a desire for the growth to be measured on a per day basis (so, growth per day). I had been using the method you described Y*=ln(y+a). But, several on the team are not comfortable with that. An alternative suggested was taking the log of the values prior to differencing them. If this is the dependent variable of a log-log model, would the coefficients with this transformation be interpreted the same way as the Y*=ln(y+a) would be interpreted? What can I do with the per day aspect of this?

    I really appreciate this thread, and all the useful feedback you are personally supplying.

    • Posted August 13, 2013 at 7:06 am | Permalink

      In general, I think it is wise for analysts to be skeptical of advice found on the internet! To answer your question, if you take logs first and then difference them, you are forming the log of a ratio, since log(y_{i+1}) - log(y_i) = log(y_{i+1} /y_i). This would mean that you would be examining the "proportion of change" from one year to the next. Assuming that none of your data are zero, this is a reasonable thing to do. It centers the "no change" situation at 1 instead of 0, and it also eliminates negative numbers (assuming your data are positive).

      Try graphing the proportion of change without the log: z_i = y_{i+1} /y_i. Perhaps you can do your regression on that proportion. If so, that's what I'd try.

      • Steven
        Posted August 20, 2013 at 3:34 pm | Permalink

        Thanks for your feedback! I really like the log-log because the coefficients are easy to interpret. How would you interpret coefficients in this proportion change model?

        • Posted August 20, 2013 at 3:42 pm | Permalink

          Same as usual: the change in the proportion when an explanatory variable changes by one unit. But you're right that this model is not used as often as the log-log model.

  20. Posted August 19, 2013 at 7:09 am | Permalink

    I have a large absolute value as a dependent variable and some equally large independent variables.The other remaining independent variables are in rates.i want to do a regression and wants to introduce logs,how do i go on it?

    • Posted August 19, 2013 at 8:59 am | Permalink

      Let's say that your independent variables are X1, X2,..., Xp and your dependent variable is Y. To use logs, define LX1=log(X1),..., LXp=log(Xp), and LY=log(Y). Then use regression to model LY as a function of LX1, LX2,..., LXp.

  21. Odi
    Posted October 10, 2013 at 9:47 am | Permalink

    Hi Rick, this blog is really great! I don't have negative numbers, but I do have values below one, which when transformed, turn into negative values. I understand why this happens but I am not sure if this affects my analysis of the data and in which way.

    • Posted October 10, 2013 at 9:54 am | Permalink

      In general the answer is that it is okay to get negative values. For example, if you are analyzing the GNP of nations in units of Trillions of dollars, small and third-world nations will have a negative log(GNP), whereas major industrial nations will not.

  22. Lola
    Posted October 28, 2013 at 2:39 pm | Permalink

    Hi, I am working with measurements of the conductivity of water and discharge of water. I want to produce a non linear regression to demonstrate the relationship between to two. When I did the log transformation of both these variables the discharge has all come back negative. How do I fix this? Do I add a constant when I am working out both logarthims ?

    • Posted October 28, 2013 at 3:11 pm | Permalink

      A value that is between 0 and 1 will be transformed to a log-value that is negative. There is nothing to "fix."

  23. Nikie
    Posted March 10, 2014 at 6:21 am | Permalink

    Dear Rick,

    Thank you for your effort on this page, it is very helpful. I am forecasting inflation in Eviews 7, and some of my relative variables are negative and not normally distributed. The smallest negative number is -1,5%. From what I understood of the above comments is that I should take log(x+1,5) of this series to convert the negative numbers into positive? (and thus to check for normality again)
    Is that correct?
    Thank you!

    • Posted March 10, 2014 at 9:38 am | Permalink

      Yes, that sounds right. Two comments:
      1) log(0) is not defined, so add a number GREATER than 1.5 to make sure that when x=-1.5, you don't get log(0). The actual number doesn't matter much: 1.6 would work, as would 1.51 or even 2.
      2) I assume from your note that x is measured in terms of percent, so that min(x)=-1.5. If min(x)=-1.5%= -0.015, then you can divide my numbers by 100.

  24. jenny
    Posted April 19, 2014 at 12:41 am | Permalink

    I am doing my research proposal. I use interest rate, inflation, deflator for my independent variables..The data got negative value -X>0 and -X<0. I use this technique (lx1=@recode(x1>0,log(1+x1),-log(1-x1)) for log the data (http://forums.eviews.com/viewtopic.php?f=3&t=1212)
    But at the end year, the data for value is NA...Is it any wrong ?? Sir, could you suggest the better technique for me to log the data??
    I appreciate it. Thank you!

    • Posted April 19, 2014 at 6:47 am | Permalink

      The only way that I can see that you would get NA is if the original data had an NA. Your transformation (which I like, and which can be simply written as sgn(x)*log(1+abs(x))) is defined for all real numbers, so the NA are coming from the data, not from the transformation.

      It sounds like you are doing this transformation on the explanatory variables. For regressions and other analyses, there is nothing wrong with having negative values in explanatory variables. Furthermore, log-transformations are most useful for variables that stretch over many orders of magnitudes (such as population of cities), and none of the variables that you mention have large ranges. If it were me, I'd try using the original variables instead of transforming them.

  25. kenny
    Posted April 30, 2014 at 3:46 pm | Permalink

    Hi - I'm curious on your thoughts about choice of "a" for log(Y+a). The two minimum values in my data (negative, of course) don't appear to be outliers when viewed on the original scale. Depending on the choice of "a", they can be made to look like severe outliers (if Y+a is very near zero or one) or not to appear like outliers at all (as Y+a increases). This has obvious implications when analyzing via regression. I know you mention 0.001 or 1 as minimums above, but are there any good resources for determining a proper "a"?

    Thanks for your time and effort.

    • Posted May 1, 2014 at 8:49 am | Permalink

      You ask an interesting question. When applying a nonlinear transformation, you are going to change the distribution of the response. Usually you are starting with a response distribution that is skewed and you are trying to transform it into a distribution that is closer to normal. That is why the log transformation is one of the so-called "transformations to normality" or "normalizing transformation." As you point out, the choice of 'a' affects the distribution of the transformed variables. So which value of 'a' to choose?

      I'm guessing that you should strive to choose a value that makes your transformed response most nearly normal. If a = min(y) + 0.0001, then the response will be strongly negatively skewed relative to its original skewness. If a = min(y) + 1, then the response will be moderately negatively skewed relative to its original skewness. If a = min(y) + 1000, then the skewness hardly changes at all (assuming the range of Y is small). Unfortunately I do not have a reference for this idea, but it reminds me of the "Box-Cox transformation," which optimizes a parameter in a family of power transformations.

  26. shweta patel
    Posted May 25, 2014 at 3:36 pm | Permalink

    I have data in the form of area in cm square..so what kind of transformation values I used to minimize or stabilize varibilty n data.

  27. Samson - ECM
    Posted June 20, 2014 at 8:11 am | Permalink

    Hi Rick,
    I am constructing an Error Correction Model using Panel Data in STATA. I was trying to obtain the natural logs of my dependent variable (ratio of Capital flight/Real GDP) and independent variables (Interest rate differential, Financial Openness- [Chin-ito index], real exchange rate) for 4 countries. I am however, constrained since all my variables contain negative numbers. I was keen on transforming since they all have different magnitudes. How would you advice I proceed in such an instance?

    • Posted June 20, 2014 at 10:21 am | Permalink

      If "4 countries" means "4 observations," then your regression isn't going to be very good. But to answer your question, if your goal is to reduce the order of magnitudes in a variable, you can use the log-modulus transformation: y --> sign(y)*log(|y| + 1), which is a continuous transformation that preserves signs.

  28. Nils
    Posted July 17, 2014 at 9:34 am | Permalink

    Hi Rick,

    I have absolutely no question at all, just wanted to say that I'm absolutely amazed by your responses here to a bunch of very badly phrased questions (even from people who aren't using SAS at all!). You're doing the work of dozens of undergrad thesis advisors - amazing!

    • Posted July 17, 2014 at 9:48 am | Permalink

      Thanks, you made my day. Yes, some of the questions contain fewer details than I would prefer, and some of my responses are no more than educated guesses. I readily admit that I am not The World's Foremost Expert on Transformations, so I hope none of these folks are relying exclusively on my judgement. Remember, never trust advice found on the Internet! :-)

  29. patrick bengana
    Posted August 12, 2014 at 4:24 am | Permalink

    dear Rick.
    Am using eviews to test for normality of inflation values but even when I log or add a constant it does not become normally distributed. The values of inflation that am using are quarterly changes in inflation so some of the values are negative. Please help advise me on how to make the variable normally distributed. thanks

    • Posted August 12, 2014 at 2:26 pm | Permalink

      Not all data are normally distributed. Not all data can be made normal by taking a logarithm. But that is okay because normality is not a requirement to run a linear regression. After you run a linear regression you should check the RESIDUALS of the regression. If these RESIDUALS are normally distributed, then that is evidence that your regression model captured the relationship between your response and your explanatory variables.

  30. Mary
    Posted September 17, 2014 at 3:56 am | Permalink

    Hi Rick,
    I have marginal cost variables (obtained from panel data analysis) that should be taken their log transformation in order to put it in the equation. But some of the data are negative. As I take the log transformations, the negative one's solution is undefined. How can I overcome this? Could you please help? Thank you.

  31. Svilena
    Posted October 9, 2014 at 9:20 am | Permalink

    Hello Rick,
    I run a fixed effects model where the dependent variable is Gini index measuring income inequality and the independent variables are: FDI stock (% of GDP), inflation rate (% change of consumer price index), secondary school enrollment ratio, government expenditure (% of GDP), services value added (% of GDP), GDP per capita in PPP (constant 2011 international $). Except for GDP per capita and inflation rate which range up to 30 000 and 1000 respectively, all the other variables range up to 100. Do you think it is appropriate to take the natural log only of GDP per capita as is the practice in similar studies and work with the original values of all the other variables? Or is it better to take natural logs of all the variables in the model? Some studies do it the first way, some – the second and I cannot understand the logic behind this choice. (Moreover, in my case taking logs of all variables doesn’t help making them normally distributed).
    Your help is highly appreciated!

  32. Felix
    Posted December 25, 2014 at 3:27 am | Permalink

    Hi Rick
    I have a sample of 30 FTSE100 companies' share prices and salaries for the respective CEOs, I have calculated averages for the share prices and the salaries so that I can run a regression, how do I transform these averages into logs in Eviews8

  33. Aemro
    Posted February 26, 2015 at 1:31 pm | Permalink

    what is the best method of transformation if the all Box-Cox transformation not satisfy for continuous response model and other alternatives to use?

  34. Nnamdi Chukwuogor
    Posted April 3, 2015 at 9:33 pm | Permalink

    Sir do you think it is actually appropriate to take the natural logs of rates like lending interest rates and ratios such as net domestic investment as a proportion of GDP

  35. kapil shrimal
    Posted April 16, 2015 at 5:08 am | Permalink

    Dera sir

    I want to do multiple regression on financial data. My dependent variable is market capitalisation I.e. Ts. In corers, where as independent variables are EPs, per, GPM, roe, ronw etc which are in percentage. Some of the data are having positive values whereas some have zero and negative values. What should I do? PLz suggest me.

  36. jhayadeguzman
    Posted April 17, 2015 at 3:07 am | Permalink

    how much should be the highest constant no.to be added if i have a negative data

  37. Madhu Sarda
    Posted April 20, 2015 at 4:31 am | Permalink

    Hi Rick

    This blog is amazing...got to know so many things...ur suggestions are really helpful..I have a query..I am dealing with the time series of output, cost and cost of capital where all the values are positive but the series are not stationary..and the difference stationary series contains negative values and zero..now in my multiple regression model i am using logQ, logC, logK.. but log of negative values are undefined..how should i proceed now???...should i use signQ*log(IQI+a)??...if so, will it have any impact on co-efficient??How should i explain the result of regression??? Pls reply..i am stuck and need ur help.

  38. Emily Malunga
    Posted June 21, 2015 at 5:58 pm | Permalink

    Hey. My data has the variable "income" divided in percentage, like; how would a 15% income increase affect your consumption of beans, so the answers are:no change, less than the change in income, etc. the income change percentage options are 5%,10%,20%,25%. how can I regress this against quantity consumed?

  39. Mahmud Mansaray
    Posted August 9, 2015 at 2:25 pm | Permalink

    Dear Rick, I am running a regression analysis on some macroeconomic variables. The variables are quarterly data, with some negative values. For example, quarterly GDP values have 234566.56, 345456.23, 678994.67, -345674.21, 879076.00, -12345.00. I would like to use logarithm transformation on them. But how can I deal with the negative values when in fact the negative values are not in single digit? Please advice.


    • Posted August 9, 2015 at 3:43 pm | Permalink

      The "Solution 1" section applies regardless of the magnitude of the negative values. For your data, the transformation begomes log(X + 1 - 345674.21).

  40. MARIA
    Posted August 13, 2015 at 5:36 am | Permalink

    Hi Rick,
    I am working in determining the climatic variables that may affect the productivity and survival of birds. The thing is that all climatic variables have got different numeric scales (mean temperature, snowdays, frost days, rain days, precipitation, NAO index, etc). And maybe that could be a problem in analysis (we are trying a PLSR, and we got No Significance. Accoring to the KSmirnov test, the data is normal distributed). Most of the variables have got negative values.How can I transform them using the log transformation?
    I will appreciate your help. Thanks alot :)

  41. MARIA
    Posted August 13, 2015 at 5:41 am | Permalink

    I forgot to mention, I am working with statistica 7.0 program.
    I have used to transform positives values the log10(v1+1) (for example). Where v1 (variable1), etc.

    • Posted August 13, 2015 at 5:53 am | Permalink

      I don't see anything in your data description that makes me think that your significance problem will be fixed by applying a log transformation. (Remember: In linear regression, the explanatory variables do NOT need to be normally distributed.) If the birds are affected in the way that the model specifies, and the effects in the model arelarge enough, you will get significant results. If you do not get significance then either (1) you model is wrong or incomplete or (2) you need more data to detect the small effects of your variables. In your case, it might be that local factors (abundance of food, nesting sites, predators,....) are so much more important to the birds than the global factors that you are unable to detect the small effects.

      • Jalal Khan
        Posted January 21, 2016 at 6:20 am | Permalink

        Sir Rick,
        I am using stock prices index as dependent variable while Exports in value, Money Supply in value, Oil prices in value, Interest rate and Inflation rate in percentage, Industrial production as index. Which of these variables I need to log transform before applying other tests? As some of the variables values are large and others are small.

        • Posted January 21, 2016 at 8:01 am | Permalink

          There is no statistical requirement to transform any variable. However, explanatory variables whose range is large can sometimes lead to
          heteroscedastic residuals. After you fit the model for the original variables, look at the graphs of residuals vs explanatory variables. If any look "fan shaped," consider using the log transformation on those variables.

  42. sidra
    Posted January 26, 2016 at 4:24 am | Permalink

    Hello Sir

    what if a variable has negative values and positive both and log is taken considering absolute values? what would be the result.

  43. Ibe Happy C.S.A
    Posted March 3, 2016 at 4:10 am | Permalink

    Good day Rick.
    Your answers have really helped me. Thank you.

  44. Ahmad fraz
    Posted April 8, 2016 at 9:17 am | Permalink

    I have a question regarding the interpretation of log transformed data where the constant was added to avoid some negative values or less than one both dependent and independent variables. Then i have got results of my regression. For example, I have 1 dependent and 2 Independent variables. The regression results showed that one of independent variables hava positive beta and the other have negative beta. The interpretation of positive beta variable is "Doubling the independent variable from the mean of e.g., 8.1% to 16.2% of companies yields a 7.5% increase in the dependent variable. the calculations might be (For example, average mean of dependent variable is 0.081 (8.1%), which means the value for the mean observation is ln(1.081). Doubling dependent variable to 0.162, means a 7.5% increase in the value 1.081 (0.075=0.081/1.081). To get the impact on the independent variable, multiply 7.5% times the coefficient if coefficient is 0.909, to get a 6.8% increase in the dependent variable. If the other variable have negative beta, then the interpenetration might be "A decrease of independent variable by half from the mean of e.g., from 6% to 3%". How we can further explain if beta is -2.48.... Please help me

    • Posted April 8, 2016 at 9:52 am | Permalink

      As you've noticed, if you take laogarithms of Y and X, then the interpretation of coefficients becomes harder. In a simple model like log(Y) = b0 + b1*log(X), the original variables are related by Y = exp(b0)*X^b1. Let Y1 be the predicted value at the mean X=XBar. If you double X, then the predicted value is Y2 = (exp(b0)*XBar^b1)*2^b1 = (Y1)*2^b1. Thus doubling X results in a multiplicative change of 2^b1 in Y. This math is the same for positive and negative values of b1. If instead of doubling X you multiply it by any alpha, then the response changes by alpha^b1.

      • Ahmad fraz
        Posted April 8, 2016 at 10:49 am | Permalink

        Thanks for the prompt reply. Basically my work is similar to a paper, which presents the following results of his regression.
        “Differences in size, analyst coverage, breadth, trading cost, turnover, and change in breadth are associated with substantive differences in R-square. As an example, a firm with twice the market capitalization of the mean (mean =$1.1 billion) has an R-square ratio that is 18.2% higher, which translates in to an R-square that is 0.023 higher (an R-square of 0.175 versus the mean of 0.152). Doubling breadth of ownership from the mean of 8% to 16% of all institutions yields a 7.5% increase in the R-square ratio,19 which implies an R-square that is 0.009 greater than the mean (0.161). A decrease of round trip trading costs by half from 6.7% to 3.35% yields a similar increase in R-square of 0.01. A difference in turnover of 0.2% is associated with an R-square ratio, which is 4.4% greater, and a R-square which is 0.005 greater than the mean R-square.”
        The interpretation is a bit more challenging because a 1 is added before taking the log. For example, average breadth is
        0.081 (8.1%), which means the value for the mean observation is ln(1.081). Doubling breadth to 0.162, means a 7.5%
        increase in the value 1.081 (0.075=0.081/1.081). To get the impact on the ratio, multiply 7.5% times the coefficient (0.909), to get a 6.8% increase in the R-square ratio. A 6.8% increase in the R-square ratio results in a 0.009 increase in

        I am bit confused regarding these lines
        "A decrease of round trip trading costs by half from 6.7% to 3.35% yields a similar increase in R-square of 0.01. A difference in turnover of 0.2% is associated with an R-square ratio, which is 4.4% greater, and a R-square which is 0.005 greater than the mean R-square."
        here is the link of his paper,

        He interpret his results from the tables given at table I page 43 and table IV page 48.
        I am really thankful and desperately needs your help regarding regression results.

      • Ahmad fraz
        Posted April 8, 2016 at 10:53 am | Permalink

        I have added one is added to each variable prior to taking the log.

  45. Madhu
    Posted April 22, 2016 at 5:48 am | Permalink

    Hello Rick
    You are doing a wonderful job. Thank you.
    Your blogs are really helpful. I need a favour from you. Kindly help me.
    I am dealing with a translog cost equation with one dependent and three independent variables. Transformation to stationary series led to emerge negative values for all variables. Log transformation of negative values are not feasible and you have suggested to add a minimum constant to the series. Will it affect the stationarity? Will it affect the analysis? Do I need to back transform data? If yes, then how and when should I back transform it?

    • Posted April 22, 2016 at 2:40 pm | Permalink

      In a linear regression, adding constants to variables does not change stationarity, In fact, it doesn't change any of the parameter estimates except for the intercept term

      • Madhu
        Posted April 26, 2016 at 5:05 am | Permalink

        Thank you for the prompt reply. ☺.
        Will bother you again in future.

      • Madhu
        Posted April 26, 2016 at 5:27 am | Permalink

        Hi Rick
        I checked stationarity again after adding a constant to the series and as you mentioned stationarity is not changed. Just wanted to know whether adding constant to the variable will affect regression co-efficient??? I guess it won't but still needs your confirmation. Thanking you in advance.

7 Trackbacks

  1. [...] which the first expression is true. For example, in a previous post, I described several ways to handle negative values in evaluating a logarithmic data transformation. You might assume that the following statements prevent the LOG function from evaluating negative [...]

  2. [...] defines a helper function, SafeLog, that returns the natural log of positive quantities and returns missing values for non-positive quantitie... [...]

  3. [...] run-time library to include special user-defined functions. In a previous blog post I discussed two different ways to apply a log transformation when your data might contain missing values and neg.... I'll use the log transformation example to show how to define and call user-defined functions in [...]

  4. [...] to obtain a missing value for the square root of a negative number. As I showed in my article on how to handle negative values in a log function, you can use the SAS/IML CHOOSE function to return missing values: y2 = sqrt( choose(x>=0, x, [...]

  5. […] you can handle zero counts in any mathematically consistent way. I have previously written about how to use a log transformation on data that contain zero or negative values. The idea is simple: instead of the standard log transformation, use the modified transformation x […]

  6. […] my four years of blogging, the post that has generated the most comments is "How to handle negative values in log transformations." Many people have written to describe data that contain negative values and to ask for advice about […]

  7. By Beware the naked LOC - The DO Loop on August 7, 2014 at 10:31 am

    […] finds elements of a vector or matrix that satisfy some condition. For example, if you are going to apply a logarithmic transform to data, you can use the LOC function to find all of the positive […]

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>