The log transformation is one of the most useful transformations in data analysis. It is used as a transformation to normality and as a variance stabilizing transformation. A log transformation is often used as part of exploratory data analysis in order to visualize (and later model) data that ranges over several orders of magnitude. Common examples include data on income, revenue, populations of cities, sizes of things, weights of things, and so forth.

In many cases, the variable of interest is positive and the log transformation is immediately applicable. However, some quantities (for example, profit) might contain a few negative values. How do you handle negative values if you want to log-transform the data?

### Solution 1: Translate, then Transform

A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(*Y+a*) where *a* is the constant. Some people like to choose *a* so that min(*Y+a*) is a very small positive number (like 0.001). Others choose *a* so that min(*Y+a*) = 1. For the latter choice, you can show that *a = b* – min(*Y*), where *b* is either a small number or is 1.

In the SAS/IML language, this transformation is easily programmed in a single statement. The following example uses *b=1* and calls the LOG10 function, but you can call LOG, the natural logarithm function, if you prefer.

proc iml; Y = {-3,1,2,.,5,10,100}; /** negative datum **/ LY = log10(Y + 1 - min(Y)); /** translate, then transform **/

### Solution 2: Use Missing Values

A criticism of the previous method is that some practicing statisticians don't like to add an arbitrary constant to the data. They argue that a better way to handle negative values is to use missing values for the logarithm of a nonpositive number.

This is the point at which some programmers decide to resort to loops and IF statements. For example, some programmers write the following inefficient SAS/IML code:

n = nrow(Y); LogY = j(n,1); /** allocate result vector **/ do i = 1 to n; /** loop is inefficient **/ if Y > 0 then LogY[i] = log(Y); else LogY[i] = .; end;

The preceding approach is fine for the DATA step, but the DO loop is completely unnecessary in PROC IML. It is more efficient to use the LOC function to assign `LogY`, as shown in the following statements.

/** more efficient statements **/ LogY = j(nrow(Y),1,.); /** allocate missing **/ idx = loc(Y > 0); /** find indices where Y > 0 **/ if ncol(idx) > 0 then LogY[idx] = log10(Y[idx]); print Y LY LogY;

The preceding statements initially define `LogY` to be a vector of missing values. The LOC function finds the indices of `Y` for which `Y` is positive. If at least one such index is found, those positive values are transformed and overwrite the missing values. A missing value remains in `LogY` for any element for which `Y` is negative.

You can see why some practitioners prefer the second method over the first: the logarithms of the data are unchanged by the second method, which makes it easy to mentally convert the transformed data back to the original scale (see the transformed values for 1, 10, and 100). The translation method makes the mental conversion harder.

You can use the previous technique for other functions that have restricted domains. For example, the same technique applies to the SQRT function and to inverse trigonometric functions such as ARSIN and ARCOS.

## 64 Comments

Did you know that SAS now has a LOG1PX function that "returns the log of 1 plus the argument"? It's true!

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003121132.htm

Dear Rick

My data set includes stock return of around 1000 companies. In most cases sometimes the return data shows a -34.5 to -108 figures. How to make log transformation in this case. How much should be the constant value in this kind of data. Please help.

It depends somewhat on what you're trying to do, but you might want to express the returns as a percentage, measured from the start of the time period (1 yr, 5 yrs, or whatever). Then the Negative returns are bounded by -100 percent, and you can safely compute log(101 + return).

Dear Rick

I have a data set for which the dependent variable is both positive and negative. Would you say an alternative is to take absolute values, then take logs, before multiplying the original values with -1. For me this seems reasonable, but I am not sure if I can interpret my coefficients in terms of percentage changes any more?

All best

Gjermund

Well, I don't know your application, but I don't think I would recommend that approach because the LOG function has a singularity at zero. The LOG transformation is best for mapping changes that are between 0 and infinity. For example, if you buy a stock at a certain price, the quantity Price/Purchace_Price is always positive. It can be log-transformed. It sounds like your variable might be "relative change" such as (Price-Purchase_Price)/Purchase_Price, which can be positive or negative. I wouldn't use a log transform for the second quantity.

Upon rereading your comment, perhaps you were attempting to form a transformation like this: y --> sign(y)*log(|y| + 1)

Dear Rick

I have data set of both positive and negative value. I have changed the large number with minus sign among the treatments to zero by adding equal positive number and also to all treatments, then I have analyzed by SAS. But I am not sure, please would you help me?

All best

Abera

To everyone who has questions about applying these ideas to specific data:

1. If you have a question about HOW (or why) to best transform specific data, post your question to the SAS Statistical support group at https://communities.sas.com/community/support-communities/sas_statistical_procedures

2. If you have a question about SAS/IML SYNTAX or are getting a programming error, post your question to the SAS/IML support group at https://communities.sas.com/community/support-communities/sas_iml_and_sas_iml_studio

In both cases,

1. supply sample data,

2. state what you've tried, and

3. link to this post so that the people who read your question know the context of your question

Dear Rick

I have to apply regression to Return on equity ratios, return on asset ratios, GDP growth, Inflation % and Real interest rate.

Problem lies where I want to take natural log of data of all variables. All data is in % form but have positive or negative values. e.g. ROA =0.328% and ROE = -7.92%. I can easily apply natural log to 0.328% value and get -1.11474 but how can I apply natural log to the negative value? if you could please shoe me how and give me a figure to see.

is there any way if I just write 0 for this entry and perform my regression? It may give wrong results.

Use the proportion y = Price / Purchase_price.

This is always in the interval (0, infinity) for non-bankrupt stocks, and y=1 means that the price has not changed since it was purchased. If this quantity spans several orders of magnitude, you can apply the log(y) transform.

Sir, I am using Eviews 7 and I have values in my data set for presidential approval ratings which are negative. I need to use log of the ratings but Eviews cannot compute it. How do I get rid of the negative values? Thank you!

You don't need to use the log of the rating, just use the ratings as given. Logarithms are used when data many orders of magnitudes, which doesn't apply for approval ratings. If you insist on transforming the data, use y = (r+100)/200, which maps the rating (r) from [-100,100] to [0,1].

Hello Rick!

I have a question regarding the interpretation of log transformed data where the constant was added to avoid some negative values. Should we still interpret the results in the way that 1% change in independent value leads to ß % (which is a coefficient found after regression) change in the dependent one? (both dependent and independent variables were log transformed)

Thanks!

No, the interpretation is that a unit change in the LOG of the indep var leads to change of beta in the LOG of the dependent variable. This is a much more difficult interpretation because the "unit change in log(x)" now depends on x. At small values of x, log(x) changes quickly; for large values of x, log(x) changes slowly.

Hello Rick,

I am working on human capital investment and economic growth, and my dependent variable is Real GDP, while my independent variables are labor, capital, government expenditure on health and education. my proxies are labor force population, gross fixed capital formation, government expenditure on health and education, life expectancy rate and adult literacy rate. However, I need to know which ones to log and whether to use natural or common base 10 logarithm, and why I should use one instead of the other. Thank you very much.

A simple rule of thumb is to log-transform variables that range over several orders of magnitude. For example, if one country has a population of one million and another has a population of a billion, that is three orders of magnitude, so a regression model that includes the log(population) is worth considering. For your variables, I would choose base 10 because the results will be more interpretable. If you see that log10(X) is close to 3, you can use mental arithmetic to figure out that X is close to 1,000.

Hi, Rick. Using log(Y+k) to deal with zero and negative values of the outcome variable seems to be problematic, if I care about the interpretation of beta_1 in E[log(Y+k)] = beta_0 + beta_1 X. I've seen some data analysts exponentiate the right side of the equation and then they subtract k to complete the backtransformation. But this isn't right, as E[log(Y+a)] = log GM(Y+a), where GM is the geometric mean. So my question is: for E[log(Y+k)] = beta_0 + beta_1 X, what is the interpretation of beta_1? If k=0, then [exp(beta)-1] has the neat interpretation of percentage change in GM(Y) for a unit increase in X. But if k is not 0, do we have a similar interpretation?

- Jacob

You've hit on a key issue: how do you interpret statistics that result from (any) transformation of a variable? As you point out, some transformations have simpler interpretations than others. There have been many books and papers written on this topic, and I recommend the ones by AC Atkinson. His book _Plots, Transformations, and Regression_ describes transformations for a wide variety of situations.

Sir,

I have a data set of food expenditures with the consumed quantities. Since there is no data about per unit prices, I got it as expenditure/quantity. Then I got the natural logarithm of prices using stata. But most of the values came as negative. I'm afraid of my results & I want to know can natural log values of prices be negative. What does it mean by the negative sign? Please help. Thanks in advance.

Yes. The log can be negative. In your case it means that the ratio is less than 1.

Hello Rick, thanks for the useful blog.

My model (OLS regression) consists of depend. variable being the industry return and then the 3 indep. var. are total market return/oil price return/natural gas return. Each of these returns I want to log

I need to transform the negative numbers to use the log and do it the firs way suggested..

I apply =log(1+*value of return*) My question is, Should I apply the +1 for all 2608 observations I have?? Or only for the negative ones.

I am very grateful for an answer here.

Regards, Moni

The transformation is applied to the entire variable, so you should apply it to all 2608 observations.

Hi Rick

I have savings data set with both negatives and positives. How do I log transform it in eviews especially the negatives?

See my previous responses, especially to "Gjermund Grimsby."

hello rick,

i have few independent variables, which are earnings per share, book value and fair value. The problem is, i got negative data for earnings per share(EPS). So, should i just transform the EPS to log (1+ EPS) or i need to do the same to book value and fair value?

Tq

You do not need to transform each variable in the same way. It seems to me that EPS can be less than 1, so that 1+EPS can still be negative, so be sure to look at the most negative value of EPS before you decide on a transformation.

Hi Rick,

Is that necessary for all variables to be normal distribution if we want to run multiple regression? I did transform some of my variables but the result is still not normal. So, what should I do then? Your suggestion is really appreciated.

No, regression does not require that the explanatory variables be normally distributed. If you do an internet search for "assumptions of linear regression" you will find many articles. If you want to do inference on the least square estimates (the regression coefficients), you assume normally distributed ERRORS (residuals). That is, the Y variable is linearly related to the X variables plus some unknown error term that is normally distributed.

hi Rick,

I have a problem with normality test again. In order to make sure that I can use parametric test, I need to make sure that my residual distribution is normal. However, when I refer to the value of skewness and kurtosis of the residual, it is -0.017 and -0.438 respectively, where i think this is considered as normal. Unfortunately, when i do kolmogorov-smirnov, the significant value is 0.021, which indicates the residual is not normal. The sample of my study is 290. Could i just ignore the kolmogorov-smirnov test and assume the residual is normal as the data is large?

In practice, many people just "eyeball" the residuals to check that they are approximately normal. If the residuals are approximately normal, the inference on the regression coefficient will still be good. The quantile-quantile plot in PROC UNIVARIATE is probably more valuable than the K-S test for assessing (approximate) normality.

Hello rick. i'm trying to log the size of firm before i run a GMM regression. Size of firm is defined as:

size = (common stock/book value) x stock market price

But the problem is, there are many negative value there. How do i log the data. should i just treat the negative value = 0? When i log the whole data using microsoft excel, the negative values are treated as 'missing'. So can i just run the GMM regression though there are missing values? is there any best alternative to handle the situation. Thanks.

First, if you run the regression with missing values, you are excluding all of that data when you construct the regression model. I wouldn't do that.

It seems like the problem is the definition of "size". I think most people expect "size" to be a positive quantity, such as "market capitalization" or something similar. If you can, change the way that you measure size.

Hi I am working on GDP forecasting.The amount is very high .So to make it stationary,I transform data into log difference .I am using eviews 7.After forecating data I dont know how to convert these values into origional values.These values have become very small.Could you help me please in this regard thanks

If you are predicting log(GDP), then exponentiate the predicted values to get back to the original scale.

Sir I am predicting d(log(gdp))

i using ARIMA MODELING.So i transform the data by using first difference logrithm.Now i got the forecasting results.But after transformation data is changed So i want to bring the data back to its origional form.I used follow transformation

d(log(gdp))=log(gdp)-log(gdp)(-1)

only exponentiate does it work?

I suggest you ask your question at https://communities.sas.com/community/support-communities/sas_forecasting

dear sir, I am doing inward FDI as a dependent variable, it has positive numbers and negative numbers, the data start from 75,and the number become larger as a time series, and the lowest number is -15348, if I plus 15348 to make all the number positive, then the first data will become very large, as GDP is a explanatory variable, if plus such a big number will the regression affected？I use eviews 7. how can I do?

lnFDI is measuring the % change of FDI in the regression, so can I just use FDI minus last period FDI, and calculate the change rate, use the growth rate of FDI instead, but without log?

Adding a constant to the response only changes the regression by changing the intercept. If you then apply a log transformation, it becomes harder to interpret the regression coefficients in terms of intuitive quantities such as %change. I think in your case you should plot the data. Is inFDI linearly related to GDP? If so, don't apply the transform. If not, tapply the log(Y+c) transform. Is the transformed response linearly related to the explanatory variables? SAS and other statistical software provide graphical diagnostic plots that you can use to assess the fit of the model.

Hi Rick,

I have monthly growth data, which is sometimes negative. There is a desire for the growth to be measured on a per day basis (so, growth per day). I had been using the method you described Y*=ln(y+a). But, several on the team are not comfortable with that. An alternative suggested was taking the log of the values prior to differencing them. If this is the dependent variable of a log-log model, would the coefficients with this transformation be interpreted the same way as the Y*=ln(y+a) would be interpreted? What can I do with the per day aspect of this?

I really appreciate this thread, and all the useful feedback you are personally supplying.

In general, I think it is wise for analysts to be skeptical of advice found on the internet! To answer your question, if you take logs first and then difference them, you are forming the log of a ratio, since log(y_{i+1}) - log(y_i) = log(y_{i+1} /y_i). This would mean that you would be examining the "proportion of change" from one year to the next. Assuming that none of your data are zero, this is a reasonable thing to do. It centers the "no change" situation at 1 instead of 0, and it also eliminates negative numbers (assuming your data are positive).

Try graphing the proportion of change without the log: z_i = y_{i+1} /y_i. Perhaps you can do your regression on that proportion. If so, that's what I'd try.

Thanks for your feedback! I really like the log-log because the coefficients are easy to interpret. How would you interpret coefficients in this proportion change model?

Same as usual: the change in the proportion when an explanatory variable changes by one unit. But you're right that this model is not used as often as the log-log model.

I have a large absolute value as a dependent variable and some equally large independent variables.The other remaining independent variables are in rates.i want to do a regression and wants to introduce logs,how do i go on it?

Let's say that your independent variables are X1, X2,..., Xp and your dependent variable is Y. To use logs, define LX1=log(X1),..., LXp=log(Xp), and LY=log(Y). Then use regression to model LY as a function of LX1, LX2,..., LXp.

Hi Rick, this blog is really great! I don't have negative numbers, but I do have values below one, which when transformed, turn into negative values. I understand why this happens but I am not sure if this affects my analysis of the data and in which way.

In general the answer is that it is okay to get negative values. For example, if you are analyzing the GNP of nations in units of Trillions of dollars, small and third-world nations will have a negative log(GNP), whereas major industrial nations will not.

Hi, I am working with measurements of the conductivity of water and discharge of water. I want to produce a non linear regression to demonstrate the relationship between to two. When I did the log transformation of both these variables the discharge has all come back negative. How do I fix this? Do I add a constant when I am working out both logarthims ?

A value that is between 0 and 1 will be transformed to a log-value that is negative. There is nothing to "fix."

Dear Rick,

Thank you for your effort on this page, it is very helpful. I am forecasting inflation in Eviews 7, and some of my relative variables are negative and not normally distributed. The smallest negative number is -1,5%. From what I understood of the above comments is that I should take log(x+1,5) of this series to convert the negative numbers into positive? (and thus to check for normality again)

Is that correct?

Thank you!

Yes, that sounds right. Two comments:

1) log(0) is not defined, so add a number GREATER than 1.5 to make sure that when x=-1.5, you don't get log(0). The actual number doesn't matter much: 1.6 would work, as would 1.51 or even 2.

2) I assume from your note that x is measured in terms of percent, so that min(x)=-1.5. If min(x)=-1.5%= -0.015, then you can divide my numbers by 100.

Hi,Sir...

I am doing my research proposal. I use interest rate, inflation, deflator for my independent variables..The data got negative value -X>0 and -X<0. I use this technique (lx1=@recode(x1>0,log(1+x1),-log(1-x1)) for log the data (http://forums.eviews.com/viewtopic.php?f=3&t=1212)

But at the end year, the data for value is NA...Is it any wrong ?? Sir, could you suggest the better technique for me to log the data??

I appreciate it. Thank you!

The only way that I can see that you would get NA is if the original data had an NA. Your transformation (which I like, and which can be simply written as sgn(x)*log(1+abs(x))) is defined for all real numbers, so the NA are coming from the data, not from the transformation.

It sounds like you are doing this transformation on the explanatory variables. For regressions and other analyses, there is nothing wrong with having negative values in explanatory variables. Furthermore, log-transformations are most useful for variables that stretch over many orders of magnitudes (such as population of cities), and none of the variables that you mention have large ranges. If it were me, I'd try using the original variables instead of transforming them.

Hi - I'm curious on your thoughts about choice of "a" for log(Y+a). The two minimum values in my data (negative, of course) don't appear to be outliers when viewed on the original scale. Depending on the choice of "a", they can be made to look like severe outliers (if Y+a is very near zero or one) or not to appear like outliers at all (as Y+a increases). This has obvious implications when analyzing via regression. I know you mention 0.001 or 1 as minimums above, but are there any good resources for determining a proper "a"?

Thanks for your time and effort.

You ask an interesting question. When applying a nonlinear transformation, you are going to change the distribution of the response. Usually you are starting with a response distribution that is skewed and you are trying to transform it into a distribution that is closer to normal. That is why the log transformation is one of the so-called "transformations to normality" or "normalizing transformation." As you point out, the choice of 'a' affects the distribution of the transformed variables. So which value of 'a' to choose?

I'm guessing that you should strive to choose a value that makes your transformed response most nearly normal. If a = min(y) + 0.0001, then the response will be strongly negatively skewed relative to its original skewness. If a = min(y) + 1, then the response will be moderately negatively skewed relative to its original skewness. If a = min(y) + 1000, then the skewness hardly changes at all (assuming the range of Y is small). Unfortunately I do not have a reference for this idea, but it reminds me of the "Box-Cox transformation," which optimizes a parameter in a family of power transformations.

I have data in the form of area in cm square..so what kind of transformation values I used to minimize or stabilize varibilty n data.

I'd try a square root transformation.

Hi Rick,

I am constructing an Error Correction Model using Panel Data in STATA. I was trying to obtain the natural logs of my dependent variable (ratio of Capital flight/Real GDP) and independent variables (Interest rate differential, Financial Openness- [Chin-ito index], real exchange rate) for 4 countries. I am however, constrained since all my variables contain negative numbers. I was keen on transforming since they all have different magnitudes. How would you advice I proceed in such an instance?

If "4 countries" means "4 observations," then your regression isn't going to be very good. But to answer your question, if your goal is to reduce the order of magnitudes in a variable, you can use the log-modulus transformation: y --> sign(y)*log(|y| + 1), which is a continuous transformation that preserves signs.

Hi Rick,

I have absolutely no question at all, just wanted to say that I'm absolutely amazed by your responses here to a bunch of very badly phrased questions (even from people who aren't using SAS at all!). You're doing the work of dozens of undergrad thesis advisors - amazing!

Thanks, you made my day. Yes, some of the questions contain fewer details than I would prefer, and some of my responses are no more than educated guesses. I readily admit that I am not The World's Foremost Expert on Transformations, so I hope none of these folks are relying exclusively on my judgement. Remember, never trust advice found on the Internet! :-)

dear Rick.

Am using eviews to test for normality of inflation values but even when I log or add a constant it does not become normally distributed. The values of inflation that am using are quarterly changes in inflation so some of the values are negative. Please help advise me on how to make the variable normally distributed. thanks

Not all data are normally distributed. Not all data can be made normal by taking a logarithm. But that is okay because normality is not a requirement to run a linear regression. After you run a linear regression you should check the RESIDUALS of the regression. If these RESIDUALS are normally distributed, then that is evidence that your regression model captured the relationship between your response and your explanatory variables.

Hi Rick,

I have marginal cost variables (obtained from panel data analysis) that should be taken their log transformation in order to put it in the equation. But some of the data are negative. As I take the log transformations, the negative one's solution is undefined. How can I overcome this? Could you please help? Thank you.

Sure. Read the article on this page and do what it says.

## 7 Trackbacks

[...] which the first expression is true. For example, in a previous post, I described several ways to handle negative values in evaluating a logarithmic data transformation. You might assume that the following statements prevent the LOG function from evaluating negative [...]

[...] defines a helper function, SafeLog, that returns the natural log of positive quantities and returns missing values for non-positive quantitie... [...]

[...] run-time library to include special user-defined functions. In a previous blog post I discussed two different ways to apply a log transformation when your data might contain missing values and neg.... I'll use the log transformation example to show how to define and call user-defined functions in [...]

[...] to obtain a missing value for the square root of a negative number. As I showed in my article on how to handle negative values in a log function, you can use the SAS/IML CHOOSE function to return missing values: y2 = sqrt( choose(x>=0, x, [...]

[…] you can handle zero counts in any mathematically consistent way. I have previously written about how to use a log transformation on data that contain zero or negative values. The idea is simple: instead of the standard log transformation, use the modified transformation x […]

[…] my four years of blogging, the post that has generated the most comments is "How to handle negative values in log transformations." Many people have written to describe data that contain negative values and to ask for advice about […]

[…] finds elements of a vector or matrix that satisfy some condition. For example, if you are going to apply a logarithmic transform to data, you can use the LOC function to find all of the positive […]