How to remove duplicate GVKEY-DATADATE when using Compustat Annual (FUNDA) and Quarterly (FUNDQ)?

The annual data (FUNDA) is easy to deal with, you just need to apply the following conditions:

indfmt=="INDL" & datafmt=="STD" & popsrc=="D" & consol=="C"

If you have downloaded FUNDA and converted it to Stata format, the uniqueness of GVKEY-DATADATE can be verified using the following Stata command:

duplicates report gvkey datadate if indfmt=="INDL" & datafmt=="STD" & popsrc=="D" & consol=="C"

This command will return “no duplicates”.

The quarterly data (FUNDQ) is a little bit complicated. First of all, applying the same conditions won’t work. In fact, FUNDQ has natively satisfied these conditions. But duplicate GVKEY-DATADATE pairs still exist in FUNDQ. The root cause of duplicate GVKEY-DATADATEs is firms changing their fiscal year end. For example, if a firm changes its fiscal year end from September 30 to December 31, there will be two records for September 30 on FUNDQ—one for fiscal quarter 4 based on the original year end and the other for fiscal quarter 3 based on the new year end).

A closer look at FUNDQ shows that 99.84% of GVKEY-DATADATE pairs on FUNDQ are unique as of December 5, 2107. This suggests that no matter how you deal with duplicates, even simply and brutally delete all of them, your results probably won’t change in a noticeable way.

Have that said, if you want to remove duplicates in a more careful way, WRDS gives the following clue:

In the definition of datafqtr, WRDS notes that,

Note: Companies that undergo a fiscal-year change may have multiple records with the same datadate. Compustat delivers those multiple records with the same datadate but each record relates to a different fiscal year-end period.

Rule: Select records from the co_idesind data group where datafqtr is not null, to view as fiscal data.

Unfortunately, I find that following this tip won’t work. WRDS apparently gives wrong instructions. I agree with one of the readers’ comments that “(how to remove duplicates) depends on what you need”. For example, in one of my projects, I want to look at 3-day CAR around earnings announcement date (RDQ) and use total assets as the deflator in my regression. As a result, when duplicate GVKEY-DATADATE pairs occur, the one with non-missing RDQ and ATQ will be preferred if I want to retain as many observations as possible. The following Stata commands will serve the purpose well:


This entry was posted in Learning Resources. Bookmark the permalink.

4 Responses to How to remove duplicate GVKEY-DATADATE when using Compustat Annual (FUNDA) and Quarterly (FUNDQ)?

  1. YANLEI ZHANG says:

    Hi, Kai, I think it depends on what you need, it could be either calendar quarter or fiscal quarter. I usually try to collect the information for each calendar quarter, in that case, I only keep the recent convention for the duplicated observations.

  2. Thanh says:

    Hi Kai. I found that there is a difference in the number of observations every download from Compustat in the same period. For example, I downloaded 10 variables from Compustat in 1993-2016 with 380,0000 observations. Then I need another variable and download this variable separately from Compustat. However, I found that the number of observations are different in 2 files regardless of the same period. After that, I merge two files. After merging, some observations are not matched in 2 files. Can you help me explain why the number of observations in 2 files are different for the same period in Compustat?

  3. Vincent Zhang says:

    Hi, Kai. Thanks for sharing! I am wondering whether you notice that there are firms report financial data in Canadian dollars in COMPUSTAT (“curcd” = CAD). However, I barely saw papers mentioned that how to deal with it. What’s your opinion? Should we exclude the firms or exchange them to US dollars?

  4. Sujesh says:

    Hi Kai, the WRDS rule mentioned above mentions datacqtr (calendar quarter) and not datafqtr, if I am not mistaken. If all that you are interested in is exact duplicates, then the above method for quarterly data is fine. But there will still remain an issue of overlapping months, when firms change month of financial year end. In the strict sense, this too could be treated as a duplicate observation. For example, if in the year 2002 a firm changed financial year end from May to June, Compustat will capture both in subsequent rows. The May observation and the June observation will have 2 overlapping months (May and April). To remove all such cases of overlap, you will need to consider the observations where the datacqtr is a null/ blank. Doing so will subsume all cases of exactly duplicate observations as well. But just using the ‘datacqtr is a null’ criteria will not help uniquely identify exact duplicates. This is what I found out based on a subset of data, though I might have missed some additional possibilities.

Leave a Reply

Your email address will not be published. Required fields are marked *