The annual data (FUNDA) is easy to deal with, you just need to apply the following conditions:
indfmt=="INDL" & datafmt=="STD" & popsrc=="D" & consol=="C"
If you have downloaded FUNDA and converted it to Stata format, the uniqueness of GVKEY-DATADATE can be verified using the following Stata command:
duplicates report gvkey datadate if indfmt=="INDL" & datafmt=="STD" & popsrc=="D" & consol=="C"
This command will return “no duplicates”.
The quarterly data (FUNDQ) is a little bit complicated. First of all, applying the same conditions won’t work. In fact, FUNDQ has natively satisfied these conditions. But duplicate GVKEY-DATADATE pairs still exist in FUNDQ. The root cause of duplicate GVKEY-DATADATEs is firms changing their fiscal year end. For example, if a firm changes its fiscal year end from September 30 to December 31, there will be two records for September 30 on FUNDQ—one for fiscal quarter 4 based on the original year end and the other for fiscal quarter 3 based on the new year end).
A closer look at FUNDQ shows that 99.84% of GVKEY-DATADATE pairs on FUNDQ are unique as of December 5, 2107. This suggests that no matter how you deal with duplicates, even simply and brutally delete all of them, your results probably won’t change in a noticeable way.
Have that said, if you want to remove duplicates in a more careful way, WRDS gives the following clue:
In the definition of
datafqtr, WRDS notes that,
Note: Companies that undergo a fiscal-year change may have multiple records with the same datadate. Compustat delivers those multiple records with the same datadate but each record relates to a different fiscal year-end period.
Rule: Select records from the co_idesind data group where datafqtr is not null, to view as fiscal data.
Unfortunately, I find that following this tip won’t work. WRDS apparently gives wrong instructions. I agree with one of the readers’ comments that “(how to remove duplicates) depends on what you need”. For example, in one of my projects, I want to look at 3-day CAR around earnings announcement date (RDQ) and use total assets as the deflator in my regression. As a result, when duplicate GVKEY-DATADATE pairs occur, the one with non-missing RDQ and ATQ will be preferred if I want to retain as many observations as possible. The following Stata commands will serve the purpose well:
use fundq, clear
duplicates tag gvkey datadate, ge(dup) //dup equals either 0 or 1
//(a) keep the one with non-missing rdq
bysort gvkey datadate: egen rdq_nonmiss=count(rdq)
tab rdq_nonmiss if dup==1
drop if dup==1 & rdq_nonmiss==1 & rdq==.
drop rdq_nonmiss dup
duplicates tag gvkey datadate, ge(dup)
//(b) keep the one with non-missing atq
bysort gvkey datadate: egen atq_nonmiss=count(atq)
tab atq_nonmiss if dup==1
drop if dup==1 & atq_nonmiss==1 & atq==.
drop atq_nonmiss dup
//(c) keep the one with more recent fiscal quarter
ge fyq=yq(fyearq, fqtr)
format fyq %tq
gsort gvkey datadate -fyq
duplicates drop gvkey datadate, force
duplicates report gvkey datadate //double check
save fundq_nodup, replace