Becker, B., and T. Milbourn. 2011. How did increased competition affect credit ratings? *Journal of Financial Economics* 101 (3):493-514.

See their Table 2 for an overview of the ratings levels for the three main rating agencies and the numerical value assignments used in their empirical work.

]]>**ttest var1 = var2**

Wilcoxon matched-pairs signed-rank test:**signrank var1 = var2**

Sign test of matched pairs:

**signtest var1 = var2**

Unpaired or unmatchedTwo-sample t test:**ttest var, by(groupvar)**

Wilcoxon rank-sum test or Mann_Whitney test:**ranksum var, by(groupvar)**

K-sample equality-of-medians test:

**median var, by(groupvar)**

UCLA IDRE has posted an article (link) that may provide a bit more explanation. UCLA IDRE is a great resource for learning statistical analysis. A big thank you to them.

]]>`pwcorr`

and `spearman`

, can do the job. However, we have to manually combine Stata output tables when producing the correlation table in the manuscript, which is time-consuming.
I find this fantastic module written by Daniel Klein. His command will return one table that combines Pearson and Spearman correlations and needs the fewest further edits. Thanks Daniel and please find his work here.

A sample command is as follows:

`corsp varlist, pw sig`

To install Daniel’s module, type `ssc install corsp`

in Stata’s command window.

A good technical comparison of Pearson and Spearman correlations can be found here.

]]>`tsset`

). The command to convert string GVKEY to numerical GVEKY is very simple:
`destring gvkey, replace`

The command to revert numerical GVKEY to string GVKEY with leading zeros is as follows:

`tostring gvkey, replace format(%06.0f)`

A better way of comparing the predictive ability of different models is to use the Receiver Operating Characteristic, or ROC curve (e.g., Hosmer and Lemeshow, 2000, Chapter 5). This curve ‘‘plots the probability of detecting a true signal (sensitivity) and false signal (1—speciﬁcity) for the entire range of possible cutpoints’’ (p. 160, our emphasis). The area under the ROC curve (denoted AUC) provides a measure of the model’s ability to discriminate. A value of 0.5 indicates no ability to discriminate (might as well toss a coin) while a value of 1 indicates perfect ability to discriminate, so the effective range of AUC is from 0.5 to 1.0. Hosmer-Lemeshow (2000, p. 162) indicate that AUC of 0.5 indicates no discrimination, AUC of between 0.7 and 0.8 indicates acceptable discrimination, AUC of between 0.8 and 0.9 indicates excellent discrimination, and AUC greater than 0.9 is considered outstanding discrimination.

The Stata command to report AUC is as follows:

`logit y x1 x2`

or `probit y x1 x2`

`lroc, nograph`

The most recent edition of the book Kim and Skinner refer to is Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley.

A technical note from Stata: `lroc`

requires that the current estimation results be from `logistic`

, `logit`

, `probit`

, or `ivprobit`

.

A side question: what’s the difference between logistic and logit regression? Nick Cox’s short answer is: “same thing with different emphases in reporting.” (something like one gives you the odds ratios, the other gives you the log of the odds ratios.)—thanks to a post on Stack Overflow.

]]>`sum var, detail`

reports.
* Figure 1:* Returns are stored in a row

* Figure 2:* Returns are stored in a column

**If returns are stored in a row**

Stata does not provide a command to calculate the skewness in this situation. The following Stata commands will do the job.

egen ret_mean=rowmean(ret1-ret12) egen n=rownonmiss(ret1-ret12) foreach v in ret1 ret2 ret3 ret4 ret5 ret6 ret7 ret8 ret9 ret10 ret11 ret12 { ge `v'_m3=(`v'-ret_mean)^3 } egen m3=rowtotal(ret1_m3-ret12_m3), missing replace m3=m3/n foreach v in ret1 ret2 ret3 ret4 ret5 ret6 ret7 ret8 ret9 ret10 ret11 ret12 { ge `v'_m2=(`v'-ret_mean)^2 } egen m2=rowtotal(ret1_m2-ret12_m2), missing replace m2=m2/n ge ret_skew=m3*m2^(-3/2)

**If**** returns are stored in a column**

Stata provides a command to calculate skewness in this situation (`egen`

and `skewness`

). However, the computation is extremely slow if we have millions of observations. I would suggest calculating the skewness manually as follows:

sort permno (add more variables here to identify a group) by permno: egen ret_mean=mean(ret) by permno: egen n=count(ret) ge ret_m3=(ret-ret_mean)^3 by permno: egen m3=total(ret_m3) replace m3=m3/n ge ret_m2=(ret-ret_mean)^2 by permno: egen m2=total(ret_m2) replace m2=m2/n ge skewness=m3*m2^(-3/2)

]]>

The model uses lawsuit data obtained from Stanford Law School’s Securities Class Action Clearinghouse (http://securities.stanford.edu). However, the website does not deliver the data in a downloadable format. I write the Python program for extracting the data from the website (a technique called webscraping).

I use Python 3.x and please install all required modules. I provide the data (as of 208/07/15) in a CSV file for easy download (scac.csv).

from urllib import request from bs4 import BeautifulSoup import re from math import ceil import csv # Determine the number of pages to webscrape scac = "http://securities.stanford.edu/filings.html" page = request.urlopen(scac) soup = BeautifulSoup(page, 'html.parser') heading = soup.find_all('h4')[-1].get_text() total_record_num = re.findall(r'\d+', heading)[0] total_page_num = ceil(int(total_record_num) / 20) # Webscrape all pages container = [("filing_name", "filing_date", "district_court", "exchange", "ticker")] i = 1 while i <= total_page_num: url = scac + "?page=" + repr(i) print(url) page = request.urlopen(url) soup = BeautifulSoup(page, 'html.parser') table = soup.find('table', class_ = 'table table-bordered table-striped table-hover') tbody = table.find('tbody') for row in tbody.find_all('tr'): columns = row.find_all('td') c1 = re.sub(r'[\t\n]', '', columns[0].get_text()).strip() c2 = re.sub(r'[\t\n]', '', columns[1].get_text()).strip() c3 = re.sub(r'[\t\n]', '', columns[2].get_text()).strip() c4 = re.sub(r'[\t\n]', '', columns[3].get_text()).strip() c5 = re.sub(r'[\t\n]', '', columns[4].get_text()).strip() container.append((c1, c2, c3, c4, c5)) i = i + 1 # Write to a CSV file with open('scac.csv', 'w', newline='') as csvfile: writer = csv.writer(csvfile) writer.writerows(container)

]]>

- Campbell, J. Y. and Taksler, G. B. (2003), Equity Volatility and Corporate Bond Yields.
*The Journal of Finance*, 58: 2321–2350. doi:10.1046/j.1540-6261.2003.00607.x - Rajgopal, S. and Venkatachalam, M. (2011), Financial reporting quality and idiosyncratic return volatility.
*Journal of Accounting and Economics*, 51: 1–20. doi.org/10.1016/j.jacceco.2010.06.001.

The code in this post is used to calculate Campbell and Taksler’s (2003) idiosyncratic stock return volatility, but it can be easily modified for other definitions.

Specifically, this code requires an input dataset that includes two variables: `permno`

and `enddt`

, where `enddt`

is the date of interest. This code will calculate the standard deviation of daily abnormal returns over the 180 calendar days before (and including) `enddt`

. Abnormal returns will be calculated using four methods: (1) market-adjusted; (2) standard market model; (3) Fama-French three factors; and (4) Fama-French three factors as well as momentum. This code requires at least 21 return observations (one-month trading days) over that 180-day period for a `permno`

to calculate its stock return volatility.

libname local "D:\Dropbox"; * Remote signon and upload input database including permno and date; %let wrds=wrds-cloud.wharton.upenn.edu 4016; options comamid=TCP remote=WRDS; signon username=_prompt_; rsubmit; proc upload data=local.input out=input; run; * Get raw returns and FF risk factors; proc sql; create table rets as select a.*, b.date, b.ret, c.smb, c.hml, c.umd, d.vwretd as mktret, (b.ret-d.vwretd) as exret from input a left join crsp.dsf b on a.permno=b.permno and a.enddt-180<b.date<=a.enddt left join ff.factors_daily c on b.date=c.date left join crsp.dsi d on b.date=d.date order by a.permno, a.enddt, b.date; quit; * Estimate factor exposures; proc printto log=junk; run; proc reg data=rets edf outest=params noprint; by permno enddt; eq0: model exret=; eq1: model ret=mktret; eq2: model ret=mktret smb hml; eq3: model ret=mktret smb hml umd; run; proc printto; run; * Compute abnormal returns for all models for each trading day; data abrets; merge rets (in=a) params (where=(_model_='eq0') keep=permno enddt _model_ _rmse_ _p_ _edf_ rename=(_rmse_=std0 _p_=p0 _edf_=edf0)) params (where=(_model_='eq1') keep=permno enddt _model_ _rmse_ intercept mktret rename=(_rmse_=std1 intercept=alpha1 mktret=beta1)) params (where=(_model_='eq2') keep=permno enddt _model_ _rmse_ intercept mktret smb hml rename=(_rmse_=std2 intercept=alpha2 mktret=beta2 smb=sminb2 hml=hminl2)) params (where=(_model_='eq3') keep=permno enddt _model_ _rmse_ intercept mktret smb hml umd rename=(_rmse_=std3 intercept=alpha3 mktret=beta3 smb=sminb3 hml=hminl3 umd=umind3)); by permno enddt; var0=std0**2; var1=std1**2;var2=std2**2;var3=std3**2; abret0=exret; expret1=alpha1+beta1*mktret; abret1=ret-expret1; expret2=alpha2+beta2*mktret+sminb2*smb+hminl2*hml; abret2=ret-expret2; expret3=alpha3+beta3*mktret+sminb3*smb+hminl3*hml+umind3*umd; abret3=ret-expret3; nobs=p0+edf0; /*number of observations used in estimation*/ drop p0 edf0 std0 std1 std2 std3 _model_ exret; if a and nobs>=21; run; proc sort data=abrets; by permno enddt date; run; proc means data=abrets noprint; by permno enddt; output out=retvol std(abret0)=vol0 std(abret1)=vol1 std(abret2)=vol2 std(abret3)=vol3; run; * Download output dataset and remote signoff; proc download data=retvol out=local.retvol; run; endrsubmit; signoff;

]]>

First of all, why and how we deal with potential outliers is perhaps one of the messiest issues that accounting researchers will encounter, because no one ever gives a definitive and satisfactory answer. In my opinion, only outliers resulting from apparent data errors should be deleted from the sample. That said, this post is not going to answer that messy question; instead, the purpose of this post is to summarize the Stata commands for commonly used methods of dealing with outliers (even if we are not sure whether these methods are appropriate—we all know that is true in accounting research!). Let’s start.

**Truncate and winsorize**

In my opinion, the best Stata commands to do truncate and winsorize are `truncateJ`

and `winsorizeJ`

written by Judson Caskey. I will save time to explain why, but simply highly recommend his work. Please see his website here.

To install these two user-written commands, you can type:

`net from https://sites.google.com/site/judsoncaskey/data`

`net install utilities.pkg`

After the installation, you can type `help truncateJ`

or `help winsorizeJ`

to learn how to use these two commands.

**Studentized residuals**

The first step is to run a regression without specifying any `vce`

parameter in Stata (i.e., not using robust or clustered error terms). Suppose the dependent variable is `y`

, and independent variables are `x1`

and `x2`

. The first step should look like this:

`regress y x1 x2`

Then, use the `predict`

command:

`predict rstu if e(sample), rstudent`

If the absolute value of `rstu`

exceed certain critical values, the data point will be considered as an outlier and be deleted from the final sample. Stata’s manual indicates that “studentized residuals can be interpreted as the t statistic for testing the significance of a dummy variable equal to 1 in the observation in question and 0 elsewhere. Such a dummy variable would effectively absorb the observation and so remove its influence in determining the other coefficients in the model.” To be honest, I do not fully understand this explanation, but since `rstu`

is a *t* statistics, the critical value for a traditional significance level should be applied, for example, 1.96 (or 2) for 5% significance level. That’s why in literature we often see that data points with absolute values of studentized residuals greater than 2 will be deleted. Some papers use the critical value of 3, which corresponds to 0.27% significance level, and seems to me not very reasonable.

Now use the following command to drop “outliers” based on the critical value of 2:

`drop if abs(rstu) > 2`

The last step is to re-run the regression, but this time we can add appropriate `vce`

parameters to address additional issues such as heteroskedasticity:

`regress y x1 x2, vce(robust)`

, or

`regress y x1 x2, vce(cl gvkey)`

**Cook’s distance**

This method is similar to studentized residuals. We predict a specific residual, namely Cook’s distance, and then delete any data points with Cook’s distance greater than 4/N (Cook’s distance is always positive).

`regress y x1 x2`

`predict cooksd if e(sample), cooksd`

`drop if cooksd > critical value`

Next, re-run the regression with appropriate `vce`

parameters:

`regress y x1 x2, vce(robust)`

, or

`regress y x1 x2, vce(cl gvkey)`

Lastly, I thank the authors of the following articles which I benefit from:

https://www3.nd.edu/~rwilliam/stats2/l24.pdf

https://www.stat-d.si/mz/mz16/coend16.pdf

A more formal and complete econometrics book is Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley.

]]>- Use Python to download TXT-format SEC filings on EDGAR (Part I)
- Use Python to download TXT-format SEC filings on EDGAR (Part II)

Although TXT-format files have benefits of easy further handling, they are oftentimes not well formatted and thus hard to read. A HTML-format 10-K is more pleasing to eyes. Actually, SEC also provides the paths (namely, URLs) to HTML-format filings. With the path, we can open a HTML-format filing in a web browser, or further download the filing as a PDF.

There remain two parts in the Python code. In the first part, we need download the path data. Instead of using `master.idx`

in the above two posts, we need use `crawler.idx`

for this task. The path we get will be a URL like this:

https://www.sec.gov/Archives/edgar/data/859747/0001477932-16-007969-index.htm

Note that the path we get is a URL to an index page, not a URL to the HTML-format 10-Q in this example. To get the direct URL to the HTML-format 10-Q, we have to go one-level deeper. The second part of the Python code is used to go that deeper and extract the direct URL to the main body of the Form (the URL embedded in the first row in more than 99% cases). The code also extracts such information as filing date and period of report on the index page. The code writes the output (including filing date, period of report and direct URL) in `log.csv`

. The following is an output example—the first URL is the path we get in the first part of the code; the second URL is the direct URL to the HTML-format Form.

13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm,2017-10-11 23:44:42,2017-10-11 23:44:50,2014-05-05,2014-03-31,https://www.sec.gov/Archives/edgar/data/5272/000000527214000007/maindocument001.htm 16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm,2017-10-11 23:44:51,2017-10-11 23:44:58,2016-11-03,2016-09-30,https://www.sec.gov/Archives/edgar/data/5272/000000527216000052/maindocument001.htm 6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm,2017-10-11 23:44:59,2017-10-11 23:45:05,2007-05-10,2007-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012307007211/y32085e10vq.htm 5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm,2017-10-11 23:45:07,2017-10-11 23:45:14,2006-05-10,2006-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012306006104/y19465e10vq.htm 10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm,2017-10-11 23:45:15,2017-10-11 23:45:20,2011-05-05,2011-03-31,https://www.sec.gov/Archives/edgar/data/5272/000104746911004647/a2203832z10-q.htm

The first part of the code:

# Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter import datetime current_year = datetime.date.today().year current_quarter = (datetime.date.today().month - 1) // 3 + 1 start_year = 1993 years = list(range(start_year, current_year)) quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4'] history = [(y, q) for y in years for q in quarters] for i in range(1, current_quarter + 1): history.append((current_year, 'QTR%d' % i)) urls = ['https://www.sec.gov/Archives/edgar/full-index/%d/%s/crawler.idx' % (x[0], x[1]) for x in history] urls.sort() # Download index files and write content into SQLite import sqlite3 import requests con = sqlite3.connect('edgar_htm_idx.db') cur = con.cursor() cur.execute('DROP TABLE IF EXISTS idx') cur.execute('CREATE TABLE idx (conm TEXT, type TEXT, cik TEXT, date TEXT, path TEXT)') for url in urls: lines = requests.get(url).text.splitlines() nameloc = lines[7].find('Company Name') typeloc = lines[7].find('Form Type') cikloc = lines[7].find('CIK') dateloc = lines[7].find('Date Filed') urlloc = lines[7].find('URL') records = [tuple([line[:typeloc].strip(), line[typeloc:cikloc].strip(), line[cikloc:dateloc].strip(), line[dateloc:urlloc].strip(), line[urlloc:].strip()]) for line in lines[9:]] cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records) print(url, 'downloaded and wrote to SQLite') con.commit() con.close() # Write SQLite database to Stata import pandas from sqlalchemy import create_engine engine = create_engine('sqlite:///edgar_htm_idx.db') with engine.connect() as conn, conn.begin(): data = pandas.read_sql_table('idx', conn) data.to_stata('edgar_htm_idx.dta')

The first part of the code generates a dataset of the complete path information of SEC filings for the selected period (in both SQLite and Stata). Then, you can select a sample based on firm, form type, filing date, etc. and feed a CSV file to the second part of the code. The feeding CSV should look like this:

13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm 16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm 6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm 5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm 10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm

The second part of the code:

import csv import random import time from selenium import webdriver with open('log.csv', 'w', newline='') as log: logwriter = csv.writer(log) with open('sample.csv', newline='') as infile: records = csv.reader(infile) for r in records: log_row = r.copy() print('Start fetching URL to', r[2], r[3], 'filed on', r[4], '...') start_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()) driver = webdriver.Chrome('./chromedriver') try: driver.get(r[5]) time.sleep(3 + random.random() * 3) filing_date = driver.find_element_by_xpath('//*[@id="formDiv"]/div[2]/div[1]/div[2]').text period_of_report = driver.find_element_by_xpath('//*[@id="formDiv"]/div[2]/div[2]/div[2]').text form_text = driver.find_element_by_xpath('//*[@id="formDiv"]/div/table/tbody/tr[2]/td[3]/a').text form_link = driver.find_element_by_link_text(form_text).get_attribute('href') end_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()) print('Success!', start_time, ' --> ', end_time, '\n') log_row = log_row + [start_time, end_time, filing_date, period_of_report, form_link] except: end_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()) print('Error!', start_time, ' --> ', end_time, '\n') log_row = log_row + [start_time, end_time, 'ERROR!'] driver.quit() logwriter.writerow(log_row)

Note:

- Please use Python 3.x.
- Please install all required modules such as selenium. Google related documentation if you do not know how to install them.
- The second part of the code only output the direct URL to the HTML-format filing. If you want to save it as a PDF, you need write additional Python code on your own.