Just knowledge it: DATA QUALITY

Data quality varies from excellent to awful. Since bad data can wreak havoc with all forms of analysis, lead to misleading results, and waste precious time, only use the best data that can be found when running tests and trading simulations. Some forecasting models, including those based on neural networks, can be exceedingly sensitive to a few errant data points; in such cases, the need for clean, error-free data is extremely important. Time spent finding good data, and then giving it a final scrubbing, is time well spent.

Data errors take many forms, some more innocuous than others. In real-time trading, for example, ticks are occasionally received that have extremely deviant, if not obviously impossible, prices. The S&P 500 may appear to be trading at 952.00 one moment and at 250.50 the next! Is this the ultimate market crash? No-a few seconds later, another tick will come along, indicating the S&P 500 is again trading at 952.00 or thereabouts. What happened? A bad tick, a “noise spike,” occurred in the data. This kind of data error, if not detected and eliminated, can skew the results produced by almost any mechanical trading model. Although anything but innocuous, such errors are obvious, are easy to detect (even automatically), and are readily corrected or otherwise handled. More innocuous, albeit less obvious and harder to find, are the common, small errors in the settling price, and other numbers reported by the exchanges, that are frequently passed on to the consumer by the data vendor. Better data vendors repeatedly check their data and post corrections as such errors are detected. For example, on an almost daily basis, Pinnacle Data posts error corrections that are handled automatically by its software. Many of these common, small errors are not seriously damaging to software-based trading simulations, but one never knows for sure.

Depending on the sensitivity of the trading or forecasting model being analyzed, and on such other factors as the availability of data-checking software, it may be worthwhile to run miscellaneous statistical scans to highlight suspicious data points. There are many ways to flag these data points, or ourlieru, as they are sometimes referred to by statisticians. Missing, extra, and logically inconsitent data points are also occasionally seen; they should be noted and corrected. As an example of data checking, two data sets were run through a utility program that scans for missing data points, outliers, and logical inconsistencies. The results appear in Tables I-1 and 1-2, respectively.

Table I 1 shows the output produced by the data-checking program when it was used on Pinnacle Data Corporation’s (800-724-4903) end-of-day, continuous-contract data for the S&P 500 futures. The utility found no illogical prices or volumes in this data set; there were no observed instances of a high that wan less than the close, a low that was greater than the open, a volume that was less than zero, or of any cognate data faux pas. Rvo data points (bars) with suspiciously high ranges, however, were noted by the software: One bar with unusual range occurred on 1 O/l 9/87 (or 871019 in the report). The other was dated 10/13/89. The abnormal range observed on 10/19/87 does not reflect an error, just tbe normal volatility associated with a major crash like that of Black Monday; nor is a data error responsible for the aberrant range seen on 10/13/89, which appeared due to the so-called anniversary effect. Since these statistically aberrant data points were not errors, corrections were unnecessary Nonetheless, the presence of such data points should emphasize the fact that market events involving exceptional ranges do occur and must be managed adequately by a trading system. All ranges shown in Table l-l are standardized ranges, computed by dividing a bar’s range by the average range over the last 20 bars. As is common with market data, the distribution of the standardized range had a longer tail than would be expected given a normally distributed underlying process. Nevertheless, the events of 10/19/87 and 10/13/89 appear to be statistically exceptional: The distribution of all other range data declined, in an orderly fashion, to zero at a standardized value of 7,well below the range of 10 seen for the critical bars.

The data-checking utility also flagged 5 bars as having exceptionally deviant closing prices. As with range, deviance has been defined in terms of a distribution, using a standardized close-to-close price measure. In this instance, the standardized measure was computed by dividing the absolute value of the difference between each closing price and its predecessor by the average of the preceding 20 such absolute values. When the 5 flagged (and most deviant) bars were omitted, the same distributional behavior that characterized the range was observed: a longtailed distribution of close-to-close price change that fell off, in an orderly fasbion, to zero at 7 standardized units. Standardized close-to-close deviance scores (DEV) of 8 were noted for 3 of the aberrant bars, and scores of 10 were observed for the remaining 2 bars. Examination of the flagged data points again suggests that unusual market activity, rather than data error, was responsible for their statistical deviance. It is not surprising that the 2 most deviant data points were the same ones noted earlier for their abnormally high range. Finally, the data-checking software did not find any missing bars, bars falling on weekends, or bars with duplicate or out-of-order dates. The only outliers detected appear to be the result of bizarre market conditions, not cormpted data. Overall, the S&P 500 data series appears to be squeaky-clean. This was expected: In our experience, Pinnacle Data Corporation (the source of the data) supplies data of very high quality.

As an example of how bad data quality can get, and the kinds of errors that can be expected when dealing with low-quality data, another data set was analyzed with the same data-checking utility. This data, obtained from an acquaintance, was for Apple Computer (AAPL). The data-checking results appear in Table l-2.

In this data set, unlike in the previous one, 2 bars were flagged for having outright logical inconsistencies. One logically invalid data point had an opening price of zero, which was also lower than the low, while the other bar had a high price that was lower than the closing price. Another data point was etected as having an excessive range, which may or may not be a data error, In addition, several bars evidenced extreme closing price deviance, perhaps reflecting uncorrected stock splits. There were no duplicate or out-of-order dates, but quite a few data points were missing. In this instance, the missing data points were holidays and, therefore, only reflect differences in data handling: for a variety of reasons, we usually fill holidays with data from previous bars. Considering that the data series extended only from l/2/97 through 1 l/6/98 (in contrast to the S&P 500, which ran from l/3/83 to 5/21/98), it is distressing that several serious errors, including logical violations, were detected by a rather simple scan.

The implication of this exercise is that data should be purchased only from a reputable vendor who takes data quality seriously; this will save time and ensure reliable, error-free data for system development, testing, and trading, In addition, all data should be scanned for errors to avoid disturbing surprises. For an in-depth discussion of data quality, which includes coverage of how data is produced, transmitted, received, and stored, see Jurik (1999).

Just knowledge it

Search This Blog

Monday, 17 November 2014

DATA QUALITY

No comments:

Post a Comment