Data collection


Example of data collection in the biological sciences: Adélie penguins are identified and weighed each time they cross the automated weighbridge on their way to or from the sea.[1]

Data collection is the process of gathering and measuring information on targeted variables in an established system, which then enables one to answer relevant questions and evaluate outcomes. Data collection is a research component in all study fields, including physical and social sciences, humanities,[2] and business. While methods vary by discipline, the emphasis on ensuring accurate and honest collection remains the same. The goal for all data collection is to capture quality evidence that allows analysis to lead to the formulation of convincing and credible answers to the questions that have been posed. Data collection and validation consists of four steps when it involves taking a census and seven steps when it involves sampling.[3]

Regardless of the field of study or preference for defining data (quantitative or qualitative), accurate data collection is essential to maintain research integrity. The selection of appropriate data collection instruments (existing, modified, or newly developed) and delineated instructions for their correct use reduce the likelihood of errors.

A formal data collection process is necessary as it ensures that the data gathered are both defined and accurate. This way, subsequent decisions based on arguments embodied in the findings are made using valid data.[4] The process provides both a baseline from which to measure and in certain cases an indication of what to improve.

There are 5 common data collection methods:

  1. closed-ended surveys and quizzes,
  2. open-ended surveys and questionnaires,
  3. 1-on-1 interviews,
  4. focus groups, and
  5. direct observation.[5]

DMPs and data collection

DMP is the abbreviation for data management platform. It is a centralized storage and analytical system for data. Mainly used by marketers, DMPs exist to compile and transform large amounts of data into discernible information.[6] Marketers may want to receive and utilize first, second and third-party data. DMPs enable this, because they are the aggregate system of DSPs (demand side platform) and SSPs (supply side platform). When in comes to advertising, DMPs are integral for optimizing and guiding marketers in future campaigns. This system and their effectiveness is proof that categorized, analyzed, and compiled data is far more useful than raw data.

Data collection on z/OS

z/OS is a widely used operating system for IBM mainframe. It is designed to offer a stable, secure, and continuously available environment for applications running on the mainframe. Operational data is data that z/OS system produces when it runs. This data indicates the health of the system and can be used to identify sources of performance and availability issues in the system. The analysis of operational data by analytics platforms provide insights and recommended actions to make the system work more efficiently, and to help resolve or prevent problems. IBM Z Common Data Provider collects IT operational data from z/OS systems, transforms it to a consumable format, and streams it to analytics platforms.[7]

IBM Z Common Data Provider supports the collection of the following operational data:[8]

  • System Management Facilities (SMF) data
  • Log data from the following sources:
    • Job log, the output which is written to a data definition (DD) by a running job
    • z/OS UNIX log file, including the UNIX System Services system log (syslogd)
    • Entry-sequenced Virtual Storage Access Method (VSAM) cluster
    • z/OS system log (SYSLOG)
    • IBM Tivoli NetView for z/OS messages
    • IBM WebSphere Application Server for z/OS High Performance Extensible Logging (HPEL) log
    • IBM Resource Measurement Facility (RMF) Monitor III reports
  • User application data, the operational data from users' own applications

Data integrity issues[9]

The main reason for maintaining data integrity is to support the observation of errors in the data collection process. Those errors may be made intentionally (deliberate falsification) or non-intentionally (random or systematic errors).

There are two approaches that may protect data integrity and secure scientific validity of study results invented by Craddick, Crawford, Rhodes, Redican, Rukenbrod and Laws in 2003:

  • Quality assurance – all actions carried out before data collection
  • Quality control – all actions carried out during and after data collection

Quality assurance

Its main focus is prevention which is primarily a cost-effective activity to protect the integrity of data collection. Standardization of protocol best demonstrates this cost-effective activity, which is developed in a comprehensive and detailed procedures manual for data collection. The risk of failing to identify problems and errors in the research process is evidently caused by poorly written guidelines. Listed are several examples of such failures:

  • Uncertainty of timing, methods and identification of the responsible person
  • Partial listing of items needed to be collected
  • Vague description of data collection instruments instead of rigorous step-by-step instructions on administering tests
  • Failure to recognize exact content and strategies for training and retraining staff members responsible for data collection
  • Unclear instructions for using, making adjustments to, and calibrating data collection equipment
  • No predetermined mechanism to document changes in procedures that occur during the investigation

Quality control

Since quality control actions occur during or after the data collection all the details are carefully documented. There is a necessity for a clearly defined communication structure as a precondition for establishing monitoring systems. Uncertainty about the flow of information is not recommended as a poorly organized communication structure leads to lax monitoring and can also limit the opportunities for detecting errors. Quality control is also responsible for the identification of actions necessary for correcting faulty data collection practices and also minimizing such future occurrences. A team is more likely to not realize the necessity to perform these actions if their procedures are written vaguely and are not based on feedback or education.

Data collection problems that necessitate prompt action:

  • Systematic errors
  • Violation of protocol
  • Fraud or scientific misconduct
  • Errors in individual data items
  • Individual staff or site performance problems

See also


  1. ^ Lescroël, A. L.; Ballard, G.; Grémillet, D.; Authier, M.; Ainley, D. G. (2014). Descamps, Sébastien (ed.). "Antarctic Climate Change: Extreme Events Disrupt Plastic Phenotypic Response in Adélie Penguins". PLOS ONE. 9 (1): e85291. doi:10.1371/journal.pone.0085291. PMC 3906005. PMID 24489657.
  2. ^ Vuong, Quan-Hoang; La, Viet-Phuong; Vuong, Thu-Trang; Ho, Manh-Toan; Nguyen, Hong-Kong T.; Nguyen, Viet-Ha; Pham, Hiep-Hung; Ho, Manh-Tung (September 25, 2018). "An open database of productivity in Vietnam's social sciences and humanities for public use". Scientific Data. 5: 180188. doi:10.1038/sdata.2018.188. PMC 6154282. PMID 30251992.
  3. ^ Ziafati Bafarasat, A. (2021) Collecting and validating data: A simple guide for researchers. Advance. Preprint.
  4. ^ Data Collection and Analysis By Dr. Roger Sapsford, Victor Jupp ISBN 0-7619-5046-X
  5. ^ Jovancic, Nemanja. "5 Data Collection Methods for Obtaining Quantitative and Qualitative Data". LeadQuizzes. LeadQuizzes. Retrieved 23 February 2020.
  6. ^ Collin, E. M. (2020-11-04). "Data Collection: The Complete Guide". Easy Earned Money. Retrieved 2020-11-05.
  7. ^ IBM: IBM Z Common Data Provider
  8. ^ IBM: IBM Z Common Data Provider Knowledge Center
  9. ^ Northern Illinois University (2005). "Data Collection". Responsible Conduct in Data Management. Retrieved June 8, 2019.

External links

  • Bureau of Statistics, Guyana by Arun Sooknarine