Databases to be migrated can have a wide range of data representations and contents. From simple numeric data fields to fields with complex structure and content, which may contain files, images, tables or even complex custom objects (e.g. in XML, CLOB, BLOB etc. formats).
The discovery of simpler data can be done easily and very efficiently using traditional methods with some basic knowledge and value sets. However, as the data becomes more complex, this efficiency decreases until traditional methods are no longer sufficient.
Such more complex data can be, for example, unstructured text or binary files whose type and content cannot be clearly identified and interpreted. For the sake of argument, let’s ignore the fact that the use of such data types in databases is justified only in a few specific cases, as this problem often arises when migrating complex systems. These complex data formats are usually unstructured, structurally only a set of bytes in a given field, about which the user often has no reliable information due to incomplete documentation. Without meta-information it is difficult to draw conclusions about the type of content and its interpretation. So our first task is to decide about these fields what kind of data they contain.
The most obvious way to identify the types of the files would be to check their extension, but when stored in a database, this information is typically not available, or even if it were, it could not be used with maximum confidence. Assuming that only the binary version of the data is available, i.e. Binary Large Object (BLOB), the first step is to load it. All metadata is encoded in this BLOB, according to the format of the file.
Depending on the format of the file or the type of data, the metadata in the file may appear in several places in the file, either in the header or at the end of the file. In addition to identifying the file format, file headers may also contain metadata about the file and its contents. Character-based (text) files usually have character-based headers, while binary formats usually have binary headers, although this is not a standard, just industry practice.
The distribution of byte values in different file types shows a different pattern for each type. Early attempts focused on the byte distribution of the entire file, but more recent methods perform different sampling, such as analyzing only samples from the beginning, end, and middle of files. These samples can provide a good basis for machine learning, which can determine (with some probability) the type of unknown files using a model built on the different distributions. In a procedure based on BFD (Byte Frequency Distribution), it is not necessary to read the whole file, which saves time. The values extracted from several positions in the file are statistically analyzed to obtain a fingerprint specific to the file. Based on this fingerprint, machine learning is able to find the file type in the model – the one we taught it with known files – that best fits it.
The figure below shows the byte distribution of some file types:
Please note that all codes and input/output data examples in this article are derived from Clarity Consulting’s MigNon tool. This data migration assistant software is a result of a multi-year R&D project, partially funded by the European Union.
Applying machine learning to file format recognition
Machine learning can be an effective method for file format recognition, especially when working with large data sets. The following models can be considered for file format detection:
- Naive Bayes: Suitable for byte- or sequence-based analysis, where typical features of each file format, such as byte distributions or characteristic patterns, are taken into account.
- Support Vector Machines (SVM): A good choice when the boundaries between file formats, i.e. decision surfaces, need to be defined on the basis of byte frequency.
- Random Forest: Among the ensemble learning methods, Random Forest is often used because it can handle many different file formats and can cope with noisy data.
- Convolutional Neural Networks (CNNs): CNNs are capable of recognizing byte distribution patterns and can treat files as images where byte frequency is interpreted as “pixels”, thus recognizing different file formats.
- Autoencoder: can be used for unsupervised learning, especially for anomaly detection, when trying to detect files with a byte pattern different from the usual one for a given file format.
- K-Nearest Neighbors (KNN): For small datasets, this can be a simple but effective way to identify file formats based on the similarity of their nearest neighbors.
- Recurrent Neural Networks (RNNs): Because RNNs efficiently handle sequential data, they can be useful for tasks where sequences of bytes in a file are important.
For automated file format recognition, the LightGBM (Light Gradient Boosting Machine) model can be a good choice for several reasons. First, it is highly efficient on large and diverse datasets, optimized for fast learning and prediction, and thus performs well for file format analysis with large byte distributions or many samples. It requires little memory, which can be an important consideration when detecting and classifying multiple file formats in a system. It is a member of the gradient boosting model family, which can efficiently model complex decision surfaces and this accuracy can be critical in file format detection where byte frequency can have discrete distributions.
The LightGBM solution is data driven, and since we need a larger amount of data to build a model of the right quality, we have created a quasi-automated download system for this in our example. To implement our automated download system, we used Selenium in Python to control the browser using a Firefox driver. This allowed us to search for links and HTML elements and then perform actions on them, such as clicking or entering data. During automatic downloads, we handled file types such as CSV, DOC, DOCX, XML, JPG, JSON, PDF, PNG, XLS, and XLSX, but some files, such as MP3 or AVI, had to be collected and processed manually. Files were collected from various sources, such as kaggle.com for CSV and JSON files, while PNG and JPG images were downloaded from imgur.com. YouTube videos were downloaded in MP4 format from btclod.com and converted to AVI using WinFF. For ZIP files, compression and downloading were handled by separate Python scripts. The automated downloads were carefully organized into an appropriate directory structure. Download volume and data quality were controlled by parameters within the program. Some of our test data was collected manually to ensure that there were different files, but we were unable to collect sufficient quantities of every file types, so in many cases automatic downloads remained the primary method of generating input.
The quantities of input data obtained were as follows:
Recognizing the details of the byte frequency distribution pattern mentioned in the introduction is the basis for building the LightGBM model on the collected data. Processing a complete byte list is costly, so to determine the file extension, the data is filtered through a BFD, which can be seen as a kind of “partial fingerprinting”. Using this as input data, we are able to determine the type of some files for the model. This is done by creating a frequency HASH-MAP from an array of bytes, where the keys are the bytes and the values are the number of occurrences. Since the keys are positive numbers (0 to 255), this HASH-MAP can be easily converted to an array, which speeds up the normalization process. Normalization in this case means constructing a probability distribution from the frequency values. As a result, the sum of the normalized frequencies will give a value of 1. This normalized form will be the BFD.
Testing and tuning the model
The primary problem in defining file extensions was that the initial model, which worked with 256 bytes, was unable to distinguish between similar formats such as XML, JSON, ZIP, DOCX, and XLSX. Of these, the ZIP format was particularly confusing because both DOCX and XLSX contain XML wrapped in a ZIP file. Confusion between XML and JSON was also common, as they are very similar in structure. Due to these factors, the accuracy of the model for these file types remained low, making it necessary to improve the model and introduce new features.
To improve accuracy, the following features were added
- Byte rates of 60, 62
- Byte rates of 123, 125
- <w byte order
- <c byte order
- sharedStrings byte series
To refine the model, we introduced several byte-specific features to identify different file types. For example, bytes 60 and 62, which correspond to the ‘<‘ and ‘>’ symbols, played a key role in accurately identifying XML files. Similarly, bytes 123 and 125, which represent the ‘{‘ and ‘}’ characters, were useful in identifying JSON files. For recognizing DOCX files, we used the ratio of the ‘<w’ byte sequence, while for XLSX files we tried to introduce the ‘<c’ byte sequence.
For XLSX file detection, the “sharedStrings” byte sequence worked very well because all XLSX files contain the sharedStrings.xml file, which is not parsed by the ZIP format. However, this feature was eventually discarded because its length made it too sensitive to even the smallest changes. Although these features improved the efficiency of the model, the exact identification of the ZIP files remained problematic and we had to use another method to identify them.
Since we did not know exactly what features would be needed to reliably identify ZIP files, we decided to include all possible 2-byte combinations in the model to increase accuracy. However, this approach was extremely memory, time, and processor demanding because the learning process had to handle a huge amount of data. To optimize this, we planned to use only the 500 most important 2-byte features, thus significantly reducing the load on computational resources without compromising the efficiency of the model.
Based on our initial analyses, we achieved over 90% accuracy for all classes individually, except for XML files, where we achieved only 49%. This indicated that the model may have an overfitting problem. Overfitting can occur when the model uses too many features, causing it to make decisions faster, for example, at the endpoints of decision trees. This can result in high accuracy on the training set, but poorer generalization on real data. One of the most effective ways to deal with overfitting is to reduce the number of features used.
The final model contains a combination of three features. First, 256 single-byte features that examine the frequency of occurrence of each byte. This is followed by an additional feature that measures the proportion of ‘<‘ and ‘>’ symbols (bytes 60 and 62), which are particularly useful for recognizing XML files. Finally, the model includes 42 selected two-byte features that, based on previous analysis, are most relevant to file type identification. These combined features help to increase the accuracy of the model in identifying different file types.
After the model was finalised, the following accuracies were found on the test stack, broken down by file type.
Since the elements of a column in a table are of the same type, accuracy can be increased by classifying these binary data points simultaneously in a set, since even if there is one misclassified data point in the set, the other data points will push the average toward the correct classification.
The figure above shows the progression of accuracy for the 3 lowest accuracy classes (XLSX, XLS, XML) for different set sizes. The three lowest accuracy classes are chosen because they require the largest set size to achieve 100% overall accuracy. The method allows us to determine the minimum set size required in the worst case. It can be seen that even in the worst case, the average accuracy of the model increases to nearly 100% when classifying 7 files at a time.
API runtime measurements for different set sizes for the model yielded the following results (average file size was 5 MB and testing was done on a machine with an Intel Core i7-2600 3.4 GHz processor and 1333 MHz RAM).
The following runtimes were measured depending on memory usage:
– 89.428 seconds with a maximum RAM usage of 7.28 GB
– 56.451 seconds with a maximum RAM usage of 4.55 GB
– 30.186 seconds with a maximum RAM usage of 2.58 GB
Microservice-based implementation of file recognition
The service is containerized using Docker, building on the Python fastAPI and uvicorn modules. In the API call process, the system first receives the request, then converts the received data to binary (Base64) format. The binary data is then used to generate the Byte Frequency Distribution (BFD) value of the data point. Once the BFD is generated, the BFD associated with the data point and a lightGBM type model are used for classification of the file type detection. The system logs the events and handles any errors. Finally, at the end of the process, the system sends a response to the user.
The Docker image consists of the following files:
- log, the location of results.log.
- models, this folder will contain the stored models
- main.py, this is the main module, this is what uvicorn runs
- logger.py, this is the module that does the logging
- bfd.py, the BFD code
- model.py, the model code
- dockerfile, docker script
- module_list.txt, list of modules to download for dockerfile
The structure of dockerfile is as follows:
In the service call the Body structure contains only two fields “data”, this attribute is a list of JSON Objects, and “id”. These JSON Objects have a “bit” field (which is a string containing the base64 encoded version of the file).
The response from the /predict endpoint is a JSON object that contains a status code. The structure of the JSON object is similar to the body of the request. The response will contain a “data” field whose value is a list. This list contains additional JSON objects that contain the “id” and “pred” fields. The value of the “id” field can be either the identifier assigned to the submitted item, or a generated identifier if not originally specified for the “bit” field. The prediction („pred”) field also contains a JSON object that contains extension and probability key-value pairs.
Closing thoughts
File format recognition is a challenge in data migration, especially for undocumented legacy systems. The encoding structure of files, such as BLOBs, makes format identification difficult and more advanced methods are needed. Our experience shows that combining machine learning with byte frequency distribution (BFD) can efficiently identify files based on their byte distribution patterns. The accuracy of the model can be increased by adding specific byte sequences and features to reliably distinguish between different formats.The resulting final model, running in a Dockerized environment, can be implemented as a flexible and efficient service that can be used in data migration projects.