Prepare Your Data
This page is currently under review. Stay tuned!
Crystal is here to help you process your data and get insights out of it to make successful business decisions.
In order to do so, it needs a data source from which to retrieve the information wanted and, to get the best out of her performance, this data needs to be prepared according to certain guidelines.
The higher the data and structure quality is, the better results and impact crystal can bring.
We recommend you to follow our advice to get the best out of your Crystal experience.
But before we dive into specific details, let’s review some basic concepts as an introduction to the perfect crystal experience.
In this article we will see:
Crystal Foundations
Data Preparation Outside Crystal
Data Manipulation Inside Crystal
Crystal Foundations
Crystal's Conversation revolves around Topics: asking a question to Crystal means asking a Topic.
Consequently, it is very important to prepare the data to create topics that have a natural way of requesting them.
You can deep-dive on the concept of Topic here.
You can deep-dive on the rules of Conversation here.
Following, a summary of the main concepts underlying Crystal that you must keep in consideration for preparing data in the best way.
Topics
A topic is an insight on your data that you can get from crystal.
It is the combination of one or multiple main entities that can be filtered for one or more filter entities that fulfill an objective and that are surfaced to the user through a visual or textual representation.
Objectives
An objective is essentially the main purpose of the topic. Each objective can be represented through one or more visualizations that will be chosen by the admins during the phase of topic creation. Each objective has a number of consistent visualizations, that can be charts or other formats (such as ranking tables and lists).
For example, the comparison objective shows the similarity or dissimilarity between different items or facts, like in the question: “Can you compare items sold versus budget by country?” The over time objective, instead, shows how data changed over a specific time period, as in: “Can you show me the sales over time by month for the last year?"
Entities
Entities are the core of crystal’s functioning: they are essentially the content that you allow the members to explore through the advisor conversation.
There are two types of entities: the main entities and the filter entities.
A main entity is the subject the insight focuses on, according to the objective you set for the analysis. For example, if the question for the advisor would be: “Can you show me total of sales?”, the main focus of this request would be “sales”.
A filter entity is the dimension that you can associate with the main entity to drill down, filter, or deepen your analysis. For example, for our question “Show me total of sales”, a filter entity would be the value of the "country" entity, like “Show me total of sales for Germany”, or a year, like “Show me total of sales in 2019”.
A filter entity should not be confused with a filter value: for example, in our case, a filter entity might be “Country”, which would retrieve the sales for a single country, while a filter value is a specific item in the entity’s data column. In our case, “Germany” would be a specific value for the entity “Country”.
From a configuration point of view, they correspond to the column selected from the database you’ve connected.
During topic creation, you also need to provide some name variations. Essentially, they are synonyms that will be used conversationally to refer to the entity - e.g. if the entity is “items”, we can set “units” or “articles” as a name variation.
Aggregation
The aggregation in crystal is based on SQL aggregation functions and it represents how the different values of a main entity can be put together in a specific topic, meaning that the same entity can be aggregated in different ways.
For example, if we have the entity “Sales”, we can either sum the different values of the entity, count them or calculate the average.
If we want to show the user how much money we made, then we would use Sum, if we want to count how many sales transactions we had, we would go for Count and if we want to represent the average sale we would use Average.
Aliases
These synonyms associated with the same entity are called aliases and they can be used in the conversation to refer to the same entity.
Value aliases are name variations that you and your team will use as alternatives to name the non-numeric values of the defined entities, e.g. every single (non-numerical) value of the related column.
Aliases are very important for crystal, because they enable the flexible conversational experience. The same entity can be in fact associated with multiple words: “sales” can be also called “revenues”, “turnover”, and so on. Another example: if the entity is “items” words like “units” or “articles” might also be used to refer to it.
For example, you may have an entity related to money. Within that entity, it was filled out in the original data source using currency codes, like USD or EUR. By adding value aliases, you can add “American dollars” and “Euros” as variations for these codes, meaning that you can ask questions to crystal using these variations.
Value aliases are very useful when the values of the entities you use as filters are hard to spell and pronounce, such as product codes: you can associate a meaningful alias to the code and use it in the conversation with crystal, e.g. “product XG059” can be called “Blue sofa”.
You must give at least one alias to each entity, but remember that the more aliases you provide, the more the advisor will understand your users during the conversation.
Data formatting
In order to display your data correctly, crystal will follow these data formatting conventions and rules.
Supported formats
The data must be formatted with a certain logic to be properly consumed by crystal.
Strings
crystal works with Text fields. It supports most of the commonly used categorical types. Here you can find the complete list of supported data sources. Depending on the connected data source, you could need to transform some of your dataset columns.
For example
Scenario | Query |
VARCHAR type | CAST ( column AS varchar [ ( 120 ) ] ) |
Mind that this is just an example and it needs to be adapted, based on the exact case and technology used.
Special characters
We often encounter qualitative fields, such as product description, address and others, containing special characters that are not properly encoded as Unicode.
To clean these data types, it is more convenient to use the regular function expression.
How to handle
Scenario | Query |
publicitÖÖÖÖ | regexp_replace('publicitÖÖÖÖ','[Ö�]+|√†!','à','g'); |
Empty values
crystal expects empty values to be represented as NULL.
How to handle
Scenario | Query |
““ | nullif(column, "") |
“None” | nullif(column, "None") |
Numbers (floats)
The floating numeric from customer data source should follow the standard format that uses the comma as thousand separators (e.g.1,200 for one thousand two hundred) and the dot as decimal separator (e.g. 1.2 for one and two).
1.234 = one and two hundred thirty-four thousandths
1,234 = one thousand two hundred thirty-four
However it can happen quite often to find numeric data that violates these conventions.
How to handle
Scenario | Query | Result |
We just need to read it correctly from text type and save it as numeric type. | to_number('1,234,567.009678' , '999G999G999G999D999999'::text) | 1234567.009678 |
Negative sign is at the end of the number. | to_number('123,456,789.01-', '999999999G999D999S'::text); | -123456789.01 |
In the following example, you can find some common scenarios and how to fix them.
If our data already complies with the common standard (dot for decimal and comma for thousands separators):
Scenario | Query | Result |
We just need to read it correctly from text type and save it as numeric type. | to_number('1,234,567.009678' , '999G999G999G999D999999'::text) | 1234567.009678 |
Negative sign is at the end of the number. | to_number('123,456,789.01-', '999999999G999D999S'::text); | -123456789.01 |
If our data does NOT comply with the common standard:
Scenario | Query | Result |
It uses dot for thousands and comma for decimal separators (eg. 1.200 and 1,2). | NULLIF(replace(replace('5.295,22' , '.'::text, ''::text), ','::text, '.'::text), ''::text)::numeric | 5295.22 |
It uses comma for decimal separator (1,2) and nothing for thousands separator (1200). | NULLIF(replace('2271,02', ','::text, '.'::text), ''::text)::numeric | 2271.02 |
Complex fields types
At the moment crystal is not supporting complex fields like “Json objects“ for Postgres or “LOB“ for Oracle. If in a table such field types are present, crystal will not be able to use them in the topic configuration process, nor will she be able to preview the related data.
Note that, for Oracle, all the data types described here are not supported.
Product limitations
Before we start to dive into the data preparation details, take a look at this article to know more about crystal’s limitations.
Data Preparation Outside Crystal
The topic configuration is a key process for the success of the crystal project, that’s why it is extremely important to provide a clean data structure by following some guidelines.
The main entity
A data source is mainly composed of tables that are also made of columns.
Main entities can be configured by using a simple rule of thumb:
One column = one main entity
This is the suggested and recommended approach for most cases. Each column must contain the values or content of a single entity.
For example, let’s consider that we have to create a topic whose main entity is “Sales” and this is our data:
Product | City | Channel | Sales | Chargebacks |
Product 1 | Milan | Online | $20,56 | $0 |
Product 2 | London | Store | $50,36 | $0 |
Product 1 | Turin | Online | $97,76 | $0 |
Product 1 | New York | Store | $190,12 | $0 |
In this case Sales will be our main entity. Other main entities, such as the chargebacks in our example, can be configured on the related column.
A crucially important factor to keep in mind is that entities’ aliases MUST be unique , therefore we will not be able to name another main entity within crystal “Sales”.
Also note that:
If, conceptually, there are multiple columns that refer to generic “sales”, they need to be centralized in the same column and in the same table to enable the entity “sales” to be unique in the conversation and allow the system to identify the source field.
if there are multiple sales columns that conceptually have differences (e.g. actual sales & budget sales), they can be different columns and can have different alias names (such as “actual sales“ and “budget sales”).
There is a possible alternative approach that can be used in peculiar cases. Let’s take a table like this as an example:
Product | Indicators | Value |
Product 1 | Revenues | $0 |
Product 1 | Costs | $0 |
Product 1 | Inventory | $0 |
In this case, we can use the “value” column as the main entity and the “indicator” column as a filter entity. Nonetheless, this approach is solid when the values within the column “Value” can be logically aggregated with one of the supported functions: otherwise there is the risk of showing inconsistent data to the final user.
Moreover, the conversation structure will be directly affected, as the main entity, and therefore the focus of the topic, will be “value”, rather than the values of the filter dimension (such as, for example, revenues or costs).
The filter entity
If a conceptual dimension (for example, “product”) is present in multiple tables (like, 3), and therefore in multiple columns (also 3), and it has to be used as a filter for topics configured on the three objects, there are two options:
We could configure two different entities (for example, “Sales”and “Sales target“) and potentially trigger the disambiguation feature, in the case the request to the advisor is partial or incomplete (for example, by specifying only "sales").
Or, we could centralize the filter column in a single object to configure the entity in a single column.
Let’s see an example.
If we assume that “product” is the only filter dimension that we want to share between the two tables, we will have to create a master table and two “data” tables as follows, linked by a key that we will use to join the two tables.
Master table
ID | ProductID | Price |
34 | 2 | 12 |
85 | 6 | 15 |
45 | 3 | 23 |
Table 1: Sales targets
ID | ProductID | Target |
23 | 2 | 45,000 |
54 | 6 | 125,000 |
57 | 3 | 10,000 |
Table 2: Products
The “Name” column would become the Product filter entity.
ID | Name |
2 | Apples |
6 | Pears |
3 | Bananas |
Centralize all entities
Assuming that Product is the only filter dimension that we want to share between the two tables, we can directly create a single table containing all the filters and main entities.
The driver for this decision would be the granularity of the data and the amount of entities that share the same filters.
For example:
Master table: sales
ID | Product | Price |
34 | Apples | 12 |
85 | Pears | 15 |
45 | Bananas | 23 |
Table 1: Sales targets
ID | Product | Target |
23 | Apples | 45,000 |
54 | Pears | 125,000 |
57 | Bananas | 10,000 |
Table 2: Products
ID | Product | Price | Target |
1 | Apples | 12 | 45,000 |
2 | Pears | 15 | 125,000 |
3 | Bananas | 23 | 10,000 |
Calculated columns
Calculated columns are customized columns created dynamically through operations. In the below case, for example, we have the total column which is generated by the system by multiplying price per quantity. At the moment crystal does not support the creation of calculated columns.
ID | ProductID | Price | Quantity | Total |
34 | 34 | 5 | 3 | 15 |
85 | 85 | 10 | 5 | 50 |
45 | 45 | 20 | 6 | 120 |
Data Manipulation Inside crystal
Joins
Join can be performed on a single key, meaning that you cannot use multiple fields to join two tables. Moreover, join operations can be performed only if the objects reside in the same data source (such as, database connection).
The joined tables can be further joined to a maximum of 5.
The performance of crystal’s answers depends on the number of the join operations and the performance of the underlying data source (for example, indexing and other database optimization techniques can be performed to speed up the data retrieval and enhance crystal answering performance).
The supported joins are:
Inner
Left
Right
Sorting
You can sort the final result by any column of the connected data, choosing ascending or descending order.
Static filtering
In crystal, static filters correspond to the SQL where clause. It is possible to apply static filtering to the connected data: static filters are applied by default to a topic and they are not to be confused with dynamic filters, applied in the filter step of the topic configuration flow.
You can apply multiple static filters to a topic with a specific operation. For example, you can select “category” as a static filter and say that you want to filter category values that EQUAL “gadget” (meaning that only the “gadget” objects will be taken into account in the conversation), or the “price” as a static filter with operation “greater than” and value 100, and your topic will only retrieve those values which are above 100.
The available options are:
Equal
Different
Empty
Not Empty
Greater Than
Less Than
Greater Than Or Equal To
Less Than Or Equal To
Between
Aggregation functions
You can choose among the following aggregation functions:
Count
Count distinct
Sum
Average
Last updated